swh.dataset.relational module#

swh.dataset.relational.BLOOM_FILTER_COLUMNS: Dict[str, List[str | int]] = {'content': ['sha1', 'sha1_git', 'sha256'], 'directory': ['id'], 'directory_entry': ['directory_id', 'target'], 'origin': ['url'], 'origin_visit': ['origin'], 'origin_visit_status': ['origin'], 'release': ['id', 'author', 'target'], 'revision': ['id', 'author', 'committer', 'directory'], 'revision_extra_headers': [], 'revision_history': ['id', 'parent_id'], 'skipped_content': ['sha1', 'sha1_git', 'sha256'], 'snapshot': ['id'], 'snapshot_branch': ['snapshot_id', 'target']}#

Columns where we include Bloom filters.

They allow looking for high cardinality values without decompressing most stripes not containing any (equality) match.