swh.dataset.athena module#

This module implements the “athena” subcommands for the CLI. It can install and query a remote AWS Athena database.

swh.dataset.athena.create_database(database_name)[source]#
swh.dataset.athena.drop_table(database_name, table)[source]#
swh.dataset.athena.create_table(database_name, table, location_prefix)[source]#
swh.dataset.athena.repair_table(database_name, table)[source]#
swh.dataset.athena.query(client, query_string, *, desc='Querying', delay_secs=0.5, silent=False)[source]#
swh.dataset.athena.create_tables(database_name, dataset_location, output_location=None, replace=False)[source]#

Create the Software Heritage Dataset tables on AWS Athena.

Athena works on external columnar data stored in S3, but requires a schema for each table to run queries. This creates all the necessary tables remotely by using the relational schemas in swh.dataset.relational.

swh.dataset.athena.human_size(n, units=['bytes', 'KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB'])[source]#

Returns a human readable string representation of bytes

swh.dataset.athena.run_query_get_results(database_name, query_string, output_location=None)[source]#

Run a query on AWS Athena and return the resulting data in CSV format.

swh.dataset.athena.generate_subdataset(dataset_db, subdataset_db, subdataset_s3_path, swhids_file, output_location=None)[source]#