swh.scheduler.backend_es module

Elastic Search backend

swh.scheduler.backend_es.get_elasticsearch(cls: str, args: Dict[str, Any] = {})[source]

Instantiate an elastic search instance

class swh.scheduler.backend_es.ElasticSearchBackend(**config)[source]

Bases: object

ElasticSearch backend to index tasks

This uses an elasticsearch client to actually discuss with the elasticsearch instance.

initialize()[source]
create(index_name) → None[source]

Create and initialize index_name with mapping for all indices matching swh-tasks- pattern

compute_index_name(year, month)[source]

Given a year, month, compute the index’s name.

mget(index_name, doc_ids, chunk_size=500, source=True)[source]
Retrieve document’s full content according to their ids as per

source’s setup.

The source allows to retrieve only what’s interesting, e.g: - source=True ; gives back the original indexed data - source=False ; returns without the original _source field - source=[‘task_id’] ; returns only task_id in the _source field

Parameters
  • index_name (str) – Name of the concerned index.

  • doc_ids (generator) – Generator of ids to retrieve

  • chunk_size (int) – Number of documents chunk to send for retrieval

  • source (bool/[str]) – Source of information to return

Yields

document indexed as per source’s setup

is_index_opened(index_name: str) → bool[source]

Determine if an index is opened or not

streaming_bulk(index_name, doc_stream, chunk_size=500, source=True)[source]
Bulk index data and returns the successful indexed data as per

source’s setup.

the source permits to retrieve only what’s of interest to us, e.g:

  • source=True ; gives back the original indexed data

  • source=False ; returns without the original _source field

  • source=[‘task_id’] ; returns only task_id in the _source field

Note that: - if the index is closed, it will be opened - if the index does not exist, it will be created and opened

This keeps the index opened for performance reasons.

Parameters
  • index_name (str) – Name of the concerned index.

  • doc_stream (generator) – Document generator to index

  • chunk_size (int) – Number of documents chunk to send

  • source (bool, [str]) – the information to return