Import steps

Steps required to import data from various sources in different formats.

ReadStep

class ReadStep(load_class, options=None)[source]

Read a raw data source from somewhere.

This step may be configured to read a local file from the harddisk or download a file from an external service. In any case, it should return an iterable (like a file-like object) that can be looped over by a ImportStep.

See also

Reader classes

Arguments:

  • str load_class: A Reader class or class name, see Reader classes.

  • dict options: Options to be passed down to the reader class.

Added context:

  • str mode: Will be set to 'import'

  • dict read
    • type reader: Will be set to the reader class.

    • dict options: The options supplied to the reader class.

ImportStep

class ImportStep(load_class, options=None)[source]

Load raw data as a known format.

After reading in raw data with ReadStep, this data should be prepared and parsed as one of the supported formats.

See also

Loader classes

Arguments:

  • str load_class: A Loader class or class name, see Loader classes.

  • dict options: Options to be passed down to the loader class.

Added context:

None.

ValueMapStep

class ValueMapStep(model, attribute_map, defaults=None, **options)[source]

Map values in datasets of an import pipeline to correct attributes in our model.

__init__(model, attribute_map, defaults=None, **options)[source]

Create a new ValueMapStep instance.

Parameters
  • model – A str name of a model class in the system or a ModelImpl instance to be used as a field_map. May also be a dict with a direct key to field map for testing purposes.

  • attribute_map (dict) –

    A dict mapping model fields to import column names. You may specify multiple values using another dict as a value. Import columns may be dot-separated paths in the source.

    Examples:

    Import the title column from our source to the name property in our model both as English and German values:

    {'name': {'en': 'title', 'de': 'title'}}
    

    Given this multilingual source data structure:

    {'fields': {
      'name': {
        'de': 'Deutsch',
        'en': 'English',
      },
    }
    

    You may import each language to a destination explicityly using this value_map:

    {'name': {'en': 'fields.name.en', 'de': 'fields.name.de'}}
    

    Or import it dynamically, using all defined fields:

    {'name': 'fields.name'}
    

    Adding multiple values to a field is also supported for multivalue attributes:

    {'things': ['fields.thing1', 'fields.thing2', 'fields.thing3']}
    

  • defaults (dict) – Default values if value not found with attribute_map.

  • kwargs

    Additional options to pass down to fields.

    key null_values

    List of values that map to a literal None.

    key true_values

    List of values that map to a literal True.

    key false_values

    List of values that map to a literal False.

    key datetimeformat

    Parse datetime values with this format.

    key dateformat

    Parse date values with this format.

    key timeformat

    Parse time values with this format.

Added context:

None.

DbWriteStep

class DbWriteStep(model, create_new=True, id_field=None, primary_keys=None, classification=None)[source]

Write instances to the database or update existing ones.

__init__(model, create_new=True, id_field=None, primary_keys=None, classification=None)[source]

Create a new DB writer step.

Parameters
  • create_new (bool) – Allow creation of new records. Set to False to only allow updating existing records.

  • id_field (str) – ID field use and set as ObjectId from source data.

  • model (str|type) – Model class name or concrete class.

  • primary_keys (str|list) – Attribute(s) to use when deciding wether to update or insert a document.

  • classification (xmm.models.Node) – A Node instance that will be linked as a classification class.

Arguments:

  • str id_field: Will be removed from the import dataset and used as primary key to update existing instances.

  • str model: Model name or class to save datasets as.

  • str`|`list primary_keys: List of primary keys to use instead of a proper object ID.

Added context:

  • dict write:
    • type model: Will be the model class used

RawDbWriteStep

class RawDbWriteStep(collection_name=None, model=None, **kwargs)[source]

Write raw records to a MongoDB collection.

It’s a LOT faster than using the default write method. After using this step, rebuilding the ElasticSearch index might be necessary.

Warning

Updates the first matching document!

__init__(collection_name=None, model=None, **kwargs)[source]

Create a new raw DB writer step.

Parameters
  • model (str) – Model name to use the collection of.

  • collection_name (str) – Collection to write documents to.

Key str|list primary_keys

Field(s) to use when deciding wether to update or insert a document.

Key bool create_new

If False, only allow updating existing documents, default is to allow upserts.

Key bool add_metainfo

Add created/updated timestamps and user metadata.

Key User meta_user

Specify user to use for ‘created_by’ and ‘updated_by’ metadata.

Key bool delete_untouched

Remove all documents from collection that have not been touched at the end.

Arguments:

  • str collection_name: Collection to import documents into.

  • str`|`list primary_keys: When not empty, upsert are performed using values from these keys in each document.

Added context:

  • dict write:
    • type collection: Equal to the collection_name option that was used.

DbDeleteStep

class DbDeleteStep(model, primary_keys=None)[source]

Delete instances from the database.

__init__(model, primary_keys=None)[source]

Create a new DB deleter step.

Parameters
  • model (str|type) – Model class name or concrete class.

  • primary_keys (str|list) – Attribute(s) to use for querying the documents to delete.

Arguments:

  • str model: Model name or class to save datasets as.

  • str`|`list primary_keys: List of keys to build the query.

Added context:

  • dict delete:
    • type model: Will be the model class used

RawDbDeleteStep

class RawDbDeleteStep(collection_name, primary_keys=None)[source]

Deletes raw records from a MongoDB collection.

It’s a LOT faster than using the default delete method. After using this step, rebuilding the ElasticSearch index might be necessary.

Warning

Deletes all matching documents!

__init__(collection_name, primary_keys=None)[source]

Create a new raw DB deleter step.

Parameters

collection_name (str) – Collection to delete documents from.

Arguments:

  • str collection_name: Collection to delete documents from.

  • str`|`list primary_keys: Set the query fields to select the documents to delete.

Added context:

  • dict delete:
    • type collection: Equal to the collection_name option that was used.