Import steps¶

Steps required to import data from various sources in different formats.

ReadStep¶

class ReadStep(load_class, options=None)[source]¶

Read a raw data source from somewhere.

This step may be configured to read a local file from the harddisk or download a file from an external service. In any case, it should return an iterable (like a file-like object) that can be looped over by a ImportStep.

ImportStep¶

class ImportStep(load_class, options=None)[source]¶

Load raw data as a known format.

After reading in raw data with ReadStep, this data should be prepared and parsed as one of the supported formats.

ValueMapStep¶

class ValueMapStep(model, attribute_map, defaults=None, **options)[source]¶

Map values in datasets of an import pipeline to correct attributes in our model.

__init__(model, attribute_map, defaults=None, **options)[source]¶

Create a new ValueMapStep instance.

Parameters

model – A str name of a model class in the system or a ModelImpl instance to be used as a field_map. May also be a dict with a direct key to field map for testing purposes.
attribute_map (dict) –
A dict mapping model fields to import column names. You may specify multiple values using another dict as a value. Import columns may be dot-separated paths in the source.

Examples:

Import the title column from our source to the name property in our model both as English and German values:
```
{'name': {'en': 'title', 'de': 'title'}}
```
Given this multilingual source data structure:
```
{'fields': {
  'name': {
    'de': 'Deutsch',
    'en': 'English',
  },
}
```
You may import each language to a destination explicityly using this value_map:
```
{'name': {'en': 'fields.name.en', 'de': 'fields.name.de'}}
```
Or import it dynamically, using all defined fields:
```
{'name': 'fields.name'}
```
Adding multiple values to a field is also supported for multivalue attributes:
```
{'things': ['fields.thing1', 'fields.thing2', 'fields.thing3']}
```
defaults (dict) – Default values if value not found with attribute_map.
kwargs –
Additional options to pass down to fields.

key null_values

List of values that map to a literal None.

key true_values

List of values that map to a literal True.

key false_values

List of values that map to a literal False.

key datetimeformat

Parse datetime values with this format.

key dateformat

Parse date values with this format.

key timeformat

Parse time values with this format.

Added context:

None.

DbWriteStep¶

class DbWriteStep(model, create_new=True, id_field=None, primary_keys=None, classification=None)[source]¶

Write instances to the database or update existing ones.

__init__(model, create_new=True, id_field=None, primary_keys=None, classification=None)[source]¶

Create a new DB writer step.

Parameters

create_new (bool) – Allow creation of new records. Set to False to only allow updating existing records.
id_field (str) – ID field use and set as ObjectId from source data.
model (str|type) – Model class name or concrete class.
primary_keys (str|list) – Attribute(s) to use when deciding wether to update or insert a document.
classification (xmm.models.Node) – A Node instance that will be linked as a classification class.

Arguments:

str id_field: Will be removed from the import dataset and used as primary key to update existing instances.
str model: Model name or class to save datasets as.
str`|`list primary_keys: List of primary keys to use instead of a proper object ID.

Added context:

dict write:
- type model: Will be the model class used

RawDbWriteStep¶

class RawDbWriteStep(collection_name=None, model=None, **kwargs)[source]¶

Write raw records to a MongoDB collection.

It’s a LOT faster than using the default write method. After using this step, rebuilding the ElasticSearch index might be necessary.

Warning

Updates the first matching document!

__init__(collection_name=None, model=None, **kwargs)[source]¶

Create a new raw DB writer step.

Parameters

model (str) – Model name to use the collection of.
collection_name (str) – Collection to write documents to.

Key str|list primary_keys

Field(s) to use when deciding wether to update or insert a document.

Key bool create_new

If False, only allow updating existing documents, default is to allow upserts.

Key bool add_metainfo

Add created/updated timestamps and user metadata.

Key User meta_user

Specify user to use for ‘created_by’ and ‘updated_by’ metadata.

Key bool delete_untouched

Remove all documents from collection that have not been touched at the end.

Arguments:

str collection_name: Collection to import documents into.
str`|`list primary_keys: When not empty, upsert are performed using values from these keys in each document.

Added context:

dict write:
- type collection: Equal to the collection_name option that was used.

DbDeleteStep¶

class DbDeleteStep(model, primary_keys=None)[source]¶

Delete instances from the database.

__init__(model, primary_keys=None)[source]¶

Create a new DB deleter step.

Parameters

model (str|type) – Model class name or concrete class.
primary_keys (str|list) – Attribute(s) to use for querying the documents to delete.

Arguments:

str model: Model name or class to save datasets as.
str`|`list primary_keys: List of keys to build the query.

Added context:

dict delete:
- type model: Will be the model class used

RawDbDeleteStep¶

class RawDbDeleteStep(collection_name, primary_keys=None)[source]¶

Deletes raw records from a MongoDB collection.

It’s a LOT faster than using the default delete method. After using this step, rebuilding the ElasticSearch index might be necessary.

Warning

Deletes all matching documents!

__init__(collection_name, primary_keys=None)[source]¶

Create a new raw DB deleter step.

Parameters: collection_name (str) – Collection to delete documents from.

Arguments:

str collection_name: Collection to delete documents from.
str`|`list primary_keys: Set the query fields to select the documents to delete.

Added context:

dict delete:
- type collection: Equal to the collection_name option that was used.