Import steps¶
Steps required to import data from various sources in different formats.
ReadStep¶
-
class
ReadStep(load_class, options=None)[source]¶ Read a raw data source from somewhere.
This step may be configured to read a local file from the harddisk or download a file from an external service. In any case, it should return an iterable (like a file-like object) that can be looped over by a ImportStep.
See also
Arguments:
str load_class: A Reader class or class name, see Reader classes.
dict options: Options to be passed down to the reader class.
Added context:
str mode: Will be set to
'import'- dict read
type reader: Will be set to the reader class.
dict options: The options supplied to the reader class.
ImportStep¶
-
class
ImportStep(load_class, options=None)[source]¶ Load raw data as a known format.
After reading in raw data with ReadStep, this data should be prepared and parsed as one of the supported formats.
See also
Arguments:
str load_class: A Loader class or class name, see Loader classes.
dict options: Options to be passed down to the loader class.
Added context:
None.
ValueMapStep¶
-
class
ValueMapStep(model, attribute_map, defaults=None, **options)[source]¶ Map values in datasets of an import pipeline to correct attributes in our model.
-
__init__(model, attribute_map, defaults=None, **options)[source]¶ Create a new ValueMapStep instance.
- Parameters
model – A str name of a model class in the system or a ModelImpl instance to be used as a field_map. May also be a dict with a direct key to field map for testing purposes.
attribute_map (dict) –
A dict mapping model fields to import column names. You may specify multiple values using another dict as a value. Import columns may be dot-separated paths in the source.
Examples:
Import the
titlecolumn from our source to thenameproperty in our model both as English and German values:{'name': {'en': 'title', 'de': 'title'}}
Given this multilingual source data structure:
{'fields': { 'name': { 'de': 'Deutsch', 'en': 'English', }, }
You may import each language to a destination explicityly using this
value_map:{'name': {'en': 'fields.name.en', 'de': 'fields.name.de'}}
Or import it dynamically, using all defined fields:
{'name': 'fields.name'}
Adding multiple values to a field is also supported for multivalue attributes:
{'things': ['fields.thing1', 'fields.thing2', 'fields.thing3']}
defaults (dict) – Default values if value not found with attribute_map.
kwargs –
Additional options to pass down to fields.
- key null_values
List of values that map to a literal
None.- key true_values
List of values that map to a literal
True.- key false_values
List of values that map to a literal
False.- key datetimeformat
Parse datetime values with this format.
- key dateformat
Parse date values with this format.
- key timeformat
Parse time values with this format.
-
Added context:
None.
DbWriteStep¶
-
class
DbWriteStep(model, create_new=True, id_field=None, primary_keys=None, classification=None)[source]¶ Write instances to the database or update existing ones.
-
__init__(model, create_new=True, id_field=None, primary_keys=None, classification=None)[source]¶ Create a new DB writer step.
- Parameters
create_new (bool) – Allow creation of new records. Set to False to only allow updating existing records.
id_field (str) – ID field use and set as ObjectId from source data.
model (str|type) – Model class name or concrete class.
primary_keys (str|list) – Attribute(s) to use when deciding wether to update or insert a document.
classification (xmm.models.Node) – A Node instance that will be linked as a classification class.
-
Arguments:
str id_field: Will be removed from the import dataset and used as primary key to update existing instances.
str model: Model name or class to save datasets as.
str`|`list primary_keys: List of primary keys to use instead of a proper object ID.
Added context:
- dict write:
type model: Will be the model class used
RawDbWriteStep¶
-
class
RawDbWriteStep(collection_name=None, model=None, **kwargs)[source]¶ Write raw records to a MongoDB collection.
It’s a LOT faster than using the default write method. After using this step, rebuilding the ElasticSearch index might be necessary.
Warning
Updates the first matching document!
-
__init__(collection_name=None, model=None, **kwargs)[source]¶ Create a new raw DB writer step.
- Parameters
- Key str|list primary_keys
Field(s) to use when deciding wether to update or insert a document.
- Key bool create_new
If False, only allow updating existing documents, default is to allow upserts.
- Key bool add_metainfo
Add created/updated timestamps and user metadata.
- Key User meta_user
Specify user to use for ‘created_by’ and ‘updated_by’ metadata.
- Key bool delete_untouched
Remove all documents from collection that have not been touched at the end.
-
Arguments:
str collection_name: Collection to import documents into.
str`|`list primary_keys: When not empty, upsert are performed using values from these keys in each document.
Added context:
- dict write:
type collection: Equal to the
collection_nameoption that was used.
DbDeleteStep¶
Arguments:
str model: Model name or class to save datasets as.
str`|`list primary_keys: List of keys to build the query.
Added context:
- dict delete:
type model: Will be the model class used
RawDbDeleteStep¶
-
class
RawDbDeleteStep(collection_name, primary_keys=None)[source]¶ Deletes raw records from a MongoDB collection.
It’s a LOT faster than using the default delete method. After using this step, rebuilding the ElasticSearch index might be necessary.
Warning
Deletes all matching documents!
Arguments:
str collection_name: Collection to delete documents from.
str`|`list primary_keys: Set the query fields to select the documents to delete.
Added context:
- dict delete:
type collection: Equal to the
collection_nameoption that was used.