|
Soong
Soong provides a general-purpose ETL library for data migration.
|
This is the API reference documentation for Soong, generated from the code using Doxygen.
All components other than Record take a keyed configuration array as their single constructor argument. The component interfaces all inherit from ConfigurableComponent, and at the moment all concrete configurable component classes inherit from OptionsResolverComponent which is based on Symfony OptionsResolver. All components using this base class must implement optionDefinitions() to define the configuration options they accept.
As an ETL framework, the key components of Soong are of course:
extract*() methods produce iterators to deliver one record at a time as a Record instance. They accept configuration to determine where and how to access the source data, including filters (see below) to control what records to process on a given invocation. Being able to tell how many source records are available for migration is very helpful, although on occasion there may be data sources where this is impossible (or at least very slow) - therefore, countability is not required by Extractor. Most extractors will want to implement \Countable (a CountableExtractorBase class is provided which should be a good starting point for most extractors).RecordTransformer class accepts a source Record and a (possibly partially populated) result Record and produces a transformed Record. A PropertyTransformer class accepts a value (usually a property from an extractor-produced record) and produces a new value.Record instance at a time and load the data it contains into a destination as configured. Note that not all destinations may permit deleting loaded data (e.g., a loader could be used to output a CSV file). The deletion capability (used by rollback operations) should be moved to a separate interface.The ETL pipeline components need to communicate the data they handle with each other - extractor outputs need to pass through a series of transformers and ultimately into a loader. The canonical representation of such data would be an associative array of arbitrarily-typed values, but rather than require a specific representation it is more flexible to abstract the data.
Record. In the context of an ETL pipeline, an extractor will output a Record to input to transformers, and this feeds into a sequence of record transformers to ultimately deliver the final record to the loader.To manage the migration process, we have:
Task rather than EtlTask can be used to incorporate these operations.migrate, which will:Extractor instance and iterate over its data set, retrieving one source Record at a time.Record, and execute one or more RecordTransformer instances to derive the destination record from source properties and configuration.Record to a Loader instance for final disposition.Finally, we have:
Record and based on the record's property values and its own configuration, decides whether the record should be further processed. Filters may be configured in the base configuration of an extractor (to help define the canonical source data to be migrated), or injected at run time (to, say, process a single specific record for debugging).