Soong
Soong provides a general-purpose ETL library for data migration.
|
This is the API reference documentation for Soong, generated from the code using Doxygen.
All components other than Record
take a keyed configuration array as their single constructor argument. The component interfaces all inherit from ConfigurableComponent, and at the moment all concrete configurable component classes inherit from OptionsResolverComponent which is based on Symfony OptionsResolver. All components using this base class must implement optionDefinitions() to define the configuration options they accept.
As an ETL framework, the key components of Soong are of course:
extract*()
methods produce iterators to deliver one record at a time as a Record
instance. They accept configuration to determine where and how to access the source data, including filters (see below) to control what records to process on a given invocation. Being able to tell how many source records are available for migration is very helpful, although on occasion there may be data sources where this is impossible (or at least very slow) - therefore, countability is not required by Extractor
. Most extractors will want to implement \Countable
(a CountableExtractorBase
class is provided which should be a good starting point for most extractors).RecordTransformer
class accepts a source Record
and a (possibly partially populated) result Record
and produces a transformed Record
. A PropertyTransformer
class accepts a value (usually a property from an extractor-produced record) and produces a new value.Record
instance at a time and load the data it contains into a destination as configured. Note that not all destinations may permit deleting loaded data (e.g., a loader could be used to output a CSV file). The deletion capability (used by rollback operations) should be moved to a separate interface.The ETL pipeline components need to communicate the data they handle with each other - extractor outputs need to pass through a series of transformers and ultimately into a loader. The canonical representation of such data would be an associative array of arbitrarily-typed values, but rather than require a specific representation it is more flexible to abstract the data.
Record
. In the context of an ETL pipeline, an extractor will output a Record
to input to transformers, and this feeds into a sequence of record transformers to ultimately deliver the final record to the loader.To manage the migration process, we have:
Task
rather than EtlTask
can be used to incorporate these operations.migrate
, which will:Extractor
instance and iterate over its data set, retrieving one source Record
at a time.Record
, and execute one or more RecordTransformer
instances to derive the destination record from source properties and configuration.Record
to a Loader
instance for final disposition.Finally, we have:
Record
and based on the record's property values and its own configuration, decides whether the record should be further processed. Filters may be configured in the base configuration of an extractor (to help define the canonical source data to be migrated), or injected at run time (to, say, process a single specific record for debugging).