Soong
Soong provides a general-purpose ETL library for data migration.
API reference

This is the API reference documentation for Soong, generated from the code using Doxygen.

All components other than Record take a keyed configuration array as their single constructor argument. The component interfaces all inherit from ConfigurableComponent, and at the moment all concrete configurable component classes inherit from OptionsResolverComponent which is based on Symfony OptionsResolver. All components using this base class must implement optionDefinitions() to define the configuration options they accept.

As an ETL framework, the key components of Soong are of course:

  • Extractors: Extractors read data from a source data store and via extract*() methods produce iterators to deliver one record at a time as a Record instance. They accept configuration to determine where and how to access the source data, including filters (see below) to control what records to process on a given invocation. Being able to tell how many source records are available for migration is very helpful, although on occasion there may be data sources where this is impossible (or at least very slow) - therefore, countability is not required by Extractor. Most extractors will want to implement \Countable (a CountableExtractorBase class is provided which should be a good starting point for most extractors).
  • Transformers: A RecordTransformer class accepts a source Record and a (possibly partially populated) result Record and produces a transformed Record. A PropertyTransformer class accepts a value (usually a property from an extractor-produced record) and produces a new value.
  • Loaders: Loaders accept one Record instance at a time and load the data it contains into a destination as configured. Note that not all destinations may permit deleting loaded data (e.g., a loader could be used to output a CSV file). The deletion capability (used by rollback operations) should be moved to a separate interface.

The ETL pipeline components need to communicate the data they handle with each other - extractor outputs need to pass through a series of transformers and ultimately into a loader. The canonical representation of such data would be an associative array of arbitrarily-typed values, but rather than require a specific representation it is more flexible to abstract the data.

  • Record: A data record (a set of named values, which could be any type) is represented by Record. In the context of an ETL pipeline, an extractor will output a Record to input to transformers, and this feeds into a sequence of record transformers to ultimately deliver the final record to the loader.

To manage the migration process, we have:

  • Task: A named object controlling the execution of operations according to a set of configuration. Most tasks will be ETL tasks, designed to migrate data, but the overall migration process may require some non-ETL housekeeping tasks (like moving files around) - classes derived from Task rather than EtlTask can be used to incorporate these operations.
  • EtlTask: A Task specifically designed to perform operations on data using extractors, transformers, and loaders. The most important operation is migrate, which will:
    1. Invoke an Extractor instance and iterate over its data set, retrieving one source Record at a time.
    2. Create a destination Record, and execute one or more RecordTransformer instances to derive the destination record from source properties and configuration.
    3. Pass the destination Record to a Loader instance for final disposition.
  • TaskPipeline: Manages a list of Tasks.

Finally, we have:

  • KeyMap: Storage of the relationships between extracted and loaded records (based on the designated unique keys for each). This enables maintaining relationships between keyed records when the keys change during migration (as when loading into an auto-increment SQL table), as well as providing rollback and auditing capabilities. This component is optional - you may implement ETL processes without tracking the keys being processed.
  • Filter: A filter simply accepts a Record and based on the record's property values and its own configuration, decides whether the record should be further processed. Filters may be configured in the base configuration of an extractor (to help define the canonical source data to be migrated), or injected at run time (to, say, process a single specific record for debugging).