Soong
Soong provides a general-purpose ETL library for data migration.
API reference

This is the API reference documentation for Soong, generated from the code using Doxygen.

All components are created with the static method create() rather than new. E.g., instead of

$dataRecord = new Record();
$dataRecord->fromArray(['foo' => 1, 'bar' => 2]);
$extractor = new ArrayExtractor($configuration);

you must do

$dataRecord = Record::create(['foo' => 1, 'bar' => 2]);
$extractor = ArrayExtractor::create($configuration);

At the moment, configuration is represented as a simple keyed array. We anticipate using an outside library to provide configuration handling services before long.

As an ETL framework, the key components of Soong are of course:

  • Extractors: Extractors read data from a source data store and via extract*() methods produce iterators to deliver one record at a time as a DataRecord instance. They accept static configuration to determine where and how to access the source data, and runtime options to control what records to process on a given invocation. Being able to tell how many source records are available for migration is very helpful, although on occasion there may be data sources where this is impossible (or at least very slow) - therefore, countability is not required by ExtractorInterface. Most extractors will want to implement \Countable (a CountableExtractorBase class is provided which should be a good starting point for most extractors).
  • Transformers: A Transfomer class accepts a value (usually a property from an extractor-produced record) and produces a new value.
  • Loaders: Loaders accept one DataRecord instance at a time and load the data it contains into a destination as configured. Note that not all destinations may permit deleting loaded data (e.g., a loader could be used to output a CSV file). The deletion capability (used by rollback operations) should be moved to a separate interface.

The ETL pipeline components need to communicate the data they handle with each other (extractor outputs need to pass through a series of transformers and ultimately into a loader). The canonical representation of such data would be an associative array of arbitrarily-typed values, but rather than require a single representation it is more flexible to abstract the data.

  • DataProperty: Represents a value (which could be a scalar, an array, or an object). Implementations of DataProperty should be immutable - the value should be set at construction time and may not subsequently be changed. The value may be any scalar, array, or object type - including DataPropertyInterface.
  • DataRecord: A data record (a set of named DataProperty instances) is represented by DataRecordInterface. In the context of an ETL pipeline, an extractor will output a DataRecordInterface to input to transformers, and the transformation process will populate another instance of DataRecordInterface one property at a time to ultimately pass to a loader.

To manage the migration process, we have:

  • Task: A named object controlling the execution of operations according to a set of configuration. Most tasks will be ETL tasks, designed to migrate data, but the overall migration process may require some non-ETL housekeeping tasks (like moving files around) - classes derived from Taske rather than EtlTask can be used to incorporate these operations.
  • EtlTask: A Task specifically designed to perform an ETL operation in the following manner:
    1. Invoke an Extractor instance and iterate over its data set, retrieving one source DataRecord at a time.
    2. Create a destination DataRecorde, and for each property to be stored in this record, execute one or more Transformer instances to derive the destination property from source properties and configuration.
    3. Pass the destination DataRecord to a Loader instance for final disposition.

Finally, we have:

  • KeyMap: Storage of the relationships between extracted and loaded records (based on the designated unique keys for each). This enables maintaining relationships between keyed records when the keys change during migration (as when loading into an auto-increment SQL table), as well as providing rollback and auditing capabilities.