Welcome to Soong

Overview

Soong is a framework for building robust Extract-Transform-Load (ETL) applications for performing data migration. It is designed to be record-oriented and configuration-driven - many applications will require little or no custom PHP code, and tools can easily customize (or generate) data migration processes implemented by Soong.

The core of Soong is a set of interfaces defining a highly decoupled architecture. Each component (other than EtlTask) may be used standalone, but is primarily designed to be part of an Extract→Transform→Load pipeline for processing sets of data one record at a time. At this stage of development, the main repository also includes basic implementations sufficient to build simple ETL applications and examples - before V1.0.0 is released, most implementations will be moved to separate repositories.

Installation

With Composer installed globally on your system, at the command line enter:

composer require soong/soong

Soong comes with some default Symfony Console commands to run your task configurations, invoked from bin/soong, e.g.:

bin/soong migrate beeraccounts

To autoload Soong classes in your custom CLI or web application:

require_once 'vendor/autoload.php';

Example

A migration task is represented by an EtlTask object constructed with a configuration array, which itself contains arrays of configuration for its various components (extractors, transformers, loaders, key maps). For many migration applications, existing components will be sufficient for your migration needs and all you will need to do is setup the configuration.

Here is example configuration for a migration task, represented in YAML:

# Name of the task, used to reference it in commands.
arraytosql:
    # Concrete Task class to instantiate and invoke.
    class: Soong\Task\EtlTask
    # Configuration passed to the Task class at creation time.
    configuration:
        # The KeyMap component stores the mappings from source record keys to
        # result record keys.
        key_map:
            class: Soong\KeyMap\DBAL
            # Configuration for this KeyMap class - since we're using a SQL-based
            # KeyMap, this contains DB connection into and a table name to use.
            configuration:
                connection:
                    # Replace with your test database credentials.
                    dbname: etltemp
                    user: root
                    host: 127.0.0.1
                    port: 3306
                    driver: pdo_mysql
                table: map_arraytosql
        # The Extractor which will provide the data.
        extract:
            # The specific Extractor we use accepts the source data as a keyed
            # array within its configuration.
            class: Soong\Extractor\ArrayExtractor
            configuration:
                # Within the source data, the unique key is named "id" and is an integer.
                # The KeyMap uses this information to create a map table and populate it.
                key_properties:
                    id:
                        type: integer
                # The data we're importing - an array of keyed arrays. The keys of
                # each record array are the source property names.
                data:
                    -
                        id: 1
                        sourcefoo: first record
                        bar: description of first record
                        num: 1
                    -
                        id: 5
                        sourcefoo: second record
                        bar: description of second record
                        num: 2
                        related: 0
                    -
                        id: 8
                        sourcefoo: third record
                        bar: description of third record
                        num: 38
                        related: 1
                    -
                        id: 12
                        sourcefoo: bogus
                        bar: we should skip this
                        num: -5
                # Filters can be used to narrow down the raw data. In this example, we
                # use a Select filter to skip bogus records.
                filters:
                    -
                        class: Soong\Filter\Select
                        configuration:
                            criteria:
                                -
                                    - sourcefoo
                                    - <>
                                    - bogus
        # The transformation stage passes each extracted Record through a series
        # of record transformers to build a result Record to pass to the Loader.
        transform:
            -
                # The Copy record transformer copies properties as-is from the
                # extracted Record to the result record. With no configuration,
                # it copies all properties.
                class: Soong\Transformer\Record\Copy
                configuration:
                    # Use the include option to copy specific properties.
                    include:
                        - bar
                        - num
                    # Use the exclude option to copy all properties except those
                    # specified.
                    # This 'exclude' definition has exactly the same effect as
                    # the above 'include':
#                    exclude:
#                        - id
#                        - sourcefoo
#                        - related
            -
                # This is the most important record transformer - it populates
                # each of the result properties (the keys in the property_map).
                class: Soong\Transformer\Record\PropertyMapper
                configuration:
                    property_map:
                        # The canonical form for a property mapping is for each
                        # result property name to contain an array of property
                        # transformer definitions, each specifying at least the
                        # property transformer class, and usually the name of a
                        # source property to pass to the transformer.
                        foo:
                            -
                                # This transformer simply returns its source
                                # value as-is. Think of it as being equivalent
                                # to the PHP statement
                                # $result['foo'] = $source['sourcefoo'];
                                class: Soong\Transformer\Property\Copy
                                source_property: sourcefoo
                        # The Copy transformer is so common, it's the default
                        # when we simply map a result property from a source
                        # property. In this case, we don't actually need to map
                        # these properties here because the Copy record
                        # transformer above has already done it.
#                        bar: bar
#                        num: num
                        # The related property is the ID of a related record in
                        # the same data set. The IDs are changing in this
                        # migration, so to maintain the relationship we need to
                        # rewrite ID references.
                        related:
                            -
                                # The KeyMapLookup transformer accepts a key value from the source
                                # data, looks that up in the specified KeyMap to see what the ID of
                                # the corresponding destination data record is, and returns that ID.
                                # Note: This works for our dataset where the related values only reference
                                # already-migrated keys - handling chicken-and-egg problems is not yet
                                # implemented.
                                class: Soong\Transformer\Property\KeyMapLookup
                                source_property: related
                                configuration:
                                    key_map:
                                        task_id: arraytosql
        # The DBAL Loader class loads the resulting data records into a DB table.
        load:
            class: Soong\Loader\DBAL
            configuration:
                connection:
                    # Replace with your test database credentials.
                    dbname: etltemp
                    user: root
                    host: 127.0.0.1
                    port: 3306
                    driver: pdo_mysql
                # Name of the table to populate.
                table: extractsource
                # The destination table's primary key column is "uniqueid". In
                # our scenario, it's an auto-increment column - the task will
                # retrieve the newly-created key to stored in the KeyMap.
                key_properties:
                    uniqueid:
                        type: integer

The soong/soong repo contains working examples - please see the README for details on running them.

API reference

All we really need here is the title above (for the TOC), because api.html will get swapped in to redirect to the doxygen-generated API documentation.

Contributing

Contributions are welcome and will be fully credited. There’s still a lot of refinement to be done to Soong - this is your opportunity to get involved with a new framework (and community) on the ground floor! As mentioned above, the plan is ultimately to break out components into small well-contained libraries - these will be excellent opportunities to get your feet wet maintaining your own open-source project. Mike Ryan will be happy to help mentor new contributors.

There’s plenty of work already identified in the Gitlab issue queue. Feel free to browse, ask questions, and offer your own insights - or, if you have a migration itch you’d like to scratch and don’t see an existing issue, open a new one.

Working on issues

  1. If you have an issue you’d like to work on, assign it to yourself.
  2. If you haven’t already, fork the project to your account.
  3. Create a feature branch in your fork. Recommended branch name is <gitlab issue #>-<dash-separated-issue-title>
  4. Develop your solution locally. Be sure to:
    • Make sure your changes are fully tested (see below).
    • Make sure your changes are fully documented (see below).
    • Follow the PSR-2 Coding Standard. Check the code style with $ composer check-style - many issues can be automatically fixed with $ composer fix-style. The only complaints you should see from check-style are long lines in tests.
  5. Make sure each individual commit in your pull request is meaningful. If you had to make multiple intermediate commits while developing, please squash them before submitting.
  6. Commits should reference the issue number - e.g., a commit for Add community docs up front might have the commit message “#51: Expand community documentation and move to docs directory.”.
  7. On gitlab, create a merge request and submit it.

Tests

Automated tests are critical, especially when code is changing rapidly. They help ensure that any changes made don’t produce any unexpected consequences, and give confidence that a new piece of code does what it’s expected to do. In the Soong tests directory, you’ll find existing tests laid out in parallel with the src directory. Of particular note is tests/Contracts - while interfaces can’t be tested (since they don’t do anything to test), we do provide base classes here which you should extend for the tests of your components - these will give you testing that your components meet the documented expectations of the interfaces, so in writing tests you can focus on the specific features added by your own code.

To run the test suite locally:

$ composer test

Documentation

  • Classes and methods are to be fully documented in comment blocks - these are used to automatically generate the API Reference section of the online documentation.
  • Add any non-trivial changes you’ve made to the CHANGELOG.
  • Review README.md and any .md files under docs to see if any changes need to be made there.

Release checklist

Process for tagging a new release:

  1. Review the changes going into the release at https://gitlab.com/soongetl/soong/compare/x.y.z...master (where x.y.z is the previous release).
  2. Make sure all significant changes are reflected in CHANGELOG.
  3. Make sure the documentation is up-to-date with all changes.
  4. Run tests and quality reports (TBD, scrutinizer-ci.com has incomplete Gitlab integration) and make sure there are no errors, or regressions in coverage or quality scores.
  5. Review all issues labelled for the next release (e.g., https://gitlab.com/soongetl/soong/issues?label_name%5B%5D=0.7.0) and triage them: Are there any we should complete first? Any which should be deprioritized? Change the tags on those that remain to the next release.
  6. Review all @todo tags in the code. Create issues in Github for any which are still relevant, and remove all of them.
  7. Add a new heading ## [0.6.0] - 2019-05-01 (replacing 0.6.0 with the new release number and 2019-05-01 with the release date) to CHANGELOG below [Unreleased]
  8. Add a link for the new release at the bottom of CHANGELOG, and update the [Unreleased] link to reflect the new release number.
  9. Make sure any changes made in the preceding steps are merged into master.
  10. Create the new tag.

Contributor Code of Conduct

Our Pledge

In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to make participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, nationality, personal appearance, race, religion, or sexual identity and orientation.

Our Standards

Examples of behavior that contributes to creating a positive environment include:

  • Using welcoming and inclusive language
  • Being respectful of differing viewpoints and experiences
  • Gracefully accepting constructive criticism
  • Focusing on what is best for the community
  • Showing empathy towards other community members

Examples of unacceptable behavior by participants include:

  • The use of sexualized language or imagery and unwelcome sexual attention or advances
  • Trolling, insulting/derogatory comments, and personal or political attacks
  • Public or private harassment
  • Publishing others’ private information, such as a physical or electronic address, without explicit permission
  • Other conduct which could reasonably be considered inappropriate in a professional setting

Our Responsibilities

Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.

Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.

Scope

This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.

Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at soong@virtuoso-performance.com. All complaints will be reviewed and investigated and will result in a response that is deemed necessary and appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.

Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project’s leadership.

Attribution

This Code of Conduct is adapted from the Contributor Covenant, version 1.4, available at http://contributor-covenant.org/version/1/4

Changelog

All notable changes to this project will be documented in this file.

Updates should follow the Keep a CHANGELOG principles.

This project adheres to Semantic Versioning. For major version 0, we will increment the minor version for backward-incompatible changes.

Unreleased

Changed

  • Moved the Csv extractor and loader to soong/csv.
  • Moved the DBAL integrations to soong/dbal.
  • Moved the console command implementations to soong/console.
  • CountableExtractorBase::count() now supports caching of counts - extractors extending CountableExtractorBase should override getUncachedCount() if they have a more efficient means of retrieving the source data count than iterating over all the source data.
  • A new PropertyList interface is added, which the Extractor and Loader interfaces extend instead of defining their own getProperties() and getKeyProperties() methods.
  • The processing of each record during migration is now implemented as a pipeline through all transformers and the loader, with the new RecordPayload class (containing the source record and destination record) as the payload passed through each. As a result, RecordTransformer::transform($sourceData, $resultData) becomes RecordTransformer::__invoke($recordPayload), and Loader::load($resultData) becomes Loader::__invoke($recordPayload).
  • Property transformers are now processed as a pipeline, and implement __invoke($data) rather than transform($data).
  • The Operation interface, with concrete implementations MigrateOperation and RollbackOperation, has been added. These classes, containing the Task instance being operated on, are now the invokable stages of the task pipeline - operations are no longer methods in the Task objects.
  • TaskPipeline has been renamed to TaskCollection to better reflect what it is. For better compatibility with standard collection terminology, the getTask/getAllTasks/addTask methods have been renamed to get/getAll/add.
  • Classes with the same name as the interface they are implementing have been renamed to include the word “Basic”. Specifically:
    • Soong\Data\Record -> Soong\Data\BasicRecord
    • Soong\Data\RecordFactory -> Soong\Data\BasicRecordFactory
    • Soong\Task\EtlTask -> Soong\Task\SimpleEtlTask
    • Soong\Task\Task -> Soong\Task\SimpleTask
    • Soong\Task\TaskCollection -> Soong\Task\SimpleTaskCollection
  • The League\Pipeline package is now used to run migration tasks. No API changes - new dependency in composer.json.

0.7.0 - 2019-06-25

Changed

  • The former Transformer interface is now named PropertyTransformer, to distinguish it from the new RecordTransformer interface.
  • Property transformer classes have been moved to the Soong\Transformer\Property namespace.
  • The structure of sample task configurations has been changed to use an array of RecordTransformer instances under the transform key.
  • Transformer construction has been moved from the EtlTask implementation to the EtlCommand implementation - the transform option for EtlTask now requires an array of RecordTransformer instances.
  • PropertyTransformer::transform() no longer specifies the types of its argument and return value. Implementations should specify their specific return type, and validate the expected argument type.
  • Record::getProperty() renamed to getPropertyValue(), setProperty() renamed to setPropertyValue().
  • Symfony Console dependency has been loosened to allow versions 3.4 through 4.x.
  • Release checklist added to CONTRIBUTING

Removed

  • The Property and PropertyFactory interfaces and implementations have been removed - all code using properties now uses values directly, leaving type-checking to PHP. property_factory configuration options have accordingly been removed.

Added

  • The RecordTransformer interface has been added.
  • The PropertyMapper record transformer class has been added.
  • The Copy record transformer class has been added.
  • The --limit option has been added to the migrate and rollback commands.
  • Transformer exceptions added.

0.6.0 - 2019-05-01

Changed

  • EtlTask now accepts its extract, key_map, and load components as object instances rather than constructing them from configuration.
  • DataProperty interface renamed to Property, and DataRecord interface renamed to Record.
  • Derivatives of ExtractorBase now must accept a record_factory configuration option, which is an instance of RecordFactory.
  • EtlTask replaced the string record_class with RecordFactory instance record_factory.

Added

  • The Filter interface has been added, to determine whether a DataRecord should be processed.
  • The Select filter has been added, allowing for filtering by comparing DataRecord properties to values using PHP comparison operators.
  • The --select option has been added to the migrate command, allow for ad-hoc filtering of extracted data at runtime.
  • PropertyFactory and RecordFactory interfaces/classes added for creation of Property and Record instances.
  • Added basic console command tests.
  • property_factory configuration option added to EtlTask, LoaderBase.
  • ExtractorException, KeyMapException, and LoaderException classes added.
  • Unit test for Record added.

0.5.3 - 2019-04-12

Changed

  • Things are now configured to generate the API documentation using Doxygen on readthedocs - the generated docs are no longer kept in the repo.

0.5.2 - 2019-04-05

Changed

  • addTask now takes an existing Task object instead of a class and configuration.
  • Static create() methods removed from all components and constructors made public.
  • Static methods removed from Task component and moved to non-static methods on the new TaskPipeline component: addTask(), getTask(), getAllTasks().

Added

  • ConfigurableComponent interface added, and all configurable component interfaces inherit from it.
  • OptionsResolverComponent added implementing ConfigurableComponent using Symfony\OptionsResolver - this is now the base class for all configurable components. Any such components adding configuration options to their parent class must implement optionDefinitions() to defined them.
  • Commands now use hassankhan/config instead of custom YAML handling - configuration now can be YAML, JSON, or XML transparently (examples provided for each).
  • TaskPipeline component for managing groups of tasks.
  • ComponentNotFound and DuplicateTask exceptions added.
  • Tests for Extractor, KeyMap, Loader, and Task components.
  • Tests for KeyMapLookup component.
  • Smoke test to make sure all provided examples keep working.

Removed

  • isCompleted method on Task - unneeded until we add dependencies.

0.4.0 - 2019-02-15

Added

  • EtlTaskInterface::getLoader()
  • Tests for Data and Transformer components.

Changed

  • DBAL and Csv implementations have been moved:
    • Soong\Csv\Extractor -> Soong\Extractor\Csv
    • Soong\Csv\Loader -> Soong\Loader\Csv
    • Soong\DBAL\Extractor -> Soong\Extractor\DBAL
    • Soong\DBAL\KeyMap -> Soong\KeyMap\DBAL
    • Soong\DBAL\Loader -> Soong\Extractor\Loader
  • Interface and Trait suffixes removed from all interfaces and traits.
  • All interfaces moved into Contracts directory.
  • All main components must now be created using Class::create() rather than new. This affects:
    • DataPropertyInterface
    • DataRecordInterface
    • ExtractorInterface
    • KeyMapInterface
    • LoaderInterface
    • TaskInterface
    • TransformerInterface
  • Explicit component documentation pages have been removed in favor of Doxygen-generated documentation.
  • Existing inline documentation has been cleaned up and expanded.

Removed

  • KeyMapFactory
  • MemoryKeyMap

0.3.0 - 2019-02-05

Added

  • Added getExtractor() and getAllTasks() to task interfaces/implementations.
  • Initial implementation of the status console command.
  • All documentation moved into this repo, will be available on readthedocs.io.
  • DataPropertyInterface::isEmpty()
  • DataRecordInterface::propertyExists()

Removed

  • Removed subtask concept from task interfaces/implementations.
  • Removed CountableExtractorInterface.

Changed

  • DataRecordInterface::setProperty() no longer accepts null values (only property objects with null values).
  • DataRecordInterface::getProperty() no longer returns null values (only property objects with null values).
  • TransformerInterface::transform() no longer accepts or returns null values (only property objects with null values).
  • The $configuration parameter to TransformerInterface::transform() has been removed - configuration should be passed in the constructor instead.
  • SHA256 for key hashing in KeyMapBase.
  • Added configuration key for hash algorithm.

Fixed

  • Hashing serialized keys needs to make sure values are always strings.

0.2.1 - 2019-01-24

Changed

  • Merged all the repos back into one for ease of development.

0.1.0 - 2019-01-17

Initial release on Packagist.