Feb 08 by Mike Reich

Rethinking ETL for the API age

The internet and APIs has created a data ecosystem where there are multiple sources of information and many potential consumers of information. Moving data between systems is no longer a simple affair. Instead, organizations face new complexities and pitfalls that are not addressed by the Extract Transform Load (ETL) process most IT departments use to migrate information.

Some New Best Practices

The following best-practices are helpful when considering how to move data between systems where there are multiple sources and multiple consuming applications.

  • Information flows instead of pipelines: Information operates in ‘flows’ where inputs and outputs are flexible and happen at any point. Flows are fluid and flexible, unlike structured, point-to-point ‘pipelines’.
  • Data has multiple owners: Information flows are composed of multiple streams of data owned by different partners and vendors. Any process must accommodate multiple canonical sources for different information.
  • Use APIs to move information: by using APIs to move information around, we decouple the data from the underlying technology and vendor, and make it possible to combine information from different technologies. APIs provide a flexible, low cost base to grow the system and meet changing needs.
  • Integrate data across systems: information lives on multiple systems inside and outside the organization. There is tremendous value to be had from combining multiple data sources together into a single information stream.
  • Translation rather than standardization: information is stored in multiple structures and formats. Any effort to manage information should focus on translating between structures rather than trying to develop a common schema.

Rethinking the Process

Building on the lessons learned from ETL, we’ve used the following process to successfully move information between multiple connected systems and applications. There are three parts: Acquiring, Processing and Publishing.

Acquire

The first step is to acquire information from different sources. There are two approaches to accomplishing this, depending on how the information is made available by the source system.

  • Pull: information is actively acquired from different systems, through a direct connection like a DB adaptor or API interface. This is synchronous and triggered by the accepting system. This works well for data sources that are inside the organization.
  • Push: information is pushed into the system via an API endpoint. This is asynchronous and triggered by the sending system. This works well for data sources that are outside the organization.

There are a couple of important points to make about data acquisition in the AP2 model. First, any data acquisition technology must be fault tolerant and support asynchronous communication. Because data may be coming from APIs and web service, the system must be able to gracefully recover if something goes wrong. A missing page of data shouldn’t crash the system, but it also shouldn’t be processed as is. Second, if an error does occur, information should be retained. This ensures that data isn’t lost, and can be reprocessed once the underlying error has been addressed.

Process

The second step is processing. In order for information to be useful to different systems, it needs to go through some processing. There are four common processing tasks:

  • Combining multiple streams: in order to successfully merge information from different sources, some sort of processing is necessary. Often, this can be basic logic, like merging user information by an email address.
  • Translating data formats: different databases and sources represent information in different ways. Often, the processing step will involve normalizing information so that it can be compared and merged regardless of the source.
  • QA information: information coming into the system may be incomplete or contain errors. A dedicated processor for addressing QA issues ensures that even data with errors can be used.
  • Integrate third party processing: processing can include integration with third party services that add value to the data. For example, as part of the processing stage, you might send address information to a geocoding service to return latitude and longitude data in order to map the information.

Publish

The final stage of the AP2 workflow is Publishing. Publishing makes information available as streams of information to consuming applications. Publishing can be implemented with any number of technologies that follow two basic principles:

  • Be application/technology agnostic: in other words, assume that your published information will be used by any number of unknown applications and technologies. This will ensure that you design a publishing interface that gives you the most flexibility in the future.
  • Assume there will be multiple consumers of the information: part of the power of the AP2 model is that multiple applications can be built off of a single stream of information. To ensure this is possible, be sure to design your publishing interface so that many applications can interface with the information at the same time.

Connecting the pieces

The process outlined above can be implemented as a single system with the three different stages, or as three separate applications that are connected together using open-standards based APIs with the following advantages:

  1. Makes acquiring and publishing data easier: APIs are a powerful tool for interfacing with external systems to gather information, and are the natural mechanism for publishing information in an agnostic way. When you use APIs to connect the different pieces of the AP2 workflow together, you have already implemented the infrastructure needed to acquire and publish information.
  2. Technology agnostic and flexible: almost every technology platform has built in support for APIs - if it can connect to the internet, it can use an API. This means that you can use the best tool for the job, rather than getting locked into a particular technology stack.

In practice, information flows may need to be fairly complicated, with multiple processing steps or data inputs. In these situations, chaining together multiple information flows (each flow is an AP2 process) can help to create complex information management workflows, while managing complexity and ensuring that the information flow is de-coupled and maintainable.

Learn More

For additional insight into APIs and emerging information management systems, contact Mike Reich mike@seabourneinc.com. In addition, we can help you to develop a detailed Project Blueprint that will guide the organization through the process of defining technology project requirements, planning infrastructure modernization strategies, deploying new systems, and the all-important budget justification.

comments powered by Disqus