Ingesting Data in the Data Value Chain
The ingestion stage concept is simple enough, but it represents rigorous, diverse data preparation for processing
Recent IBM Data magazine articles introduced the seven lifecycle phases in a data value chain and took a detailed look at the first phase, data discovery, or locating the data.1 The second phase, ingestion, is the focus here. Ingestion is the process of bringing data into the data processing system. This deceptively simple concept covers a large amount of the work that is required to prepare data for processing. We often hear that 80 percent of the work involved in data science is preparing the data for analysis.2 The variety of ingestion processes is as diverse as all the potential data sources and needs to account for a broad spectrum of data formats, schemas, volumes, and update frequencies that are required. Broadly speaking, data ingestion processes fall into two functional categories: batch and streaming. Ingesting batch data involves importing data in discrete chunks, such as dumps of a day’s transactions. Streaming ingestion—sometimes referred to as real time, although not necessarily in the strict sense of the term—means every data item is ingested individually as it is emitted by the source, such as sensor outputs or social network messages.
Bringing data into the fold
In addition to the main division of batch versus streaming data, any platform with multiple data sources accommodates diversity in other key factors, each of which plays its part in affecting the reliability and complexity of the ingestion phase:
- Data source type—for example, databases, event streams, files, log files, and web services
- Data transport protocol
- Update semantics of the incoming data—for example, append, changeset, and replace
When assessing requirements for ingestion, take account of other functions that are often performed as part of the process: data validation, transformations, and routing the data items to their destination. The ingestion stage is a critical point for ensuring reliability of service and data quality; administrators need to consider the robustness of their ingestion processes and how to monitor them. Before drilling down into ingestion of batch and streaming data, comparing the ingestion stage of the data value chain to the well-established extract-transform-load (ETL) pattern is worthwhile. ETL is the process of extracting data from an operational system, transforming it, and loading it into an analytical data warehouse. As such, it is a special case of the ingest stage. Modern data platforms don’t necessarily require either the transformation or loading into database stages to happen at the point of data import.
Ingesting batch data
Operations for ingesting batch data draw from exports of data from other systems, typically in the form of files. In the best case, the format of the data is a known syntax that can be later processed without further transformation. The schema-on-read nature of modern big data solutions means that decisions about transformation can be deferred from the point of ingest. That aspect notwithstanding, performing lightweight transformations on the data at ingestion time may be appropriate. As the big data ecosystem has evolved, so has the tool support for data ingestion into Apache Hadoop frameworks. While writing custom programs to ingest data provides well-suited flexibility, this tactic is often unnecessary. In one important use case, that of extracting data from relational sources, Apache Sqoop makes the job much simpler than custom programming. Recently, new breeds of data-wrangling tools such as Trifacta help analysts to understand a data source—and to perform some transformations on the ingested data—without having to write any code. Such transformations may include, for example, the removal of error values or the disaggregation of compound fields. These highly sophisticated tools tend to be biased toward analytical use cases in which another end-user tool can be used to perform an analysis. The following key considerations when ingesting data in batch form are noteworthy:
- The semantics of incoming data: Does the data augment or replace existing data, or is it a collection of deltas to a previous state?
- Acquiring the data: How can incoming files be reliably fetched?
- Behavior on error conditions: Can a best effort be made, or should the entire batch be abandoned?
- Denormalization: Data extracted directly from a system, especially a relational database, is likely normalized for operational purposes. Denormalizing at export and ingest time prevents downstream processors from having to understand that system’s operational logic.
- Transformation: Is the data heading directly to an analyst, or on to a system component? It may be appropriate to do more transformation on the data, such as field disaggregation, to make the analysis easier.
Ingesting streaming data
There are several key reasons to choose to ingest data as a stream. If action is to be taken dependent on the current state or transitions of a system, then the relevant data should be ingested as a stream. Example scenarios include financial trading, environmental monitoring, or network security. Even if online action isn’t required, streaming can be the most efficient method of ingesting constantly created data, such as server logs, into a data platform. If an organization requires aggregated data, such as running averages, to perform the analysis, computing these averages in real time while the full context is available is computationally cost-effective. Although storing all incoming data and recreating history offline is possible, this approach requires a lot more effort than real-time streaming. While the ingestion and processing stages are conceptually separate in the data value chain, finding the two closely linked is common in streaming-data ingestion. Typically, the ingestion stage is a front end to a streaming-data processing framework, such as the following common frameworks:
- Apache Storm: A scalable, fault-tolerant system for distributed real-time computation; ingestion happens through spouts, which connect Storm to sources of events.
- Apache Flume: Although less processing-oriented than Storm, Flume’s core use case is ingestion and aggregation of log data into stores, such as Apache HBase, Hadoop Distributed File System (HDFS), ElasticSearch, or Apache Cassandra. Ingestion happens through a network connection to a source emitting a known format, such as syslog, Apache Thrift, or Apache Avro.
- Apache Kafka: This framework is newer than Storm and Flume, but Kafka is a general-purpose distributed messaging system, for which processing streaming data is one of its possible applications.
When ingesting data as a stream, factor in the following key considerations:
- Capacity and reliability: The system needs to scale to input volumes and be failure tolerant.
- Context: What additional derivative data must be computed to support use cases?
- Data volume: Though storing all incoming data is preferable, there are some cases in which aggregate data should be stored instead because storing all the data may be cost prohibitive.
Progressing through the data value chain
The old computing aphorism—garbage in, garbage out—holds as true as ever: the ingestion stage of a data platform is critical to its operation as a whole. The reliability of data ingestion determines the overall reliability of the data platform, and the downstream use cases of the data platform also place requirements on the work done at ingestion time. Stay tuned for subsequent features that explore the other phases of the data value chain, including the next stage, process. Any mature data platform ingests data through both batch and streaming methods. The process part of the value chain concerns the nature of the processing performed on the data prior to it ultimately being stored in the system. Please share any thoughts or questions in the comments. 1 “Understanding the Data Value Chain,” by Edd Dumbill, IBM Data magazine, November 2014, and “Data Discovery in the Data Value Chain,” by Edd Dumbill, IBM Data magazine, January 2015. 2 “For Big Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights,” by Steve Lohr, The New York Times, Technology, August 2014.