The Preeminence of IBM Informix TimeSeries: Part 1
Handling the large volumes of time-stamped data generated from stock transactions, smart metering, and more
As the world becomes increasingly instrumented, and the IBM vision of a Smarter Planet™ becomes reality, the value of IBM® Informix® TimeSeries is becoming clearer. With its origins in the key architectural breakthrough of Informix Dynamic Server (IDS) Version 9, Informix TimeSeries is a unique and powerful technology that is gaining recognition as the preeminent tool for dealing with vast volumes of time-stamped data.
The technology at the core of Informix TimeSeries was originally described as the “object relational model” or a “Universal Database.” Informix used the proprietary (but not very descriptive) marketing term “datablade.”
No matter what you called it, the idea was a powerful one. This technology enabled you to open up the database so that new data types could be added, and it included the ability to add new “methods” by extending the standard SQL language with operators specific to the new data types.
Because the application programming interface (API) for adding these extensions (the Datablade Developer Kit) was open to clients and partners as well as Informix developers, these groups could now add support for new data types. Dozens of interesting datablades were created, but two stood out as the most popular and appealing to clients: the extensions for dealing with time-series data and spatial data.
It makes sense that these extensions were so widely embraced. Almost every organization, whether it’s a business or government agency, has data that includes location and time.
The case for location data
The location (spatial) data is so useful to organizations that some mechanism for dealing with this data type has been added to many databases today, and there are several companies whose business is based on offering detailed spatial data and maps for use in conjunction with them. For example, any time you use a website that asks for your zip code and then shows you the nearest locations of a business (such as the closest ATM), you are likely using a database that supports spatial data.
Without the specialized functions for time-series data and spatial data it is still possible to keep addresses in your database, but it’s difficult to answer simple queries such as “how far is it from 26 Cherry St. to 52 Adams St. in Springfield, OR.” Addresses kept as strings without reference to a geographic database don’t have enough information to answer this kind of query.
The need for time-series data
So while the case for extending the typical relational data types with spatial ones is conceptually easy to grasp, the need for time-series data types requires a little more explanation. After all, since the earliest database systems, time and date have been important elements and have always been an integral part of database systems. Routine transactions in any online transaction processing (OLTP) system are all time-stamped. Billing systems are usually based on billing cycles that are delimited by dates and times. So if date and time data formats have been supported in SQL since the beginning, why do we need a special time-series data type and associated methods?
The short answer is that time-series data is a special case where some data value is changing quite rapidly over time, thus creating vast amounts of data points. A comparison between the use of traditional time elements and time-series data is useful. An average bank account might have a few transactions a day; using SQL time and date format, creating a new row for each transaction is simple and obvious. At the other extreme, consider the number of data points created by a heavily traded stock on a major exchange—for example, IBM stock averages just short of four million trades a day. The New York Stock Exchange has a 6.5-hour trading day, which works out to about 10,000 transactions per minute (or 170 transactions per second) that an investment firm might need to accommodate.
It is certainly possible to use standard SQL data types and formats to record this information. The TPC-C benchmark was invented several decades ago to help customers understand the throughput capacity of the various vendors’ systems in just these types of scenarios. But what if our requirements demand much more? What if we need not only to write many transactions, but also to do it for many inputs almost simultaneously? And what if we need the data in one place so we can continuously run queries against it? Scanning dozens or hundreds of multimillion-row tables doesn’t sound like a method that is likely to yield fast query results.
It was these real-world problems that Informix engineers and architects set out to address with Informix TimeSeries technology. In the case of TimeSeries, the early adopters were all high-value arbitrage traders working for large investment firms.
Developing TimeSeries technologies
The Informix team created several technologies, making use of the IDS ability to be extended with new data types and methods.
First, the Informix team created a more efficient mechanism for storage so that millions of redundant rows were not created for time-series data. This storage mechanism is the TimeSeries data type, which is designed specifically to store data points (or “ticks,” as in the ticker tape used by financial exchanges). This data type greatly reduced the size of the database, and subsequently the number of rows in the database, enabling it to hold the vast amount of trading data that arbitrage traders needed (see Figure 1). The Informix team then added new extensions to the query language (SQL) to enable the questions to be asked in ways that do not require dozens of statements to ask simple questions.
In addition, the Informix team developed a faster method to get data into the database, quicker than the typical methods used for OLTP systems. The Informix team built the High Speed Loader to meet this mission-critical requirement of clients.
Informix TimeSeries streamlines storage by avoiding the creation of millions of redundant rows.
Understanding how time-series data is stored by Informix TimeSeries is conceptually simple, and exemplifies the thinking that was used for the other features in the product. The goal was to make it fast, efficient, and easy to understand. A database system is only as useful as the ability of developers to get information out of it. Simple conceptual models lead to simple queries and ultimately faster development and deployment of new trading strategies. This was a critical requirement for the hyper-competitive world of commodities trading.
A further refinement was added to enhance efficiency in cases where the time-series data arrives at regular intervals—for example, when sensor data in a laboratory is measured every tenth of a second. Informix TimeSeries does not store the time stamps since time stamps can be easily calculated from the starting value. For example, if the sensor reading began as 2:00:00.0 and it is the 251st datapoint in the series, the time stamp would read 2:00:25.1.
In Part 2 of this article, we’ll explore how the utilities industry has quickly adopted Informix TimeSeries as a way to accommodate the tremendous volume of time-stamped data generated by smart meters.
|[followbutton username='IBMdatamag' count='false' lang='en' theme='light']|