Blogs

The Database Revolution

Today they’re indispensable. Sixty years ago they didn’t even exist as an idea. How did we get here?

Today, almost everyone who uses information technology simply takes the vast sea of underlying data for granted. Business users may be aware that a database is involved, but have little idea of the complex architecture needed to keep their data organized, related, linked, current, consistent, and available.

Even database administrators and other professionals operate within a universe of well-established concepts and reliable functionality. They depend on characteristics such as atomicity, consistency, isolation, and durability (ACID) properties as well as a range of data management technologies and methodologies that did not exist a half century ago. This article explores how these things came to be.

Early days: Start with a file system

In 1958, computers had already evolved from huge tube-driven hulks into smaller, lighter, transistorized machines. They could perform most simple management functions themselves, could load and execute application programs without rebooting, and most importantly for business, they were (relatively) affordable, practical, and manageable.

However, the data revolution was still in its infancy. Computer programs wrote data to storage devices formatted according to their own internal definitions. These data collections, eventually called “files,” were normally used so that a program could generate output that would be read by the next program in a batch sequence. The last file in the sequence would be stored on tape or cards for the next time the sequence, eventually called a “job,” was executed. Because a file was written and read according to definitions within the program code, there was no external data definition, making data sharing difficult and ad hoc reporting impossible. Although business users wanted to use this data for reporting or with other newer programs, there was no effective technique for collecting and reusing data in an organized manner.

As it turns out, 1958 was an important year for data. That was the year IBM produced a file system that could store and retrieve data based on data format definitions held in the file system rather nthan in the program. It was called the Formatted File System (FFS), and it was developed for the IBM 704 computer. Just two years earlier, IBM had also developed the first random access disk storage device, the Random Access Memory for Accounting and Control (RAMAC) 350, so there was a place to store this stuff. FFS was a critical first step toward automated database management; it spawned a series of software systems that could be regarded as the first database management systems (DBMSs).

During the next four years, the arrival of powerful, transistorized computers from several vendors as well as third-generation languages—including COBOL, ALGOL, and PL/I—resulted in a flurry of application programming, more and more application data, and a critical need to organize that data. A team of IBM developers responded with the Generalized Information Retrieval and Listing System (GIRLS) for the IBM 7090 in 1962. GIRLS improved on the FFS facilities, enabling users to collect data and easily code reports on a recurring basis. This activity could be thought of as a forerunner to data warehousing. The collection of data was referred to as a “data base” because it consisted of not one but many different record types.

The declaration of (data) independence

At the time that GIRLS was developed, a tremendous increase in computer functionality, manageability, and affordability was taking place. Computers were getting smaller and more powerful due to the development of integrated circuits (ICs) and large scale integration (LSI). Meanwhile, data could be shared even more widely using the American Standard Code for Information Interchange (ASCII) and IBM’s Extended Binary Coded Decimal Interchange Code (EBCDIC).

However, data kept in the early FFS-based DBMSs of the day formed indexed sets of records having the same layout, and the records in one set could not be formally associated with the records in another set. In 1964, General Electric consultant Charles W. Bachman developed a way to enable the sharing of complex data under schematic control that would ensure that all the data forms necessary for the constituent applications would be preserved. He used a network orientation to build his design, and the result was a DBMS called the Integrated Data Store (IDS) that was designed to run under GE’s GECOS operating system.

Bachman formally defined each bit of data typologically as an “element,” such as “customer name” or “order number.” Then, record types would be defined using these elements. Records would be stored in sets—not based on their type, but based on their relationship to the record that “owned” the set. So, for instance, a “customer” record might own the set of all the “order” records representing orders that that customer had placed. A “product” record might own the set of all the “order” records representing orders for that product. To find out what products a customer had ordered, it was possible to traverse the set of all orders placed by the customer and, for each order, find the product record that owned the set of ordered products in which that order record is kept. Records that had no owner were called “entrypoints” and were found based on their key value.

Although most DBMSs are not network oriented, they share these common features:

  • Data can be defined at an elementary level, and it is kept in typologically consistent groups that can be related to other groups.
  • Data groups can be found randomly, and associated data can be retrieved based on the way the data is defined to the database in a structure called a “schema.”

In this way, the data definitions and the rules governing how they are stored and related to each other are independent of the programs that use the data. These concepts evolved from the FFS work, but they first found organized expression in IDS.

The following year, IBM produced its own stand-alone data storage and query system, called the Generalized Information Store (GIS). Over the next several years, the company executed a major project with Rockwell International to build upon GIS and produce a high-volume, large-scale DBMS for the National Aeronautics and Space Administration (NASA). Unlike Bachman’s approach, which organized record types under other record types as “sets” with “owners” and “members” to form a data network, IBM and Rockwell International used a hierarchical organization of data to deliver very rapid response times in executing complex data transactions. The result was the Information Management System (IMS), which was delivered to NASA in 1969. IBM subsequently productized it for the System/360 mainframe.

The next 10 years represented the heyday of the mainframe DBMS, most of it centered on the IBM System/360 and its successor, the System/370. Network DBMSs IDMS and TOTAL, inverted-list DBMSs Adabas and Model 204, and indexed table DBMS DATACOM provided lively competition during this period. There were major DBMS offerings on other mainframes also: Honeywell, which had acquired GE’s computer business, continued to offer IDS, and Burroughs had a similar network DBMS called DMS. In the end, however, it was the IBM mainframe that became dominant.

From navigational to relational

The DBMSs of the 1970s represented powerful advances in the ability of companies to collect, reuse, and report on data, and to manage large and complex, yet well-coordinated application systems. However, these DBMSs were complex and required people with considerable technical knowledge to understand them, much less design and manage them. They were also not useful for providing query support, because users had to know how to navigate the physical structure of the database to find the data.

But another quiet revolution was taking place—one that began with a research paper describing a simple way to collect, manage, and share data based on mathematical set theory. The paper was called “A Relational Model of Data for Large Shared Data Banks,” and its author was an IBM engineer named E. F. Codd.

Codd had been bothered by the widespread practice in the database world of treating data and its definitions haphazardly, without a systematic approach that could lead to scalable sharing of the data across diverse systems. The problem, from his point of view, was a lack of mathematical rigor in defining the data model (some database models had no rigor whatsoever). His solution was to consider ways in which data could be defined and organized according to principles derived from mathematical set theory, allowing the data to be managed using elements of predicate logic. His work, which was laced with obscure mathematical terms, at first was barely understood in the broader data management community. Once some bright light translated “tuples,” “attributes,” and “relations” into “rows,” “columns,” and “tables,” the relational concept spread like wildfire.

Codd’s paper touched off a flurry of research into how his approach might be implemented practically. The first fruit of this effort was a purpose-built relationally-based data handling system built for the British Geological Survey in 1973. Called G-EXEC, it was developed on an IBM System/360 by a team led by Keith Jeffery from the British Geological Survey and Elizabeth Gill from the Atlas Computer Laboratory, along with Stephen Henley and John Cubitt from the British Geological Survey. Four years later, IBM’s Jim Gray led a team of engineers to produce a prototype relational DBMS (RDBMS) called System R. (Gray went on to develop RDBMS technology at Tandem Computers and Digital Equipment Corporation, and he finished his impressive career at Microsoft.) By that time, Honeywell had produced an RDBMS for its Multics OS called the Multics Relational Data Store (MRDS), and Michael Stonebraker launched a project at the University of California at Berkeley to produce an RDBMS, code-named Ingres. During the following decade, these projects, as well as others from a variety of companies and targeting a variety of platforms, created the now-familiar RDBMS landscape.

The start of SQL

As part of the System R project, an IBM engineer named Donald Chamberlin led a team that developed an interactive query language called SEQUEL (a sort of acronym for Structured English Query Language) in 1974, which Chamberlin and colleague Robert Boyce rewrote in 1977. Its name was changed for legal reasons to SQL (Structured Query Language) even though it was built out to become a full data manipulation and definition language.

IBM promoted SQL as a standard and developed from System R an RDBMS called SQL/DS. The success of SQL/DS and its mainframe successor, DB2, helped ensure the adoption of SQL as the standard for RDBMS data access. The origins of DB2 go back to the OS/2 operating system: IBM took the RDBMS technology developed for OS/2 and created a powerful cross-platform RDBMS, which it called DB2 Universal Database (UDB) at first (in 1997). Today this RDBMS is generally known as DB2 for Linux, UNIX, and Windows (LUW).

Other DBMS technology developments

Some of the DBMS technology developments between 1970 and 2000 were not relational. Two of the engineers who had contributed to the development of GIRLS, Dick Pick and Don Nelson of TRW, developed a computer system with integrated database management capabilities based on GIRLS that came to be known as the Pick System. It had the additional advantage of supporting multivalued fields.

It was sold as an integrated environment on a variety of minicomputers in the mid- to late 1970s.

In the late 1980s, object-oriented (OO) programming techniques and languages such as SmallTalk and C++ emerged, and soon there were DBMSs that could store and retrieve object attribute data seamlessly as extensions of the programming environment. These OO RDBMSs had sophisticated structures that supported nesting, recursion, and all the required OO characteristics, and were brought to market by such firms as Versant, Objectivity, Object Design, GemStone, and POET Software.

Some of these developments have influenced the course of the relational DBMS, resulting in extensions to the relational model such as multivalued data support and explicit support for unstructured and semi-structured data in the database. Stonebraker was looking to achieve this capability by applying OO principles in working on an object-relational DBMS (ORDBMS), which resulted in the emergence of Illustra in 1994. Informix acquired Illustra the following year, blending ORDBMS technology into a product called Informix Universal Server and inspiring Oracle and IBM to offer ORDBMS capabilities of their own.

The turn of the 21st century saw considerable consolidation. VMark (makers of UniVerse) acquired Unidata to become Ardent Software, which was also acquired by Informix, along with Red Brick, a data warehouse DBMS vendor. Informix DBMS technologies were ultimately acquired by IBM in 2001. Oracle acquired in-memory DBMS vendor TimesTen in 2005 and open-source small-footprint DBMS vendor Sleepycat in 2006.

In addition to consolidation, the DBMS industry saw considerable technical advancement by all the major vendors and some new players. These technical developments included advanced clustering support and, more recently, columnar and cell-based in-memory DBMS.

The revolution continues

Technologies underlying some products have fueled the design and development of what may be the next big leap in DBMS technology. The key-value pairs, list-oriented structures, and other technologies that offer flexibility and scalability for workloads that do not require perfect consistency are being rediscovered and offered, often in the context of the cloud, as so-called NoSQL DBMSs. In addition, object-oriented DBMS technology is being used to deliver graph databases, which are key to managing social networks and addressing other relationship-oriented Big Data problems. In 2000, some leading analysts who watched the database industry said that DBMS was becoming a commodity—that most significant innovation is in the past. How wrong they were. The database revolution has barely begun.