Big data, black gold

How oil and gas companies could store their most precious resource

Global Oil and Gas Sales Leader for Software Defined Infrastructure, IBM

Oil and gas customers face explosive growth of unstructured data. This growth is hitting 60–80 percent year over year, and to cope with it many oil and gas customers are implementing industrial-grade predictive analytics solutions. These solutions help these organizations handle their big data, and the return is seemingly endless. A successful implementation of big data, which emphasizes the right balance of connectivity and analysis of complex data sources, can improve several facets in this industry: 

  • Analysis while managing exponential expansion of storage
  • Data availability while avoiding data loss
  • Exploration and development
  • Drilling to meet completion deadlines
  • Production and operations by reengineering processes and systems to optimize logistics and hedge risk
  • Security and safely delivering crude, liquified natural gas (LNG) and refined products legacy data storage and management

Why do oil and gas companies need next-generation analytics and storage? Why can’t their legacy storage solutions address the big data requirements of this era?

Quite simply, legacy storage solutions can no longer satisfy the big data needs of oil and gas, especially for seismic backup, archive and global distribution requirements. Redundant array of inexpensive disks (RAID) is a primary example of why legacy storage is no longer an option, especially given increased sizes of disk drives and the need for longer rebuild time—which can take weeks and lead to data loss. Moreover, legacy systems open an organization up to complete loss of data, particularly because the complexity surrounding backup solutions requires massive amounts of information and does not necessarily provide desirable reliability.

A single storage management solution that supports any deployment model, whether that deployment is on premises or on a public or hybrid cloud platform, is no longer an option. We are increasingly witnessing more oil and gas workloads moving to the cloud. Bye-bye, RAID.

A RAID scheme is based on parity. At its root, if more than two drives fail simultaneously, data is not recoverable. The statistical likelihood of multiple drive failures has not been an issue in the past because of the amount of data being stored. However, as drive capacities continue to grow beyond the terabyte range and storage systems continue to grow to hundreds of terabytes and petabytes, the likely simultaneous failure of multiple drives is now a reality.

Shifting to object storage

Businesses are moving toward petabyte-scale data storage, and object storage solutions are proving to be the right choice for balancing scale, complexity and costs. By way of their core design principles, object storage platforms can deliver unprecedented scale at reduced complexity and cost, even over the long term. The market has adopted Amazon S3 and OpenStack Swift as the Representational State Transfer (RESTful) access mechanisms for object storage.

The market has more or less adopted Amazon S3 and OpenStack Swift as the RESTful access mechanisms for object storage. Several suppliers are aligning themselves to OpenStack in an attempt to make their object storage compatible with Swift, Cinder, Manila, Glance, and Nova.

Object storage also allows addressing and identification for individual objects by more than just file name and file path. It adds a unique identifier within a bucket, or across the entire system, to support much larger namespaces and eliminate name collisions. Object storage also allows for including rich, custom metadata within the object, and it explicitly separates file metadata from data to support additional capabilities. And in contrast to fixed metadata in file systems such as file name, creation date, type, and so on, object storage provides for full-function, custom, object-level metadata to achieve several benefits:

  • Capture application-specific or user-specific information for enhanced indexing
  • Support data management policies—for example, a policy to drive object movement from one storage tier to another
  • Centralize management of storage across many individual nodes and clusters
  • Optimize metadata storage—for example, encapsulated, database or key value storage—and caching and indexing—when authoritative metadata is encapsulated with the metadata inside the object—independently from the data storage (for example, unstructured binary storage)

In addition, the following features characterize some object-based file system implementations: 

  • The file system clients only contact metadata servers once when the file is opened, and then get content directly through object storage servers—in contrast to block-based file systems that require constant metadata access.
  • Data objects can be configured on a per-file basis to allow adaptive stripe width—even across multiple object storage servers—supporting optimizations in bandwidth and I/O.
  • Object-based storage devices and some software implementations—for example, Caringo Swarm—manage metadata and data at the storage device level.
  • Instead of providing a block-oriented interface that reads and writes fixed-sized blocks of data, data is organized into flexibly sized data containers called objects.
  • Each object has both data—an uninterpretable sequence of bytes—and metadata—an extensible set of attributes describing the object. And physically encapsulating both of them together benefits recoverability.
  • The command interface includes commands to create and delete objects, write and read bytes to and from individual objects, and to set and get attributes on objects.
  • Security mechanisms provide per-object and per-command access control.

Enhancing object storage with information dispersal

Other differentiators exist within the object storage space. Object storage technology eliminates the need for costly replication and effectively addresses big data storage objectives. 

Information dispersal technology (like that offered by Cleversafe) provides a more efficient, cost effective approach to object storage and dispersed storage technology. Information dispersal algorithms expand, virtualize, transform, slice and disperse data across a network of storage nodes in various locations. This uses a single instance of data with minimal expansion, stored in hybridized fashion across both the physical and the cloud, and maintain high levels of data integrity and availability.

Each individual "slice" of data does not contain enough information to understand the original data. In the information dispersal algorithm process, the slices are stored with extra bits of data that enable the system to only need a predefined subset of the slices from the dispersed storage nodes to fully retrieve all the data. Because the data is dispersed across devices, it is resilient against natural disasters or technological events, such as drive failures, system crashes and network interruption. And because only a subset of slices is needed to reconstitute the original data, multiple simultaneous failures are possible across a string of disks, servers or networks and the data can still be accessed in real time.

To store a petabyte of information in a typical RAID + Replication arrangement, you need about 5,000 TBs of raw storage, or 1,350 3 TB disks. To store the same amount of data within information-dispersal fueled object storage architectures (like that of CleverSafe), less space is needed. Now 1,700 TBs of raw storage, and 534 3 TB disks are required. Statistically, compared to standard RAID + Replication architectures, object storage solutions can be approximately 60 percent more efficient and 80 percent more cost effective.

To hear more, Cleversafe's Russ Kennedy and Storage Switzerland's Erick Slack take to the whiteboard:

Deploying an advanced, clean-sheet solution

Despite this extensive discussion about object storage, RAID and other high-level concepts, we've only touched lightly on specific solutions, which might be a good note on which to conclude this discussion. IBM offers Cleversafe, a clean-sheet object storage solution acquired in late 2015. This object storage solution leverages content transformation, physical distribution and

reliable retrieval. Cleversafe also offers several key benefits and features that are quite useful for oil and gas companies, many of which still rely on traditional RAID systems.


  • No RAID sets or replication schemes require management. Fewer people can manage more petabytes than ever.
  • No downtime is necessary during software upgrades, hardware refreshes and in the face of disk, node and site failures.


  • Software defined and single-instance storage that leverages commodity hardware can dramatically reduce cost.
  • The solution functions to scale by delivering enterprise-class functions and security at an exabyte-plus scale.
  • Cleversafe can be deployed across a continuum either on premises or off premises on the IBM SoftLayer hybrid cloud, and it can connect with the expansive IBM Storage family.
  • Various data access options are available using the Amazon S3–compatible application programming interface (API), OpenStack-compatible API and simple object API.
  • Secure multitenancy through DsNet enables data-at-rest encryption plus vaults. Multiple vaults can be created in the same DsNet to provide access and data separation.


  • Cleversafe’s multiple deployment options include single-site, multisite and geographically dispersed deployment.
  • Less raw storage means reduced power is required, which can result in significant reduction in total cost of ownership (TCO) for cooling and floor space.
  • Interoperable with IBM Spectrum Scale. Users can migrate files or objects to Cleversafe’s storage pool, either on premises or in the cloud.


  • Multiple exabytes are already in production.
  • Performance and/or capacity can be scaled at any time with no downtime to operations.
  • Nonstop availability means data can be accessed across the network worldwide.

Security and integrity

  • The object-based approach encrypts and disperses data across the system without making copies—helping deliver always-on expandability in a greatly reduced footprint.
  • Encryption for data at rest provides government-grade security; and with built-in key management, no single disk, node or site contains enough information to allow a compromised data breach.
  • Powerful encryption helps ensure security at every step, so personnel can work together on sensitive files with fast access and 100 percent reliability. Critical digital assets are able to be accessed, distributed and protected across multiple locations.
  • Built-in self-repair and integrity-checking features help ensure data is maintained for its lifecycle.

A call for change

Storing data and ensuring its integrity takes time, money and other resources. But there's no reason that companies need to hew to legacy infrastructures that are inefficient and outdated, when there are new, better opportunities out there. Fast companies need fast data growth, and hybridized solutions that take advantage of storing data on-premises and in the cloud are the way of the future, and will hopefully change the way many oil and gas organizations choose to store their data.