Making big data lean and nimble

Staff Software Engineer, IBM

Storage format has become a very hot topic around the IT watercooler. In today’s world of big data we need a place to put the massive amounts of data coming at us from multiple sources, in unique formats and at different speeds. Ideally, we would access and query these data sources directly, rather than be forced to copy them to intermediate formats, which takes time, results in replicated files and wastes storage space in HDFS.

The flexibility of Big SQL

graphic for Lawangs Sept 2014 blog.jpgLeaders are looking for ways to get value from huge amounts of data, but do it in a way that doesn’t break the bank or slow down the speed at which key business decisions can be delivered to users. Fortunately, unlike competing SQL on Hadoop implementations that require additional proprietary metadata and additional RDBMS front-ending, Big SQL in BigInsights v3.0 can work with files in their native formats on Hadoop. This means that customers have the best of both worlds: the ability to store their data in Hadoop's native formats, but access and query the data using industry standard ANSI SQL in addition to other Hadoop tools.

There are multiple Hadoop world file-formats which are the result of extensive research, such as plain text files and other, more powerful types like the row or column-oriented binary formats (namely the Sequence file, RC file, ORC file, Parquet and AVRO). The good news is that Big SQL supports it all. In fact, it is the only SQL on Hadoop instance which supports the ORC and PARQUET storage formats.

Big data environments deal with massively large datasets and, in order to store and process them effectively, compression is a key capability which results in reduced costs and increased efficiencies. In the Hadoop world, SNAPPY, LZO, GZIP, BZIP2 and CMX are very popular compression codecs. In all the file formats discussed above, at least one of Hadoop compression codecs is supported in Big SQL in BigInsights v3.0

Read more of my thoughts on related subjects on HadoopDeveloperWorks and gain a deeper understanding of Hadoop storage formats and the way in which they are helping us manage big data.

Learn more