Blogs

Topics


Post a Comment

How to succeed using Avro storage format

August 12, 2014

As a big data platform, Hadoop is designed to work with very large data sets, and it does so while following a divide and conquer approach towards data and its processing. In an ideal scenario, Hadoop prefers the data be divided into fewer but larger chunks that allow them to be processed efficiently with minimal overhead. Moreover, much of this data is semi-structured (for example: log files and sensor data) or unstructured (such as text from social or news service) and the ability to represent these efficiently and effectively (with meta-data for example) has proven to be quite helpful with data processing.

These imperatives have led to the creation of a variety of file formats for Hadoop, including the Avro storage format. IBM InfoSphere BigInsights’s Big SQL works seamlessly with the popular Hadoop file formats including Avro, though doing so requires some unique input and output format definitions.

What is Avro?

Avro is a storage format for data. It stores data by putting data definition with the data allowing for Avro files to be read and interpreted by many different programs. It stores all of the data in a binary format making the files more compact, and will even add in markers to help Map Reduce jobs find where to break large files for more efficient processing.

How to create an Avro table

When creating tables, you are able to specify the storage format you would prefer and Big SQL will take care of converting the input data into the specified format. When creating an Avro table however, there are several things to look out for:

  • Only a subset of datatypes are allowed
  • Special mappings for datatypes
  • Specific Inputformat and Outputformat required
  • Multiple methods to create an Avro table

Big SQL to Avro datatype mappings

Big SQL

Avro

Int

Int

Bigint

Long

Double

Double

Float

Double

String

Varchar(max)

Boolean

Smallint

Enum

Varchar(max)

When defining an Avro table there are four parts that are required

  1. Inputformat
  2. Outputformat
  3. SerDe
  4. Avro Mappings

Note, that the Big SQL table definition is not required, that means there are a couple of ways to create an Avro storage table:

  • You specify the Big SQL datatype and the Avro mapping
  • You specify the Avro mapping and let Big SQL handle mapping to the Big SQL datatype

For the first option an example would be a table like this:

CREATE HADOOP TABLE AVRO_TABLE (int_col           int,

                                                                          bint_col         bigint,

                                                                          double_col    double,

                                                                          float_col        float,

                                                                          string_col      string,

                                                                          boolean_col   boolean)

STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

TBLPROPERTIES

('avro.schema.literal'=

 '{"type": "record", "name": "TUPLE_1",

   "fields": [ {"name": "int_col",          "type": ["null", "int"] },

                    {"name": "bint_col",       "type": ["null", "long"] },

                    {"name": "double_col",  "type": ["null", "double"] },

                    {"name": "float_col",      "type": ["null", "double"] },

                    {"name": "string_col",    "type": ["null", "string"] },

                    {"name": "boolean_col", "type": ["null", "boolean"]  } ]}')

When creating a table this way you define the columns with Big SQL and then match them up with matching column names and the appropriate Avro mappings. This method guarantees the mappings between Avro and Big SQL.

In the second option you don't need to specify the Big SQL portion of the create and Big SQL will take care of the mappings from Avro datatypes to its own.

CREATE HADOOP TABLE AVRO_TABLE

STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

TBLPROPERTIES

('avro.schema.literal'=

 '{"type": "record", "name": "TUPLE_1",

   "fields": [ {"name": "int_col",          "type": ["null", "int"] },

                    {"name": "bint_col",       "type": ["null", "long"] },

                    {"name": "double_col",  "type": ["null", "double"] },

                    {"name": "float_col",      "type": ["null", "double"] },

                    {"name": "string_col",    "type": ["null", "string"] },

                    {"name": "boolean_col", "type": ["null", "boolean"]  } ]}')

Following this same format, you are able to do Create Table as Select with Avro being the goal. To do so you just need to create a table as in option 2 and include a “Select” statement afterward.

Compression formats

Avro differs in how it uses compression types. Avro only supports 2 compression types currently, snappy and deflate. There is a different hadoop variable that must be used to achieve the desired compression. To set compression, before you load in the data you must set the following variables.

$SET HADOOP PROPERTY 'mapred.output.compress'='true';

$SET HADOOP PROPERTY 'mapred.output.compression.type'='BLOCK';

$SET HADOOP PROPERTY 'avro.output.codec'='snappy';

Note, the final row is different than other storage formats and will be what sets the compression types for the data being loaded into the Avro table.

Avro can be tricky

success with avro.jpg

There are several pitfalls that must be navigated around, but when following the above guidelines it becomes straightforward on how to use Big SQL with the Avro file-format. Employ these easy-to-follow instructions and immediately start benefitting from the advantages that Avro provides, as well as from the data exploration, query and analytical capabilities that Big SQL brings on top of Hadoop.

Even as big data continues to move into the spotlight, one can fully expect additional file formats to evolve and current file formats to become smarter and more robust. High performance tools like Big SQL on InfoSphere BigInsights is designed to take advantage of such advances and to continue to be a data science and business analyst friendly instrument to extract value from all of an enterprise’s data rapidly and efficiently.

Further reading