10 reasons to love IBM InfoSphere BigInsights for Hadoop
IBM InfoSphere BigInsights is an industry-standard Hadoop offering that combines the best of open source software with enterprise-grade features. Let’s talk about some key features of the distribution and the top ten reasons to love it:
- InfoSphere BigInsights: 100 percent standard, open source Hadoop
A good thing that about standards is that they help avoid getting locked into a vendor’s proprietary solutions. InfoSphere BigInsights, is based on open source software and includes all of the rich tools and capabilities that Hadoop users expect. Customers who need access to the latest Hadoop innovations can participate in IBM’s continuous beta program and have fast access to new open-source components and IBM value added capabilities. Enhancements have been carefully implemented to ensure adherence to open standards (where they exist) and that customers have a choice whether to use IBM enhancements or standard Hadoop functionality.
- Big SQL: Lightning fast, ANSI compliant, native Hadoop formats
Did I mention that SQL is a standard? While many are off reinventing their own dialects of SQL, and implementing them over proprietary data structures, IBM has taken a different approach. We have applied more than 30 years of database engineering and query optimization expertise to build a high performance SQL implementation that runs natively over Hadoop data formats. Big SQL leverages open source catalogs like the HIVE metastore. It also supports sophisticated query optimization, memory management and rich SQL analytic functions so that standard queries run without modification. SQL federation means that you can formulate queries that join data from multiple sources including InfoSphere BigInsights and other IBM and third party offerings such as Teradata, Oracle, DB2, Netezza and others.
- BigSheets: Spreadsheet-like data access for business users
This often comes as a shock to my developer friends, but there are folks out there who are not comfortable coding mappers in Java, writing Pig scripts or developing Spark routines in Scala or Python. For business users who would prefer a spreadsheet over an IDE, BigInsights has you covered with BigSheets. BigSheets provides a familiar, web-based graphical spreadsheet interface that allows users to work with collections of structured or semi-structured data from various Hadoop and non-Hadoop sources, and represent that data in hierarchical workbooks, sheets and graphs. Users can easily explore, refine and visualize data using a variety of prebuilt line readers and built-in functions. The parallel manipulation of large underlying datasets is handled automatically.
- Big Text: Simplify text analytics and natural language processing
Building computer programs that can collect and analyze human language is hard—especially when your customers speak many languages, are scattered across the globe and interact with you across many channels. Using text analytics, however, I can better monitor and analyze customer service interactions, call-center conversations and social-media activity, as well as surface new opportunities to better serve customers and improve the bottom line. InfoSphere BigInsights includes some of the same text analytics tooling popularized when IBM Watson, the Jeopardy winning computer, demonstrated that it was capable of answering questions posed in natural language. Rather than reinvent Watson-like capabilities from scratch, IBM customers can “stand on the shoulders” of this innovation and get to market fast with applications that incorporate advanced text analytics.
- Adaptive MapReduce: Fully compatible, four times faster
While the chatter right now is all around Spark and SQL-on-Hadoop, MapReduce remains the core framework enabling parallelism for most big data applications. A cool feature in InfoSphere BigInsights Enterprise Edition is Adaptive MapReduce. Adaptive MapReduce can be optionally enabled. It maintains 100 percent compatibility with Apache MapReduce, but improves performance and scheduling agility. Leveraging advanced scheduling technologies from IBM platform computing, Adaptive MapReduce was shown in an audited benchmark to outperform Apache MapReduce on average by a factor of four running a large-scale social media workload. What’s even cooler are management capabilities like the ability to resize or change priorities for MapReduce jobs in flight or make jobs recoverable so that, in the event of a master host or job tracker failure, long-running jobs can resume from where they left off.
- In-Hadoop Analytics: Deploy the analytics to the data
A key design principle of Hadoop is to minimize data movement by vectoring compute tasks to the nodes housing relevant data blocks. Moving analytics to data is hardly a new idea. So what are we talking about exactly? The trick is to be able to run higher-level analytic functions that parallelize easily and transparently, but respect data locality in a fashion that hides complexity from the developer. Big R is a framework for doing exactly this: pushing down R language analytics into the distributed dataset. Similar analytic functions are embedded in IBM Big SQL and other BigInsights facilities so that users can simply embed analytic functions in queries, and let BigInsights do the work.
- HDFS and POSIX: A more capable enterprise file system
Remember at the outset of this post, when I said that InfoSphere BigInsights was open? By default it uses open source HDFS just like most other Hadoop distributions. Like some other distributions however, IBM has a better file system. Unlike others, we give customers a choice about whether to deploy it. IBM GPFS FPO is a 100 percent POSIX compliant distributed file system that fully implements HDFS interfaces and semantics. To Hadoop applications, it looks like standard HDFS, but under the covers it leverages the highly-regarded GPFS file system, widely deployed in the world’s largest supercomputing environments. The cool thing about GPFS is that you can “have your POSIX” and be HDFS compatible too. Rather than being constrained to operate on large Hadoop style blocks only, regular applications can read and write as normal, and data is immediately available to Hadoop applications, avoiding the need to perform Hadoop style copyFromLocal or copyToLocal operations. In multi-step operations common in ETL processing, users can often avoid duplicate copies of data, reducing the total storage footprint required. Because metadata is distributed in GPFS, GPFS has no need for a NameNode eliminating a single point of failure. With storage pools, you can manage “multi-temperature” storage retaining frequently accessed data in storage pools that use the characteristic n-way block replication in Hadoop for performance and availability, but migrating less frequently used data to more economical storage leveraging approaches like RAID (or even robotic tape silos) to store data more cost-efficiently.
- Big R: Deep R Language integration in Hadoop
As any data scientist will tell you, analytic models in R are often memory constrained. In Hadoop environments, developers can work around this by writing Map and Reduce logic to distribute R-based code, but this is not efficient, complicates coding and can lead to errors. Big R in InfoSphere BigInsights provides a comprehensive set of analytic functions callable using familiar R language semantics that auto-parallelize across the Hadoop cluster. What’s great about Big R is that it works with existing open-source R tools and downloadable CRAN projects (Comprehensive R Archive Network) available from r-project.org. Advanced machine learning technologies (Based on SystemML, a declarative Machine Learning language developed by IBM Research) have also found their way into Big R making advanced, parallel machine learning algorithms accessible using familiar R language syntax.
- IBM Watson Explorer: Search, explore and visualize all your data
Open source search capabilities like Apache Solr and Apache Lucene are great for searching Hadoop data sources and are included in InfoSphere BigInsights. Big data is about “all the data” however, and often data exists in platforms other than Hadoop. IBM Watson Explorer provides secure, federated navigation and discovery across a broad range of enterprise content and data sources to maximize return on information. By delivering information to the right people, at the right time, at the right level of detail and in the right visual format, organizations can improve understanding of their operations for better, faster decisions.
- Accelerators: Get to market faster leveraging pre-written code
When it comes to big data projects, time to value matters. Gathering and storing vast amounts of data, as tough as that is, is actually the easy part; what differentiates Hadoop distributions is the available tooling for manipulation and analysis of data. InfoSphere BigInsights provides pre-packaged “accelerators” for popular use cases such as machine data analytics, social media analytics and the extraction and analysis of text. Whether you’re parsing log files for evidence of fraud, performing geo-spatial analysis for location based services or building an application to measure sentiment based on twitter feeds, IBM provide pre-written code supporting common use cases to help you build higher quality applications faster and beat competitors to market.
To learn more about IBM InfoSphere BigInsights, download the free InfoSphere BigInsights QuickStart Edition today.
Also, be sure to check out the Big Data for Social Good Challenge (#Hadoop4good), a global hackathon where developers compete to create innovative solutions using Hadoop that solve civil and other real world social challenges. It’s open, fun and there are big prizes too.