Everyone knows everything is bigger in China, and big data is no exception. This is why it is no surprise that big data interest is gaining a lot of momentum in China.
On August 23 and 24, IBM organized a technical summit in Beijing, and one of the main tracks was about big data. In fact, this track was the most popular one of the summit. All presentations were held in Chinese by local presenters. Some of the session titles were “SQL vs. noSQL – Getting the Best of Both Worlds,” “BigInsights Programming Overview,” “How to Get Started with Your First Big Data Project,” “IBM Cognos BI and BigInsights - Business Intelligence on big data,” “Content Analytics Meet Big Data” and many more.
Prior to the summit there were other activities in support of the event. One of them was an online challenge, and an online interview with me, who happened to be in China delivering a big data workshop to IBM business partners. Below is the edited summary of the questions that I received during the online interview.
Everybody is now talking about big data, but many people are still puzzled about what exactly it is. As a big data expert, could you please define what is big data in a simple and concise way?
In simple, concise terms, big data is a collection of data sets that are so big that it is hard for traditional software such as RDBMSs or statistical packages such as SPSS to handle them. Big data is hard to collect, analyze, store, search and visualize.
Can you talk about which industries and applications generate big data? Please list the top 3 industries or the top 3 application scenarios. Would your answer be different for China?
You can find big data in most industries. I will give examples in the telecommunications industry, in security and email processing (which could be in any industry). In the telecommunications industry, big data is generated every time a person texts a message, calls a person, sends a picture, or performs any other activity from their mobile device. When this happens, Call Detail Records (CDRs) are generated. Pause for a moment, and think right now how many people in China are using their mobile device... At this very second, thousands of CDR records are being generated. In addition, telecommunications companies normally require CDRs to be sent twice as a backup. This means that the data sent is doubled. That is a lot of data!
Another areas where big data is used is in security. Picture the thousands of web logs that are being generated by servers. Analyzing suspicious behavior very quickly in order to detect threats needs to be performed quickly.
A third area where you can find big data is in processing of emails. For example, if a company sets up an email address like email@example.com, then assuming this company has many customers, it will receive thousands of emails per day. The company could use big data analytics processing to understand the most common pain points, and provide fixes proactively.
I think most of the issues around the world, including China, are very similar. In the case of China, there may be more complexities due to the language; however, IBM does have software like Text Analytics which can process Chinese. In addition, in China, being such a large country with such a large population, everything is big! So maybe you can say that big data is bigger in China, so this poses even more challenges.
Which IT careers will be affected by big data and what would be the change if any?
There will be changes for developers who will have to learn how to code in a MapReduce framework. This requires a different way of thinking. Fortunately, new higher level languages like Jaql have been developed that can reduce the complexity of MapReduce. Writing programs in Jaql are a lot easier than writing a Java MapReduce program, and the Jaql program will be converted behind the scenes to MapReduce.
There will be changes for administrators such as DBAs, architects and analysts who will need to plan well for capacity in order to handle big data. Their existing skill set will not be wasted, however, but they will need to build new skills on top. For example, they will need to learn HDFS or GPFS-SNC file systems. They will also need to adapt to the Cloud Computing model which will allow them to set up big data clusters quickly, at low cost, and in a flexible way.
Which products does IBM offer for big data processing? Can you please give us a brief introduction of each product? What are the top 3 advantages compared to competitors’ products?
In IBM we talk about a “big data Platform” which includes (1) traditional data processing and analysis software like DB2, InfoSphere Warehouse, Netezza, and so on; (2) software to process big and typically unstructured data (IBM InfoSphere BigInsights, and IBM InfoSphere Streams), and (3) software that integrates both traditional data and big data (InfoSphere Information Server).
The two main software that are at the core of this platform for big data processing are IBM InfoSphere Streams (Streams) and IBM InfoSphere BigInsights (BigInsights). In order to explain briefly what each of these two products do, you need to understand there are two main types of big data: data in motion, and data at rest.
The analogy used for data in motion is a river or a stream. Imagine standing up in the middle of a bridge and looking down at a stream. As the water passes by you analyze what it carries in real time. Similarly, Streams software can analyze in real time the streams of data flowing through its system, and perform real time analytics processing (RTAP).
The analogy used for data at rest is an ocean. Oceans are huge and the water is not flowing. Similarly there are huge amounts of big data stored in many repositories which have never been analyzed. BigInsights software can help analyze this type of big data.
BigInsights has many advantages over competitors’ products. Three features that come to mind are: GPFS-SNC file system, Adaptive MapReduce, and BigSheets. GPFS-SNC is a robust file system that allows for a mixed workload (random and sequential processing), High Availability, security and more. Compared to HDFS, which has a single point of failure (when the NameNode crashes), GPFS-SNC guarantees no data loss.
Adaptive MapReduce allows a TaskTracker to not only work independently on its task, but also see what’s going on overall with other TaskTrackers. When it sees it can help on other tasks, it will help out, rather than just do its originally assigned work. Therefore, there will be reduced cost in starting new TaskTrackers. These two features take into account that in real life, companies have mixed workloads, and cannot have one system only for Hadoop-type workloads and another one for OLTP. BigSheets is a great feature to empower business users and management to work directly on big data without much support from their IT department.
With respect to InfoSphere Streams, the main advantage is scalability. There simply is no other product in the market that can scale to the degree that Streams can scale. Another advantage compared to competitor’s products is that Streams can handle both structured and unstructured data types, while competitors’ products can only support structured data types. Another advantage that is shared with BigInsights is the capability to perform text analytics. Streams can also be integrated with BigInsights.
By the way, BigDataUniversity.com has courses on both Text Analytics, and Streams Computing. Courses in BigDataUniversity.com are free, have hands-on labs, and can offer you a certificate of completion if you pass the course test.
Currently there is a lack of skill in big data. Can you please share what initiatives have been taken by IBM worldwide and how that pertains to Chinese users? Is it possible to invite more worldwide experts of big data to the community to share in their expertise?
Indeed there is a lack of big data skill worldwide. Jobs are growing exponentially in this area, and salaries for corresponding jobs are growing too! It is a good time to be in the big data area. To mitigate the skill shortage, IBM sponsors a community site called “big data University” (BigDataUniversity.com). The site officially started operations around August 2011, and in 1 year it has grown to have close to 35,000 students. Amazing! The site provides free online educational training about big data and more. Their motto is “Learn @yourpace, @yourplace, @yourtime”. I encourage you to visit the site, and try some of the free courses. Take the course tests so you get your certificate of completion diploma. In addition, each course has hands-on labs so it’s not just theory, but you can actually try “doing big data.” Other than that, I’m sure other experts of the big data Community would be happy to come to China to present the latest and greatest about big data.
You are one of the key promoters of big data in communities. We have the biggest community in China for different areas including DB2 China, Cognos China, and WebSphere China users’ communities. We would like to know if we can have some collaboration with worldwide communities such as big data University?
I would be delighted if we could collaborate more with your organization and communities. Actually, we are looking for enthusiastic volunteers to help us translate some of the existing courses in BigDataUniversity to Chinese. In addition, we are planning with the help of the IBM China Development Lab to develop courses in Chinese with English subtitles. Please contact me if interested to help.
We are holding the first “IT Practice Talent” campaign in China, and there are nearly 20,000 people who have participated in the knowledge contest, of which 8499 users are taking the big data contest. Any words of wisdom you would like to say to the participants?
I understand there were other interesting areas they could have chosen to participate, so the first thing I’d like to tell them is that choosing big data is a great choice. Everyone should embrace big data because it’s a topic that will be with us for years to come. Becoming an expert on big data now will open your career opportunities immensely, and you may be working in very interesting projects that may have life-changing repercussions. I wish you all good luck, and don’t give up!
The second part of the “IT Practice Talent” will consist on designing an application that is innovative and uses the latest technologies. Can you provide suggestions to the participants about how to design their application using IBM big data products?
First, you need to understand what problem you are trying to solve. Will you need Real Time Analytics processing (RTAP)? If that’s the case, IBM InfoSphere Streams would be the best choice. If, on the other hand, the big data is “at rest,” then you need IBM InfoSphere BigInsights. IBM also has another product called Vivisimo Velocity which allows you to analyze data “in place.” Next, you need to understand where is it that you can find the big data sets required for processing. Sometimes this is where the main problem is. There is a lot of data being collected, but you need to verify if you can legally access it, and how to move it to GPFS or HDFS. You may also need to cleanse the data. The next step may require involving a domain expert to review what you have collected. For example, you may have collected medical information; however, you are not a doctor, so you need an expert in this area to analyze what you have. Finally, prototype your solution, and use the Cloud. Cloud computing is ideal for working in a development/test environment.
For more information about the IBM big data platform and the products within it, visit ibm.com/bigdata