I recently participated in a Twitter chat on big data strategy hosted by @IBMbigdata. In all, seven questions were asked (including inquiry around the hot new role of chief data officer), and the conversation certainly demonstrated that each are well worth discussing again, in a little more than 140 characters.
What role does company culture play in developing and executing a successful strategy?
Company culture is, to a large extent, defined by a corporate communications strategy. That strategy should facilitate and encourage willingness to share information and to collaborate with others in the company for better business gain. A company that encourages transparency makes it easy to share and collaborate over business insights produced from analyzing data (big or otherwise), and a company that can forecast what is likely to happen in the future is highly likely to succeed.
How does big data strategy differ from and impact existing data warehouse, BI and data management programs?
It differs in some ways, and not in others. What is the same is that big data is needed to solve business problems. It may be that these problems could not be solved before due to technology or cost limitations. It may also be that big data is needed to gain deeper insights into existing areas where insights already exist, such as in the area of customer intelligence. What is different: the data characteristics (3 Vs), the fact that it may require new skills (calling all data scientists!) and that analytical characteristics may dictate that big data analytical workloads run on platforms outside of a traditional data warehouse or data mart. That can mean new technologies are needed: for example, a NoSQL column family database for ingesting high velocity data, Hadoop and graph DBMSs for exploratory analytics or streaming analytics of data in motion. And, while these new platforms may require new skills to administer and manage, the outcome is familiar in that it should provide deeper insight that adds to what is already known.
What is the strategic overlap between enterprise big data, cloud, social and mobile initiatives?
The first thing to be said is that all of these are likely to be needed for business strategy execution. Big data can be captured in the cloud (for example, in the form of sensor data) and analyzed in the cloud given that big data platforms such as Hadoop already run on public cloud service offerings. In addition, social data can be a data source for analysis on big data platforms. Interaction data on networks like Twitter and Facebook is of interest to marketers who want to analyze sentiment and identify social network influencers. Sentiment is also of interest to product managers who perhaps want to understand market reaction to a new product. Of course let’s not forget that a very valuable source of sentiment data is also available inside the enterprise in the form of inbound customer interactions stored in CRM customer service application databases. Social network platforms also present opportunity to interact with customers and prospects and so can be used for personalized outbound marketing including targeting influencers. Mobile is where online interaction occurs. Web server logs that capture all that mobile (and desktop) browser behaviour are a hugely popular big data source that some organizations need to analyze in real time as data streams and also to analyze over longer time periods. GPS sensor data combined with click stream is even more powerful and represents opportunity to use location and online behaviour together to offer up new products and services.
With so many technology options to choose from, how do you determine which components are needed in a big data strategy?
First of all you need to understand the data and analytical characteristics of your analytical workload. This will likely push you in a specific direction with regards to technology choice. For example, the need to conduct real-time statistical or predictive analysis on real-time streaming sensor data is likely to highlight the need for stream processing software that can scale to handle very high velocity data. Similarly exploratory analysis of huge corpuses of text, or huge volumes of click stream data may push you into looking at Hadoop as an analytical platform.
The second thing to also consider this is where do you plan to use NoSQL technology and why? For example, Cassandra could be used to capture very high velocity data. So you are exploiting it’s core strengths of high write processing. But once you have captured it, then what? You may want to join it to structured data, such as customer data. Similarly, Hadoop can be used as:
- A staging area
- An analytical sandbox
- A data refinery for low cost ETL processing
- A data archive in data lifecycle management
You have to know where these technologies fit in your overall architecture and for what business purposes you plan to use them.
A hot new role is chief data officer. What responsibilities do you see for a CDO? How do they differ from existing roles?
I have seen the CDO role in some parts of the world more than others. This is a highly visible role, often with a strong link to analytics. I would, however, like to leave analytics aside for a second. CDOs are accountable for delivery of trusted high-quality data that has high business value and that is widely shared, and therefore has to be consistent. They play a central role in improving processes and decisions, reducing risk and in achieving compliance.
They are responsible for simplification of the data landscape which, by most accounts, is getting more complex (more data sources, more data types, more targets, on-premise and cloud applications and more). They need to identify where data is, establish a common approach to getting data under control, improve data quality, data usage and value and make data available as a service for others to find and use. They are also accountable for measuring data quality. Finally, the programs of work they initiate need to be aligned with business strategy to help contribute to process improvement and reducing cost (for example master data and reference data), increased revenue (such as single view of customer and customer insight) or both.
How should data governance be addressed within a big data strategy?
Just like any other data, big data should be part of a data governance program. Sensitive data stored on big data platforms needs to be protected beyond just the file level. Access and manipulation activity associated with sensitive data needs to be recorded and monitored. With respect to data cleansing and integration, if possible, the same enterprise information management (EIM) tools platform associated with data warehousing and master data management should be capable of also supporting big data. This means that big data cleansing and ETL processing should be capable of being pushed down to run on big data platforms so that the power of the platform can be exploited to get scalability at low cost. For example, using Hadoop MapReduce batch processing for ELT processing at scale. For platform-specific tools like Sqoop and Flume, which are used heavily in Hadoop environments, then these should integrate with the aforementioned EIM platform. With respect to data science, the sandboxes created should be governed (who is allowed to create sandboxes, who can access, what data sources can be brought in to these sandboxes and what activities are performed on this data [lineage]). All of this matters and should not be designed to stifle innovation, but to improve robustness. Also any new insights produced from big data analytics should be mapped into the company shared business vocabulary, which is typically stored in a business glossary. In that way new insights can be formally introduced into the vocabulary after they have been determined to be of value to the business as a whole. There are those who advocate total freedom on big data platforms. I am not an advocate of ruining the ability to explore and innovate, but at some point, what goes on in this environment needs to be kept organized, tidy, secure and monitored. Also business user self-service data cleansing and integration tools, or BI tools that support this capability, need to have data governance built in. There should be a governance process to escalate new transformations and manipulations created by a data scientist on a personal level up to a departmental level and then on up to becoming enterprise wide so that sharing and re-use of the metadata created is managed. Either that or they should be able to invoke data services built for them that deliver the trusted data they need.
What pitfalls should people be mindful of when developing a big data and analytics strategy?
There are several pitfalls to avoid. These include:
- Investing in big data technology without first having identified candidate business use cases. This is not a technology ego trip. There has to be a valid business reason as to why you need it. Also ask yourself what could you do with such a technology? ("What if we could use in-memory technology to do real-time detection of anti-money laundering?")
- Building analytical stove pipes with little or no integration and no use of common technologies across platforms. An example here is a Hadoop project, a graph DBMS project, a streaming analytics project, a data-mining project on an analytical RDBMS and traditional data warehousing. If all these are done in a stand-alone fashion with different data management technologies and no integration, the cost will be higher than needed, skills will be more thinly spread and there will be little in the way of re-use. It is important to recognise the importance of integrating these technologies to build a stronger analytical ecosystem.
- Selecting NoSQL technologies without first understanding the strengths and weakness of these technologies and also where they fit within your overall architecture. It is also important to recognize that with many NoSQL technologies there are no standards. Unlike relational DBMSs that have SQL as a standard, NoSQL DBMS APIs are often 100 percent proprietary, which of course means 100 percent lock-in. You also need to understand how they can be used to extend your capability.
- Taking data to the analytics. Take the analytics to the data by exploiting in-database and in-Hadoop analytics.
Read more about #BigDataMgmt here on the Hub and join us each Wednesday at noon (EST) as we delve into a new and exciting #BigDataMgmt topic.