Selecting your first big data project

Solution CTO, IBM

In this blog, we will cover how to select, staff and plan your first big data project. Our recommendations are based on many years of experience that we have had working with a wide variety of customers in several industries. We won’t focus on specific technologies. Instead, we will examine the organizational dynamics and lessons learned from how these projects go in real life, inside existing, often very busy IT infrastructures.

Every customer is different, so please take these as general guidelines rather than hard and fast rules. Please note that this is not focused on normal project management. We assume that you have adequate project management discipline in place, and we’re simply going to look at the dynamics that can be applied to your big data project.

Know what your compelling drivers are

The first, most obvious question is “Why do this at all?” There should be a compelling use case, a competitive driver, cost driver, or some other issue that has been identified where the application of big data technologies is in the critical path to solving the problem. Typical drivers include the information type (for example, under-utilized structured information sources), or the volume of information (retention of IP logs), but in any case, you need to identify exactly why you are pursuing this path.

One of the most important things you should look for is a compelling ROI (return on investment). That is to say, find something for which you can put a value on the cost of the problem before you plan a solution.

When calculating the problem’s cost, remember to add the cost of the effort you will put into solving it, both from a technology and labor point of view. You can then compare that to the “after” state to determine the overall value of the project. Now, it is important to understand that the first phase of a project may not, in and of itself, be ROI positive. But what is important is for you to have a line of sight through the completion of the entire use case of the project. It may be the second or third application of the technology that flips the switch to a positive ROI.

A good example of such a line of sight comes from one of our clients, who uses the natural language analytics capabilities of InfoSphere BigInsights to understand email correspondence. Through the BigInsights analysis, our client can identify problems in customer satisfaction before they manifest in an unhappy customer leaving the firm.

To begin this project, our client selected a subset of its data to analyze; in this case, the data represented just one region of the country. Once we analyzed the data from that region, it did, in fact, show a positive ROI. However, we didn't base the model on achieving positive ROI from a single region. The project’s ROI plan was based on running all of the email across the client’s entireU.S.footprint through the system, not just one division. Staking out a path to ROI and getting agreement on it keeps everyone focused on what is practical.

Please note we’re not saying that pure experimentation – and going after R&D projects first as a way of understanding new technologies – is not a valid approach. What we are saying is it is important to differentiate between experimentation and your first “business” project. Do not confuse how you learn with how you implement something that ultimately has to go into production.

Select your people before the technology

This may seem a little counter intuitive, but we've learned that selecting the people who are sponsoring and staffing the project is actually a more important predictor of success than the technology. While we spend time up front making sure that we get the technology right (and this is where having a broad portfolio of technology options is a great help), I’ll know going in what team members from our side I am going to assign to the project. You should as well, since in my experience personnel selection is the biggest variable in the project’s success – even, or maybe especially, when dealing in emerging technologies.

Simply stated, it is not a best practice to put people in the critical path of your first project if they are overly vested in existing approaches and technology, or downright hostile toward new technologies.

Make sure you staff with resources who expect the project will be doing things differently than the way you currently do them, especially in comparison to how relational database and warehouse projects work. Make sure the people involved understand that there will be different outcomes as a result of using this new technology. Your people need to understand that both the outcomes and how the data are exercised to get to those outcomes will differ from what they have done before.

Remember, this is not about replacing existing databases, but rather about augmenting them with different and new ways of doing things. While these projects are inherently complementary to what you're currently doing, and the new technologies should have been selected based on them being additive (rather than replacing anything core in your existing topology), it's important to make sure that people first understand the variability and nature of all the elements.

Don’t introduce too many new skill variables

Your first big data project is not the right time to concurrently develop Linux or Java skill sets in the team. If you have been appropriately selective with the people you have assigned (as we discussed above), you also need to be level-headed about how much you are throwing at them. Big data technologies will have scripting and flow considerations. For example, sometimes it is easier or required for performance reasons to express a routine in Java. Does your team have access to people with those skills? Are you okay with the performance profile if you “only” use visual tools and have the jobs be translated into MapReduce for you?

If you have people who possess the skills you need, find a way for them to join the project. It shouldn’t be hard to find people intrigued with exercising these big data technologies. We understand your staff will be busy, but “time slicing” your way to success here is not a best practice. If you do “time slice,” delays are the usual outcome, rather than outright project failure, but this is still an important consideration.

Finally, when you make your staffing plan, assign resources to address security; select people with experience in scripting languages; and add team members who can tackle base Linux and networking, especially your firewalls if you are – as you should – moving information across systems.

What does the production version look like

One of the best practices we share with clients is a takeoff on the old adage of “practice like you are going to play.”

Simply stated, this means run your project in a test environment that is representative of how the production environment is going to look. This is a best practice for a number of reasons: to help you understand the performance attributes of the system, how to scale and size, how to determine the needed resources, etc. But it's also important to make sure that you understand all of the variables involved.

For example, if you are going to run a production flow where the big data technology is part of a larger end-to-end flow, then you need to make sure all of these touch points are exercised during the project.

It seems obvious, but this is a practical concern. When inserting your production big data project into a set of flows where it is but one step in a larger pipeline of workflow activities, you want to make sure that the work you're doing includes a flow that is representative of the data acquisition processing steps, and that it is then delivered and accepted into whatever system it is ultimately feeding.

You want to make sure that you're loading the system and pulling from it in a way that is a proxy for your actual data acquisition. This is important for several reasons; for example, to ensure you have the ability to handle any metadata or information lineage issues.

Neglecting considerations like metadata management, security and governance up front is not a best practice. You can’t assume they will magically be solved after the fact.

Make sure you have access to the data

It seems obvious, but the provisioning time for delivery of the data is always longer than you expect it to be. Try to aim for the data to be available concurrent with the hardware environment.

Until we started pushing hard on this due to the delays we were encountering, this was a common issue in most of our early projects. Interestingly, this will happen even in environments where the data is not considered sensitive.

Oftentimes the root cause is that the information is coming at a non-trivial scale, or from a non-trivial number of places. The latter can be a source of surprisingly long delays. For example, in our work to instrument parts of the business that are under-instrumented, such as when dealing closely with customers, we are working with scattered and diverse information sources – there is no one place to go to acquire the information (hence, the problem we are working to solve).

By carefully thinking through in advance the full upstream and downstream flow of your data, including how and when it will feed into and be pulled out of your big data technology, you will be much better prepared for a successful deployment.

Boil a bathtub

There’s an old saying, “You should boil a bathtub before you try to boil an ocean.” Humor aside, this is actually a very important point, and one that people sometimes lose sight of. Because big data technologies offer profoundly new ways of doing things, we oftentimes see customers that are starry-eyed on very big ideas.

Big, transformative ideas are important to your business, and they should play a key role in your strategic plan for what you want to accomplish over the next 5 to 10 years. But those ambitious ideas are not the right place to start. Your initial objective should be manageable and should have a clear, direct line-of-sight ROI. I covered this point earlier.

Some of this can be accomplished just by practicing good project management discipline, but it's important to note that any journey starts with the initial steps of being specific about the objectives upfront.

Good project management aside, establishing an initial set of attainable objectives is especially important with big data technologies because of the organizational dynamics we see in people who have inflated expectations. The technology can open many doors and address previously unsolvable challenges, but it cannot overcome hyper-inflated expectations.

Before you start, get a clear, agreed-to objective that is fairly modest so the team can gain experience while having a strong likelihood of success.

Plan for success

This next situation is what we would call a “high-quality problem,” but it is still potentially a problem. It is common that once the project is rolling and you have initial results, the project sponsors are going to ask you to put it into production faster than you might expect – or want. To ensure that this does not catch you by surprise, be prepared with a plan that describes how you would scale the project in a way that does not blow your ROI case.

The line of business constituents you are supporting will become very impatient. They will want to use – and expand – the project as quickly as possible so they can take advantage of the insight that these projects often deliver. Plan for it. This is one of the reasons why you want to pull/touch/push from your other key systems during the pilot.

Keep your original set of agreed-to objectives close at hand and remind people of them as needed. Remember, you have to be able to boil a bathtub before you can boil an ocean.

Related resources