Post a Comment

Moving to Hadoop? Understand counting basics

November 22, 2013

I’ve encountered many customers who are keen to ditch their data warehouse and use Hadoop to store the organization’s data as a cost-saving measure and to provide more flexibility to the business. However, an oft-overlooked consequence of eliminating the data warehouse is that analysts are now exposed to data in its operational or transactional state.

Setting aside the obvious challenges of data cleansing, joining and validation, double counting is a common mistake when analyzing operational data or data that hasn’t been normalizedin other words, the kind of data dumped from an operational application directly into Hadoop.

Even accountants and those with the training to navigate this problem can be challenged by double counting, particularly when the data originates from disparate sources and when the business context requires interpretation.

How serious is double counting? In a world being measured more and more, contracts that reference metrics as part of a guarantee or implied promise are causing corporations to assess their liability and risk. A Federal Energy Regulatory Commission’s ruling (February 2012) upheld PJM’s assertion that EnerNOC double counted its customers’ participation in certain demand response programs. This dispute lasted months, battering EnerNOC’s stock price and resulting in resignations by EnerNOC’s COO and CFO.

What is double counting?

Double counting is a term to describe when results are overstated due to referencing a transaction more than once. Double counting does not imply that duplicate records exist.

Curiously, I couldn’t find a decent definition for double counting outside of GDP and economic calculations. Feel free to suggest improvements on mine!

In my experience, double counting typically occurs in the following situations:

  • Data values are repeated in a column: for example the same customer number is found in multiple rows
  • Multiple identifiers for the same “noun” or entity exist: for example, one customer is identified by multiple customer numbers, possibly due to different applications, a system migration or a merger
  • A contextual element or business rule is overlooked or not represented in the data
  • A many-to-many or recursive relationship in the data exists: for example, a customer has offices in France and Germany, which makes the number of customers in France as 1 and in Germany also 1, but what freaks people out is that the number of customers for Europe is still only 1


Let’s consider an example where the budget for commissions was accrued at 5 percent of sales. The organization has a sales manager responsible for North American (NA) sales and another for OEM sales, which sells product BLK-40-5.

A typical approach would be to calculate the commissions as follows:

  • NA Sales Manager = 5 percent of sales in North America
  • OEM Sales Manager = 5 percent of sales where product code is BLK-40-5

Using a row from an invoice table, we see that product BLK-40-5 was sold in the USA for $1,275.00 (150 units at $8.50 each), resulting in that transaction being referenced in both calculations.


How long do you think the budget will last if commissions are paid out using this approach? Clearly, either the calculation or the budget needs to change.

What to do?

I speak from experience when I say double counting is an easy error to make. I’ve seen this mistake made by experienced business analysts and IT professionals, and on data ranging from 10 rows of an Excel spreadsheet to terabytes of SAP data. The usual reason is the data contains redundancy and dependencies, problems that require a good grasp of the business and data management skills to address.

So, when favoring Hadoop over your data warehouse, consider the following:

  • Identify who uses analytics within your organization and assess their analytical skills (for example, Excel user versus programmer). Orient the data sources to skill level.
  • Consider how transactional level data will impact estimates to validate results. My experience is that the time to validate a report has a logarithmic curve as the number of disparate sources or complex data relationship being referenced increase.
  • A transition plan might be to shift innovative or one-off analysis to Hadoop and reduce the data warehouse to support only frequently asked questions.
  • For new sources of data such as social media data, use Hadoop for experimentation and value assessment. Consider moving high value insights to Hive or the data warehouse for broader consumption.
  • When combining data from disparate sources, consider complementary technologies such as master data management (MDM) or entity analytics to improve confidence in the results and enhance repeatability.

For more information about double counting, check out my article on IBM Business Analytics Blog "The Rookie Analyst Series: Lessons in Double Counting."