Turning Up the Heat on Open Data
An IBM team mashes up public domain data and creates a model to analyze New York City heating problems
A team of IBM big data practitioners recently looked at service log data from the New York City 311 (NYC 311) agency compiled from actual complaints reported by city residents.1 In particular, the team wanted to utilize open data available from public domain sources to create a model that could be used to help predict potential heating problems. The initial study looked at data for the period between November 2012 and February 2014. Among several complaint categories for problems reported by New York City residents, heating was clearly the most prevalent complaint category with nearly two and one-half times more complaints than the category with the second highest number of complaints in the NYC 311 data repository.
In addition, the NYC 311 data reflects that during winter months the number of heating problem complaints rose sharply, with daily complaints ranging between 2,000 and 6,000 incidents within short time frames on several days (see Figure 1).
Figure 1. Volume of heating complaints by New York City residents
Given the number of complaints and the importance of having adequate heating during harsh winter days, the team decided to delve into an analysis of New York City heating complaints. It used data sets from different public domains that could reveal various potential factors contributing to heating problems. The team had two primary goals. One goal was to establish a process for cross-domain data analysis and modeling that helps to investigate heating and other similar types of problems big cities experience. The other goal was to use that process to gain an initial insight into heating problems in New York City.
Apart from weather conditions, typically heating problems in buildings can be associated with the size of the building, the age of the building, the types of heating fuel used, infrequent maintenance of the heating equipment, demographics of the inhabitants, and so on. In this study of New York City heating problems, the IBM team selected the first three—building size and age and heating fuel type—because these characteristics tend to be the most common among factors contributing to heating problems. Relevant data was collected from the various public domain data sets and merged into expanded data sets that covered these three characteristics associated with heating problems.
Data was collected from the NYC 311 agency from the NYC Open Data portal2 along with housing information from American Community Survey (ACS) data for 2008 to 2012 from the United States Census Bureau3 and NYC PLUTO release 13v2 data.4 On the technology side, different components of the IBM® InfoSphere® BigInsights™ platform, including the InfoSphere BigSheets, Big SQL, and Big R tools, were used for performing the analysis and modeling.
Applying models for data at different granularities
Data sets are typically never available at similar granularity levels. Therefore, rolling them up at the right levels to arrive at useful data sets for modeling is a key step for an effective modeling process. To analyze the New York City heating problem, the models constructed were at two different granularity levels: individual buildings and zip codes (see Figure 2).
Figure 2. Two models of data analysis for two levels of granularity
The output from the first model around individual building-level characteristics identifies the possible top five influencers that contribute to risks of heating problems. Among them, the number one influencer was building size. As a next step, this factor was rolled up to the zip code level by identifying the percentage of large-size buildings in each zip code within the Manhattan borough. That data point is then merged with the zip code–level heating source data to create the second model at the zip code level.
The output from the second model can be represented in a variable analysis plot (see Figure 3). It indicates the relative importance of the top five factors from the second model. The x-axis values indicate the relative rate of influence of the corresponding variable—in other words, the higher the value, the higher the influence of the corresponding variable. For example, the result shows that electricity as the energy source for heating has the highest influence on heating problems. Interestingly, the size of the building space still appears within the top five influencing factors.
Figure 3. Top five factors influencing potential heating problems in Manhattan
Predicting risk at the zip code level
The second model was further used to create a scoring mechanism for predicting the risk of heating problems at the zip code level for Manhattan zip codes (see Figure 4). A lighter shade of green implies less risk of encountering a heating problem than a darker shade, which implies a high risk of experiencing a heating problem.
Figure 4. Risk of potential heating problems in Manhattan mapped across zip codes
In this particular study, a key goal was to establish a data analytics and modeling process to showcase how public domain open data and other data sources can be used to potentially address some key problems such as heating problems in other metropolitan cities such as New York City. This work can be further extended to incorporate other variables such as demographics, economic conditions, brand of heating equipment, and so on that can enhance the model’s accuracy.
Extending the work to additional data sets, however, requires a comprehensive big data infrastructure with necessary software and hardware and deep domain insight. Still, the process outlined in this study and the outcomes can go a long way as a template that provides an initial insight into the highly discussed heating problems in New York City. A similar model using necessary data available at comparable levels of granularity can be developed for other big cities across North America and in other countries that may have heating problems and other challenges related to city life to look into.
Please share any thoughts or questions in the comments.
1 “Keeping the Trains Running On Time,” by Sourav Mazumder, Matthew Riemer, and Boris Vishnevsky, IBM Data magazine, June 2014.
2 NYC OpenData portal.
3 American Community Survey, United States Census Bureau, data and documentation.
4 PLUTO Release 13v2 tax lot data, Department of Planning, City of New York.
|[followbutton username='IBMdatamag' count='false' lang='en' theme='light']|