Keeping the Trains Running On Time

IBM data scientists exploit open and social media data to investigate approaches for smarter city operations

In 2012, the city council of New York, New York enacted Local Law 11 of 2012 that encouraged and enabled New York City’s agencies and departments to make their data available online and share it using open standards. As a result, many data sets managed by various city agencies and departments are available to companies, organizations, and individual application developers (see figure). These entities can utilize these data sets for research or to develop applications that can help make city operations more transparent and easier than ever for residents and visitors to engage in.

Keeping the Trains Running on Time – figure

Available and planned data sets for New York City agencies and departments


In September 2013, the New York City Department of Information Technology and Telecommunications (DoITT) submitted the New York City (NYC) OpenData Plan. It outlined which controlling agencies would submit data sets that fit the public data set definition as defined by law (see the sidebar, “Providing a framework for open information management”).

As increasing numbers of data sets from multiple New York City agencies became available, a significant number of applications—many of them mobile applications—were developed and released. These applications operated with various data sets published by New York City agencies. As of this writing, the NYC OpenData portal lists more than 1,100 data sets.1

Opening insight into city operations

An IBM team of data and information scientists—who also live in New York City—engaged in a series of studies that utilized the open data provided by the city’s agencies and departments. For example, the team recently looked at augmenting survey data from the NYC 311 agency with related social media data to look at residents’ heating complaints in particular.2 And the team further demonstrated how this information could be used to delve deeper into a study of the city’s potential heating problems.3


Providing a framework for open information management
Less than a year after New York City enacted Local Law 11 of 2012, a similar policy was published by the Executive Office of the President of the United States in May 2013. This policy, known as “Open Data Policy—Managing Information as an Asset,” was detailed in a memorandum to the heads of executive departments and agencies.*

It sets forth a framework that helps “institutionalize the principles of effective information management at each stage of the information’s life cycle to promote interoperability and openness.” The memorandum further stipulated that agencies must create or collect information in a manner that supports downstream information processing and dissemination. That processing includes machine-readable data, open formats, data standards, and metadata for data creation and collection activities.

*Memorandum for the Heads of Executive Departments and Agencies,” Executive Office of the President, Office of Management and Budget, M-13-13, Washington DC, May 2013.

Another effort made use of open data to look into the timely operation of the New York City public transit system, particularly the city’s subways. The subway system was of keen interest to the IBM team because its members were highly dependent on it for their own commuting.

The team had also closely followed the evolution of NYC OpenData along with crowdsourcing applications that were being developed. Among many other applications, the team’s favorite was one that dynamically visualized a target subway schedule for all the city’s subway lines. The team members found the application to be quite useful, and it helped them catch just the right trains from Manhattan to Brooklyn for their scheduled meetings with New York City agencies. While using the application, team members could tell if the subway schedule was changing because of holidays or unanticipated events that occur from time to time with the subway system. In those cases, the visualization of the subway schedule wouldn’t correspond to its actual operation.

The team also noticed that many applications that utilize the NYC OpenData portal data sets often focus on or leverage a single data set published by a particular agency representing a single New York City domain. Such data sets include the city’s Metropolitan Transit Authority (MTA), police department (NYPD), department of education (DOE), Health and Hospitals Corporation (HHC), department of sanitation (DSNY), and so on. In many cases, these individual applications did not attempt to explore, correlate, or derive insight from cross-domain data. An example of working with cross-domain data sets might involve looking for opportunities to report on or analyze data published by seemingly unrelated agencies such as the MTA and DOE or the DOE and public safety.

Similarly, many applications did not attempt to consider and correlate New York City–based social media information from channels such as Twitter or Facebook. However, from a common sense, real-life perspective, studying and correlating seemingly unrelated data sets combined with appropriate analysis of public sentiment data can provide extremely important insight.

Many events and activities often reflect data that is available from different domains and published by different, interdependent agencies and departments. For example, many children in the New York City public school system take subways to and from schools. On-time operation of the public transit system, therefore, can directly affect school attendance.

Taking a deeper dive into city life

These kinds of city agency and department interdependencies and the volumes of open data from them that are now available have inspired the IBM team to investigate creative ways to put this data to use. The team was able to dig deeply into specific issues that impact New York City residents, share their findings, and describe the unique and nonobvious insight that can be derived while working with these valuable data resources from multiple domains.

Equally compelling is how these data sets can be enhanced by additional insights derived from New York City–related social media interactions on Twitter and Facebook. Look for upcoming articles in IBM Data magazine to learn more about the team’s approach, development, and conclusions based on working with New York City open data augmented by data gleaned from social media interaction.

Please share any thoughts or questions in the comments.

1 NYC OpenData portal.
2Turning Up the Heat on Open Data,” by Sourav Mazumder, IBM Data magazine, June 2014.
3 Enhancing Survey Data with Related Social Media,” by Matthew Riemer, IBM Data magazine, June 2014.

[followbutton username='souravmazumder0' count='false' lang='en' theme='light']
[followbutton username='mattriemer' count='false' lang='en' theme='light']
[followbutton username='borisvi' count='false' lang='en' theme='light']
[followbutton username='IBMdatamag' count='false' lang='en' theme='light']