# Modeling Trains with SPSS

## How sequential analysis can predict the future, and why DB2 might be smarter than you think

By their nature, trains invite imagination. They conjure visions of exotic travel and long-distance commerce. But trains are also large, complicated, and expensive to repair when they break unexpectedly. For a railway operator, the ability to predict a failure before it happens would be tremendously valuable.

Recently, an IBM team worked with a European train manufacturer to develop exactly this capability, using IBM SPSS Modeler and IBM DB2 to analyze data generated by the trains. This article will show how to use SPSS to identify patterns in data. It also will show how IBM DB2 made an unexpected contribution to the project—handling a set of moderately complex mathematical calculations directly within SQL.

## Predictions from statistics

On a modern train, more than 1,000 different types of mechanical and electrical events are recorded, including operational events such as “opening door” or “train is braking,” warning events such as “line voltage frequency out of range” or “compression is low in compressor X,” and failure events such as “pantograph is out of order” or “inverter lockout.” More than 2,000 events are collected on each train every day.

IBM proposed to use sequential analysis to determine whether any of these events or event sequences indicate that a more severe problem is about to occur. Sequential analysis uses statistical modeling to detect patterns in data. For example, it can show that if particular grocery store customers purchase baby formula and diapers one day, they are highly likely to purchase stewed fruit (baby food) at some point in the future.

## Sequential analysis of events

The IBM team started with the raw event data for 2010: a 20-million-row data set containing the event time, event code, and the GPS location of the event. The team used the sequence node algorithm in IBM SPSS Modeler to analyze the train data. The sequence analysis is a two-stage process for sequential pattern mining.* First, the algorithm mines the data for frequent sequences and computes an adjacency lattice. The user can then explore the lattice by using specified criteria, such as support and confidence bounds, or by placing restrictions on the antecedent sequence.

Using the IBM SPSS Modeler workbench (see Figure 1), the team defined the input table in a DB2 data source node, added a Select node to remove data that occurred while the trains were in maintenance depots, and defined the input columns in the sequence node algorithm. The team also set the time period for which the sequences were to be studied, specifying that only events occurring in a two-day timeframe were to be considered. Figure 1. The ID field indicates which train is related to a sequence of events, the Time field describes the sequence order, and the Content field contains the code of the event that has occurred.

The SPSS Modeler rule sets interface shows the rules found by SPSS after the analysis has been completed (see Figure 2). The highlighted rule displays an antecedent sequence (21382, 29203) followed by a consequent sequence (in this case a single event: 18232), with a support rating of 55.7 percent and a confidence rating of 93.1 percent. This means that during 2010, this antecedent sequence appeared in 55.7 percent of trains and was followed by the consequent sequence 93.1 percent of the time. Figure 2. The rule sets interface, showing the results of the analysis. The sequence node found 12,358 rules.

Once the rules have been calculated, the data can be explored using the SPSS Modeler filtering tools (see Figure 3). According to the data, event code 2201 followed event code 2003 in 76.9 percent of the cases and event code 2001 in 84.6 percent of cases. Because event code 2201 indicates a failure that makes the train stop and leads to delays, this information can be used to optimize maintenance planning: if event code 2001 appears, the device related to this code should be inspected as soon as possible to avoid a failure. Figure 3: A filter applied for event code 2201. One rule shows that 87.2 percent of the time, event code 2201 was preceded by two occurrences of event code 2001.

## Is the train at the station?

As mentioned earlier, the team needed to eliminate data that was recorded while the trains were in a depot for maintenance; the service process triggered sensors in ways that did not reflect the actual behavior of the train systems. The best way to detect events that occurred during maintenance was to use the well-known point-in-polygon algorithm (see Figures 4 and 5) to calculate whether the train’s position at the time an event was recorded corresponded with the location of the maintenance depots. The manufacturer had an existing table with the depot polygons already defined. Figure 4. The solution to the point-in-polygon problem is to draw a horizontal line starting at the point in question. If the point is within the polygon, the line will intersect the edges of the polygon an odd number of times. Figure 5. If the horizontal line drawn is aligned with a polygon vertex, then only the segments that are below the line are counted.

Initially, the team thought that the calculation would be performed in a Visual Basic or Java application. But faced with processing 20 million events, the team quickly determined that these solutions would not be fast enough. Instead, they implemented the algorithm in SQL directly in IBM DB2. (Note: IBM DB2 Spatial Extender provides a native function to find whether a given geometry falls within the boundaries of another geometry.)

## Implementing point-in-polygon in SQL

To express the point-in-polygon algorithm in SQL, the team first built a pair of functions to transform longitude and latitude coordinates into mathematical X and Y planar coordinates:

• X (longitude) implemented as GetX SQL function: CASE WHEN LonDeg <= 180 THEN -1 * (LonDeg + (DOUBLE(LonMin) + DOUBLE(LonSec)/100)/60) ELSE (360 - LonDeg - (DOUBLE(LonMin) + DOUBLE(LonSec)/100)/60) END
• Y (latitude) implemented as GetY SQL function: CASE WHEN LatDeg <= 90 THEN -1 * (LatDeg + (DOUBLE(LatMin) + DOUBLE(LatSec)/100)/60) ELSE ((LatDeg + (DOUBLE(LatMin) + DOUBLE(LatSec)/100)/60)-90) END

Next, the team built a table of the segments of the polygon. They created a POLYSEG table with the polygon segment coordinates X1,Y1 and X2,Y2, keeping track of the segment order. They used a CASE WHEN … THEN ELSE END structure to detect whether a given point was actually the first point in the table. If it was, then it became the endpoint for the final segment. Otherwise, the code moved on to the next point:

SELECT p1.DepotID,p1.DepotName,p1.CoordinateRank, p2.GPSOrder,
GetX(p1.LonDeg, p1.LonMin,p1.LonSec), GetY(p1.LatDeg, p1.LatMin,p1.LatSec),
FROM POLYGONS p1,POLYGONS p2
WHERE p1. DepotID =p2. DepotID
AND p2.GPSOrder=CASE WHEN p1.CoordinateRank =(SELECT MAX(p3. CoordinateRank) FROM POLYGONS p3 WHERE p3. DepotID =p1. DepotID) THEN 1 ELSE p1.DepotID +1 END;

To detect the number of intersections to the right from a point that is not aligned with a vertex, start with the formula of the coordinate where the horizontal line intersects a segment (Xi):
Xi=X1+(Y-Y1)*(X2-X1)/(Y2-Y1)
For a given horizontal line that intersects the segment, use X<=Xi, or:
X<=X1+(Y-Y1)*(X2-X1)/(Y2-Y1)
To avoid any division by zero or overflow, transform all division into multiplication and check for the sign (horizontal lines with Y2=Y1 do not count in the algorithm):
If (Y2-Y1)>0 then X*(Y2-Y1)<=X1*(Y2-Y1)+ (Y-Y1)*(X2-X1)
If (Y2-Y1)<0 then X*(Y2-Y1)>=X1*(Y2-Y1)+ (Y-Y1)*(X2-X1)
The following resulting WHERE clause counts a segment that is intersected:

WHERE Y2<>Y1 AND (((Y2-Y1)*(TEMP.Y-Y1)>0 AND TEMP.X*(Y2-Y1)*(TEMP.Y-Y1)<=X1*(Y2-Y1)*(TEMP.Y-Y1)+X2-X1)
OR ((Y2-Y1)*(TEMP.Y-Y1)<0 AND TEMP.X*(Y2-Y1)*(TEMP.Y-Y1)>=X1*(Y2-Y1)*(TEMP.Y-Y1)+X2-X1)) AND (TEMP.Y>Y1 OR TEMP.Y>Y2) AND
(TEMP.Y<Y1 OR TEMP.Y<Y2)

To handle the points on the line aligned with a segment vertex, clauses that count only the segments that are lower than the line are necessary:

OR (TEMP.X<TEMP.X1 AND TEMP.Y=Y1 AND Y1>Y2)
OR (TEMP.X<TEMP.X2 AND TEMP.Y=Y2 AND Y2>Y1)

## Removing events from maintenance depots

To actually count the intersections, identify the points that have an odd number of intersections, and then filter out those points using GROUP BY and HAVING clauses. Notice that COUNT is an integer. In an odd number division, it does not retain the decimal part; this is why it can be used to test for an odd number.

GROUP BY EVENT.ID, P.DepotName HAVING COUNT(*)-2*(COUNT(*)/2)=1)

The team also needed to assemble everything into a table and update the events table with the depot location name or null. The following is the full SQL:

CREATE OR REPLACE PROCEDURE UpdateGPSLocation
BEGIN
DECLARE SQLCODE INTEGER;
DECLARE nbrows INTEGER DEFAULT 0;
DECLARE batchsize INTEGER DEFAULT 10000;
DECLARE t1 TIMESTAMP;
DECLARE t2 TIMESTAMP;
SET T1=CURRENT TIMESTAMP;
loop: LOOP
CALL DBMS_OUTPUT.PUT_LINE('Updating '||nbrows||' rows...');
UPDATE EVENTS
SET X = 1,
Location = (SELECT P.LOCATIONNAME
FROM POLYSEGS P,
TABLE(values(CASE WHEN LonDeg <= 180 THEN -1 * (LonDeg + (DOUBLE(LonMin) + DOUBLE(LonSec)/100)/60)
ELSE (360 - LonDeg - (DOUBLE(LonMin) + DOUBLE(LonSec)/100)/60) END,
CASE WHEN LatDeg <= 90 THEN -1 * (LatDeg + (DOUBLE(LatMin) + DOUBLE(LatSec)/100)/60)
ELSE ((LatDeg + (DOUBLE(LatMin) + DOUBLE(LatSec)/100)/60)-90) END
)) as TEMP(X,Y)
WHERE (Y2<>Y1 AND (((Y2-Y1)*(TEMP.Y-Y1)>0 AND TEMP.X*(Y2-Y1)*(TEMP.Y-Y1)<=X1*(Y2-Y1)*(TEMP.Y-Y1)+X2-X1)
OR ((Y2-Y1)*(TEMP.Y-Y1)<0 AND TEMP.X*(Y2-Y1)*(TEMP.Y-Y1)>=X1*(Y2-Y1)*(TEMP.Y-Y1)+X2-X1))
AND (TEMP.Y>Y1 OR TEMP.Y>Y2)
AND (TEMP.Y<Y1 OR TEMP.Y<Y2))
OR (TEMP.X<TEMP.X1 AND TEMP.Y=Y1 AND Y1>Y2) OR (TEMP.X<TEMP.X2 AND TEMP.Y=Y2 AND Y2>Y1)
GROUP BY TMSEVENTS.ID, P.LOCATIONNAME
HAVING COUNT(*)-2*(COUNT(*)/2)=1)
WHERE ID IN (SELECT T.ID FROM EVENTS as T
WHERE T.X <> 1 OR T.X IS NULL
AND T.LatMin IS NOT NULL
FETCH FIRST 10000 ROWS ONLY);
IF SQLCODE = 100 THEN
LEAVE loop;
END IF;
COMMIT;
SET nbrows=nbrows + batchsize;
END LOOP;
SET T1=CURRENT TIMESTAMP;
CALL DBMS_OUTPUT.PUT_LINE('T1: '||T1||' T2:'|| T2);
END
@
CALL UpdateGPSLocation

The final code updates the 20-million-row table in less than one hour, identifying the events that happen in a maintenance depot. Because this update is performed on a monthly basis, this performance is acceptable. Additional optimization with appropriate indexes and EXPLAIN command usage could be used to increase performance.

## Keep the trains running

The team demonstrated that IBM SPSS Modeler provides the capability to automate failure predictions in real time by calculating confidences on the fly and generating alerts when needed. In addition, DB2 provides advanced SQL features to help handle complex queries, including geospatial data and complex computations. To see the real-world benefits, consider the money saved by reducing the need for emergency repairs. Or think of the lessons learned that can be used to make the next generation of trains more reliable. Or perhaps, simply contemplate the railway traveler: watching the landscape smoothly unfurl while the train rushes quietly to the next exciting destination.

* "Mining Sequential Patterns, by Rakesh Agrawal and Ramakrishnan Srikant, San Jose, California, IBM, 1995.