Blogs

The Program, the Framework, and the Platform: Part 1

Processing data in motion can be complex and requires a decision: programs, framework, or platform

Executive IT Specialist, Competitive and Product Strategy, IBM Analytics, IBM

At this point, I trust that readers of the Data in Motion column are convinced that processing data in motion is essential to organizations in their quest to maintain a business advantage. But isn’t processing data in motion the same thing as writing a program that reads data, does something with it, and returns a result in the appropriate form such as a database update or the generation of an alarm? The simple answer is “yes,” but processing data in motion gets complicated very quickly, as will soon be revealed. Ultimately, developers should understand why they need to use a platform to implement their solution.

Program flexibility

For example, suppose a solution is implemented in Java. From a programming point of view, Java is a good choice because it comes with a large number of classes that can be leveraged to do the work. Take a simple scenario: reading a data source, processing it, and writing it to a database.

The naïve solution is to write a program in which the processing is sequential: data is read from the source, processed, and then written to the database. This process can work well for a low volume of data, but as soon as the rate of data input reaches a certain level, some performance problems emerge. The difference in processing time between reading the input, processing the data, and writing to the database could result in a complex situation (see Figure 1).

 
The Program, the Framework, and the Platform: Part 1 – Figure 1

Figure 1. Multithreaded program requiring information queuing between threads

 
In this scenario, a multithreaded program has a different number of threads for each processing part. If the processing gets increasingly complicated, the multithreaded program may have to be scrapped and split into multiple processes to take advantage of a cluster of machines. As a result, an additional type of communication for machine-to-machine process communication is added.

For flexibility on where to run each program, consider using a directory service such as Lightweight Directory Access Protocol (LDAP) where each program registers, and adding the logic for each program to wait for the appropriate other program to be available. Because this approach is starting to look like a lot of extra work beyond writing the solution code, maybe using some sort of framework that eliminates the extra work should be considered.

Process framework underpinning

A framework can be defined as a basic structure underlying a system. In the current example, a framework is needed for distributed programming. With it, the focus can be on writing the processing and not on how the different parts communicate with each other.

A framework provides some classes and interfaces to make programs fit within it. These classes and interfaces are the mandatory parts, but nothing else is available. This situation reminds me of when C++ compilers first emerged on the scene. There were no class libraries with them. They either had to be bought or created. Having no class libraries may be hard for some to consider. Imagine having an object-oriented programming language that first requires writing a String class.

Even if the framework is implemented in Java, there is still a lot of work to do. Reading a comma-delimited file requires using multiple classes such as FileReader and BufferedReader. Developers have to handle exceptions, loop over reading the file line by line, divide the line using the separator, and map the resulting values to the output attributes. And these tasks have to be carried out in a generic way that offers some flexibility to handle a different number of output attributes and to be able to use different separators such as the pipe character instead of a comma.

If reading from a database is necessary, the appropriate code must be written using the Java Database Connectivity (JDBC) application programming interface (API). If it is a message queue, another API needs to be learned and used. And other types of input require other APIs.

Deep dive perspective

Part 2 of this series probes a bit deeper into the programming environment including developer tools, topologies, and platforms. Please share any thoughts or questions in the comments.