Blogs

The Program, the Framework, and the Platform: Part 2

Programs, frameworks, and platforms get the job done, and development tools enhance developer productivity

Executive IT Specialist, Competitive and Product Strategy, IBM Analytics, IBM

Part 1 of this series discusses using a program to implement data-in-motion processing. It also argues that this approach should be dismissed in most cases because once the volume of data gets high enough, additional time is required to maintain a distributed infrastructure that does the core work. In addition, Part 1 presents the programming aspect of using a framework—assuming a Java framework—and shows everything required for building a solution from scratch.

To be fair, communities supporting some frameworks do provide a few pre-built modules with different levels of documentation—though the documentation is usually minimal. Still, reviewing the code is a best practice to make sure it does what we want it to do and at the expected quality. Then, of course, the code must be maintained.

Processing, tools, and platforms

After the data is in the program, what about the processing? Sure, the power of the Java language and its classes are available, but there are a lot of generic capabilities that need to be implemented. These operations include filtering, controlling the flow from multiple sources, joining, and aggregating data from multiple inputs. The last two capabilities may also force implementation of some sort of windowing mechanism to accommodate different data rates and figure out a number of tuples or a time interval for which the operation applies. This mechanism is not trivial to implement.

What about development tools? Sure, a generic integrated development environment (IDE) can be used to help with programming tasks. It would not include specific capabilities to handle the distributed programming paradigm such as wizards to create new operators. Another example is putting together a topology. A topology is the processing flow from one operator to another (see Figure 2). Any complex solution may end up having dozens of operators. This number of operators is good because an operator is one processing component. Multiple operators mean that we can take advantage of multiple processors in one or more machines. The following pseudo code shows what a small topology could look like:

Graph job = new Graph();

job.setAdapter("reader", new ReaderAdapter(), 7);
job.setOperator("work1",
new Operator1(), 4).shuffleGrouping("reader");
job.setOperator("work2",
new Operator2(), 2).shuffleGrouping("work1");

 
The Program, the Framework, and the Platform: Part 2 – Figure 2

Figure 2. A topology graphical representation

 
This potential graphical representation of a topology shows how the operators are connected and how each has a level of parallelism attached to it. Topologies represented in a graphical editor, such as the editor provided by IBM® InfoSphere® Streams software for stream computing, can become quite complex (see Figure 3). Some topologies can even be viewed as operators within other topologies. The capabilities of a proper IDE help greatly enhance programmer productivity.

 
The Program, the Framework, and the Platform: Part 2 – Figure 3

Figure 3. InfoSphere Streams graphical editor view of a topology

 
In addition to the distributed processing framework, these tools can be highly beneficial to developers and make a justifiable case for utilizing a platform instead of a framework. A platform should include the following components:

  • An IDE
  • Pre-built operators to help simplify common processing tasks
  • A runtime environment with capabilities to start and stop topologies and relocate operators as needed
  • Easy-to-use tooling for monitoring and management
  • Proper documentation
  • Packaging that makes it easy to install and deploy

Developing solutions should be easy, but the ease of maintaining them is just as important. Having a graphical editor is a helpful part of the solution. Another part is to be able to modularize the topologies to make them highly manageable.

Then there are the performance characteristics of topologies to consider. Performance tuning cannot be done blindly because often the bottleneck is not where it’s expected. Having a comprehensive set of tools to monitor production systems is vital, but equally important is to also profile the execution.

Programs, frameworks, or platforms

For many organizations, the question comes down to whether they prefer building a solution from scratch or focusing on solving their business problems. It also comes down to whether they want to deploy a platform that gives then added agility. An agile platform offers the distributed processing and tooling to enable highly efficient development, maintenance, and management of business solutions. These benefits result in minimized timelines and quick responses to the changing needs of business.

Was this article series helpful? Please share any thoughts or questions in the comments.