Early Warnings, Fast Fixes
Maximize mission-critical uptime with existing tools
There was a time when the term “24x7” applied more to convenience stores than to databases and transactional systems. Data management professionals still needed to ensure that their systems were reliable and responsive, but there was always a window for batch programs, ETL jobs, and ongoing maintenance. Not anymore. The Internet has put off-line off-limits, especially for the mission-critical databases whose availability global companies have come to rely on around the clock.
Fortunately, database monitoring tools have evolved accordingly, but they’re often designed more for reactive tuning and troubleshooting than for proactively preventing outages. Monitoring availability, on the other hand, means placing your production systems under the unrelenting gaze of intelligent agents that can spot problems while they’re still minor enough to be fixed without incurring an outage.
Here’s the exciting news: if you have even a moderately sized IT infrastructure, you probably already have the tools to augment your database monitoring strategy with an early warning system that’s flexible, lightweight, and surprisingly affordable.
The Internet giveth
Even while the web ran off with your maintenance window and left you with a high-volume, transactional headache, it was also kind enough to lay the groundwork for the solution. The emergence of the web posed new and immediate challenges, even to large, mature IT shops that were already managing other 24x7 applications. Early generations of HTTP server software were notoriously fragile and insecure, while other less-popular programs had even shakier reputations. Caretakers at even a modest site could expect curveballs, and woe be to those who had them happen at a critical moment in the day, quarter, or holiday shopping season.
IT shops of all sizes responded to the challenge by adopting network management software such as Nagios, IBM Tivoli NetView, Big Brother, and others. These applications helped them understand what “normal” looked like on their 24x7 systems. Out of the box (or .tar file), the products focused primarily on the status of network devices and a handful of basic server resources, which was hardly complete coverage. However, the programs offered custom monitoring APIs that enabled savvy administrators to extend the reach of their network management software well beyond Layer 3, ultimately polling and measuring thousands of service points.
Keeping everything running was motivation enough for comprehensive monitoring during the wild and woolly days of Web 1.0. But today’s age of regulatory compliance, service level agreements, and rightsized virtual environments have made proactive monitoring an absolute must-have technology. In response, companies created a wealth of compelling IT monitoring solutions, each with different strengths, specialties, and licensing options. By now, your operations team has most likely chosen a product and configured it to poll your company’s production servers and network devices. The tools are waiting for you—all you need to do is walk over and pick them up.
Wiring up your databases into a network monitoring console is not as difficult as it sounds. Some platforms streamline the process by offering dedicated monitoring agents for IBM database servers, either in the base product or as a separate plug-in. Even then, you’ll probably want to create a few custom monitors to watch for specific conditions. Extending an otherwise generic monitoring platform to watch over your databases may sound like a daunting task, but the learning curve isn’t that steep for DBAs, and the additional coverage is bound to deliver immediate savings.
How immediate? Think days, not months. Within a week of deploying a custom service check consisting of barely a dozen lines of UNIX shell script code, operations technicians at a large international organization were able to detect an infrequent but serious anomaly in one of their databases and avert an outage that would have disrupted the company’s operations for several hours across three continents.
Active polling without passwords
One criticism of active monitoring is that it typically requires the monitoring server to store passwords for all the remote accounts accessed by the service checks, but a secure and convenient work-around has existed for many years. Through SSH key–based authentication, individual accounts on Linux and UNIX servers can be configured to allow remote sign-ons through the use of preregistered encryption key pairs instead of traditional password authentication. Applying this pattern to active monitoring involves a task-specific private key that is stored on the central monitoring server and a corresponding public key that is registered into the authorized list of keys for a specific service account (the restricted monitoring user on the database server, for example).
When it’s time for the central monitoring server to refresh the status of a particular service, the service check program uses the private key to open an encrypted SSH connection to the remote system and issues the commands to obtain the status information. Unlike password authentication, which limits a user to having just one password at a time, SSH key–based authentication allows each user account to manage multiple authorized keys to produce highly granular authentication policies that can differ by task, client machine, or other criteria.
Preparing your data systems
As a data management professional, your knowledge of what to look for in each database server and its associated middleware is an essential part of an effective monitoring plan. Without your insight, the default monitoring checks available may not reveal much beyond the status of the server’s network connection and TCP service port.
Before assessing which of your service checks will or won’t require custom programming, draw up a wish list of all the indicators that would warrant dedicated attention in an ideal monitoring environment, along with the DBA commands necessary to capture them. Also consider OS indicators, such as the size of the diagnostic message log and the presence of recently created core-dump files, as well as the following issues.
With your currently deployed systems in mind, estimate the alert boundaries for gauge-type indicators such as current connections and storage available. Specify acceptable per-minute rates for performance counters that are always increasing, such as lock-wait time, rollbacks, and cache overflows. Decide how often each service check will run; it could be every minute or every 5 or 10 minutes, depending on the importance of the resource and how taxing the service check is. Rank and prioritize your monitoring wish list items by their potential to disrupt production if left unchecked. Regardless of how your service checks are implemented, have a detailed plan of attack ready before jumping into the monitoring suite.
Active or passive.
Next, determine if your monitoring checks need to be active or passive (from the monitoring server’s perspective). An active monitor executes on the central monitoring server and checks remote resources by polling them across the network. (For more information, see the sidebar, “Active polling without passwords.”) Passive monitors run directly on the production servers and transmit their results to the monitoring server. Monitoring platforms traditionally favor active service checks, but many also support passive monitoring over a variety of network protocols.
It’s quite common to mix active and passive service checks when monitoring databases and sophisticated business applications. Some implementations embrace passive monitoring for databases because it can be easier to write service checks that execute locally on the database server without needing to authenticate. It can also be easier because the database server won’t need to allow inbound connections from the monitoring server, which is often located in a less-secure network zone. You may have a preference toward one approach or the other, but don’t be disappointed if that decision has already been made by your network manager and operations team.
When it comes to developing custom service checks, simpler is better. After all, each one is just a wrapper around one or two database commands, followed by some text formatting and possibly a bit of math. Stick with programming languages that are bundled with the base OS, such as bash, ksh, or Perl on Linux and UNIX; and Microsoft Windows PowerShell or Microsoft VBScript on Windows servers. Once you’ve written a few scripts, look for opportunities to make them even smaller and tighter by relocating commonly duplicated code routines to a reusable library.
Although it’s always worthwhile to write code with some unknown maintenance programmer in mind, it becomes even more important when writing service checks, since they’ll require immediate adjustment when something goes wrong (for example, a vendor software update suddenly breaks their testing logic). Your company’s collection of custom monitoring scripts may be small, but they’re still important and well worth the modest overhead of managing with a version control system such as Apache Subversion, IBM Rational ClearCase, or Git. If you’re monitoring any resources with passive checks, the checkout and update features provided by version control programs offer a streamlined, consistent deployment process across multiple servers.
If the monitoring platform supports the concept of dependencies between resources, take the time to accurately define these hierarchies in order to reduce the amount of downstream chatter that follows a significant problem. For example, if a primary network switch fails, the servers connected to it will also be affected, but you probably won’t want to wade through dozens of alerts confirming that those systems and applications are indeed unreachable while the switch is down.
Real-time reporting at Akbank
When you hear about 24x7 operations at a financial institution, you naturally think transactions: processing an account deposit or credit-card sale at midnight. But financial institutions like Akbank operate around the clock, and they need accurate, up-to-date information to provide all kinds of services. For example, the call center is always open, and when a customer calls with a problem, it must be solved immediately. Call center staff also need instant access to customer records and reports to offer appropriate products when selling opportunities present themselves. And at any time of day or night, customer satisfaction and complaint reports must be available.
Real-time information is so critical that Akbank, one of the leading banks in Turkey, would run reports directly on its operational systems—primarily running DB2 for z/OS—when possible. “Transactions and batch jobs always took priority over reports; the system automatically canceled any report that took longer than 15 minutes; and no report requests were accepted at all during peak times,” says Banu Ekiz, business intelligence applications vice president at Akbank.
These measures were needed despite having an extensive and sophisticated data warehousing and reporting system: “We run IBM Cognos and SAP Business Objects against about 15 data marts,” says Ekiz. But it could take a day or more for new data from the operational systems to be made available. Maintaining and updating the system to match the needs of the business is also a complex, time-consuming endeavor.
In July 2010, that started to change. Akbank created a real-time operational data store running on the Netezza appliance, using Informatica PowerCenter and PowerExchange to capture data in real time from the operational systems. “Today we have approximately 1,000 tables in the real-time data store, and we’re capturing 4,000 changes per second,” says Ekiz. “Business users see data 15 minutes after it’s created in the operational systems; the average run time of a report is 80 seconds; and the reports are not competing for resources on the operational systems. Plus, over the last year the system has been available 99.99 percent of the time.”
This is just the first phase of a far-reaching plan. “The next step is to migrate our data warehouse platform and datamarts to the Netezza system,” says Ekiz. “We have also started a data governance program. All of these initiatives will combine to help us answer crucial business questions as quickly as possible.”
Achieving a culture of availability
“Rudy’s Rutabaga Rule: Once you eliminate your number one problem, number two gets a promotion.” –Gerald M. Weinberg
Although reducing the frequency and duration of unplanned outages through IT monitoring is an admirable outcome, your efforts don’t have to stop there. After a few rounds of defining monitors to address your most pressing problems and gaining additional insight into your system, move on to capture key performance indicators from your servers and applications. Over time, measurements of throughput, response time, and resource utilization will reveal trends that can be used to plan upgrades, drive consolidation efforts, and minimize costly over-licensing.
On the business side, many application databases can be monitored with simple, inexpensive SQL statements that provide recent counts of customer sign-ups, received orders, gross revenue, and other essential metrics to produce a veritable dashboard of compelling information. Create role-specific web accounts on your monitoring platform, compose a custom view of status indicators and performance trends that are relevant to each role, and you may find formerly contentious departments acting more like a user community once they share a common repository of availability data.
All this may sound overly optimistic, but there is no denying that a web-accessible source providing timely, accurate statistics of your company’s mission-critical systems—as well as improved availability overall—will attract more advocates to the work you do.
Major financial firm switches from Oracle to DB2 in 12 hours
If you are going to talk to companies in the financial industry about high availability, you need to remove three words from your vocabulary: “if,” “but,” and “except.” For example, a global financial services company permits a 12-hour quarterly maintenance window for its online retail banking services. This is plenty of time for ordinary operations, but what about a complete platform migration?
When the organization—one of the top 25 financial services companies in the world—moved its online banking system from Oracle to IBM DB2 9.7, the company faced a serious availability challenge, but not the one that you might expect. Because DB2 9.7 provides native support for commonly used Oracle-specific features, the migration team was able to transition the application without significant recoding. The IBM Data Movement Tool automated the process of actually moving the production data—three terabytes in more than 300 dynamic tables—from Oracle to DB2. The only problem was the timetable.
“With the migration toolkit, moving the data from Oracle to DB2 was the easy part,” says Frank Fillmore, principal and founder of The Fillmore Group, a DB2 technical consulting firm. “But we were going to need five days for the batch unload/reload and to check and validate the data. There’s no way that the system could be down that long.”
Working with a team from IBM Lab Services, Fillmore recommended using the Q Replication feature of IBM InfoSphere Replication Server. Typically deployed to provide high availability and load balancing, Q Replication captures and stores database changes in message queues. Before the migration began, IBM and The Fillmore Group installed Q Replication, which captured changes made to the Oracle source database that was still up and running. The team then tested extensively, documenting everything from the volume of transactions to the time needed to update the target database. “Once the migration and validation were complete, we opened the queues and let the changes flow into the DB2 database,” says Fillmore. “When the two databases were in sync, we flipped the switch and cut the transaction stream over.”
The team’s creative solution paid off with a seamless transition. On the production cutover weekend, the team limited access to the online banking application, performed the necessary technical operations validation, and then re-opened the application to the public—all within the standard 12-hour maintenance window. “The project went very smoothly,” says Fillmore. “We’ll almost certainly use the Q Replication strategy on future Oracle-to-DB2 migrations.”