Traveling To The Cloud – Predict

As I stated in my previous post, traditional monitoring approaches focusing on named systems do no longer make sense. In an agile cloud environment the system name does not matter, so in turn the performance values from this system don’t.

In such a situation the prediction approach also has to change. The data flowing into IBM Operations Analytics – Predictive Insights should no longer identify a single system nor a single instance of a resource. It should represent the sum of resources or the average usage value. So let us review a few simple examples:

While we monitor the key performance metrics of the system instance with our monitoring agents like

  • Disk I/O per second

  • Memory Usage in Megabytes

  • Network Packages sent and received

  • CPU percentage used

we feed the following values into our prediction tool:

  • SUM(Disk I/O per second) across all used OS images

  • SUM(Memory Usage in Megabytes) across all used OS images

  • SUM(Network Packages sent and received) across all used OS images

  • AVG(CPU percentage used) across all used OS images

IBM Monitoring stores historical data in the Tivoli Data Warehouse. A traditional system setup might directly leverage the data stored in the data warehouse to feed the prediction tool. With the elastic cloud approach we should add some new views to the database, which enable the required summarized data view as described above.

To ensure that a single operating system instance isn’t overloaded a traditional resource monitoring has to be deployed to each cloud participant. Distribution lists from IBM Monitoring will help to do this automatically.

These list of systems are also important to maintain the efficiency of the view’s introduced for the prediction.

The following table is required in the WAREHOUS database:

This table represents the distribution list known from IBM Monitoring.

Based on this table we can create views like the one below: With this new view we are now able to feed data regarding the disk usage into the IBM Operations Analytics – Predictive Insights tool.

The column “CloudName” is useful to identify records for streams. The “TimeFrame” column works as time dimension.

Five streams are the result from the table above:

  • AllReadRequestPerSecond

  • AllWriteRequestPerSecond

  • AvgWaitTimeSec

  • AllReadBytesPerSec

  • AllWriteBytesPerSec

All streams are generate for each single instance “CloudName”.

 In the Predictive Insights Modeling Tool the view is selectable (as a table), so that the generation of the data model is pretty straight forward.

 The SQL line



makes sure that TimeFrame is a Candle Time Stamp which is known to IBM Operations Analytics – Predictive Insights tool.

This sample shows how a data model for the cloud might look like.

With moving more and more systems to the cloud and becoming more and more agile while serving the IT workload, the monitoring approach has to become more agile as well. Also the point of view which key performance metrics matter have to change. But as you can see, the data is there, we only have to change the perspective a little bit.

So what is your approach? What requirements do you see arising while moving your monitoring and prediction tools to the cloud?

Follow me on Twitter @DetlefWolf, or drop me a discussion point below to continue the conversation.

In my next blog I will share a few ideas how to automate the implementation of IT monitoring in a cloud environment.

IT Service Management – Traveling To The Cloud

More and more customers are moving to cloud architectures to fulfill the alternating resource requirements in the IT. Traditional monitoring approaches with checking the availability of a single system or resource instance does only make limited sense in this new era. Resources are provisioned and removed on dynamic request and have no long term life date.

It no longer matters whether a named system exists or not, it is about the service, and its implementing pieces. The number of systems will vary in accordance to the workload covered by the service. In some cases the service itself may disappear, when it is not permanently required. The key metric is the response time the service consumers achieve. But how can we assure this key performance metric at the highest level, without being hit by an unpredicted slowdown or outage.

We need a common monitoring tool watching the key performance metric keys on a resource level and frequently check the availability of these resources, like:

  • Disk

  • Memory

  • Network

  • CPU

Application containers will also be handled like resources, e.g.:

  • Java Heap

  • Servlet Container

  • Bean Container

  • Messaging Bus

Also resources from database systems, messaging engines and so on are monitored. With IBM Monitoring we have a useful and easy to handle tool, available on-premise and in the cloud.

With this data achieved by the monitoring tool, we can now feed a predictive insight tool. As described in a previous post, monitoring is the enabler for prediction. Prediction is a key success factor in cloud environments. It is essential to understand the behavior of an application in such an environment in a long term.

The promise of the cloud is, that an application has almost unlimited resources. If we are getting short on resources, we simply add additional ones. But how could be detect, that the application is behaving somehow suspicious? Every time we are adding additional resources these are eaten up by the workload. Does this correlate to the number of transaction, to the number of users or other metrics? Or is it a misbehaving application?

We need a correlation between different metrics. But are we able to oversee all possible dependencies? Are we aware of all these correlations?

IBM Operations Analytics Predictive Insights will help you in this area. Based on statistical models, it discovers mathematical relationships between metrics. A human intervention is not needed to achieve this result. The only thing to happen is, that the metrics are provided as streams in a frequent interval.

After the learning process is finished, the tool will send events on unexpected behavior, covering uni-variate and multivariate threshold violations.

For example, you have three metrics:

  • Number of request

  • Response time

  • Number of OS images handling the workload

Raising number of OS Images wouldn’t be detected by a simple threshold on a single resource, covered by the traditional monitoring solution.

Either the response time shows no anomaly nor the number of users does. Also the correlation between these to data streams remains inconspicuous. However, adding the number of OS images shows an anomaly in the relation to the other values. This could lead to a situation, where all available (even the cloud resources are limited, because we can’t afford it) resources are eaten up. In this situation our resource monitor would send out an alarm at a much later point of time.

For example, first, the OS agent would report a high CPU usage. Second, the response time delivered to the end users would reach a predefined limit. The time between the first resource event and the point in time where the user’s service level agreement metric (response time) is violated is too short to react.

With IBM Operations Analytics Predictive Insights we earn time to react.

So what is your impression? Did you also identify correlations to watch out for after analyzing the reason for a major outage and the way to avoid this outage?

Follow me on Twitter @DetlefWolf, or drop me a discussion point below to continue the conversation.

In my next blog I will start a discussion which values make sense to be fed into a prediction tool.

WebSphere Monitoring

Over the last weeks I see an increasing request for  WebSphere Application Server (WAS) monitoring. This article summarizes solutions available, created over the last years on top of IBM’s monitoring solution, SmartCloud Application Performance Management (SCAPM).

The SCAPM portfolio comprises almost everything of IBM’s monitoring capabilities under the umbrella of ITM. The ITCAM for Application contains the WAS monitoring agents.

The documentation of the WAS monitoring solution may be found on the IBM Knowledge Center.

Additionally, I’ve created two add-ons for the WebSphere Monitoring. The situation package gives a set of sample monitoring rules, covering the most often seen requirements in the field.

To get a comprehensive overview of all WebSphere Application Server instances monitored in your environment, this navigator view might help.

For deep dive analysis the data collector might be connected directly with the ITCAM Managing Server to enable transaction debugging and detailed WebSphere environment analysis.

The WebSphere monitoring is only one discipline within the SCAPM portfolio.  Other areas of the application performance management are covered, including transaction tracking, HTTP response time measurement and robotic monitoring.

SCAPI: Preparing your system — Software Packages

Before you can install SmartCloud Analytics Insight (SCAPI) you have to meet the software prerequisites on the RedHat Enterprise Linux Server system you are using to host the SCAPI Data Server Components. Currently only RHEL 6 64-bit is supported.

The documentation names the requirements in several locations of the installation brochure.

I’m using the following command stack to make sure that all software packages are installed:

  • yum -y install libstdc++.i686
  • yum -y install *libstdc++-33*.i*86
  • yum -y install openmotif22*.*86
  • yum -y install pam.i686
  • yum -y install libXpm.i686
  • yum -y install libXtst.i686
  • yum -y install freetype.i686
  • yum -y install libmcpp.i686
  • yum -y install libXdmcp.i686
  • yum -y install libxkbfile.i686
  • yum -y install libpciaccess.i686
  • yum -y install libXxf86misc
  • yum -y install libXm.so.4
  • yum -y install ksh*
  • yum -y install libstdc++.*
  • yum -y install *libstdc++-33*
  • yum -y install openmotif22*
  • yum -y install compat-glibc
  • yum -y install pam
  • yum -y install libXpm
  • yum -y install libXtst
  • yum -y install freetype
  • yum -y install xorg-x11-xinit
  • yum -y install Xorg
  • yum -y install firefox
  • yum -y install openmotif
  • yum -y install atlas
  • yum -y install compat-libgfortran-41
  • yum -y install blas
  • yum -y install lapack
  • yum -y install dapl
  • yum -y install sg3_utils
  • yum -y install libstdc++.so.6
  • yum -y install libstdc++.so.5
  • yum -y install java-1.7*-openjdk.x86_64
    Java is required to get the prerequisite checker executed delivered with the IBM InfoSphere Streams software.
    The packages below are installed because the InfoSphere Streams checker.
  • yum -y install libcurl-devel.i686
  • yum -y install libcurl-devel.x86_64
  • yum -y install fuse-curlftpfs.x86_64
  • yum -y install libcurl.i686
  • yum -y install libcurl.x86_64
  • yum -y install perl-Time-HiRes
  • yum -y install perl-XML-Simple*
  • yum -y install gcc-c++

This command stack includes only those packages, which are provided by RedHat Satellite server.

After you installed all the packages above you have to add the provided RPM package as documented in the installation manual.

I’ve used the following command:

#
# Install provided InfoSphere RPM Prerequisite
rpm -Uvh <streams_unpack_folder>/rpm/*.rpm

Having all this packages installed, allows you to install all SCAPI software components.

IT monitoring is out of style?

This blog has been also published on Service Management 360 on 09-Jul-2014.

A few weeks ago I read a blog entry written by Vinay Rajagopal on Service Management 360 with the headline “Still configuring thresholds to detect IT problems? Don’t just detect, predict!” I was wondering what that new big data approach will imply and what it means to my profession focusing on IT monitoring. Is IT monitoring old style now?

The IT service management discipline today is really a big data business. We have to take a lot of data under consideration if we want to understand the health of IT services. In today’s modern application architectures, with their multitier processing layers and the requirement that everything be available all the time and that performance remains at an acceptable level, IT management becomes a threat that often ends in critical situations.

The “old” approach, of monitoring a single resource or a dedicated response time of a single transaction doesn’t seem to be the way to succeed anymore. However, it is still essential to perform IT monitoring for multiple reasons:

  1. IT monitoring helps to gather performance and availability data as well as log data from all involved systems.

    This data may be used to understand and learn the “normal” behavior. Understanding this “normal behavior” is essential to predict upcoming situations and to send out alerts earlier.

    The more data we gather from different source, the better our prediction accuracy gets.

    With this early detection mechanism in place from so many different data sources, injected by the IT monitoring, operations teams can earn enough time before the real outage takes place, so that they can avoid this outage.

     

  2. IT monitoring can help to identify very slow-growing misbehavior.

    Gathering large amounts of data does not guarantee that all misbehavior can be identified. If the response time of a transaction server system increases over a long period of time and all other monitored metrics evolve accordingly, an anomaly detection system will fail. There are no anomalies. Because growing workload is nothing unexpected and the growth takes place over a long period of time, only distinct thresholds will help. This is classical IT monitoring.

  3. IT monitoring helps subject matter experts to understand their silos.

    Yes, we should no longer think in silos, but for good system performance it is essential to have a good understanding of key performance metrics in the different disciplines, like operating systems, databases and middleware layers. IT monitoring gives the experts the required detailed insight and enables the teams to adjust performance tasks as required.

So the conclusion is simple: monitoring is a kind of prerequisite for doing successful predictive analysis. Without monitoring you won’t have the required data to make the required decisions, whether manually or automatically, as described with IBM SmartCloud Analytics – Predictive Insights.

Prediction based on big data approaches is a great enhancement for IT monitoring and enables IT operation teams to identify system anomalies much earlier and thus to start reactive responses in time.

IBM SmartCloud Application Performance Management offers a suite of products to cover most monitoring requirements and gather the required data for predictive analysis.

So what is your impression? Is monitoring yesterday’s discipline?

Follow me on Twitter @DetlefWolf, or drop me a discussion point below to continue the conversation.