SAP BLOG My Adventures in Machine Learning 4

SAP Blog · 16 Şub 2020

Ranking the systems by workload

Today I’ll start with a trivial question: How busy are the BW systems compared to each other? It will be a preparation for my machine learning steps in the next section. I am already collection many metrics, so I have enough data to evaluate. For this question I focus on 26 metrics because I think they are most helpful:

IP_INFOPAKIDS, IP_RECORDS, DTP_REQUESTS, DTP_DATAPAKS, DTP_RECORDS, DTP_GB, Q_USERS, Q_QUERIES, Q_NAVSTEPS, PC_CHAINS, PC_RUNS, BTC_JOBS, DB_BLOCK_CHANGES, LOG_GB, SELECTS, INSERTS, UPDATES, DELETES, CALLS, DV_NUMROWS, DV_DSO_GB, DV_IC_GB, DV_WARM_GB, BIA_USED, AL_SUCCESS, USERA

They should correlate very well with the desired workload, which means the higher these numbers the higher the (useful) workload on the system. I omitted ambiguous metrics like e.g. IP_SECONDS, because a high total runtime for InfoPackages could mean either:

a lot of data was being loaded from the source systems
few data was loaded from the source systems, but in a very inefficient way

I assume, no unnecessary batchjobs are being scheduled and users only execute useful queries etc. The actual values of the metrics are normalized by their overall average. For each system the normalized averages for all 26 metrics are summed up. A score of 26 would mean a system is showing a very average overall workload. This also means that all 26 metrics have the same weight, which sounds like a reasonable thing to do to keep things simple.

Finally I get this overview:

SID	Score
PO7	59.04
POA	52.99
PO3	50.86
PO6	46.42
PH1	39.87
PO1	34.96
POC	24.41
PH2	23.98
POB	20.55
PO2	16.73
PO5	8.93
PO8	7.64
PO4	4.80
POD	1.74
PO9	1.38

Table 1: Workload Scores

System POC is pretty much average, the highest workload is on PO7. The BWonHANA systems PH1 and PH2 are somwhere in the middle.

Picture 1: Barcharts of the workload scores

From my personal experience, I would classify the sysmtes PO5, PO8, PO4, POD and PO9 as low worload systems. Maybe the systems POC, PH2, POB and PO2 could be classified as medium workload systems, since they also seem to have a similar overall workload. Now let’s start with classification algorithms if they would confirm my opinion.

Classifying systems by workload

My overall goal is to detect unusual BW workload. So a classfication algorithm should help me with anomaly detection. I’ll start with standard classifcation algorithms that should simply take all available performance data without knowing which SID the workload belongs to, and then cluster the data intelligently. Ideally the 15 SIDs will each get their own cluster.

KMEANS

The first algorithm I used for clustering is KMEANS. This alogrithm needs the number of desired clusters as an input. I used the Silhouette score to identify the ideal number of clusters. The best fit was achieved with 9 clusters and a silhouette score of 0.45:

0	1	2	3	4	5	6	7	8
PH1	67
PH2	67
PO1	71
PO2	71
PO3	71
PO4	71
PO5	70
PO6	71
PO7	70
PO8	71
PO9	71
POA	71
POB	71
POC	71
POD	71

Table 2: KMEANS optimal clustering

The numbers in the cells represent the number of workload samples (days) that were classified in this cluster group for the given SID. The low workload systems PO4, PO5, PO8, PO9 and POD are lumped up together in one group. However, there is no medium workload group. Systems PO2, POA and POB seem to have a similar workload and are therefore put into a single group. When I check what happens with a suboptimal clustering of 2 or 3 or 4 clusters, then I see that system PO7 is the first to be granted a cluster on its own, followed by PO1. These systems seem to be for KMEANS the easiest to identify via their workload.

A very valuable output of the KMEANS algorithm are the centroids for these 9 clusters. This provides me the datapoints of a typical workload for this cluster. These centroids are similar, but more helpful than a simple average for the collected data for a given SID. I need to test if the distance of one data point (= workload of a specific calendar day) to its centroid can be used for anomaly detection.

DBSCAN

The second algorithm doesn’t need to be told how many clusters it should generate, but it needs to get as an input the allowed distance and the number of neighbors which constitute a cluster. Again I chose the silhouette matrix to identify the ideal clustering. This time the silhouette score was only 0.41, so slightly worse. And probably due to the small amount of data, many small clusters were detected. Probably with more data the clusters would merge to fewer and larger clusters:

-1	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18
PH1	2	65
PH2	5	62
PO1	4	41	17	6	4
PO2	71
PO3	2	39	30
PO4	71
PO5	70
PO6	2	69
PO7	1	69
PO8	71
PO9	71
POA	17	34	9	7	4
POB	5	40	26
POC	40	21	8
POD	71

Table 3: DBSCAN optimal clustering

DBSCAN creates a cluster containing the outliers or anaomalies. And it creates several clusters for a system if the workload is very diverse. E.g. PO1, PO3 and POA get serveral clusters. Maybe with more data collected they would merge into larger clusters. Remember, KMEANS confidently put all PO1 samples in one group. Again the low workload systems PO4, PO5, PO8, PO9 and POD as summed up in a single cluster. And PO2, POB and POC seem similar for DBSCAN, but not as similar as for KMEANS.

Gaussian Mixtures

As a third alogrithm I checked the Gaussian Mixtures. They assume an ellipsoid data distribution similar to KMEANS, so let’s see if the output is comparable. I couldn’t compute a silhouette score here, so I had to identify the ideal clustering myself. I chose this one:

0	1	2	3	4	5	6	7	8	9	10
PH1	67
PH2	67
PO1	71
PO2	71
PO3	40	30
PO4	66	5
PO5	4	66
PO6	71
PO7	70
PO8	5	66
PO9	70	1
POA	14	57
POB	71
POC	71
POD	71

Table 4: Gaussian Mixtures ideal clustering

This time the “low workload” systems PO4, PO5, PO8, PO9 and POD are distributed over two clusters. The other similar systems PO2, POB and POC get their own cluster, however. To detect anomalies, I could examine the rare cases (marked in red in table 4), or similar to KMEANS the Gaussian Mixtures could calculate a probability that a sample belongs to a certain cluster. Then I just need to define a threshold probability under which the samples are defined as outliers.

Next Steps

In summary, I was not happy with the classification algorithms so far. They were easy to implement and running fast, but the results feel somewhat dumb. Of course, obvious differences are spotted, but I wouldn’t need these alogrithms for spotting the obvious. Small random differences can often have some large impact on their clustering, and I assume that this would still be true if I had 5 times as much data available. I am sure people with real experience in Machine Learning are reading this. If anyone could share some tips, that would be highly appreciated.

So my next step is to train a neuronal net for anomaly detection as a quite different approach. Hopefully a neuronal network can detect and learn subtle patterns in the workload data. After that, I will examine in detail the anomalies detected by the various approaches to see which one provides the best results.

Okumaya devam et...

Dil Seçicisi

Ara

Theme colors

SAP BLOG My Adventures in Machine Learning 4

SAP Blog