SAP BLOG My Adventures in Machine Learning 4

SAP Blog

Kayıtlı Üye
Katılım
22 Ara 2017
Mesajlar
1,925
Tepki puanı
7
Puanları
6
Ranking the systems by workload


Today I’ll start with a trivial question: How busy are the BW systems compared to each other? It will be a preparation for my machine learning steps in the next section. I am already collection many metrics, so I have enough data to evaluate. For this question I focus on 26 metrics because I think they are most helpful:

IP_INFOPAKIDS, IP_RECORDS, DTP_REQUESTS, DTP_DATAPAKS, DTP_RECORDS, DTP_GB, Q_USERS, Q_QUERIES, Q_NAVSTEPS, PC_CHAINS, PC_RUNS, BTC_JOBS, DB_BLOCK_CHANGES, LOG_GB, SELECTS, INSERTS, UPDATES, DELETES, CALLS, DV_NUMROWS, DV_DSO_GB, DV_IC_GB, DV_WARM_GB, BIA_USED, AL_SUCCESS, USERA

They should correlate very well with the desired workload, which means the higher these numbers the higher the (useful) workload on the system. I omitted ambiguous metrics like e.g. IP_SECONDS, because a high total runtime for InfoPackages could mean either:

  • a lot of data was being loaded from the source systems
  • few data was loaded from the source systems, but in a very inefficient way

I assume, no unnecessary batchjobs are being scheduled and users only execute useful queries etc. The actual values of the metrics are normalized by their overall average. For each system the normalized averages for all 26 metrics are summed up. A score of 26 would mean a system is showing a very average overall workload. This also means that all 26 metrics have the same weight, which sounds like a reasonable thing to do to keep things simple.

Finally I get this overview:

SID
Score
PO7
59.04​
POA
52.99​
PO3
50.86​
PO6
46.42​
PH1
39.87​
PO1
34.96​
POC
24.41​
PH2
23.98​
POB
20.55​
PO2
16.73​
PO5
8.93​
PO8
7.64​
PO4
4.80​
POD
1.74​
PO9
1.38​

Table 1: Workload Scores

System POC is pretty much average, the highest workload is on PO7. The BWonHANA systems PH1 and PH2 are somwhere in the middle.

df_workload-1.png


Picture 1: Barcharts of the workload scores

From my personal experience, I would classify the sysmtes PO5, PO8, PO4, POD and PO9 as low worload systems. Maybe the systems POC, PH2, POB and PO2 could be classified as medium workload systems, since they also seem to have a similar overall workload. Now let’s start with classification algorithms if they would confirm my opinion.

Classifying systems by workload


My overall goal is to detect unusual BW workload. So a classfication algorithm should help me with anomaly detection. I’ll start with standard classifcation algorithms that should simply take all available performance data without knowing which SID the workload belongs to, and then cluster the data intelligently. Ideally the 15 SIDs will each get their own cluster.



KMEANS


The first algorithm I used for clustering is KMEANS. This alogrithm needs the number of desired clusters as an input. I used the Silhouette score to identify the ideal number of clusters. The best fit was achieved with 9 clusters and a silhouette score of 0.45:

0
1
2
3
4
5
6
7
8
PH1
67​
PH2
67​
PO1
71​
PO2
71​
PO3
71​
PO4
71​
PO5
70​
PO6
71​
PO7
70​
PO8
71​
PO9
71​
POA
71​
POB
71​
POC
71​
POD
71​

Table 2: KMEANS optimal clustering

The numbers in the cells represent the number of workload samples (days) that were classified in this cluster group for the given SID. The low workload systems PO4, PO5, PO8, PO9 and POD are lumped up together in one group. However, there is no medium workload group. Systems PO2, POA and POB seem to have a similar workload and are therefore put into a single group. When I check what happens with a suboptimal clustering of 2 or 3 or 4 clusters, then I see that system PO7 is the first to be granted a cluster on its own, followed by PO1. These systems seem to be for KMEANS the easiest to identify via their workload.

A very valuable output of the KMEANS algorithm are the centroids for these 9 clusters. This provides me the datapoints of a typical workload for this cluster. These centroids are similar, but more helpful than a simple average for the collected data for a given SID. I need to test if the distance of one data point (= workload of a specific calendar day) to its centroid can be used for anomaly detection.

DBSCAN


The second algorithm doesn’t need to be told how many clusters it should generate, but it needs to get as an input the allowed distance and the number of neighbors which constitute a cluster. Again I chose the silhouette matrix to identify the ideal clustering. This time the silhouette score was only 0.41, so slightly worse. And probably due to the small amount of data, many small clusters were detected. Probably with more data the clusters would merge to fewer and larger clusters:

-1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
PH1
2
65​
PH2
5
62​
PO1
4
41​
17​
6​
4​
PO2
71​
PO3
2
39​
30​
PO4
71​
PO5
70​
PO6
2
69​
PO7
1
69​
PO8
71​
PO9
71​
POA
17
34​
9​
7​
4​
POB
5
40​
26​
POC
40​
21​
8​
POD
71​

Table 3: DBSCAN optimal clustering

DBSCAN creates a cluster containing the outliers or anaomalies. And it creates several clusters for a system if the workload is very diverse. E.g. PO1, PO3 and POA get serveral clusters. Maybe with more data collected they would merge into larger clusters. Remember, KMEANS confidently put all PO1 samples in one group. Again the low workload systems PO4, PO5, PO8, PO9 and POD as summed up in a single cluster. And PO2, POB and POC seem similar for DBSCAN, but not as similar as for KMEANS.

Gaussian Mixtures


As a third alogrithm I checked the Gaussian Mixtures. They assume an ellipsoid data distribution similar to KMEANS, so let’s see if the output is comparable. I couldn’t compute a silhouette score here, so I had to identify the ideal clustering myself. I chose this one:

0
1
2
3
4
5
6
7
8
9
10
PH1
67​
PH2
67​
PO1
71​
PO2
71​
PO3
40​
30​
PO4
66​
5
PO5
4
66​
PO6
71​
PO7
70​
PO8
5
66​
PO9
70​
1
POA
14
57​
POB
71​
POC
71​
POD
71​

Table 4: Gaussian Mixtures ideal clustering

This time the “low workload” systems PO4, PO5, PO8, PO9 and POD are distributed over two clusters. The other similar systems PO2, POB and POC get their own cluster, however. To detect anomalies, I could examine the rare cases (marked in red in table 4), or similar to KMEANS the Gaussian Mixtures could calculate a probability that a sample belongs to a certain cluster. Then I just need to define a threshold probability under which the samples are defined as outliers.



Next Steps


In summary, I was not happy with the classification algorithms so far. They were easy to implement and running fast, but the results feel somewhat dumb. Of course, obvious differences are spotted, but I wouldn’t need these alogrithms for spotting the obvious. Small random differences can often have some large impact on their clustering, and I assume that this would still be true if I had 5 times as much data available. I am sure people with real experience in Machine Learning are reading this. If anyone could share some tips, that would be highly appreciated.

So my next step is to train a neuronal net for anomaly detection as a quite different approach. Hopefully a neuronal network can detect and learn subtle patterns in the workload data. After that, I will examine in detail the anomalies detected by the various approaches to see which one provides the best results.

Okumaya devam et...
 
Üst