Revolutionizing System Testing With AI and ML
This article explores leveraging K-Means clustering, PCA, and anomaly detection for large-scale digital transformation with AI and ML.
Join the DZone community and get the full member experience.
Join For FreeThe digital transformation of businesses involves the adoption of digital technologies to change the way companies operate and deliver value to their customers. This can include the use of cloud computing, artificial intelligence, big data analytics, the Internet of Things (IoT), and other digital tools.
One of the significant challenges that come with digital transformation is ensuring that software systems remain reliable and secure. This is where software testing comes in. As software systems become more complex, testing becomes more critical than ever. It helps to ensure that software is functioning as expected, that bugs and vulnerabilities are identified and addressed, and that the software meets user needs and expectations.
However, testing in the context of digital transformation can be particularly challenging due to the complexity and scale of the systems involved. Testers must identify test cases that adequately cover all possible scenarios without wasting time testing redundant or insignificant scenarios. They must also account for the lack of documentation often found in legacy systems, which can make it challenging to understand the system and its dependencies.
Additionally, digital transformation often involves migrating legacy systems to the cloud. This can require changes to the architecture, which can introduce new bugs or vulnerabilities. This makes it challenging to identify and fix issues during testing, as they may not be immediately apparent.
To overcome these challenges, organizations can adopt new testing strategies and tools. For example, leveraging machine learning algorithms such as K-Means and Anomaly Detection can help to identify test cases for complex domains in large digital transformation projects. K-Means can cluster data points into distinct groups based on their similarity, allowing testers to identify patterns and commonalities across different scenarios. Anomaly Detection can identify data points that deviate significantly from the rest of the data, helping testers to pinpoint potential bugs or vulnerabilities.
To leverage K-Means and Anomaly Detection algorithms for identifying test cases, we can follow these steps:
Step 1: Data Collection
The first step is to collect data related to the digital transformation project. The team could leverage the existing data sources to collect historical data from the tables.
import pandas as pd
data = pd.read_json('./historical_data_from_database.json')
Step 2: Data Preprocessing
The collected data must be preprocessed to ensure that it is consistent, complete, and relevant. The data must be cleaned by removing any irrelevant or redundant data. The data must also be transformed into a suitable format for further processing. The below code first removes unnecessary columns and then replaces all null and empty string values.
filtered_table_columns = [‘column_1’, ‘column_2’, ‘column_3’, ‘column_4’, ‘column_5’, ‘column_6’, ‘column_7’]
data = data[filtered_table_columns]
#Replace empty strings
data = data.replace({'': '9999', np.nan: '9999'})
Since machine learning algorithms need numeric values, we need to convert all non-numeric values to numeric ones. To achieve this, we will create a dictionarytag_to_idx
for each column value and replace the values in the data with their corresponding dictionary index.
# Translate all string or non-numeric values to numeric values.
tags={}
tag_to_idx = {}
idx_to_tag = {}
features_with_non_numeric_values = [‘column_1’, ‘column_2’, ‘column_4’, ‘column_6’, ‘column_7’]
for item in features_with_non_numeric_values:
tags[item] = list(data[item].unique())
tag_to_idx[item] = {tag: idx for idx, tag in enumerate(tags[item])}
idx_to_tag[item] = {idx: tag for tag, idx in tag_to_idx[item].items()}
data[item] = data[item].replace(tag_to_idx[item])
Step 3: Feature Extraction
Feature extraction is the process of selecting the relevant features from the data that will be used for clustering and anomaly detection. These features can be extracted using techniques such as Principal Component Analysis (PCA) or Singular Value Decomposition (SVD).
Principal Component Analysis (PCA) is a powerful statistical technique that is widely used for reducing the dimensionality of data while preserving its most important features or patterns. The technique is commonly applied in data preprocessing and feature extraction before applying machine learning algorithms.
PCA transforms the original variables of the data into a new set of uncorrelated variables called principal components. These components are ranked according to decreasing variance, where the first principal component has the highest variance, and each subsequent component has the greatest possible variance, given that it is orthogonal to the previous components.
Before applying PCA, the data is first normalized using mean normalization. This involves calculating the mean of each feature and subtracting it from the original data, resulting in each feature having a zero mean. Since different features can have vastly different scales, they are also scaled to have a similar range of values. This ensures that each feature is equally important in the analysis, regardless of its initial scale.
The below code explains how we can perform normalization on the data.
# Normalize data
data_normalized = data.copy()
data_normalized = (data_normalized - data_normalized.mean()) / data_normalized.std()
# Removing None columns after normalization
data_normalized = data_normalized.dropna(axis=1, how='all')
data_normalized.head()
After normalization, all features in the data have the same scale and range of values
The diagram below illustrates original three-dimensional vectors and their normalized counterparts. Notably, “feature 3” has a range of 0 to 1000, while “feature 1” ranges from 0 to 10. Following normalization, each feature is scaled to a comparable range of values, specifically within the range of -1.5 to 1.5.
PCA is particularly useful for reducing the dimensionality of data by combining highly correlated features into a smaller number of principal components that account for most of the variance in the data. This makes it easier to visualize and analyze the data and can improve the performance of machine learning algorithms by reducing the risk of overfitting and increasing computational efficiency.
Here's a sample diagram to show how PCA can reduce the dimensionality of 3D data to 2D:
Before applying dimension reduction, it is important to understand what is the ideal dimension reduction value when dealing with multiple features.
Deciding the number of components to use in PCA is an important step in the process. One common approach is to use the "elbow" method, which involves calculating the explained variance for each component and plotting it against the number of components. The idea is to select the number of components at the "elbow" of the plot, which is the point where the additional components start to add less explanatory power.
The below code can be used to determine the elbow index in the normalized data to select the optimal dimension to be used for PCA.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
# Assume that you have a DataFrame called 'df' with 30 columns
# Perform PCA on the data
pca = PCA().fit(data_normalized)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')
# Use the elbow method to identify the optimal number of principal components
diffs = np.diff(pca.explained_variance_ratio_)
elbow_index = np.argmax(diffs) + 1
# Plot the elbow point and display its index value
plt.plot(elbow_index, pca.explained_variance_ratio_[elbow_index-1], marker='o', markersize=12,
label=f'Elbow Point ({elbow_index} components)', color='red')
num_components = len(pca.explained_variance_ratio_)
plt.axvline(x=elbow_index, color='r', linestyle='--')
plt.legend()
plt.show()
Replace the n_components
value with the value from the elbow_index
.
# Perform PCA
data_pca = PCA(n_components=elbow_index).fit_transform(data_normalized)
Step 4: K-Means Clustering
K-Means clustering is a commonly used unsupervised learning algorithm for data clustering. It is a partitioning technique that divides a dataset into K clusters, where K is a pre-defined number of clusters. The algorithm works by first randomly initializing K centroids (cluster centers) and assigning each data point to its nearest centroid. Then, it iteratively updates the centroids by computing the mean of all the data points in each cluster and reassigning the data points to the nearest centroid based on the updated centroids. The iterations continue until the centroids no longer change significantly or a maximum number of iterations is reached. The final output is the K clusters, each containing the data points that are closest to its centroid. We will use the clustering algorithm to create clusters of data to be used for test cases.
The number of clusters can be determined using techniques such as the Elbow Method or the Silhouette Method. We will use the below code to visually identify the elbow point to determine the optimal value of the cluster to leverage K-means clustering on data_pca.
The below code explains how to plot a graph to determine the elbow point.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Define a list of the number of clusters to try
num_clusters = range(1, 20)
# Calculate the within-cluster sum of squares for each number of clusters
wcss = []
for n in num_clusters:
kmeans = KMeans(n_clusters=n)
kmeans.fit(data_pca)
wcss.append(kmeans.inertia_)
# Plot the within-cluster sum of squares as a function of the number of clusters
plt.plot(num_clusters, wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('Within-cluster Sum of Squares')
# Use the elbow method to identify the optimal number of clusters
diffs = np.diff(wcss)
elbow_index = np.argmax(diffs) + 1
plt.plot(elbow_index, wcss[elbow_index-1], marker='o', markersize=12,
label='Elbow Point ({})'.format(elbow_index), color='red')
plt.legend()
plt.savefig('plot.png')
plt.show()
The below code can be used to perform clustering based on the determined elbow index.
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.cluster import KMeans
# Create a KMeans clustering model with 5 clusters
kmeans = KMeans(n_clusters=elbow_index, random_state=42)
# Fit the model to the data
kmeans.fit(data_pca)
# Get the predicted clusters for each data point
clusters = kmeans.predict(data_pca)
centroids = kmeans.cluster_centers_
# Add the predicted clusters as a new column to the original dataframe
data['cluster'] = clusters
# line plots a scatter plot of the transformed data on the first two PCA components.
plt.scatter(data_pca[:,0] , data_pca[:,1], c=kmeans.labels_, cmap='rainbow')
labels = kmeans.labels_
plt.xlabel('feature 1')
plt.ylabel('feature 2')
plt.title('K-Means Clustering')
plt.savefig('plot.png')
plt.show()
The below code can be used to look at counts per cluster or items in a specific cluster.
# Print counts per cluster
print(data['cluster'].value_counts())
#Extract items from a specific cluster
cluster_13 = data[data['cluster'] == 13]
#Coverting the result back to orignal tags
for item in features_with_non_numeric_values:
cluster_13[item] = cluster_13[item].replace(idx_to_tag[item])
cluster_13
Step 5: Anomaly Detection
Anomaly Detection can be performed to identify the test cases that deviate significantly from the rest of the data. These test cases can be flagged for further investigation and testing.
In anomaly detection, we typically have only one class of data that represents normal behavior. Thus, we cannot split our data into a training and validation set in a traditional sense, as we need all normal behavior data to train the model. However, we can still create a validation set using the following methods:
- Time-based validation: We can split our data into a training set and a validation set based on time. For example, we can train our model on data from the first 90% of the time period and validate it on the last 10%.
- Random sampling: We can randomly select a portion of our data (e.g., 10-20%) to use as a validation set and use the remaining data for training.
It is important to note that in anomaly detection, the performance of the model is often evaluated using metrics such as precision, recall, and F1 score on the validation set. However, the results should always be interpreted with caution, as the distribution of anomalies in the validation set may not be representative of the distribution of anomalies in real-world scenarios.
You can use the below code to generate the training, test, and validation set.
from sklearn.model_selection import train_test_split
data_pca_training, data_pca_test, data_pca_training, data_pca_test = train_test_split(data_pca, data_pca, test_size=0.2, random_state=2018)
data_pca_training, data_pca_validation, data_pca_training, data_pca_ validation = train_test_split(data_pca_training, data_pca_training, test_size=0.2, random_state=2018)
print('training set size {} \n testing set size {}'.format(len(data_pca_training),len(data_pca_test)))
The below code trains an auto-encoder to detect anomalies in a dataset. An auto-encoder is an unsupervised learning algorithm that learns to reconstruct its input. In this case, the input is the PCA-reduced dataset, and the output is the reconstruction of the input. The auto-encoder is trained to minimize the mean squared error between the input and its reconstruction.
Reconstruction error refers to the difference between the input data and the output data generated by the auto-encoder model. The auto-encoder is trained to learn an efficient representation of the input data, and the reconstruction error measures how well the auto-encoder can reconstruct the original input data from this learned representation.
The reconstruction error is typically measured using a loss function, such as mean squared error (MSE), which measures the average squared difference between the input and output data. A lower reconstruction error indicates that the auto-encoder is better at reconstructing the input data, while a higher reconstruction error indicates that the auto-encoder is not as good at reconstructing the input data.
In anomaly detection, we can use the reconstruction error as a measure of how different an input data point is from the normal or expected data distribution. Data points with high reconstruction error values are more likely to be anomalies or outliers in the data.
After training, the reconstruction error for each input data point is calculated, and anomalies are identified as data points whose reconstruction error is above a certain threshold. Finally, the code selects the identified anomalies from the original dataset and replaces any non-numeric feature values with their original labels.
The below code explains how anomalies detection algorithm can be applied to the data.
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import EarlyStopping
# Define the shape of the input data
input_shape = (data_pca.shape[1],)
# Define the encoder and decoder layers
encoder_layer1 = Dense(64, activation='relu')
encoder_layer2 = Dense(32, activation='relu')
encoder_layer3 = Dense(16, activation='relu')
decoder_layer1 = Dense(32, activation='relu')
decoder_layer2 = Dense(64, activation='relu')
decoder_layer3 = Dense(input_shape[0], activation='sigmoid')
# Define the autoencoder model
inputs = Input(shape=input_shape)
encoder = encoder_layer3(encoder_layer2(encoder_layer1(inputs)))
decoder = decoder_layer3(decoder_layer2(decoder_layer1(encoder)))
autoencoder = Model(inputs=inputs, outputs=decoder)
# Compile the model
autoencoder.compile(optimizer='adam', loss='mse')
# Train the autoencoder
early_stop = EarlyStopping(patience=20, verbose=1)
# In this example we will use the data_pca and not training and validation set.
history = autoencoder.fit(data_pca, data_pca, validation_data = (data_pca, data_pca), epochs=500, batch_size=10, callbacks=[early_stop])
# Get reconstruction errors
reconstructions = autoencoder.predict(data_pca)
mse = np.mean(np.square(data_pca - reconstructions), axis=1)
threshold = np.mean(mse) + 5 * np.std(mse)
# Find anomalies
anomalies = np.where(mse > threshold)[0]
anomaly_samples = data.iloc[anomalies].copy()
# Replace non-numeric values with their corresponding tags
for item in features_with_non_numeric_values:
anomaly_samples[item] = anomaly_samples[item].replace(idx_to_tag[item])
# Return the anomaly samples
anomaly_samples
The threshold is set to np.mean(mse) + 5 * np.std(mse)
, which means that anomalies are defined as those data points whose reconstruction error (mean squared error between the input data and its reconstructed output) is more than five standard deviations away from the mean. The threshold for identifying anomalies in this code is set to five standard deviations away from the mean of the reconstruction error. This is a more conservative threshold than multiplying by two or three standard deviations, which means that it will result in fewer anomalies being detected but with a higher level of confidence.
Multiplying by two or three standard deviations corresponds to confidence levels of approximately 95% and 99.7%, respectively, assuming the data follows a normal distribution. In contrast, setting the threshold to 5 standard deviations corresponds to a confidence level of approximately 100%.
The choice of threshold ultimately depends on the specific application and the acceptable trade-off between the number of false positives (normal data classified as anomalies) and false negatives (anomalies not detected).
The below code can be used to plot a graph to identify the anomalies corresponding to n-deviations away from the mean and their respective confidence level.
import matplotlib.pyplot as plt
import numpy as np
from scipy.special import erf
# calculate mean and standard deviation
mean_mse = np.mean(mse)
std_mse = np.std(mse)
# define thresholds based on number of standard deviations away from the mean
threshold_2std = mean_mse + 2 * std_mse
threshold_3std = mean_mse + 3 * std_mse
threshold_4std = mean_mse + 4 * std_mse
threshold_5std = mean_mse + 5 * std_mse
confidences = [100*erf(i/np.sqrt(2)) for i in range(2, 6)]
# create a scatter plot of the reconstruction error vs sample index
plt.scatter(range(len(mse)), mse)
# highlight the anomalies
plt.scatter(anomalies, mse[anomalies], color='red')
# add threshold lines
plt.axhline(y=threshold_2std, color='green', linestyle='--', label='2 std')
plt.text(0.02, threshold_2std + 0.2, f"2σ ({confidences[0]:.2f}% confidence)")
plt.axhline(y=threshold_3std, color='orange', linestyle='--', label='3 std')
plt.text(0.02, threshold_3std + 0.2, f"3σ ({confidences[1]:.2f}% confidence)")
plt.axhline(y=threshold_4std, color='blue', linestyle='--', label='4 std')
plt.text(0.02, threshold_4std + 0.2, f"4σ ({confidences[2]:.2f}% confidence)")
plt.axhline(y=threshold_5std, color='purple', linestyle='--', label='5 std')
plt.text(0.02, threshold_5std + 0.2, f"5σ ({confidences[3]:.2f}% confidence)")
# add labels and title
plt.xlabel('Record Index')
plt.ylabel('Reconstruction Error (MSE)')
plt.title('Anomaly Detection Results')
# display the plot
plt.show()
Step 6: Test Case Selection
Finally, based on the results of the K-Means clustering and Anomaly Detection, the test cases can be selected for testing the complex domain in the digital transformation project. A certain percentage of test cases can be selected from each cluster to ensure that all possible scenarios are covered. Additionally, the test cases flagged as anomalies can be given higher priority for testing.
Reference
- A study conducted by researchers at IBM reported a 30% reduction in defect density by using clustering to identify areas of high risk in software development projects (source: IBM Research Report "Reducing Development Time and Improving Quality Through Partitioning and Prioritizing Test Suites").
- A case study by a software development company reported a 50% reduction in defect density by using anomaly detection to identify and address issues in a web application (source: Altar.io Case Study "Anomaly Detection in Web Applications").
- A research paper published in the Journal of Systems and Software reported a 25% reduction in defect density by using clustering to identify and address issues in software development projects (source: Journal of Systems and Software, Volume 82, Issue 11, November 2009, Pages 1783-1795).
Conclusion
Unfortunately, there is no clear-cut solution to testing complex legacy systems that are being modernized for the cloud. It is a challenging and time-consuming task that requires significant planning and coordination between developers and testers. However, leveraging modern technologies such as machine learning algorithms like K-Means and Anomaly Detection can help identify test cases that cover the most critical scenarios while reducing the testing effort. It is also essential to have clear communication and documentation of the system's functionality, dependencies, and architecture to ensure that all aspects are covered in the testing process. Ultimately, testing legacy system modernization to the cloud requires a holistic approach that considers the system's complexity and the unique challenges associated with it.
Opinions expressed by DZone contributors are their own.
Comments