Neural Anomaly Detection Using Keras -- Visual Studio Magazine

Neural Anomaly Detection Using Keras

An advantage of using a neural technique compared to a standard clustering technique is that neural techniques can handle non-numeric data by encoding that data.

By James McCaffrey
03/04/2019

Anomaly detection, also called outlier detection, is the process of finding rare items in a dataset. Examples include finding fraudulent login events and fake news items.

Take a look at the demo program in Figure 1. The demo examines a 1,000-item subset of the well-known MNIST (modified National Institute of Standards and Technology) dataset. Each item is a 28 x 28 grayscale image (784 pixels) of a handwritten digit from "0'" to "9".

**[Click on image for larger view.]** **Figure 1.** Anomaly Detection on the MNIST Dataset

The demo program creates and trains a 784-100-50-100-784 deep neural autoencoder using the Keras library. An autoencoder is a neural network that learns to predict its input. After training, the demo scans through the 1,000 images and finds the one image which is most anomalous, where most anomalous means highest reconstruction error. The most anomalous digit is a "2" that looks like it could be a "9" digit or a script lowercase "a" character.

This article assumes you have intermediate or better programming skill with a C-family language and a basic familiarity with machine learning but doesn't assume you know anything about autoencoders. All the demo code and the data file are presented in this article. They're also available in the download that accompanies this article. All normal error checking has been removed to keep the main ideas as clear as possible.

Installing Keras
Keras is a code library that provides a relatively easy-to-use Python language interface to the relatively difficult-to-use TensorFlow library. Installing Keras involves two main steps. First you install Python and several required auxiliary packages such as NumPy and SciPy. Then you install TensorFlow and Keras as add-on Python packages.

Although it's possible to install Python and the packages required to run Keras separately, it's much better to install a Python distribution, which is a collection containing the base Python interpreter and additional packages that are compatible with each other. For my demo, I installed the Anaconda3 5.2.0 distribution (which contains Python 3.6.5).

After installing Anaconda, I used the pip utility to install TensorFlow 1.11.0, and then I used pip to install Keras 2.2.4. The demo program has no significant Python or Keras dependencies, so any relatively recent versions will work. Note that when you install TensorFlow, you get an embedded version of Keras, but most of my colleagues and I prefer to use separate TensorFlow and Keras packages.

The Demo Program
The structure of demo program, with a few minor edits to save space, is presented in Listing 1. I indent with two spaces rather than the usual four spaces to save space. Note that Python uses the "\" character for line continuation. I used Notepad to edit my program. Most of my colleagues prefer a more sophisticated editor, but I like the enforced simplicity of Notepad.

Listing 1: The Anomaly Detection Demo Program Structure

# autoencoder_anom_mnist.py
# anomaly detection for the MNIST digits

import numpy as np
import keras as K
import matplotlib.pyplot as plt
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'  # suppress CPU msg

def display(raw_data_x, raw_data_y, idx):
  label = raw_data_y[idx]  # like '5'
  print("digit = ", str(label), "\n")
  pixels = np.array(raw_data_x[idx])  # target row of pixels
  pixels = pixels.reshape((28,28))
  plt.rcParams['toolbar'] = 'None' 
  plt.imshow(pixels, cmap=plt.get_cmap('gray_r'))
  plt.show()  

# -----------------------------------------------------------

class MyLogger(K.callbacks.Callback):
  def __init__(self, n):
    self.n = n   # print loss every n epochs

  def on_epoch_end(self, epoch, logs={}):
    if epoch % self.n == 0:
      curr_loss =logs.get('loss')
      print("epoch = %4d loss = %0.6f" % (epoch, curr_loss))

def main():
  # 0. get started
  print("Begin MNIST anomaly detect using an autoencoder ")
  np.random.seed(1)  # not used

  # 1. load data into memory
  # 2. define autoencoder model
  # 3. compile model
  # 4. train model
  # 5. save model
  # 6. use model to find anomaly

if __name__=="__main__":
  main()

The demo program starts by importing the NumPy, Keras, OS and Matplotlib packages. Notice you don't need to explicitly import TensorFlow. The OS package is used just to suppress an annoying startup message. The Matplotlib package is used to visually display the most anomalous digit that's found by the model.

Loading Data into Memory
Working with the raw MNIST data is a bit difficult because it's saved in a binary and proprietary format. I wrote a utility program to extract the first 1,000 items from the 60,000 training items. I saved the data as mnist_keras_1000.txt in a Data subdirectory.

The resulting data looks like:

7 ** 0 255 67 . . 123
2 ** 113 28 0 . . 206
. . .

Each line represents one digit. The first value on each line is the digit. The second value is an arbitrary double asterisk sequence just for readability. The next 28 x 28 = 784 values are grayscale pixel values between 0 and 255. All values are separated by a single blank space character. Figure 2 shows the data item at index [1] in the data file, which is a typical "2" digit.

**[Click on image for larger view.]** **Figure 2.** A Typical MNIST Digit

The data is loaded into memory with these statements:

# 1. load data
print("Loading Keras version MNIST data into memory \n")
data_file = ".\\Data\\mnist_keras_1000.txt"
data_x = np.loadtxt(data_file, delimiter=" ",
  usecols=range(2,786), dtype=np.float32)
labels = np.loadtxt(data_file, delimiter=" ",
  usecols=[0], dtype=np.float32)
norm_x = data_x / 255

Notice the digit-label is in column 0 and the 784 pixel values are in columns 2 to 785. After all 1,000 images are loaded into memory, a normalized version of the data is created by dividing each pixel value by 255 so that the normalized pixel values are all between 0.0 and 1.0.

Defining the Autoencoder Model
The demo program creates and compiles a deep autoencoder model with the code shown in Listing 2. The number of input values and output values (784) are determined by the data but the number of hidden layers and the number of nodes in each hidden layer, are free parameters (also called hyperparameters) that must be determined by trial and error.

Similarly, the weight initialization algorithm (Glorot uniform) and the hidden layer activation function (tanh) and the output layer activation function (tanh) are hyperparameters. Because all input and output values are between 0.0 and 1.0 for this problem, logistic sigmoid is a strong alternative for output activation.

The model is compiled using the Adam optimization algorithm which usually gives better results than the simpler stochastic gradient descent algorithm.

Listing 2: Creating and Compiling the Autoencoder Model

# 2. define autoencoder model
print("Creating a 784-100-50-100-784 autoencoder ")
my_init = K.initializers.glorot_uniform(seed=1)
autoenc = K.models.Sequential()
autoenc.add(K.layers.Dense(input_dim=784, units=100, 
  activation='tanh', kernel_initializer=my_init))
autoenc.add(K.layers.Dense(units=50, 
  activation='tanh', kernel_initializer=my_init))
autoenc.add(K.layers.Dense(units=100, 
  activation='tanh', kernel_initializer=my_init)) 
autoenc.add(K.layers.Dense(units=784,
  activation='tanh', kernel_initializer=my_init)) 

# 3. compile model
simple_adam = K.optimizers.adam()  
autoenc.compile(loss='mean_squared_error',
  optimizer=simple_adam)

The model uses mean squared error for the training loss function because an autoencoder is essentially a regression problem rather than a classification problem,

Training and Evaluating the Autoencoder Model
The model is trained with these statements:

# 4. train model
print("Starting training")
max_epochs = 400
my_logger = MyLogger(n=100)
h = autoenc.fit(norm_x, norm_x, batch_size=40, 
  epochs=max_epochs, verbose=0, callbacks=[my_logger])
print("Training complete")

# 5. save model
# TODO

The maximum number of training epochs is a hyperparameter. The program-defined MyLogger object displays the current training loss mean squared error every 100 epochs so you can monitor progress. The batch size, 40, is another hyperparameter. Notice that the fit() function uses the norm_x data for both input and output values. The label data is used only for visualization.

After training, you'll usually want to save the model but that's a bit outside the scope of this article. The Keras documentation has several good examples that show how to save a trained model.

When working with autoencoders, in most situations (including this example) there's no inherent definition of model accuracy. You must determine how close computed output values must be to the associated input values in order to be counted as a correct prediction, and then write a program-defined function to compute your metric.

Using the Autoencoder Model to Find Anomalous Data
After autoencoder model has been trained, the idea is to find data items that are difficult to correctly predict, or equivalently, difficult to reconstruct. The demo code scans through all 1,000 data items and calculates the squared difference between the normalized input values and the computed output values like this:

# 6. find most anomalous data item
N = len(norm_x)
max_se = 0.0; max_ix = 0
predicteds = autoenc.predict(norm_x)
for i in range(N):
  diff = norm_x[i] - predicteds[i]
  curr_se = np.sum(diff * diff)
  if curr_se > max_se:
    max_se = curr_se; max_ix = i

The maximum squared error is calculated and the index of the associated image is saved. An alternative to finding the single item with the largest reconstruction error is to save all squared errors, sort them and return the top-n items where the value of n will depend on the particular problem you're investigating.

After the most anomalous data item has been found, it is shown using the program-defined display() function:

raw_data_x = data_x.astype(np.int)
raw_data_y = labels.astype(np.int)
print("Most anomalous digit is at index ", max_ix)
display(raw_data_x, raw_data_y, max_ix)

The pixel and label values are converted from type float32 to int mostly as a matter of principle because the Matplotlib imshow() function can accept either data type.

Wrapping Up
Anomaly detection using a deep neural autoencoder is not a well-known technique. An advantage of using a neural technique compared to a standard clustering technique is that neural techniques can handle non-numeric data by encoding that data. Most clustering techniques depend on a distance measure which means the source data must be strictly numeric.

A related, but also little-explored, technique for anomaly detection is to create an autoencoder for the dataset under investigation. Then, instead of using reconstruction error to find anomalous data, you can cluster the data because the inner-most hidden layer nodes hold a strictly numeric representation of each data item. After clustering, you can look for clusters that have very few data items, or look for data items within clusters that are most distant from their cluster centroid.

Get Code Download

About the Author

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].

Printable Format

comments powered by Disqus

Featured

As Agentic AI Explodes, Microsoft Announces MS365 Copilot Agent Debugging

Microsoft announced agent debugging functionality for Microsoft 365 Copilot directly from the AI tool itself, no Visual Studio 2022 or Visual Studio Code needed.
Creating Business Applications Using Blazor

Expert Blazor programmer Michael Washington' will present an upcoming developer education session on building high-performance business applications using Blazor, focusing on core concepts, integration with .NET, and best practices for development.
GitHub Celebrates Microsoft's 50th by 'Vibe Coding with Copilot'

GitHub chose Microsoft's 50th anniversary to highlight a bevy of Copilot enhancements that further the practice of "vibe coding," where AI does all the drudgery according to human supervision.
AI Coding Assistants Encroach on Copilot's Special GitHub Relationship

Microsoft had a great thing going when it had GitHub Copilot all to itself in Visual Studio and Visual Studio Code thanks to its ownership of GitHub, but that's eroding.
VS Code v1.99 Is All About Copilot Chat AI, Including Agent Mode

Agent Mode provides an autonomous editing experience where Copilot plans and executes tasks to fulfill requests. It determines relevant files, applies code changes, suggests terminal commands, and iterates to resolve issues, all while keeping users in control to review and confirm actions.

Subscribe on YouTube

.NET Insight

Email Address*Country*

Please type the letters/numbers you see above.

Upcoming Training Events

0 AM

VSLive! 4-Day Hands-On Training Seminar: Hands-on with Blazor
May 5-8, 2025

Cybersecurity & Ransomware Live! VirtCon 2025
May 13-15, 2025

VSLive! 4-Hour In-Depth Workshop: Deep Dive into ASP.NET Core Razor Pages
May 29, 2025

VSLive! 3-Day Hands-On Training Seminar: Master Modern JavaScript: Unlock the Full Potential of Your Code
June 2-4, 2025

VSLive! 2-Day Hands-On Training Seminar: Asynchronous and Parallel Programming in C#
June 24-25, 2025

VSLive! 4-Day Hands-On Training Seminar: Immersive .NET Full Stack Training: 4-Day Hands-On Experience
July 15-18, 2025

Visual Studio Live! @ Microsoft HQ
August 4-8, 2025

Visual Studio Live! San Diego
September 8-12, 2025

Live! 360 2-Day Hands-On Seminar: Swimming in the Lakes of Microsoft Fabric and AI – A Hands-on Experience
September 18-19, 2025

VSLive! 2-Day Hands-On Training Seminar: Hands-On with .NET Web Development in 2025
October 7-8, 2025

Live! 360 Orlando
November 16-21, 2025

Artificial Intelligence Live! Orlando
November 16-21, 2025

Cloud & Containers Live! Orlando
November 16-21, 2025

Cybersecurity & Ransomware Live! Orlando
November 16-21, 2025

Data Platform Live! Orlando
November 16-21, 2025

Visual Studio Live! Orlando
November 16-21, 2025

VSLive! 4-Day Hands-On Training Seminar: Immersive .NET Full Stack Training: 4-Day Hands-On Experience
December 16-19, 2025

Free Webcasts

> More Webcasts