Anomaly Detection in Python

Published Aug 1, 2020

I Used this System in 2 Projects:

FTX Move Analyzer, A Webapp to Analyze Historical Data on the MOVE, a Bitcoin Derivative (https://github.com/Wally869/FTX_MoveAnalyzer)
MidiSplitter, an Algorithm to split a Midi File in independent subsegments (https://github.com/Wally869/MidiSplitter)

When analyzing data, it is often useful to be able to detect outliers in a dataset. This can be used as a preprocessing step, to spot potential erroneous inputs, or it can be a part of the analysis itself.

In this article I share a quick and dirty way to perform anomaly detection on a dataset. This is very easy to implement, but I found to be useful and even sufficient in many cases.

Transform your Data to an Appropriate Domain of Definition

Raw Data is often unsuited to analysis and needs to be preprocessed.

In the case of financial data, such as my FTX Move Analyzer project, I transformed prices to Percentage Returns, while for MidiSplitter I first went through the Midi Messages to convert tick time (aka the time delta in ticks between messages) to absolute time (basically cumulative time), deleted the messages showing a time differential of 0 (which would usually mean Notes belonging to a Chord) as well as messages not signifying the playing of a note. Then I transformed the absolute time back to deltatime in ticks.

Perform the Anomaly Detection

Once your data has been preprocessed appropriately, you can perform the analysis.

My method is simple:

Compute the Median of the values
Multiply the Median by a given factor (usually between 3 and 4)

That’s it! But you can also use this method with a rolling window if you wish.

Values above the multiplied median can be considered anomalous or signifying a switch in regime. In the case of MidiSplitter, I used that value as threshold, and created a new split segment whenever the time delta between 2 messages exceeded that threshold.

I encourage you to try it out with my MidiSplitter algorithm: Auditory and Visual samples are great testaments to the efficiency of this method.

#statistics #algorithm #anomaly detection #big data #analysis #python #learn #timeseries #finance #preprocessing data #numpy #pandas