The anomaly detection problem is set up here as a binary classification task, where each observation \(x\) is classified as anomalous (class 1
) or non-anomalous (class 0
). We assume that anomalies are much less frequent than normal observations.
The scoring method is as follows. Given an observation \(x\) consisting of features \(\{x_1, ..., x_n\}\), compute the probability density \(p(x)\) of the features. If the \(p(x) < \epsilon\), where \(\epsilon\) is a tuned parameter, the observation is classified as anomalous.
In the univariate case, each of the features \(x_i\) is assumed to be normally distributed with mean \(\mu_i\) and variance \(\sigma_i^2\): \(x_i \sim \mathcal{N}(\mu_i,\sigma_i^2)\).
For a variable with mean \(\mu\) and variance \(\sigma^2\), the probability density is given by:
\[p(x; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\]
The univariate normal distribution is parametrized by mean \(\mu\) and variance \(\sigma^2\). For a data set \(\{x_1,x_2, ..., x_m\}\), the parameters are estimated using maximum likelihood:
\[\mu = \frac{1}{m}\sum_{i=1}^{m}x_i\] \[\sigma^2 = \frac{1}{m}\sum_{i=1}^{m}(x_i-\mu)^2\]
For the univariate case, the joint probability density is calculated under the assumptions that features are independent of each other, and each feature is normally distributed. For a set of observations \(\{x_1,x_2, ..., x_m\}\) where each observation \(x_i \in R^n\), the joint probability then is a product of the individual densities:
\[p(x) = \prod_{j=1}^{n}p(x_j;\mu_j,\sigma_j^2)\]
Using the above assumptions and definitions, the algorithm is implemented as follows:
anode
, for final evaluation.step 2
.step 2
and \(\epsilon\) from step 3
.step 2
and \(\epsilon\) from step 3
.It is possible to extend the above algorithm by using the multivariate version of the normal distribution. In this case, the features are not assumed to be independent, and \(p\) is not a product of independent densities.
For parameters \(\mu \in R^n\) (vector of means) and \(\Sigma \in R^{n * n}\) (covariance matrix), the multivariate probability density function is given by:
\[p(x; \mu, \Sigma) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}e^{(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu))},\] Where \(|\Sigma|\) is the determinant of \(\Sigma\).
Unlike the univariate version, the multivariate pdf approach automatically captures correlations between features. However, the latter approach is computationally more expensive due to having to compute the inverse of \(\Sigma\). The univariate algorithm works even when the training set size is small. For the multivariate version, \(\Sigma\) is non-invertible if the number of observations is smaller than the number of features or the data contains features that are linearly dependent on others.
Create a small data set and train the algorithm using formula notation and matrix notation:
Machine Learning Course - the package is based on anomaly detection lectures from Andrew Ng’s Machine Learning course.