Isolation-based Anomaly Detection

Anomalies are data points that are few and different. As a result of these properties, we show that, anomalies are susceptible to a mechanism called isolation. This paper proposes a method called Isolation Forest (iForest) which detects anomalies purely based on the concept of isolation without employing any distance or density measure - fundamentally different from all existing methods.

As a result, iForest is able to exploit subsampling (i) to achieve a low linear time-complexity and a small memory-requirement, and (ii) to deal with the effects of swamping and masking effectively. Our empirical evaluation shows that iForest outperforms ORCA, one-class SVM, LOF and Random Forests in terms of AUC, processing time, and it is robust against masking and swamping effects. iForest also works well in high dimensional problems containing a large number of irrelevant attributes, and when anomalies are not available in training sample.

  1. 1. INTRODUCTION

Anomalies are data patterns that have different data characteristics from normal instances. The ability to detect anomalies has significant relevance, and anomalies often provides critical and actionable information in various application domains. For example, anomalies in credit card transactions could signify fraudulent use of credit cards. An anomalous sport in an astronomy image could indicate the discovery of a new star. An unusual computer network traffic pattern could stand for an unauthorised access. These applications demand anomaly detection algorithms with high detection accuracy and fast execution.

Most existing anomaly detection approaches, including classification-based methods, Replicator Neural Network (RNN), one-class SVM and clustering-based methods, construct a profile of normal instances, then identify anomalies as those that do not conform to the normal profile. Their anomaly detection abilities are usually a ‘side-effect‘ or by-product of an algorithm originally designed for a purpose other than anomaly detection (such as classification or clustering). This leads to two major drawbacks: (i) these approaches are not optimized to detect anomalies - as a consequence, these approaches often under-perform resulting in too many false alarms (having normal instances identified as anomalies) or too few anomalies being detected; (ii) many existing methods are constrained to low dimensional data and small data size because of the legacy of their original algorithm.

This paper proposes a different approach that detects anomalies by isolating instances, without relying on any distance or density measure. To achieve this, our proposed method takes advantage of two quantitative properties of anomalies: i) they are the minority consisting of few instances, and ii) they have attribute-values that are very different from those of normal instances. In other words, anomalies are ‘few and different‘, which make them more susceptible to a mechanism we called Isolation. Isolation can be implemented by any means that separates instances. We opt to use a binary tree structure called isolation tree (iTree), which can be constructed effectively to isolate instances. Because of the susceptibility to isolation, anomalies are more likely to be isolated closer to the root of an iTree; whereas normal points are more likely to be isolated at the deeper end of an iTree. This forms the basis of our method to detect anomalies. Although, this is a very simple mechanism, we show in this paper that it is both effective an efficient in detecting anomalies.

The proposed method, called Isolation Forest (iForest), builds an ensemble of iTrees for a given data set; anomalies are those instances which have short average path lengths on the iTrees. There are two training parameters and one evaluation parameter in this method: the training parameters are the number of trees to build and subsampling size; the evaluation parameter is the tree height limit during evaluation. We show that iForest‘s detection accuracy converges quickly with a very small number of trees; it only requires a small subsampling size to achieve high detection accuracy with high efficiency; and the different height limits are used to cater for anomaly clusters of different density.

  1. 2. ISOLATION AND ISOLATION TREES

In this paper, the term isolation means ‘separating an instance from the rest of the instances‘. In general, an isolation-based method measures individual instances‘ susceptibility to be isolated; and anomalies are those that have the highest susceptibility. To realize the ideal of isolation, we turn to a data structure that naturally isolates data. In randomly generated binary trees where instances are recursively partitioned, these trees produce noticeable shorter paths for anomalies since (a) in the regions occupied by anomalies, less anomalies result in a smaller number of partitions - shorter paths in a tree structure, and (b) instances with distinguishable attribute - values are more likely to be separated early in the partitioning process. Hence, when a forest of random trees collectively produce shorter path lengths for some particular points, they are highly likely to be anomalies.

Definition: Isolation Tree. Let be a node of an isolation tree. Is either an external-node with no child, or an internal-node with one test and exactly two daughter nodes . A test at node consists of an attribute and a split value such that the test determines the traversal of a data point to either or .

Let be the given data set of a -variate distribution. A sample of instances is used to build an isolation tree (iTree). We recursively divide by randomly selecting an attribute and a split value , until either: (i) the node has only one instance or (ii) all data at the node have the same values. An iTree is a proper binary tree, where each node in the tree has exactly zero or two daughter nodes. Assuming all instances are distinct, each instance is isolated to an external node when an iTree is fully grown, in which case the number of external nodes is and the number of internal nodes is ; the total number of nodes of an iTrees is ; and thus the memory requirement is bounded and only grows linearly with .

The task of anomaly detection is to provide a ranking that reflects the degree of anomaly. Using iTrees, the way to detect anomalies is to sort data points according to their average path lengths; and anomalies are points that are ranked at the top of the list. We define path length as follow:

Definition: Path Length of a point is measured by the number of edges traverses an iTree from the root node until the traversal is terminated at an external node.

We employ path length as a measure of the degree of susceptibility to isolation:

  • short path length means high susceptibility to isolation,
  • long path length means low susceptibility to isolation.

3. ISOLATION, DENSITY AND DISTANCE MEASURES

In this paper, we assert that path-length-based isolation is more appropriate for the task of anomaly detection than the basic density and distance measures.

Using basic density measures, the assumption is that ‘Normal points occur in dense regions, while anomalies occur in sparse regions‘. Using basic distance measures, the basic assumption is that ‘Normal point is close to its neighbours and anomaly is far from its neighbours‘.

There are violations to these assumptions, e.g., high density and short distance do not always imply normal instances; likewise low density and long distance do not always imply anomalies. When density or distance is measured in a local context, which is often the case, points with high density or short distance could be anomalies in the global context of the entire data set. However, there is no ambiguity in path-length-based isolation and we demonstrate that in the following three paragraphs.

In density based anomaly detection, anomalies are defined to be data points in regions of low density. Density is commonly measured as (a) the reciprocal of the average distance to the -nearest neighbours (the inverse distance) and (b) the count of points within a given fixed radius.

In distance based anomaly detection, anomalies are defined to be data points which are distant from all other points. Two common ways to define distance-based anomaly score are (i) the distance to nearest neighbour and (ii) the average distance to -nearest neighbours. One of the weaknesses in these density and distance measures is their inability to handle data sets with regions of different densities. Also, for these methods to detect dense anomaly clusters, has to be larger than the size of the largest anomaly cluster. This creates a search problem: finding an appropriate to use. Note that a large increases the computation substantially.

On the surface, the function of an isolation measure is similar to a density measure or a distance measure, i.e., isolation ranks scattered outlying points higher than normal points. However, we find that path length based isolation behaves differently form a density or distance measure, under data with different distributions. Path length, however is able to address this situation by giving the isolated dense points shorter path lengths. The main reason for this is that path length is grown in adaptive context, in which the context of each partitioning is different, from the first partition (the root node) in the context of the entire data set, to the last partition (the leaf node) in the context of local data-points. However, density () and distance only concern with neighbours (local context) and fail to take the context of the entire data set into consideration.

In summary, we have compared three fundamental approaches to detect anomalies; they are isolation, density and distance. We find that the isolation measure (path length) is able to detect both clustered and scattered anomalies; whereas both distance and density measures can only detect scattered anomalies. While there are many ways to enhance the basic distance and density measures, the isolation measure is better because no further ‘adjustment‘ to the basic measure is required to detect both clustered and scattered anomalies.

时间: 2024-10-07 11:14:30

Isolation-based Anomaly Detection的相关文章

Machine Learning - XV. Anomaly Detection异常检测(Week 9)

http://blog.csdn.net/pipisorry/article/details/44783647 机器学习Machine Learning - Andrew NG courses学习笔记 Anomaly Detection异常检测 Problem Motivation问题的动机 Anomaly detection example Applycation of anomaly detection Note:for Frauddetection: users behavior exam

anomaly detection algorithm

anomaly detection algorithm 以上就是异常监测算法流程

斯坦福NG机器学习课程:Anomaly Detection

Anomaly Detection Problem motivation: 首先描述异常检测的例子:飞机发动机异常检测 直观上发现,如果新的发动机在中间,我们很大可能认为是OK的,如果偏离很大,我们就需要更多检测确定是否为正常发动机. 下面进行数学形式上的描述,通过概率密度进行估计,如下图: 对正常的数据进行建模,求Xtest的概率,当处于中心位置时概率比较大,并且大于设定的阈值,我们判定为OK状态,在远离中心状态,概率比较小,小于设定阈值我们判定为anomaly点. Anomaly detec

Coursera 机器学习 第9章(上) Anomaly Detection 学习笔记

9 Anomaly Detection9.1 Density Estimation9.1.1 Problem Motivation异常检测(Density Estimation)是机器学习常见的应用,主要用于非监督学习,但在某些方面又类似于监督学习.异常检测最常见的应用是欺诈检测和在工业生产领域. 具体来说工业生产飞机发动机的例子:这个的特征量假设只有2个,对于不同训练集数据进行坐标画图,预测模型p(x)和阈值ε.对于一个新的测试用例xtest,如果p(xtest)<ε,就预测该实例出现错误:否

Coursera机器学习-第九周-Anomaly Detection

Density Estimation Problem Motivation 所谓异常检测就是发现与大部分对象不同的对象,其实就是发现离群点,异常检测有时也称偏差检测,异常对象是相对罕见的. 应用: 欺诈检测:主要通过检测异常行为来检测是否为盗刷他人信用卡. 入侵检测:检测入侵计算机系统的行为 医疗领域:检测人的健康是否异常 下面来看一个例子: x1: 引擎运转时产生的热量 x2: 引擎的振动 将它们绘制成图表,假设某天新生产出一个发动机引擎,我们需要对它进行检测是否正常.如果xtest对应的特征

斯坦福机器学习视频笔记 Week9 异常检测和高斯混合模型 Anomaly Detection

异常检测,广泛用于欺诈检测(例如"此信用卡被盗?"). 给定大量的数据点,我们有时可能想要找出哪些与平均值有显着差异. 例如,在制造中,我们可能想要检测缺陷或异常. 我们展示了如何使用高斯分布来建模数据集,以及如何将模型用于异常检测. 我们还将涵盖推荐系统,这些系统由亚马逊,Netflix和苹果等公司用于向其用户推荐产品. 推荐系统查看不同用户和不同产品之间的活动模式以产生这些建议. 在这些课程中,我们介绍推荐算法,如协同过滤算法和低秩矩阵分解. Problem Motivation

异常检测(anomaly detection)

异常检测(anomaly detection) 关于异常检测(anomaly detection)本文主要介绍一下几个方面: 异常检测定义及应用领域 常见的异常检测算法 高斯分布(正态分布) 异常检测算法 评估异常检测算法 异常检测VS监督学习 如何设计选择features 多元高斯分布 多元高斯分布在异常检测上的应用 一.异常检测定义及应用领域 先来看什么是异常检测?所谓异常检测就是发现与大部分对象不同的对象,其实就是发现离群点.异常检测有时也称偏差检测.异常对象是相对罕见的.下面来举一些常见

论文笔记:Chaotic Invariants of Lagrangian Particle Trajectories for Anomaly Detection in Crowded Scenes

[原创]Liu_LongPo 转载请注明出处 [CSDN]http://blog.csdn.net/llp1992 最近在关注 crowd scene方面的东西,因为某些原因需要在crowd scene上实现 anomaly detection,所以看到了这篇论文,该论文是目前在crowd scene中进行abnormal detection做的最好的,记录下笔记当做学习资料. 传统的 anomaly detection中,很多突发事件监测都是基于motion information的,这样就忽

6D姿态估计从0单排——看论文的小鸡篇——Model Based Training, Detection and Pose Estimation of Texture-Less 3D Objects in Heavily Cluttered Scenes

这是linemod的第二篇,这一篇把训练从online learning 变成了 使用3D model, 并且对于检测结果用 3种方法: color.Pose.Depth来确保不会有false positive.感觉有种不忘初心的感觉(笑 基于linemod,是前一篇的改良 initial version of LINEMOD has some disadvantages. First, templates are learnede online, which is difficule to c