Towards Outlier Detection For Scattered Data and Mixed Attribute Data
Abstract
Detecting outliers which are grossly different from or
inconsistent with the remaining dataset is a major challenge in
real-world knowledge discovery and data mining (KDD)
applications.
The research work in this thesis starts with a critical review on
the latest and most popular methodologies available in outlier
detection area. Based on a series of performance evaluation of
these algorithms, two major issues that exist in outlier
detection, namely scattered data problem and mixed attribute
problem, are identified, and then being further addressed by the
novel approaches proposed in this thesis.
Based on our review and evaluation it has been found that the
existing outlier detection methods are ineffective for many
real-world scatter datasets, due to the implicit data patterns
within these sparse datasets. In order to address this issue, we
define a novel Local Distance-based Outlier Factor (LDOF) to
measure the outlierness of objects in scattered datasets. LDOF
uses the relative location of an object to its neighbours to
determine the degree that the object deviates from its
neighbourhood. The characteristics of LDOF are theoretically
analysed, including LDOF's lower bound, false-detection
probabilities, as well as its parameter range tolerance. In order
to facilitate parameter settings in real-world applications, we
employ a top-n technique in the proposed outlier detection
approach, where only the objects with the highest LDOF values are
regarded as outliers. Compared to conventional approaches (such
as top-n KNN and top-n LOF), our method, top-n LDOF, proved more
effective for detecting outliers in scattered data. The parameter
settings for LDOF is also more practical for real-world
applications, since its performance is relatively stable over a
large range of parameter values, as illustrated by experimental
results on both real-world and synthetic datasets.
Secondly, for the mixed attribute problem, traditional outlier
detection methods often fail to effectively identify outliers,
due to the lack of the mechanisms to consider the interactions
among various types of the attributes that might exist in the
real-world datasets. To address this issue in mixed attribute
datasets, we propose a novel Pattern based Outlier Detection
approach (POD). A pattern in this thesis is defined as a
mathematical representation that describes the majority of the
observations in datasets and captures the interactions among
different types of attributes. The POD is designed in the way
that the more an object deviates from these patterns, the higher
its outlier factor is. We simply use logistic regression to learn
patterns and then formulate the outlier factor in mixed attribute
datasets. For the datasets which outliers are randomly allocated
among normal data objects, distance based methods, i.e. LOF and
KNN, would not have effective. On the contrary, as the
outlierness definition proposed in POD is able to integrate
numeric and categorical attributes into a united definition, the
numeric attributes would not represent the final outlierness
directly but contribute their anomaly through categorical
attributes. Therefore, the POD will be able to offer considerably
performance improvement compared to those traditional methods. A
series of experiments show that the performance enhancement by
the POD is statistically significant comparing to several classic
outlier detection methods. However, for POD, the algorithm
sometimes shows lower detection precision for some mixed
attribute datasets, because POD has a strong assumption that the
observed mixed attribute dataset in any subspace is linearly
separable. This limitation is determined by the linear
classifier, logistic regression, we used in POD algorithm.
Description
Citation
Collections
Source
Type
Book Title
Entity type
Access Statement
License Rights
Restricted until
Downloads
File
Description