Automatic Discovery of Abnormal Values in Large Textual Databases

Christen, Peter; Gayler, Ross W.; Tran, Khoi-Nguyen; Fisher, Jeffrey; Vatsalan, Dinusha

doi:10.1145/2889311

A change is coming. Click to see a sneak peek of the new Open Research Repository.

Automatic Discovery of Abnormal Values in Large Textual Databases

Download (560.57 kB)

link to publisher version

Altmetric Citations

Christen, Peter; Gayler, Ross W.; Tran, Khoi-Nguyen; Fisher, Jeffrey; Vatsalan, Dinusha

Description

Textual databases are ubiquitous in many application domains. Examples of textual data range from names and addresses of customers to social media posts and bibliographic records.With online services, individuals are increasingly required to enter their personal details for example when purchasing products online or registering for government services,while many social network and e-commerce sites allow users to post short comments. Many online sites leave open the possibility for people to...[Show more] enter unintended or malicious abnormal values, such as names with errors, bogus values, profane comments, or random character sequences. In other applications, such as online bibliographic databases or comparative online shopping sites, databases are increasingly populated in (semi-) automatic ways through Web crawls. This practice can result in low quality data being added automatically into a database. In this article, we develop three techniques to automatically discover abnormal (unexpected or unusual) values in large textual databases. Following recent work in categorical outlier detection, our assumption is that "normal" values are those that occur frequently in a database, while an individual abnormal value is rare. Our techniques are unsupervised and address the challenge of discovering abnormal values as an outlier detection problem. Our first technique is a basic but efficient q-gram set based technique, the second is based on a probabilistic language model, and the third employs morphological word features to train a one-class support vector machine classifier. Our aim is to investigate and develop techniques that are fast, efficient, and automatic. The output of our techniques can help in the development of rule-based data cleaning and information extraction systems, or be used as training data for further supervised data cleaning procedures. We evaluate our techniques on four large real-world datasets from different domains: two US voter registration databases containing personal details, the 2013 KDD Cup dataset of bibliographic records, and the SNAP Memetracker dataset of phrases from social networking sites. Our results show that our techniques can efficiently and automatically discover abnormal textual values, allowing an organization to conduct efficient data exploration, and improve the quality of their textual databases without the need of requiring explicit training data.

dc.contributor.author	Christen, Peter
dc.contributor.author	Gayler, Ross W.
dc.contributor.author	Tran, Khoi-Nguyen
dc.contributor.author	Fisher, Jeffrey
dc.contributor.author	Vatsalan, Dinusha
dc.date.accessioned	2016-09-12T05:18:16Z
dc.date.available	2016-09-12T05:18:16Z
dc.identifier.issn	1936-1955
dc.identifier.uri	http://hdl.handle.net/1885/108730
dc.description.abstract	Textual databases are ubiquitous in many application domains. Examples of textual data range from names and addresses of customers to social media posts and bibliographic records.With online services, individuals are increasingly required to enter their personal details for example when purchasing products online or registering for government services,while many social network and e-commerce sites allow users to post short comments. Many online sites leave open the possibility for people to enter unintended or malicious abnormal values, such as names with errors, bogus values, profane comments, or random character sequences. In other applications, such as online bibliographic databases or comparative online shopping sites, databases are increasingly populated in (semi-) automatic ways through Web crawls. This practice can result in low quality data being added automatically into a database. In this article, we develop three techniques to automatically discover abnormal (unexpected or unusual) values in large textual databases. Following recent work in categorical outlier detection, our assumption is that "normal" values are those that occur frequently in a database, while an individual abnormal value is rare. Our techniques are unsupervised and address the challenge of discovering abnormal values as an outlier detection problem. Our first technique is a basic but efficient q-gram set based technique, the second is based on a probabilistic language model, and the third employs morphological word features to train a one-class support vector machine classifier. Our aim is to investigate and develop techniques that are fast, efficient, and automatic. The output of our techniques can help in the development of rule-based data cleaning and information extraction systems, or be used as training data for further supervised data cleaning procedures. We evaluate our techniques on four large real-world datasets from different domains: two US voter registration databases containing personal details, the 2013 KDD Cup dataset of bibliographic records, and the SNAP Memetracker dataset of phrases from social networking sites. Our results show that our techniques can efficiently and automatically discover abnormal textual values, allowing an organization to conduct efficient data exploration, and improve the quality of their textual databases without the need of requiring explicit training data.
dc.description.sponsorship	This work is funded by the Australian Research Council (ARC), Veda, and Funnelback Pty. Ltd., under Linkage Project LP100200079.
dc.publisher	Association for Computing Machinery
dc.rights	© 2016 ACM. http://www.sherpa.ac.uk/romeo/issn/1936-1955/..."Author's pre-prints can be deposited on public repositories as long as accompanied by ACM copyright notice upon transfer of copyright" from SHERPA/RoMEO site (as at 13/09/16).
dc.source	Journal of Data and Information Quality
dc.subject	Data quality
dc.subject	One-class classifier
dc.subject	Out-of-vocabulary
dc.subject	Outlier detection
dc.subject	Probabilistic language model
dc.subject	String databases
dc.subject	Support vector machine
dc.subject	Word features
dc.title	Automatic Discovery of Abnormal Values in Large Textual Databases
dc.type	Journal article
local.identifier.citationvolume	7
dc.date.issued	2016
local.publisher.url	http://www.acm.org/
local.type.status	Submitted Version
local.contributor.affiliation	Christen, P., College of Engineering & Computer Science, The Australian National University
local.contributor.affiliation	Tran, K-N, The Australian National University
local.contributor.affiliation	Fisher, J., The Australian National University
local.contributor.affiliation	Vatsalan, D., The Australian National University
dc.relation	http://purl.org/au-research/grants/arc/LP100200079
local.bibliographicCitation.issue	1-2
local.bibliographicCitation.startpage	1
local.bibliographicCitation.lastpage	31
local.identifier.doi	10.1145/2889311
dcterms.accessRights	Open Access
Collections	ANU Research Publications

Download

File	Description	Size	Format	Image
01_Christen_Automatic_Discovery_2016.pdf		560.57 kB	Adobe PDF

Show simple item record

Automatic Discovery of Abnormal Values in Large Textual Databases

Altmetric Citations

Description

Download