Detecting vandalism on Wikipedia across multiple languages
Date
2015
Authors
Tran, Khoi-Nguyen Dao
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Vandalism, the malicious modification or editing of articles, is a serious problem
for free and open access online encyclopedias such as Wikipedia. Over the 13 year
lifetime of Wikipedia, editors have identified and repaired vandalism in 1.6% of more
than 500 million revisions of over 9 million English articles, but smaller manually
inspected sets of revisions for research show vandalism may appear in 7% to 11% of
all revisions of English Wikipedia articles. The persistent threat of vandalism has led
to the development of automated programs (bots) and editing assistance programs
to help editors detect and repair vandalism. Research into improving vandalism
detection through application of machine learning techniques have shown significant
improvements to detection rates of a wider variety of vandalism. However, the focus
of research is often only on the English Wikipedia, which has led us to develop a
novel research area of cross-language vandalism detection (CLVD).
CLVD provides a solution to detecting vandalism across several languages through
the development of language-independent machine learning models. These models
can identify undetected vandalism cases across languages that may have insufficient
identified cases to build learning models. The two main challenges of CLVD are (1)
identifying language-independent features of vandalism that are common to multiple
languages, and (2) extensibility of vandalism detection models trained in one
language to other languages without significant loss in detection rate. In addition,
other important challenges of vandalism detection are (3) high detection rate of a variety
of known vandalism types, (4) scalability to the size of Wikipedia in the number
of revisions, and (5) ability to incorporate and generate multiple types of data that
characterise vandalism.
In this thesis, we present our research into CLVD onWikipedia, where we identify
gaps and problems in existing vandalism detection techniques. To begin our thesis,
we introduce the problem of vandalism onWikipedia with motivating examples, and
then present a review of the literature. From this review, we identify and address the
following research gaps. First, we propose techniques for summarising the user activity
of articles and comparing the knowledge coverage of articles across languages.
Second, we investigate CLVD using the metadata of article revisions together with
article views to learn vandalism models and classify incoming revisions. Third, we
propose new text features that are more suitable for CLVD than text features from
the literature. Fourth, we propose a novel context-aware vandalism detection technique
for sneaky types of vandalism that may not be detectable through constructing
features. Finally, to show that our techniques of detecting malicious activities are not
limited to Wikipedia, we apply our feature sets to detecting malicious attachments
and URLs in spam emails. Overall, our ultimate aim is to build the next generation
of vandalism detection bots that can learn and detect vandalism from multiple
languages and extend their usefulness to other language editions of Wikipedia.
Description
Keywords
Wikipedia, vandalism, sneaky vandalism, detection, cross-language learning, machine learning, feature engineering, metadata, text, context-aware, bots, users, editors, English, German, Spanish, French, Russian, spam emails, malicious, attachments, URLs
Citation
Collections
Source
Type
Thesis (PhD)
Book Title
Entity type
Access Statement
License Rights
Restricted until
Downloads
File
Description