Detecting vandalism on Wikipedia across multiple languages

dc.contributor.authorTran, Khoi-Nguyen Dao
dc.date.accessioned2015-07-27T01:47:32Z
dc.date.available2015-07-27T01:47:32Z
dc.date.issued2015
dc.description.abstractVandalism, the malicious modification or editing of articles, is a serious problem for free and open access online encyclopedias such as Wikipedia. Over the 13 year lifetime of Wikipedia, editors have identified and repaired vandalism in 1.6% of more than 500 million revisions of over 9 million English articles, but smaller manually inspected sets of revisions for research show vandalism may appear in 7% to 11% of all revisions of English Wikipedia articles. The persistent threat of vandalism has led to the development of automated programs (bots) and editing assistance programs to help editors detect and repair vandalism. Research into improving vandalism detection through application of machine learning techniques have shown significant improvements to detection rates of a wider variety of vandalism. However, the focus of research is often only on the English Wikipedia, which has led us to develop a novel research area of cross-language vandalism detection (CLVD). CLVD provides a solution to detecting vandalism across several languages through the development of language-independent machine learning models. These models can identify undetected vandalism cases across languages that may have insufficient identified cases to build learning models. The two main challenges of CLVD are (1) identifying language-independent features of vandalism that are common to multiple languages, and (2) extensibility of vandalism detection models trained in one language to other languages without significant loss in detection rate. In addition, other important challenges of vandalism detection are (3) high detection rate of a variety of known vandalism types, (4) scalability to the size of Wikipedia in the number of revisions, and (5) ability to incorporate and generate multiple types of data that characterise vandalism. In this thesis, we present our research into CLVD onWikipedia, where we identify gaps and problems in existing vandalism detection techniques. To begin our thesis, we introduce the problem of vandalism onWikipedia with motivating examples, and then present a review of the literature. From this review, we identify and address the following research gaps. First, we propose techniques for summarising the user activity of articles and comparing the knowledge coverage of articles across languages. Second, we investigate CLVD using the metadata of article revisions together with article views to learn vandalism models and classify incoming revisions. Third, we propose new text features that are more suitable for CLVD than text features from the literature. Fourth, we propose a novel context-aware vandalism detection technique for sneaky types of vandalism that may not be detectable through constructing features. Finally, to show that our techniques of detecting malicious activities are not limited to Wikipedia, we apply our feature sets to detecting malicious attachments and URLs in spam emails. Overall, our ultimate aim is to build the next generation of vandalism detection bots that can learn and detect vandalism from multiple languages and extend their usefulness to other language editions of Wikipedia.en_AU
dc.identifier.otherb37327884
dc.identifier.urihttp://hdl.handle.net/1885/14453
dc.language.isoenen_AU
dc.subjectWikipediaen_AU
dc.subjectvandalismen_AU
dc.subjectsneaky vandalismen_AU
dc.subjectdetectionen_AU
dc.subjectcross-language learningen_AU
dc.subjectmachine learningen_AU
dc.subjectfeature engineeringen_AU
dc.subjectmetadataen_AU
dc.subjecttexten_AU
dc.subjectcontext-awareen_AU
dc.subjectbotsen_AU
dc.subjectusersen_AU
dc.subjecteditorsen_AU
dc.subjectEnglishen_AU
dc.subjectGermanen_AU
dc.subjectSpanishen_AU
dc.subjectFrenchen_AU
dc.subjectRussianen_AU
dc.subjectspam emailsen_AU
dc.subjectmaliciousen_AU
dc.subjectattachmentsen_AU
dc.subjectURLsen_AU
dc.titleDetecting vandalism on Wikipedia across multiple languagesen_AU
dc.typeThesis (PhD)en_AU
dcterms.valid2015en_AU
local.contributor.affiliationResearch School of Computer Science, The Australian National Universityen_AU
local.contributor.authoremailkndtran@gmail.comen_AU
local.contributor.supervisorChristen, Peter
local.contributor.supervisorcontactpeter.christen@anu.edu.auen_AU
local.identifier.doi10.25911/5d70eeb78a592
local.mintdoimint
local.type.degreeDoctor of Philosophy (PhD)en_AU

Downloads

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Tran Thesis 2015.pdf
Size:
3.05 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
884 B
Format:
Item-specific license agreed upon to submission
Description: