Skip navigation
Skip navigation

Levenshtein distances fail to identify language relationships accurately

Greenhill, Simon

Description

The Levenshtein distance is a simple distance metric derived from the number of edit operations needed to transform one string into another. This metric has received recent attention as a means of automatically classifying languages into genealogical subgroups. In this article I test the performance of the Levenshtein distance for classifying languages by subsampling three language subsets from a large database of Austronesian languages. Comparing the classification proposed by the Levenshtein...[Show more]

dc.contributor.authorGreenhill, Simon
dc.date.accessioned2015-12-13T22:42:55Z
dc.identifier.issn0891-2017
dc.identifier.urihttp://hdl.handle.net/1885/78964
dc.description.abstractThe Levenshtein distance is a simple distance metric derived from the number of edit operations needed to transform one string into another. This metric has received recent attention as a means of automatically classifying languages into genealogical subgroups. In this article I test the performance of the Levenshtein distance for classifying languages by subsampling three language subsets from a large database of Austronesian languages. Comparing the classification proposed by the Levenshtein distance to that of the comparative method shows that the Levenshtein classification is correct only 40% of the time. Standardizing the orthography increases the performance, but only to a maximum of 65% accuracy within language subgroups. The accuracy of the Levenshtein classification decreases rapidly with phylogenetic distance, failing to discriminate homology and chance similarity across distantly related languages. This poor performance suggests the need for more linguistically nuanced methods for automated language classification tasks.
dc.publisherMIT Press
dc.rightsAuthor/s retain copyright
dc.sourceComputational Linguistics
dc.titleLevenshtein distances fail to identify language relationships accurately
dc.typeJournal article
local.description.notesImported from ARIES
local.identifier.citationvolume37
dc.date.issued2011
local.identifier.absfor200402 - Computational Linguistics
local.identifier.ariespublicationf5625xPUB7508
local.type.statusPublished Version
local.contributor.affiliationGreenhill, Simon, College of Asia and the Pacific, ANU
local.bibliographicCitation.issue4
local.bibliographicCitation.startpage689
local.bibliographicCitation.lastpage698
dc.date.updated2015-12-11T10:09:37Z
local.identifier.scopusID2-s2.0-82255182443
local.identifier.thomsonID000298118200003
dcterms.accessRightsOpen Access
CollectionsANU Research Publications

Download

File Description SizeFormat Image
01_Greenhill_Levenshtein_distances_fail_to_2011.pdf131.38 kBAdobe PDF


Items in Open Research are protected by copyright, with all rights reserved, unless otherwise indicated.

Updated:  19 May 2020/ Responsible Officer:  University Librarian/ Page Contact:  Library Systems & Web Coordinator