Documenting geographically and contextually diverse language data sources

McMillan-Major, Angelina; De Toni, Francesco; Alyafeai, Zaid; Biderman, Stella; Chen, Kimbo; Dupont, Gérard; Elsahar, Hady; Fikri Aji, Alham; Ilić, Suzana

Documenting geographically and contextually diverse language data sources

Date

2024

Authors

McMillan-Major, Angelina

De Toni, Francesco

Alyafeai, Zaid

Biderman, Stella

Chen, Kimbo

Dupont, Gérard

Elsahar, Hady

Fikri Aji, Alham

Ilić, Suzana

Abstract

Contemporary large-scale data collection efforts have prioritized the amount of data collected to improve large language models (LLM). This quantitative approach has resulted in concerns for the rights of data subjects represented in data collections. This concern is exacerbated by a lack of documentation and analysis tools, making it difficult to interrogate these collections. Mindful of these pitfalls, we present a methodology for documentation-first, human-centered data collection. We apply this approach in an effort to train a multilingual LLM. We identify a geographically diverse set of target language groups (Arabic varieties, Basque, Chinese varieties, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. We structure this effort by developing an online catalogue in English as a tool for gathering metadata through public hackathons. We present our tool and analyses of the resulting resource metadata, including distributions over languages, regions, and resource types, and discuss our lessons learned.

URI

https://hdl.handle.net/1885/733805254

Collections

ANU Research Publications

Source

NEJLT: Northern European Journal of Language Technology

Type

Journal article

Entity type

Publication

DOI

10.3384/nejlt.2000-1533.2024.5217

Full item page

Cultural advice

Documenting geographically and contextually diverse language data sources

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Access Statement

Research Projects

Organizational Units

Journal Issue

Abstract

Description

Keywords

Citation

URI

Collections

Source

Type

Book Title

Entity type

Access Statement

License Rights

DOI

Restricted until