Visualisation and Software to Communicate Data Preprocessing Decisions

Date

Authors

Lucchesi, Lydia

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

This thesis is concerned with the communication of data preprocessing. Data preprocessing is a crucial intermediate stage in quantitative data analysis. During this stage, data practitioners decide how to resolve dataset issues and transform, clean, and format the dataset(s). It can be a challenging stage, full of decisions that have the potential to influence analytical outcomes. Yet, data preprocessing is often treated as behind-the-scenes work and overlooked in research dissemination. This discrepancy, in the practice and presentation of data analytics, is limiting when it comes to replicating, interpreting, and utilising research outputs. This work makes several contributions to advance the communication of data preprocessing decisions. The first contribution is a new operational view of data preprocessing. It demarcates data preprocessing within the broader data pipeline and avoids the need to list out the wide variety of tasks that data preprocessing can encompass. The two most central contributions include Smallset Timelines and smallsets. The Smallset Timeline is a static and compact visualisation, documenting the sequence of decisions in a preprocessing pipeline; it is composed of small data snapshots of different preprocessing steps. The smallsets software builds a Smallset Timeline from a user's data preprocessing script, containing structured comments with snapshot instructions. Together, Smallset Timelines and smallsets are designed to support the production of accessible data preprocessing documentation, for research dissemination. The final two contributions are four case studies and a focus group study. The former demonstrates use of smallsets, in a range of research problems, which rely on diverse data sources (e.g., citizen science data and home loan data). The latter is a formal evaluation of smallsets, which gathered feedback from prospective users on the software's utility/usability and data on experiences with preprocessing communication, more broadly.

Description

Keywords

Citation

Source

Type

Thesis (PhD)

Book Title

Entity type

Access Statement

License Rights

DOI

Restricted until

2024-06-04

Downloads

File
Description