AstroMLab 1: Who wins astronomy jeopardy!?

Ting, Y. S.; Nguyen, T. D.; Ghosal, T.; Pan, R.; Arora, H.; Sun, Z.; de Haan, T.; Ramachandra, N.; Wells, A.; Madireddy, S.; Accomazzi, A.

AstroMLab 1: Who wins astronomy jeopardy!?

dc.contributor.author	Ting, Y. S.	en
dc.contributor.author	Nguyen, T. D.	en
dc.contributor.author	Ghosal, T.	en
dc.contributor.author	Pan, R.	en
dc.contributor.author	Arora, H.	en
dc.contributor.author	Sun, Z.	en
dc.contributor.author	de Haan, T.	en
dc.contributor.author	Ramachandra, N.	en
dc.contributor.author	Wells, A.	en
dc.contributor.author	Madireddy, S.	en
dc.contributor.author	Accomazzi, A.	en
dc.date.accessioned	2025-05-23T09:25:40Z
dc.date.available	2025-05-23T09:25:40Z
dc.date.issued	2025	en
dc.description.abstract	We present a comprehensive evaluation of proprietary and open-weights large language models using the first astronomy-specific benchmarking dataset. This dataset comprises 4,425 multiple-choice questions curated from the Annual Review of Astronomy and Astrophysics, covering a broad range of astrophysical topics.1 Our analysis examines model performance across various astronomical subfields and assesses response calibration, crucial for potential deployment in research environments. Claude-3.5-Sonnet outperforms competitors by up to 4.6 percentage points, achieving 85.0% accuracy. For proprietary models, we observed a universal reduction in cost every 3-to-12 months to achieve similar score in this particular astronomy benchmark. open-weights models have rapidly improved, with LLaMA-3-70b (80.6%) and Qwen-2-72b (77.7%) now competing with some of the best proprietary models. We identify performance variations across topics, with non-English-focused models generally struggling more in exoplanet-related fields, stellar astrophysics, and instrumentation related questions. These challenges likely stem from less abundant training data, limited historical context, and rapid recent developments in these areas. This pattern is observed across both open-weights and proprietary models, with regional dependencies evident, highlighting the impact of training data diversity on model performance in specialized scientific domains. Top-performing models demonstrate well-calibrated confidence, with correlations above 0.9 between confidence and correctness, though they tend to be slightly underconfident. The development for fast, low-cost inference of open-weights models presents new opportunities for affordable deployment in astronomy. The rapid progress observed suggests that LLM-driven research in astronomy may become feasible in the near future.	en
dc.description.sponsorship	This research was conducted using resources and services provided by the National Computational Infrastructure (NCI) , which receives support from the Australian Government, and the Oak Ridge Leadership Computing Facility Frontier Nodes . We are also grateful for support from Microsoft\u2019s Accelerating Foundation Models Research (AFMR) program , which played a crucial role in enabling this benchmarking work. The work at Argonne National Laboratory was supported by the U.S. Department of Energy, Office of High Energy Physics and Advanced Scientific Computing Research , through the SciDAC-RAPIDS2 institute. Argonne National Laboratory is a U.S. Department of Energy Office of Science Laboratory operated by UChicago Argonne LLC under contract no. DE-AC02-06CH11357 . The views expressed herein do not necessarily represent the views of the U.S. Department of Energy or the United States Government. This research was conducted using resources and services provided by the National Computational Infrastructure (NCI), Australia, which receives support from the Australian Government, and the Oak Ridge Leadership Computing Facility Frontier Nodes, United States, which is a DOE Office of Science User Facility at the Oak Ridge National Laboratory supported by the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. We are also grateful for support from Microsoft's Accelerating Foundation Models Research (AFMR) program, United States, which played a crucial role in enabling this benchmarking work. The work at Argonne National Laboratory was supported by the U.S. Department of Energy, Office of High Energy Physics and Advanced Scientific Computing Research, through the SciDAC-RAPIDS2 institute. Argonne National Laboratory is a U.S. Department of Energy Office of Science Laboratory operated by UChicago Argonne LLC, United States under contract no. DE-AC02-06CH11357. The views expressed herein do not necessarily represent the views of the U.S. Department of Energy or the United States Government.	en
dc.description.status	Peer-reviewed	en
dc.format.extent	29	en
dc.identifier.issn	2213-1337	en
dc.identifier.scopus	85210381998	en
dc.identifier.uri	http://www.scopus.com/inward/record.url?scp=85210381998&partnerID=8YFLogxK	en
dc.identifier.uri	https://hdl.handle.net/1885/733751962
dc.language.iso	en	en
dc.rights	© 2024 The Author(s)	en
dc.source	Astronomy and Computing	en
dc.subject	Astronomy	en
dc.subject	Benchmarking	en
dc.subject	Large Language Models	en
dc.subject	Question Answering	en
dc.subject	Scientific Knowledge Assessment	en
dc.title	AstroMLab 1: Who wins astronomy jeopardy!?	en
dc.type	Journal article	en
dspace.entity.type	Publication	en
local.contributor.affiliation	Ting, Y. S.; School of Computing, ANU College of Systems and Society, The Australian National University	en
local.contributor.affiliation	Nguyen, T. D.; University of Pennsylvania	en
local.contributor.affiliation	Ghosal, T.; Oak Ridge National Laboratory	en
local.contributor.affiliation	Pan, R.; Hong Kong University of Science and Technology	en
local.contributor.affiliation	Arora, H.; Indian Institute of Technology Patna	en
local.contributor.affiliation	Sun, Z.; Tsinghua University	en
local.contributor.affiliation	de Haan, T.; High Energy Accelerator Research Organization, Institute of Particle and Nuclear Physics	en
local.contributor.affiliation	Ramachandra, N.; Argonne National Laboratory	en
local.contributor.affiliation	Wells, A.; Argonne National Laboratory	en
local.contributor.affiliation	Madireddy, S.; Argonne National Laboratory	en
local.contributor.affiliation	Accomazzi, A.; Harvard-Smithsonian Center for Astrophysics	en
local.identifier.citationvolume	51	en
local.identifier.doi	10.1016/j.ascom.2024.100893	en
local.identifier.pure	f881fb9f-ea5e-4091-a995-a0b097ef116b	en
local.identifier.url	https://www.scopus.com/pages/publications/85210381998	en
local.type.status	Published	en

Collections

ANU Research Publications

Cultural advice

AstroMLab 1: Who wins astronomy jeopardy!?

Downloads

Collections