Page 706 - AI for Good Innovate for Impact
P. 706
AI for Good Innovate for Impact
dynamic and time-sensitive information. 4) High-quality annotated data based on reasoning
and fine-tuning: Used for model fine-tuning and reasoning optimization.
The training of the AstroOne model employs a two-stage methodology.
The first stage is Continual Pre-training (CPT), aimed at integrating diversified astronomical
corpora to establish a specialized knowledge foundation. Data for this stage is sourced from
the public RedPajama[ ] astronomical subset (2.61 million records, ~4.31GB) and information
4
from Dolma[ ] astronomical subset which is made available under the Open Data Commons
5
Attribution License (1.73 million records, ~2.2GB), as well as datasets internally curated with
open accessible astronomical papers with proper license (32,000 records, ~4.9GB).
The second stage is Supervised Fine-tuning (SFT), designed to enhance domain-specific
instruction comprehension and execution capabilities by integrating expert-annotated
astronomical question-answer pairs and inference-focused data. SFT data originates from
public datasets including curated subsets of open-community resources like ShareGPT_Vicuna_
unfiltered[ ] and MetaMathQA[ ] (covering diverse instructional tasks, yielding 3.50 million
7
6
general instruction records) and Multiple HuggingFace open-source platform datasets with
permissive licenses(yielding 360,000 general reasoning records), supplemented by internally
curated datasets derived from arXiv papers with proper licenses prior to March 12, 2025
(yielding 24,000 astronomical reasoning records).[5]
Given the emphasis on fairness and bias mitigation, we have taken several targeted measures
to ensure balanced dataset coverage. The current version primarily includes English and
Chinese data, with plans to incorporate German, Spanish, French, and Portuguese in the next
iteration. Specifically, ahead of the IAUS(International Astronomical Union Symposium) 397
conference in Greece, we evaluated AstroOne’s performance on Greek through a dedicated set
of assessment tasks, achieving results comparable to English. All data is sourced from publicly
available global astronomy platforms, and analysis confirms comprehensive geographic
representation, including contributions from researchers across Asia, North America, Africa,
South America, and Oceania. While the dataset is focused on astronomical sciences, we have
also evaluated the model’s general capabilities, confirming that its performance remains
consistent with the base model (Qwen2.5-72B[ ]) and shows no evidence of bias toward or
8
against other scientific fields.
AstroOne has the following capabilities:
(1) AstroOne was evaluated on two benchmark datasets. From the following charts, we can
see AstroOne significantly outperforms its base model Qwen 2.5 in both astronomy
knowledge and reasoning. It even surpasses GPT-4o ] in astronomy-specific tasks,
[9
4 [] Weber, M., Fu, D., Anthony, Q., Oren, Y., Adams, S., Alexandrov, A., ... & Zhang, C. (2024). RedPajama: an
open dataset for training large language models arXiv preprint, arXiv:2411.12372.
5 [] Soldaini, L., Kinney, R., Bhagia, A., Schwenk, D., Atkinson, D., Authur, R., ... & Hajishirzi, H. (2024). Dolma:
Data and tools for generating and inspecting OLMo pre-training data, arXiv preprint, arXiv:2402.00159.
6 [] https:// huggingface .co/ datasets/ anon8231489123/ ShareGPT _Vicuna _unfiltered.
7 [] https:// huggingface .co/ datasets/ meta -math/ MetaMathQA.
8 [] A 72-billion-parameter instruction-tuned model optimized for instruction following and advanced
reasoning from Qwen, https:// huggingface .co/ Qwen/ Qwen2 .5 -72B -Instruct
9 [] OpenAI's flagship multimodal model with advanced reasoning and broad domain coverage, https://
platform .openai .com/ docs/ models/ gpt -4o
670

