Page 706 - AI for Good Innovate for Impact
P. 706

AI for Good Innovate for Impact



                      dynamic and time-sensitive information. 4) High-quality annotated data based on reasoning
                      and fine-tuning: Used for model fine-tuning and reasoning optimization.

                      The training of the AstroOne model employs a two-stage methodology.

                      The first stage is Continual Pre-training (CPT), aimed at integrating diversified astronomical
                      corpora to establish a specialized knowledge foundation. Data for this stage is sourced from
                      the public RedPajama[ ] astronomical subset (2.61 million records, ~4.31GB) and information
                                           4
                      from Dolma[ ] astronomical subset which is made available under the Open Data Commons
                                  5
                      Attribution License (1.73 million records, ~2.2GB), as well as datasets internally curated with
                      open accessible astronomical papers with proper license (32,000 records, ~4.9GB).

                      The second stage is Supervised Fine-tuning (SFT), designed to enhance domain-specific
                      instruction comprehension and execution capabilities  by  integrating expert-annotated
                      astronomical question-answer pairs and inference-focused data. SFT data originates from
                      public datasets including curated subsets of open-community resources like ShareGPT_Vicuna_
                      unfiltered[ ] and MetaMathQA[ ] (covering diverse instructional tasks, yielding 3.50 million
                                                   7
                                6
                      general instruction records) and Multiple HuggingFace open-source platform datasets with
                      permissive licenses(yielding 360,000 general reasoning records), supplemented by internally
                      curated datasets derived from arXiv papers with proper licenses prior to March 12, 2025
                      (yielding 24,000 astronomical reasoning records).[5]

                      Given the emphasis on fairness and bias mitigation, we have taken several targeted measures
                      to ensure balanced dataset coverage. The current version primarily includes English and
                      Chinese data, with plans to incorporate German, Spanish, French, and Portuguese in the next
                      iteration. Specifically, ahead of the IAUS(International Astronomical Union Symposium)  397
                      conference in Greece, we evaluated AstroOne’s performance on Greek through a dedicated set
                      of assessment tasks, achieving results comparable to English. All data is sourced from publicly
                      available global astronomy platforms, and analysis confirms comprehensive geographic
                      representation, including contributions from researchers across Asia, North America, Africa,
                      South America, and Oceania. While the dataset is focused on astronomical sciences, we have
                      also evaluated the model’s general capabilities, confirming that its performance remains
                      consistent with the base model (Qwen2.5-72B[ ]) and shows no evidence of bias toward or
                                                                 8
                      against other scientific fields.



                      AstroOne has the following capabilities:

                      (1)  AstroOne was evaluated on two benchmark datasets. From the following charts, we can
                           see AstroOne significantly outperforms its base model Qwen 2.5 in both astronomy
                           knowledge and reasoning. It even surpasses GPT-4o ] in astronomy-specific tasks,
                                                                             [9


                      4   []  Weber, M., Fu, D., Anthony, Q., Oren, Y., Adams, S., Alexandrov, A., ... & Zhang, C. (2024). RedPajama: an
                         open dataset for training large language models arXiv preprint, arXiv:2411.12372.
                      5   []  Soldaini, L., Kinney, R., Bhagia, A., Schwenk, D., Atkinson, D., Authur, R., ... & Hajishirzi, H. (2024). Dolma:
                         Data and tools for generating and inspecting OLMo pre-training data, arXiv preprint, arXiv:2402.00159.
                      6   []  https:// huggingface .co/ datasets/ anon8231489123/ ShareGPT _Vicuna _unfiltered.
                      7   []  https:// huggingface .co/ datasets/ meta -math/ MetaMathQA.
                      8   []   A  72-billion-parameter  instruction-tuned  model  optimized  for  instruction  following  and  advanced
                         reasoning from Qwen, https:// huggingface .co/ Qwen/ Qwen2 .5 -72B -Instruct
                      9   []  OpenAI's flagship multimodal model with advanced reasoning and broad domain coverage, https://
                         platform .openai .com/ docs/ models/ gpt -4o



                  670
   701   702   703   704   705   706   707   708   709   710   711