AstroLLaMA-Chat: Scaling AstroLLaMA with Conversational and Diverse Datasets

We explore the potential of enhancing LLM performance in astronomy-focused question-answering through targeted, continual pre-training. By employing a compact 7B-parameter LLaMA-2 model and focusing exclusively on a curated set of astronomy corpora -- comprising abstracts, introductions, and conclusions -- we achieve notable improvements in specialized topic comprehension. While general LLMs like GPT-4 excel in broader question-answering scenarios due to superior reasoning capabilities, our findings suggest that continual pre-training with limited resources can still enhance model performance on specialized topics. Additionally, we present an extension of AstroLLaMA: the fine-tuning of the 7B LLaMA model on a domain-specific conversational dataset, culminating in the release of the chat-enabled AstroLLaMA for community use. Comprehensive quantitative benchmarking is currently in progress and will be detailed in an upcoming full paper. The model, AstroLLaMA-Chat, is now available at https://huggingface.co/universeTBD, providing the first open-source conversational AI tool tailored for the astronomy community.

Large Language Models (LLMs) have demonstrated exceptional capabilities across a wide range of tasks, covering both general and specialized domains, as evidenced by models like GPT and LLaMA (Radford et al. 2019;Brown et al. 2020;Touvron et al. 2023a,b).Despite their impressive achievements, these models face notable challenges in highly specialized fields such as astronomy, particularly in keeping abreast of the latest field developments.This limitation arises from two primary factors: firstly, LLMs' propensity to align with general concepts restricts their capacity for providing detailed, nuanced responses in question-answering scenarios; secondly, infrequent updates to their training datasets result in a delay in assimilating recent astronomical advancements.
Building upon our earlier initiative, AstroLLaMA (Nguyen et al. 2023), the pioneering LLM tailored for astronomy and trained on over 300,000 arXiv paper abstracts using the LLaMA-2-7b model (Touvron et al. 2023b), we identified that while AstroLLaMA excelled in abstract completion, its ability in question-answering tasks is still wanting.To enhance this, we introduce AstroLLaMA-Chat, an advanced version of AstroLLaMA.This new iteration broadens the training scope to include introductions and conclusions of papers, alongside abstracts, as these sections are often rich in pivotal information for question-answering tasks.We initiated by downloading all papers up to July 2023, including all the files that come with a submission to arXiv.The data has been further refined for optimal operability, retaining only files with ".tex" suffixes.Through a multi-stage process, and utilising a comprehensive regex matching process, the extraction of the targeted sections was performed.Given the diverse LaTeX formatting standards, approximately 90% of the samples remained post-processing.Subsequently, we removed specific formatting patterns, comments, and superfluous symbols like newlines to ensure the readability of the training data.
Further, we have fine-tuned AstroLLaMA-Chat on a domain-specific dialogue dataset.To generate Question-Answer pairs, we engaged GPT-4 (OpenAI 2023) to formulate pertinent questions from paragraphs within 300,000 arXiv papers, with GPT-4 also tasked with answering these questions by retrieving context-relevant information.This approach facilitated the extraction and conversational structuring of the dataset's knowledge, laying the groundwork for training a conversational bot.We created 10,356 samples from the abstracts of the aforementioned arXiv papers and integrated additional open-source datasets.The training involved a diverse mix of datasets, including the LIMA dataset (Zhou et al. 2023), 10,000 samples from Open Orca (Lian et al. 2023;Mukherjee et al. 2023;Longpre et al. 2023), and 10,000 samples from UltraChat (Ding et al. 2023).

TRAINING
We executed fine-tuning on the LLaMA-2 models using the LMFlow LLM-training framework (Diao et al. 2023), incorporating advanced techniques like Flash Attention (Dao et al. 2022;Dao 2023), ZeRO Optimization (Rajbhandari et al. 2020) and long-context techniques (Chen et al. 2023).This approach led to a significant efficiency gain: LMFlow reduced the time cost for training AstroLLaMA (Nguyen et al. 2023) by approximately fivefold.When training AstroLLaMA-Chat, despite the expanded dataset, LMFlow required only about twelve A100 GPU days.Here the domain-specific training takes ≥ 90% of the time and the downstream chat training only takes ∼ 10 GPU hours.Our hyperparameters for domain-specific training included a peak learning rate of η = 10 −5 , a batch size of 32, a maximum token length of 2048, a warmup ratio of 0.03, two gradient accumulation steps, and the use of the fp16 format, while the downstream chat training shares a similar set of hyperparameters except for learning rate η = 2 × 10 −5 and batch size 64.We opted for a cosine learning rate (Loshchilov & Hutter 2016), as empirical evidence suggested it enhanced model performance for our specific tasks.While we are releasing the 7b conversational model, our team has also trained a more substantial 70b version.The training for this larger model is considerably more resource-intensive, requiring approximately 160 A100 GPU days per epoch.We plan to release this 70b version in an upcoming full paper.

DISCUSSION
A question naturally arises in the era of versatile and powerful large language models: Is there merit to developing specialized chatbots?Our findings indicate that general-purpose models such as GPT-4 and, to some extent, LLaMA-2, demonstrate robust reasoning and a good general understanding of astronomy.This suggests that with strategic prompting and engineering, existing large language models can serve as effective tools in this domain.
However, the primary objective of our research is to demonstrate that continual pre-training, even with a relatively modest model such as the 7b AstroLLaMA, can yield competitive and, in certain specific cases, superior performance.Our experiments reveal that while AstroLLaMA-Chat may not consistently outperform GPT-4 and LLaMA-2 in general astronomy-related Q&A, it performs better in highly specialized topics.These include intricate areas like the dimensionality of elemental abundance space, differential line-by-line spectroscopic analysis, and recent studies in astronomy, such as the Large Magellanic Cloud (LMC) wake in the Milky Way's stellar halo or the cosmological parity violation.In these niche areas, AstroLLaMA tends to provide more accurate answers that GPT-4 and LLaMA-2, albeit still with limitations in alignment and a propensity for more hallucination.We aim to address the limitation of multi-turn conversations by enhancing our model in the future.This involves incorporating additional training data and implementing alignment techniques.
In addition to these specialized topics, AstroLLaMA-Chat, akin to what we have shown in AstroLLaMA-1 (Nguyen et al. 2023), shows a marginal edge in completing abstracts in astronomy articles, a feature now extended to introduc-

LLaMA-2-7b-Chat
Cosmological parity violation refers to the idea that the laws of physics may be different in different directions in space, or that the symmetries of the universe may be broken in a way that depends on the direction in which we look.While this idea may seem strange, there is some evidence that it could be true.Here are some of the key lines of evidence:

AstroLLaMA-Chat
The study provides indirect evidence for cosmological parity violation using observables like cosmic microwave background radiation and galaxy clustering statistics.Specifically, they analyzed the correlation between CMB quadrupole amplitude and dipolar power, finding a weak preference for even multipole moments.They also detected a slightly stronger signal when comparing odd versus even multipoles.Both results suggest deviations from statistical isotropy, potentially indicating a deviation from parity conservation in the early Universe.Further analysis will help determine if these anomalous signals are statistically significant and robust against systematic effects.If confirmed, this could revolutionize our understanding of the origins and evolution of the Cosmos.

GPT-4
As of my last update in April 2023, cosmological parity violation is a theoretical concept that suggests a fundamental asymmetry in the laws of physics, particularly in how they apply to mirrored configurations of physical systems.In the context of cosmology, this would imply that the universe might not be identical to its mirror image, violating what is known as "parity symmetry".However, it's important to note that this is a highly speculative area and evidence for such a violation at the cosmological scale, if it exists, is subtle and subject to ongoing research and debate.
Here are some key points regarding the potential evidence and considerations: 1.

AstroLLaMA-Chat
The elemental abundance space of stars has been described using various dimensionality reduction techniques, but its actual number of dimensions remains somewhat debatable due to limitations in observational data.
One approach involves plotting all known elements against one another, resulting in what is called the "elemental abundance pattern" or EAP.However, this method does not take into account uncertainties in measured values nor does it consider potential correlations between elements.
Another approach uses principal component analysis (PCA) to reduce the original set of elemental abundances down to a smaller subset of variables that still capture most of their information content.This results in what is sometimes referred to as the PCA abundance pattern, although again, uncertainties and correlation effects may limit its usefulness.
In general, however, the number of dimensions required to fully characterize the elemental abundance space of stars probably lies somewhere between five and ten, depending on the specific elements being considered.While we cannot say exactly how many dimensions exist without better observational constraints, current methods suggest that at least four to six dimensions should suffice for accurate descriptions of typical stellar systems.

LLaMA-2-7b-Chat
The number of dimensions in the elemental abundance space of stars can vary depending on the specific survey and the elements being studied.

GPT-4
The concept of "dimensions" in the context of elemental abundances in stars can be interpreted in a few ways, but it's most commonly related to the number of different elements or isotopes considered when analyzing stellar compositions.Each element or isotope represents a dimension in a multidimensional space where each star's composition can be plotted.This approach is often used in astrophysics and stellar chemistry for various analyses and studies.[abridged] Figure 1.Demonstration of AstroLLaMA-Chat's Capabilities.While general large language models like GPT-4 continue to exhibit robust reasoning and Q&A abilities, even in specialized domains such as astronomy, our study highlights the benefits of continual pre-training on a dedicated astronomy corpus from arXiv, enriched with the latest data.This approach gives AstroLLaMA-Chat an edge in two specific areas.The top example illustrates its performance in a highly specialized topic within astronomy.AstroLLaMA-Chat demonstrates a better understanding of the complexities involved in studying the dimensionality of elemental abundance in stars, reflecting the true chemical yield channels.It also outlines prevalent methods in this specialized area.In contrast, GPT-4 and the LLaMA-2-7b model, from which AstroLLaMA is derived, often provide responses that lack depth in understanding this field.The bottom panel illustrates AstroLLaMA-Chat's adeptness in addressing contemporary and dynamic research areas, notably the burgeoning field of parity violation studies in cosmology.While it captures some of the latest directions in the field (though with occasional detail inaccuracies), both GPT-4 and LLaMA-2 tend to diverge into broader implications and detection methods, failing to encapsulate the current focus of the field.tions and conclusions.By contrast, LLaMA-2 occasionally deviates from its assigned tasks and is prone to errors.GPT-4, while sometimes providing informative responses, often generates overly lengthy answers that may not align well with the conventional format of a journal article in astronomy.For instance, given a prompt "Complete the following abstract: "Recent advances in X-ray binaries"" or "Recent advances in X-ray binaries", LLaMA-2-7b-Chat normally provides abstracts with different prefixes, while LLaMA-2-7b just generates empty or nonsense completions like "[Jonathan](https://github.com/jonathan-m)" from time to time.In comparison, AstroLLaMA-Chat outputs a rather reasonable completion about 3× shorter than GPT-4 with the special prompt of "###ABSTRACT: Recent advances in X-ray binaries".This tendency of concise completion in AstroLLaMA-Chat can be attributed to its training procedure focus on reducing perplexity in causal completion.However, it is important to note that the improvements in the 7b model are somewhat modest.A more detailed quantitative analysis, including comparisons with our trained 70b models, will be presented in the full paper.
We hope this research note will inspire more astronomers to explore the fine-tuning of smaller models, achievable with modest computational resources (around 10 GPU days).Additionally, we are releasing these models on the Hugging Face demo playground.In a later version, this platform will allow users to rate the responses with a thumbs up or down (https://huggingface.co/spaces/universeTBD/astrollama-7b-chat-alpha), offering valuable feedback from expert users.Such input is crucial as it will help advance this field of study, which, while still in its nascent stages, is already showing promising results.
Here are some examples of the number of dimensions in the elemental abundance space of stars from different surveys: Gaia-ESO Survey: 35 dimensions (Gilmore et al. 2012), GALAH Survey: 52 dimensions (Buder et al. 2021), RAVE Survey: 23 dimensions (Steinmetz et al. 2006), APOGEE Survey: 36 dimensions (Majewski et al. 2017), SEGUE Survey: 12 dimensions (Yanny et al. 2009) In general, the number of dimensions in the elemental abundance space of stars can range from around 20 to over 100.