Toward Machine-learning-based Metastudies: Applications to Cosmological Parameters

We develop a new model for automatic extraction of reported measurement values from the astrophysical literature, utilizing modern natural language processing techniques. We use this model to extract measurements present in the abstracts of the approximately 248,000 astrophysics articles from the arXiv repository, yielding a database containing over 231,000 astrophysical numerical measurements. Furthermore, we present an online interface (Numerical Atlas) to allow users to query and explore this database, based on parameter names and symbolic representations, and download the resulting data sets for their own research uses. To illustrate potential use cases, we then collect values for nine different cosmological parameters using this tool. From these results, we can clearly observe the historical trends in the reported values of these quantities over the past two decades and see the impacts of landmark publications on our understanding of cosmology.


Introduction
There is currently an unprecedented level of availability of scientific literature and knowledge, made possible by the internet and the open-science spirit of many in the community.In addition, we are seeing increasing numbers of new publications being added to these repositories at a remarkable rate.While this availability is highly beneficial to the wider community, the sheer number of publications does cause issues for academics wishing to overview literature on particular topics.Due to the technical nature of the domain, keyword search queries and other common content-retrieval algorithms (such as those used by NASA ADS and the arXiv search interface) are often insufficient for identifying useful collections of documents.More than this, if one is searching not just for particular articles, but specific data contained within those articles-such as numerical measurements, as concerns us here -the problem is compounded.Not only do we have the task of identifying the relevant papers, but also of reading and cataloging the data we are interested in.For example, many researchers are regularly interested in metastudies on the values of specific parameters, where an understanding of the current consensus is required, such as for use in simulations or experimental calculations.When done manually, these endeavours are often time-consuming and can be prone to clerical errors and human bias (Kerzendorf 2017).
This, therefore, is a task that would benefit from the support of automated approaches, both to free up research time from manual data collection and bookkeeping, and also to broaden the horizons of our search-what with machines not becoming bored after reading the thousandth paper and not having any unconscious bias toward popular articles.Such a search algorithm could be prerun over the entire backlog of available literature, allowing for fast search-time queries by users, and then be automatically kept up to date as new publications are released.
However, even within the presumably well-structured texts found in scientific literature, we still see a vast array of linguistic creativity from the authors of the papers we read.This presents a multitude of interesting challenges when performing computations on the texts.In particular, more rudimentary algorithms based on heuristics and hand-coded rules often fall short due to the high potential for variation in the patterns one is attempting to capture.
In our previous work (Crossland et al. 2020), we utilized these simpler strategies, in the form of pattern matching and keyword search, to identify numerical measurements in the astrophysical literature.However, balancing the scope and selectivity of hand-written regular expressions for the large quantity of writing styles seen in the literature is a difficult process, and resulted in large amounts of noise in the results.This in turn required additional hand-tuned filtering steps.These many steps of processing led to gaps in the patterns we were able to capture, and the rule-based nature of the process meant the algorithm was brittle in the face of irregular writing styles.
This problem of variability in free text is well studied in the field of natural language processing, the area of machine learning concerned with the processing of human language, and more modern techniques have been developed in recent years to better capture and understand the complex patterns found in language.Lately, this process has been heavily influenced by the development of artificial neural networks, especially in the case of recurrent neural networks, and as such neural techniques for natural language processing have become commonplace in recent years.Models such as BERT/ RoBERTa (Devlin et al. 2018;Liu et al. 2019) and GPT-3 (Brown et al. 2020) are excellent examples of the successes that may be seen from this trend.
Our goal here, then, is to leverage some of these statistical techniques for our problem of numerical measurement extraction from astrophysical literature.This will allow us to overcome some of the shortcomings found in our previous approach and extend its successes to more complex instances of measurement reporting.In particular, our earlier attempt was severely limited by the requirement that the parameter names and symbols be prespecified and atomic (i.e., known to the user and of a singular, rigid form).These requirements simply cannot be met for many real-world parameters, as there is either no single agreed-upon name, perhaps because the entity represents some complex definition (e.g., σ 8 ), or no agreedupon symbol.Or worse, the symbol may be overutilized (e.g., β).In other cases, the user only has access to an incomplete list of names and symbols, and this limits the recall of their search.In cases such as these, a more contextually aware technique is required, in order to leverage the information available in the text itself.
The final goal of this project is to produce a system that will allow researchers to quickly and easily search the available corpus of literature for instances of measurements of a particular parameter.Our initial investigations toward this goal focused on simple measurement extraction of a single parameter with a well-defined name and symbol (the Hubble constant, H 0 ).In this work, we extend this using statistical techniques to a general search for any parameters contained in the text.This means that the "search" aspect of utilizing the model is moved to a pipeline postprocessing step, rather than a user query-time step, which greatly improves efficiency for the user, in addition to providing theoretical advantages for the model structure.
In the following sections, we discuss the steps involved in producing these new models, beginning with a brief description of the data we are utilizing and the pre-processing pipeline that converts it into an appropriate format (see Section 2).For this project, we must also create training and evaluation data sets for our task as, to the best of our knowledge, none currently exist.This will involve the construction of a hand-annotated training data set created from examples of astrophysical literature, a process that is discussed in Section 3.
Using this training data, we train artificial neural-network models to perform the named-entity recognition and relation classification tasks for our problem.This will involve identifying spans in the text relating to physical parameters, their mathematical symbols, reported measurements and other numerical data, and so on, and then linking these together such that numerical measurements can be connected to the physical parameters they represent.The architectures and training of these models are discussed in Section 4.
These trained models are applied to the entire arXiv data set and the outputs used to create a searchable database of numerical measurements that can be easily queried to extract measurements of a given parameter, as well as other useful information regarding the reporting of such measurements (e.g., confidence limits, constraint values, and associated objects).This database will be made available to the community via an online interface, available at https:// numericalatlas.doc.ic.ac.uk; a copy of the underlying SQLite3 database has been deposited to Zenodo at doi:10.5281/zenodo.8025930.A schematic diagram of the project outline is shown in Figure 1.
Finally, Section 5 focuses on comparing the new statistical approach with the previous rule-based approach from Crossland et al. (2020), showing how it performs equally well on the simple tasks that the rule-based model excelled at and how it surpasses the rule-based approach in more complex situations.We also present a set of example use cases of the result set, focusing on extracting values of various cosmological parameters, demonstrating the various search parameters that may be employed.Using these results, we discuss some of the trends and features that are observed in the community's understanding of these quantities over the last few decades and how they may relate to particular events and publications during that time.This will show the utility of the model for scientists wishing to quickly gather numerical information relating to a measurable physical quantity for various kinds of analyses.

Data
The data set for this project is taken from astrophysics publications from the arXiv, an open-source repository for scientific literature, maintained by Cornell Tech. 5 Publications on the arXiv may be stored in a variety of formats, with the most common being LATEX source files (91% of all submitted articles).As such, we have chosen to utilize the structured nature of the LATEX files to allow us to process the documents into well-formatted text appropriate for machine-learning tasks.
In order to process these source files into a more usable format, we have utilized the preprocessing pipeline described in Crossland et al. (2020).Article source files are processed using the LaTeXML program, created by the National Institute of Standards and Technology,6 into a single XML document, which improves the usability of the data for computational purposes.The text is then tokenized and sentence split-with a purpose built tokenizer for LATEX math environments.Using this corpus, we can easily create textual data samples to a variety of specifications for our machine-learning models, based on the content of section headings (e.g., "Results" or "Conclusions"), document components (e.g., abstract), and so on.It should be noted that comments in the LATEX documents are not included in the output text from this process.
Our current data set consists of all arXiv papers published up until 2020 September, corresponding to 1.6 million articles.Of these, approximately 265,000 have the astrophysics tag ("astroph"), and our pipeline can successfully extract over 248,000 formatted articles from this set (corresponding to a success rate of 94%).Failure cases are generally found in older articles, often due to the source files being written in TEX rather than LATEX.This coverage of the available articles is considered to be sufficient for the purposes of this task, and it shall be assumed in this work that the 94% of processed articles are statistically similar to the remaining data, in terms of their content and linguistic style.
We have also utilized the data set compiled by Croft & Dailey (2011), which comprises 638 values of 8 cosmological parameters from 468 papers.These papers are used as a curated set of example literature for our task, both for analysis and as a component of the annotation effort described in Section 3.
We have made one further assumption in the use of this data: that any paper whose goal is to report some numerical measurement as a finding of the publication will report said measurement in the paper abstract.This is not always the case, especially for publications concerning the determination of numerical quantities for a set of objects (stellar parameters for some large sample of stars, for instance).However, based on our investigations of the Croft & Dailey (2011) data set, we find this to be a reasonable working assumption and use it for the majority of this work.Specifically, this makes the creation of a manually annotated training corpus a more tractable proposition.It is, however, noted that there are distinct linguistic differences between article abstracts and main bodies, and generalizing the models trained on this data to entire papers will be the subject of future work.

Annotation of Astrophysics Abstracts
For our machine-learning tasks, we require data to train and evaluate our models-examples that show the mapping between input data and desired output.Therefore, the next step in our data processing is to produce hand-annotated samples that demonstrate the information we wish our models to extract (annotated article abstracts in our case).
In natural language processing, there are many kinds of annotation that may be produced; here, we are interested in Entity, Relation, and Attribute annotations.
An Entity annotation is one where we select a span of text from our document and assign some label to that span.For example, in the sentence, "Kfor the Hubble constant at the present epochK," we could select the span "Hubble constant" and assign it the label ParameterName.
A Relation annotation is where we have two Entity annotations and we declare the existence of some semantic relationship between them.For example, in the sentence, "Using the Hubble constant, H 0 , under the assumptionK," we could create a Relation between the Entities "Hubble constant" (ParameterName) and "H 0 " (ParameterSymbol) and assign it the label Name (the labels used in this project are discussed below).Relation annotations may be constrained by the Entity types they may connect.For example, a Name Relation may only exist between a ParameterName Entity and a Para-meterSymbol Entity (generally, these constraints are not symmetric, meaning that most Relations are directional).
Finally, an Attribute annotation is one that modifies an Entity, by assigning another label to it.For example, in the sentence, "Using a value of 0.3 from the literatureK," we could assign a LiteratureValue Attribute to the Entity "0.3" (MeasuredValue).As for Relations, Attributes may be constrained by the type of Entity they can be assigned to.For example, a LiteratureValue Attribute can only be placed on a MeasuredValue or Constraint Entity.Now that we have our annotation types, we create a schema which describes the Entity, Relation, and Attribute labels we have available for our annotation project, and the constraints that exist for them.We are interested in measurement extraction from astrophysical literature, and so require labels that reflect that domain: Entity labels for measurements, parameters, objects, and definitions are all appropriate.Likewise, for Relations, we must be able to define which names and symbols relate to which measurements, which parameters are properties of which objects, and so on.A complete list of the annotations used in this project may be found in Tables 1, 2, and 3, along with any constraints that exist on them.Detailed descriptions of each may be found in Appendix A. This schema is not intended to represent an exhaustive list of the various semantic entities that may be relevant to this problem or domain.A compromise has been struck between completeness and practicality, as we will be requiring human annotators to implement this schema when annotating training data (as a very detailed schema is impractical for annotators, if there are too many annotation types and combinations to remember).As such, we have chosen to focus on the most important Entities and Relations for our task, favoring broader definitions over an increased number of labels in certain cases (e.g., ObjectName labels, where we could easily have multiple labels for different kinds of physical entities).

Annotation Process
Using this schema and a team of annotators, we have annotated 600 article abstracts, with each abstract being annotated by three annotators.For this process, we utilized the brat rapid annotation tool (Stenetorp & Pyysalo 2012).The resulting sets of annotations have then been combined such that each abstract has a single, consensus annotation set, and it is this consensus data that will be utilized as training data by our machine-learning models.The steps taken in this process are detailed below.
First, we select a set of papers to be annotated from the available corpus.As a starting point, we choose the 305 papers contained in the Croft & Dailey (2011) data set (the subset that successfully pass through our preprocessing pipeline, as discussed in Section 2).These serve as examples of the papers reporting measurements of cosmological parameters that we wish to identify in our test cases, as in Section 5. To round out this selection of papers, we score the available papers from the arXiv data set according to an estimate of the number of measurements used in the paper abstracts (for this estimate, we use a regular expression to identify candidate measurement strings in the text, as described in Section 4.2 in Crossland et al. 2020).We then filter these measurements to remove noise, notably by requiring that the measurement patterns contain uncertainties.Due to the prevalence of dimensionless quantities in cosmology, we also reject papers that only contain measurements with concrete units, such that the distribution of these papers will be closer to that of the Croft & Dailey (2011) data set.We then randomly sample papers with a nonzero estimated number of measurements in their abstracts to produce a final set of papers for annotation.For this annotation work, we sampled 300 abstracts from the approximately 35,000 papers with nonzero estimated measurement counts, to complement the ∼300 available papers from the Croft & Dailey (2011) data set-this represented the limitations of this project in terms of available time and resources.
It should be noted, therefore, that this set of papers is heavily biased toward cosmological measurements, and this will have an impact on the efficacy of the model in identifying measurements in other areas.However, we should also note that the randomly sampled papers are not constrained by arXiv subject tag (beyond simply the astrophysics tag, "astroph"), and so are selected from a range of subject areas within astrophysics.This bias was chosen due to the target test case for this work being cosmological parameters (see Section 5.2).
For the annotation project itself, we recruited seven astrophysics PhD students and presented them with a set of example annotated documents based on the schema outlined above.The selected papers were then released in batches of 100, evenly divided between the Croft & Dailey (2011) and randomly sampled papers, over the course of several months.The annotators were paid for their time at their standard rate, allowing for an average of 5 minutes per abstract.The papers were allocated such that each was annotated by three separate annotators.
Each round of annotations was conducted in two stages: first, the annotators were asked to work independently on their  sample, and second, once these initial annotations were complete, they were made available to all annotators, who were then asked to compare their annotations with the others and bring the annotations for each paper into better alignment.However, it should be noted that it was not a requirement that the annotators ensure their annotations be in perfect agreement, meaning that the final data set still contains some discrepancies between individual annotation attempts.These repeated annotations were then consolidated into single annotation sets, representing the consensus of the annotators.
Our final data set contains 572 paper abstracts, after accounting for papers that were unsuitable, contained no useful annotations, or were found to be incorrectly formatted-with a final division of 300 abstracts from the Croft & Dailey (2011) data set and 272 abstracts randomly sampled using the algorithm above.

Annotation: Caveats
There were a few issues encountered during the annotation process that should be noted: First, the ParameterName annotation causes some issues with agreement between annotators.This is to be expected, as the exact span of a parameter's name can be difficult to define exactly.Some examples of this would be "mean baryon density of the Universe," "total mass of three massive neutrinos," or "massweighted Galactic disk scale length" (all examples taken from our annotated documents).In these instances, there is a more compact span that could approximate the "name" in question ("baryon density," "total mass," "scale length"), but does not accurately capture the full intended context.We can, of course, generally extend this reasoning in both directions arbitrarily far -right down to single words and up to full sentences (or even paragraphs) of explanation-but this is often impractical.Deciding on the exact compromise is difficult, and this leads to different annotators selecting slightly different spans for many instances of ParameterName annotations.The alignment segment of our annotation strategy alleviates this disagreement somewhat, but it serves to show that this Entity has a lot of linguistic ambiguity.Indeed, we shall see in Section 4 that our models struggle to achieve higher scores when recognizing these labels-reflecting a combination of these disagreements between annotators carrying over to the data set and the inherent linguistic ambiguity in the boundaries of these phrases.
We also see issues with the ObjectName annotations.In some instances, this is closely related to the problems with ParameterName boundaries.For example, the phrase, "massweighted Galactic disk scale length" could be annotated as a single ParameterName or as the ParameterName "scale length," which is in turn a Property (Relation) of the ObjectName "Galactic disk."If the phrase had been written "Milky Way disk scale length," this breakdown into Object-Name and ParameterName would perhaps be more appealing, but the use of an adjective ("Galactic") coupled with a self-contained phrase ("disk scale length") may give the annotator pause.Context is also important in many of these situations, as reference to a simulated object rather than an observed one may bias the annotator away from using an ObjectName label, and so on.
It should be noted, however, that the combination of annotator discussion and our consensus algorithm (see Appendix B) go a long way to alleviating the observed disagreements.They are discussed here to illustrate the problem cases presented by the data, as well as the problems we will encounter during model training.

Tasks
We have chosen to formulate the overall task of finding measurements in free text as two subtasks, which are both welldocumented in the natural language processing domain: named-entity recognition (Nadeau & Sekine 2007) and relation classification (Pawar et al. 2017).In contrast to our approach in Crossland et al. (2020), we approach these tasks using artificial neural-network techniques, as has become standard practice in natural language processing in recent years, rather than the heuristic approach taken before.This can give the models more flexibility and scope, allowing for a broader investigation of the data available in the literature (however, as we shall see, this cannot always overcome the inherent difficulty of the task, as seen with our neural Relation models).Additionally, we have a simpler classification task for predicting Attributes.Other than the inclusion of a recurrent neural network to deal with the variable-length sequences involved, this will be formulated as a traditional classification problem.
The PyTorch package (Paszke et al 2019) has been used for the implementation of these models throughout this work.

Named-entity Recognition
In named-entity recognition tasks we consider the text as a series of individual tokens, which may be words, numbers, punctuation marks, or other self-contained collections of characters (without whitespace).The task is then to find subsequences of these tokens that correspond to named entities.In general tasks, this may be place or person names (often consisting of multiple tokens-for example, "Hubble Space Telescope"), or any other sequence of tokens that together refer to some single entity.For example, the Entity "effective temperature" (with label ParameterName) is comprised of the tokens "effective" and "temperature," whereas the Entity "H _ { 0 }" (ParameterSymbol) consists of the tokens "H," "_," "{," "0," and "}." Named-entity recognition is distinct from the task of assigning labels to individual tokens, such as labeling words as "verb," "noun," "adjective," etc. in a sentence, a task generally referred to as part-of-speech tagging.The list of named entities we are considering in this work are the same as those found in Table 1.
A common practice in named-entity recognition tasks is to classify tokens according to the beginning-inside-outside (BIO) format (Ramshaw & Ramshaw 1995), where each token is designated as either a "beginning" token (corresponding to a particular label, e.g., "〈B-ParameterName〉"), an "inside" token (again corresponding to a particular label, e.g., "〈I-Parame-terName〉"), or an outside token (not belonging to any label, e.g., "〈O〉").An example sentence showing this labeling is given in Table 4.
Hence, for a set of N Entity names, we have a possible 2N + 1 BIO labels ("begin" and "inside" for each Entity name, and one "outside" label).This, therefore, is the number of output classes for our machine-learning models.
The BIO format has some drawbacks in the general case, notably that it cannot express nested or overlapping annotations, but as we have specified that our Entity annotations will be nonoverlapping, we will not encounter that problem here.

Relation Classification
Relation classification is the subject of much active research in the field of natural language processing.Many of the recent works in this field have involved data sets comprised of singlesentence samples, where each sample either contains one of a set of possible Relations or no Relation at all (e.g., Hendrickx et al. 2010).However, we cannot easily break our data down into these atomic relational chunks, as we have many sentences that contain multiple Relations, and many long-distance Relations (where the Entities are not contained in the same sentence, and may even be several sentences apart in the text).Therefore, here we are considering the task of relation classification between labeled Entities in free text.The exact formulation of this problem is treated differently for the models described below, and so will be discussed in following sections.

Performance Metrics
In the following sections, the following performance metrics are used: precision, recall, and F1 score.Here, we consider a testing data set, containing a number of samples.Some portion of these samples are considered "relevant"-we desire that these samples be identified as positive by the model.Relevant samples identified as positive are "true positive" (tp), and those identified as negative are "false negative" ( fn).Conversely, nonrelevant samples identified as positive are "false positive" ( fp), and those identified as negative are "true negative" (tn).
Precision is defined as the number of tp results, as a fraction of the total number of retrieved samples (all samples with a positive prediction, i.e., the number of tp and fp) in the tested data set: This is a measure of how relevant the retrieved samples are.
Recall is defined as the number of tp results, as a fraction of the relevant samples (i.e., the total number of tp and fn samples): This is a measure of how many of the relevant items were identified.
The F1 score is the harmonic mean of the precision and recall: The F1 score can have a value between 0 and 1, with 1 representing perfect precision and recall.While these metrics are primarily for binary problems, they can be generalized to the multiclass case by microaveraging (biased by class frequency) the scores-this strategy is used in this work.

Featurization
All of the models we use in this project require a mechanism for converting tokens into a numerical vector representation, often referred to as an embedding.These embeddings may then be used in mathematical operations, such as the matrix operations that underlie all neural-network layers.There are many algorithms and models currently in use for this purpose, such as Word2Vec (Mikolov et al. 2013), GloVe (Pennington et al. 2014), or BERT/RoBERTA (Devlin et al. 2018;Liu et al. 2019).We have chosen to use Word2Vec, as it comprises a class of models that are well documented and can be retrained locally if a large corpus is available (such as our arXiv data set).Word2Vec operates by creating an embedding space (vector space), where each token in the vocabulary is assigned a separate vector representation.The Word2Vec model is then trained such that "similar" words have similar embeddings, i.e., appear close to each other in the embedding space.Word2Vec is a very powerful technique, as the resulting models produce embedding spaces where tokens are clustered semantically and in a structured manner, such that both direction and position have semantic meaning (Mikolov et al. 2013).
One downside of Word2Vec is that tokens are defined solely by their character strings.This means that, for example, the words "play" as in "theatrical production" and "play" as in "play a sport" only have one embedding, despite having separate meanings-and the Word2Vec algorithm must encode both possible meanings into a single representation.More recent approaches in natural language processing have utilized contextual word embeddings (e.g., BERT), where the surrounding tokens are taken into account when constructing an individual token's embedding, but these come with a significant runtime and memory cost.
For this project, we trained a set of Word2Vec embeddings on the entire arXiv astrophysics corpus (see Section 2), and these embeddings will be used for all of the models discussed below.For efficiency reasons, these embeddings are fixed at training time.However, this can impose limitations on any model using the embeddings (especially shallower networks), and so each model also performs an initial projection of the vectors.This is done with a simple matrix multiplication with a square matrix, which is itself a trainable part of the model.This increases the model capacity with regards to the fixed input embeddings, while maintaining the efficiency of pretrained embeddings.
While the Word2Vec token embeddings provide an excellent basis, they do fall short under certain circumstances.A notable instance of this is in the case of rare tokens, i.e., specific sequences of characters that occur infrequently.As the Word2Vec algorithm requires a minimum number of occurrences before a token is included in the vocabulary, rare tokens are often referred to as "out of vocabulary," and are replaced with a default embedding.As our Word2Vec model was trained specifically on astrophysical literature, we are less concerned with out-of-vocabulary technical language, but instead are concerned with numerical strings.
To a human reader, the difference between the strings "0.70" and "0.71" is minor, as we interpret the value in its numerical sense.However, the Word2Vec algorithm is not designed to leverage the numerical nature of the strings, as they are considered only as a string of characters.While Word2Vec does indeed organize numerical strings in a structured manner, due to their usages in text, this is only sufficient for common numerical strings ("1," "15," "100," and so on).In our scientific context, important numerical values (especially measurement values) are likely to be rare character sequences.As such, Word2Vec may encounter issues dealing with these tokens (Thawani et al. 2021).
In order to alleviate this problem and generally increase the capacity of our Entity models, we have also created versions of the above models that utilize boosted token embeddings.For these models, the embedding for each token is constructed by concatenating the Word2Vec embedding with the output of a trainable character-level neural-network encoder (akin to Seo et al. 2016).
This encoder is a simple single-layer bidirectional long short-term memory (LSTM) network, which is passed over a word matrix, created by concatenating trainable character embeddings.Hence, for a word W of length w, with character embeddings of dimensionality c, each word may be represented by a w × c matrix.The hidden state of the Bi-LSTM at the final time step is used as a fixed-length character-based word embedding for W.
Therefore, for these boosted models, each (projected) Word2Vec word embedding is concatenated with the character-based word embedding before being supplied to the model.Training signal is allowed to backpropagate into the character encoder during training, allowing the model to learn to fill in the information gaps in the Word2Vec embeddings, while still having the power of the Word2Vec algorithm to fall back on.

Data Usage
When training the following models, we use a holdout data set comprised of all the annotations contained in a subset of the article abstracts from our annotated data set (here, we use a data split of 60%-20%-20% for the training, development, and holdout testing data sets).This means that the training data for the Entity and Relation models come from the same set of papers, which are distinct from the set of papers used as a holdout testing set.This is done to prevent contamination of the validation results.

Entity Models
To begin, we examine the named-entity recognition models we have created: a feed-forward neural network, and a recurrent neural network using LSTM layers.Here, we are experimenting with multiple model architectures to give us insights into the complexity of the problem, and aid in interpreting model performance (as the different architectures emphasize different kinds of information from the text).
It should be noted that, due to the relative sparsity of Entities in the texts, for all the models here we shall be combining MeasuredValue and Constraint Entities for the purposes of token prediction.This improves the model performance on the named-entity recognition task, and the Constraint annotations can be recovered by using the Attribute model to predict the presence of constraints (i.e., any MeasuredValue Entity for which LowerBound or UpperBound Attribute is predicted can be assumed to be a Constraint annotation).

Feed-forward Model
Our first model uses a multilayer perceptron (MLP) neural network to predict BIO labels for each token in a document.This architecture is a natural baseline for experiments with neural models.We step through the document token by token (starting from the beginning), considering each token's word embedding, concatenated with the embeddings of the tokens in a fixed-width window (forward and backward) around the current token, to predict the label for that token.A fixed-length history of previous output predictions is maintained (whose length is equal to the window width), which is also used as input in each prediction step.
A schematic diagram of this model is shown in Figure 2.For a model with a window width, w, we concatenate the token embeddings of the 2w + 1 tokens in the current window (w to either side, plus the current token) along with the previous w outputs (each a 2N + 1 vector representing the BIO Entity labels, normalized using the softmax function) to produce our input.The prediction history is initialized using a trainable vector parameter, and zero-padding is used to account for the window width (as we begin at first token, not the wth token).The input is then passed through an MLP network, using ReLU activations (Nair & Hinton 2010), to produce our token label prediction.The exact number of layers and neurons in the MLP network is determined via grid search (using a search range of 1-3 layers and neuron counts of 256-1024, with no dropout, and window sizes of 3-15), with the results for the bestperforming model shown below.
It should be noted that this model is not a recurrent neural network, despite utilizing the outputs from previous tokens, as the training signal is not allowed to backpropagate between token steps.However, the use of the output label "memory" was found to greatly improve the model performance.

LSTM Model
Our second model uses an LSTM (Hochreiter & Schmidhuber 1997) architecture followed by a dense output layer.A schematic diagram of the architecture is shown in Figure 3.We chose a bidirectional (Schuster & Paliwal 1997) LSTM model in this case, as information will need to propagate in both directions through the text (for example, it is important if a number is followed by a " ± " sign, as well as whether it is preceded by an equals sign).The exact number of layers and cells in the LSTM network is determined by grid search (with a search range of 1-3 layers and 64-1024 hidden features for the LSTM units, and a dropout value of 0.2), with the best model performance given below.
For this model, the bidirectional LSTM units are passed along the document, and the sequential output from the LSTM (corresponding to each token) is then sent through a dense output layer, giving the desired 2N + 1 output nodes for each time step.
The LSTM units should allow the model to capture longerdistance dependencies between words and phrases, as it is not limited by a fixed-length window, creating smoother predictions across tokens-as models without any contextual awareness tend to produce very fractured prediction sequences, where many Entities are incomplete and split due to individual missing tokens.

Entity Model Results
A grid search was performed over the hyperparameters for both models, with model performance judged using the F1 score and strict Entity overlap (the proportion of Entities that are exactly predicted by the model, i.e., with no missing or additional tokens) on the holdout test data set.The highestperforming models for both proposed architectures were then selected, and their performance statistics are shown in Tables 5  and 6.
We see that the two models show comparable performance on this task, with the LSTM model proving slightly more effective overall.This suggests that the linguistic markers required to determine the nature of a token are predominantly local, as the LSTM's capacity to examine longer-distance dependencies does not have a particularly large impact on model performance.Indeed, the top-performing feed-forward model uses a window length of only three tokens.However, on   We also note that both models struggle particularly with ParameterName and ObjectName tokens.In the case of ObjectName tokens, this may be explained by the relative sparsity of these Entities in the training data.The difficulties with ParameterName labels, however, are suspected to be due to the intrinsic difficulty of separating these tokens from general physical discussion, as well as the ambiguity in the start and end points of these Entities, as shown by the disagreements experienced between annotators during the creation of the training data (see Section 3.2).As seen in Table 6, the models struggle more with recall for these ParameterName labels (although the precision is also noticeably lower than for other classes), suggesting that the model predictions represent a more conservative view of what constitutes a parameter name.As such, usage of the outputs for search purposes should emphasize parameter symbols to have the best results.

Relation Models
For our relation classification task, we have created two models: a neural-network model that considers the two Entities and the span that exists between them (along with a windowed region outside) to classify the Relation that may exist between them; and a rule-based model, which does not use any neuralnetwork techniques, but instead relies on hand-coded heuristics.This rule-based approach was not possible previously, as we did not have access to the token-level predictions from the Entity model that are the basis for the heuristics.We are experimenting with both approaches in order to best explore the possible benefits of the neural model against the interpretability of the rule-based model, to better contextualize model performance.

Neural Relation Model
For this model, we consider each potential pair of Entities separately, also considering both possible directions of the Relation (as most Relations are directional, and so A → B ≠ B → A in most cases).For every pairing of Entities, E m and E n (where m < n), we have certain obvious information available: the tokens comprising each Entity span, the labels of these Entities, the tokens of the span between the two Entities, and the labels of the tokens in that span.Additionally, we will use an outer window around the two Entities (i.e., a fixedlength span of tokens that lie outside the Entities and their connecting span) as input into the model, along with any Entity labels that may apply.With this, we have five spans of tokens (akin to Hashimoto et al. 2015).To account for possible directionality of the Relation, we also include a bit indicating whether the Relation runs from the earlier to the later Entity, or vice versa, and evaluate the available spans twice, with differing values for this "direction bit."The output of the model is an N + one-dimensional vector, where N is the number of Relation labels we are considering (plus one for a "none" label).
We now encounter a problem in that there is no predetermined fixed length for the Entity and connecting spans-they can have any number of tokens (even zero, in the case of the connecting span).As such, we require a way of converting these variable-length token matrices (produced by concatenating the token embeddings) into fixed-length representations.We have chosen to use an LSTM for this purpose, where the fixed-length representation is the hidden state of the LSTM at the final time step (token).Other approaches were experimented with, notably the strategy of taking minimum, maximum, and mean values along the time axis (i.e., the document length) to produce fixed-length summary vectors.However, this approach suffers with long-distance dependencies, and was outperformed by the LSTM summarization.
A schematic diagram of this model is shown in Figure 4.The token embeddings in each of the five spans are concatenated with their BIO token labels, and each span is passed through the same bidirectional LSTM network.The hidden state of the LSTM at the final time step is used as a fixed-length representation of the span, and these five vectors are concatenated, along with the direction bit and the Entity labels for the start and end points of the proposed Relation (as one-hot encodings), and passed through a final dense output layer.
As with our Entity models, we use a trainable projection matrix to increase the model's capacity, and we zero-pad the document to account for the windowed area.

Rules-based Model
It is also useful to produce a rule-based model as a baseline for this relation classification task, in order to determine if a more complex trained model is justified-as certain tasks are sufficiently tractable to be solved by much simpler models.Hence this model is hand-crafted from observations of the available data to produce a robust set of rules that can predict Relations between labeled Entities in a document (as opposed to the trained statistical models previously discussed).By creating a heuristic model such as this, we allow ourselves to determine a baseline for model performance based on human intuition and knowledge of the domain.Without such a baseline, we have no way of contextualizing model performance against a more easily interpretable algorithm.This model uses two primary approaches: searching for patterns in the text between the two Entities (only practical for very short-distance Relations), and using the patterns of Entities within sentences (ignoring individual tokens) to propose Relations that may exist between them.
For example, for examining the text between Entities, if we have a ParameterSymbol annotation that is followed by a MeasuredValue annotation, and the span of text between these two Entities is "=" (ignoring any whitespace that may exist between them), then we can safely assume that the measurement is related to the symbol by a Measurement Relation.There are other obvious connecting strings, such as "sim" or "approx," and similar strings for other Entity type pairings (e.g., "<" and ">" for ParameterSymbol and Constraint Entities).
However, this is insufficient for more complex sentences.For example, an author reporting multiple possible values for a physical parameter (e.g., dependent on different physical assumptions) may write a sentence of the form: "If we make assumption X, we find a value for A of 1.5, yet including assumption Y we find a value of 2.0."We observe that this pattern of ParameterSymbol followed by multiple Measur-edValue Entities is quite common, and so we can search for sentences that contain this pattern of Entity annotations, without needing to consider the constituent tokens (i.e., ignoring the textual content, and using only the order of Entities in the sentence).Similarly, a sentence that contains multiple measurements will often have a single ConfidenceLimit Entity after all the values have been stated.Hence, we assume that any sequence of MeasuredValue Entities followed by a single ConfidenceLimit Entity can be linked such that each MeasuredValue is connected by a Confidence Relation to that ConfidenceLimit.
A full list of the rules and patterns used for this model may be found in Appendix C.

Relation Model Results
Table 7 shows the results of the top-performing model from our model search (performed as a grid search over model hyperparameters), along with the corresponding performance from our rule-based model.Model performance was again Here, t n and y n indicate the token embedding and Entity label prediction for the nth token, respectively, D is the direction bit indicating the direction of the Relation in the text, R is the Relation prediction for this Entity pair and direction, and w is the window width.The two Entities in question run from tokens i to j, and from tokens k to l.The h t−1 notations indicate that it is the hidden state from the final time step that is used as the output from the LSTM nodes.judged using the global F1 score calculated on the holdout test data.
The best-performing neural model had an F1 score of 0.976, compared with 0.977 for the rule-based model.However, the similarity of these results is misleading, due to the heavy class imbalance in favor of the "none" label (due to the large number of possible Entity pairings).If we examine the per-class performance metrics for the models, we can see that the neural model suffers significantly in comparison to the rule-based approach, only achieving superior performance for the Measurement Relation.From observation, we find that the neural model struggles significantly with anything but the shortest Relations, where the Entities are very close to one another in the text, separated by only a few tokens.However, the rule-based model shows good performance across the desired Relation labels, and so we shall be utilizing this model for our final processing.
For the rule-based approach, the biggest issue remains the Property Relation.This Relation is by far the most longdistance, often covering nearly the entire span of the text.As we are dealing with article abstracts here, it is common to have an object referenced at the beginning of the text, often in the first sentence (e.g., "We examine the supernova SN 1998bu..."), followed by a description of the experimental approach, and then finally a concluding sentence stating the final result ("We find a peak luminosity of...").This longdistance nature negates much of the sentence-level pattern matching we have leveraged for the rule-based approach.Additionally, if multiple celestial objects are mentioned in this way, or with some other oblique reference later in the text, it can be hard to distinguish which measurement belongs to which object using simple patterns.As such, the required simplifying assumptions produce a very low quality of predictions for the Property Relation.

Attribute Models
For predicting Attributes, we are considering only one model architecture, due to the relative simplicity of the problem.For this model, we are only predicting Attributes relating to Constraint values (LowerBound and UpperBound), due to the relative sparsity of the other Attribute labels in the training set, and so we only consider MeasuredValue Entities when making predictions (as Constraint values are not directly predicted, but rather inferred from the presence of Attributes).For example, in "...finding x 0.5 for...," we would assign a UpperBound Attribute label to "0.5."It should be noted that here each MeasuredValue Entity is considered as an individual classification task.
A schematic diagram of this model is shown in Figure 5.For this model, we examine the tokens of the Entity itself, along with a fixed-width window around the Entity in question in both directions, and utilize a bidirectional LSTM layer to process these sequences of tokens.As for our Relation model, we use an LSTM to account for the variable-length sequences we will encounter.The LSTM is used despite the fact that only the Entity token sequence is of variable length (both window sequences are fixed in length), as training on all sequences increases the training signal through the LSTM cells.As before, the Word2Vec embeddings are projected using a trainable projection matrix, and the predicted Entity label for each token is concatenated onto this projected embedding.The concatenated LSTM outputs (hidden state at final time step) are then passed through a densely connected layer, producing the final output.
Using this model, we achieve the results shown in Table 8, using a grid search over model hyperparameters.These correspond to an overall F1 score of 0.98.These results are considered to be of a reasonable quality to be used in our final pipeline.

Postprocessing
With our models trained, we combine their outputs and utilize them for prediction.For a given abstract, we first predict the presence of Entities in the text, by converting the tokenlevel BIO Entity predictions into full Entity spans.This is done by simply identifying contiguous spans of tokens with the same predicted class, using "begin" tokens to identify the start of such sequences in cases where there are no separating Outside tokens.If no "begin" token is present, the first "inside" token is assumed to be the beginning of the Entity.Next, any MeasuredValue Entities are evaluated using the Attribute model to determine if they should be annotated with Upper-Bound or LowerBound Attributes, or simply left as Measur-edValue annotations.If the Attribute model returns an appropriate prediction, the MeasuredValue label is changed to a Constraint label, with the appropriate bound Attribute.Finally, the Relation model is used to predict the presence of any Relations between the predicted Entities.
However, as is generally the case when dealing with natural language, the prediction outputs are not always as clean as we would desire-especially in this context, where the textual entities we are searching for may be highly structured and brittle against minor errors (missing braces, for example).As such, postprocessing steps are applied to the predictions before they are stored in a database, to remove obvious noise and false positives.Here, we are dealing only with simple and glaring errors, rather than attempting to solve more subtle issues.
Full details of the postprocessing steps applied to Entity and Relation annotations may be found in Appendix D. We note that no postprocessing steps are applied to Attribute annotations (other than the Entity label replacement discussed previously).

Results
In this section, we demonstrate a series of search queries on the model predictions for a variety of cosmological parameters.These will serve as examples of the kind of data sets that may be produced from these outputs.

Comparison with Rules-based Model
To begin our analysis of the processed neural model predictions, we compare the results to those of our previous work in Crossland et al. (2020), which utilized a rule-based approach for identifying measurements based on a list of query strings.This previous work focused on extracting measurements of the Hubble constant, H 0 , chosen for this parameter's well-defined name and symbol as well as the use of a commonly accepted standard unit for the quantity (km s −1 Mpc −1 ).The simplicity of the parameter identifiers was essentially a requirement of the approach, given that exact string matching was used in the algorithm.Our new approach should be capable of distinguishing all of the measurement patterns already identified in the rule-based approach, while also extending beyond these rigid (and hand-coded) patterns to encompass a more diverse range of writing styles.
For the rule-based model, we use the data from Figure 4 in Crossland et al. (2020), which used the following keyword strings for the search: 1. Hubble constant.2. Hubble parameter.3. H 0 : written "H_0," "H_{0}," "H_o," "H_{o}," "H_circ," or "H_{circ}." For the neural model, we use a database of measurements created from the outputs of the final trained models from Section 4, and we use the same keyword strings to extract measurement instances (it should be noted that the symbol normalization discussed in Section 4.8 will make some of the above symbol strings degenerate).This produces data sets as follows: 2228 data points for the rule-based model, and 872 data points for the neural model.
After this initial search, both data sets have some additional constraints placed on them: 1. We require that the measurement have units compatible with km s −1 Mpc −1 .This leaves 584 and 578 data points for the rule-based and neural models, respectively.2. We require that the measurement have a stated uncertainty or (in the case of the neural model) be a constraint value.This has the effect of reducing noise in the result set as well as removing assumed or literature values, which are often reported without an accompanying uncertainty.
This leaves us with the following data sets: 299 samples from the rule-based model and 314 samples from the neural models, all with the correct units and a provided uncertainty or bound.The outputs of the models are displayed as time series (by publication date) in Figure 6.
From the effects of these cuts on the number of returned data points, we observe the following: The neural model is far more selective when identifying potential measurements in the text, finding far fewer potential spans initially.However, the Figure 5.A schematic diagram of the Attribute model.The Bi-LSTM nodes here refer to the same LSTM network, passed over each span of tokens individually.Here, t n and y n indicate the token embedding and model output for the nth token, respectively, i and j refer to index the start and end tokens for the Entity in question, and A is the Attribute label prediction for that Entity.The h t−1 notations indicate that it is the hidden state from the final time step that is used as the output from the LSTM nodes.identified spans are shown to be more grammatically relevant to the query phrases ("Hubble constant," "H 0 ," etc.), given that a higher proportion survive our selection cuts using our existing knowledge of the Hubble constant (i.e., unit and required uncertainty): 13% for the rule-based model versus 36% for the neural model.With these data collected and cross-referenced, we find an overlap of 261 samples, with 39 samples identified by the rulebased model that the neural model did not recover, and likewise 53 samples that only the neural model found.Most interesting out of these samples are the instances where only one model identifies a measurement, as they highlight gaps in the models' comprehension.To investigate this further, the textual spans for both data sets have been manually examined, and the following recurring failure states are noted (the examples reference those found in Table 9): 1.As seen in Crossland et al. (2020), the rule-based model fails on a number of trivial cases, such as the presence of additional, unrelated numbers in the text, such as Example 1, or more verbose language causing separation of keyword and measurement (as the model selects the closest measurement in the same sentence by character distance).We note that, in Example 1, the final ParameterSymbol will be incorrectly identified as "H (z"-which cannot be parsed into a complete symbol without postprocessing corrections.Many of these cases can be caught by the neural model-however, longdistance and multisentence Relations continue to pose a problem for both models.2. The rule-based model cannot distinguish stand-alone symbols from symbols as part of a larger span (indeed, no distinction is made in the keyword list between names and symbols at all).As such, it may misidentify instances of symbol search strings inside compound symbols, such as in Example 2. The neural model, however, looks at all tokens in context, and is not limited to a fixed set of symbols, and so it will (ideally) identify the whole symbol span, as in Example 2 where it correctly identifies the full symbol (and therefore does not return the MeasuredValue for our Hubble constant search, which specifies "H _ { 0 }" rather than "H _ { 0 } ̂{ −1 }").

Stray LATEX macros or other typographical anomalies
can cause the regular expression patterns used by the rule-based model to miss potential measurements in the text, as for the failure in Example 3 for the rule-based model, where the unit string has been missed (and so the measurement is incomplete).The neural model, however, is more robust to these LATEX irregularities, and it successfully annotated Example 3. 4.However, the neural model does stumble on certain styles of measurement reporting, most notably on brackets ("()") present in the middle of both measurements and symbols, as in Example 4. This confusion is understandable, given that brackets often denote the beginning or end of an Entity annotation, and hence we can expect the model to be biased toward classifying bracket tokens as None tokens (i.e., not belonging to any class), or transition from a run of tokens of one Entity type to another.This can either cause an Entity annotation to be incomplete, missing important tokens at the beginning and/or end, or split into multiple such incomplete annotations.For instance, in Examples 4 (Neural Model) and 5 the MeasuredValue Entity spans should be single MeasuredValue annotations, but they have been incorrectly identified as two separate spans due to the "(or" and "(random" tokens being labeled None by the model.This means that, while having the correct token labels, the two Entity spans cannot represent the actual value of the measurement.5.A notable point of failure for the neural results is the manner in which symbols are currently matched in the database: namely by using an exact match against the normalized symbol string (see Section 4.8).This leads to accurate annotations being ignored in our query in cases where a slight variation on the standard symbol has been used.An example of this can be seen in Example 6, where the symbol "H _ { 0 } { (EPM ) }" has been correctly classified (as the bracketed portion was presumably intended as part of the symbol by the author-here describing a methodology for the measurement), but does not exactly match the query string "H _ { 0 }." 6.Finally, the neural model suffers more broadly from uncertain classification of tokens at the beginning and end of Entities, commonly resulting in one or two missing or added tokens.Especially in the case of braces, where incomplete braces present a nontrivial postprocessing issue, this can have a serious impact on parsing of symbols and measurements.This is especially true for symbols, where braces can imply sophisticated mathematical relations in composed symbols.This can be seen in Example 1, where the ParameterSymbol text contains unbalanced braces (as the numerical value has been incorrectly labeled as a MeasuredValue).
From these observations and the results of our comparison of the model outputs, we conclude that the neural model is capable of catching the large majority of cases covered by the rule-based model, and it has the capacity to distinguish far more complex linguistic and typographical patterns than the rigid rule-based approach by considering token context.However, manual examination of the model outputs shows that the neural model also suffers from incorrect classification of Entities, resulting in problems similar to those seen in Crossland et al. (2020).As such, we have not yet moved beyond the requirement for some prior knowledge from the user to filter and refine the search results.

Discussion
From the collected data shown in Figure 6, we may also note the presence of certain trends in the measurement values of the Hubble constant.Of particular interest is the spike in reported measurements over the last few years, which could not be seen in the data set used previously.This is thought to be due to the Note.The rule-based model does not use an annotation schema, and so the identified spans have been simply labeled "Keyword" or "Measurement," whereas the examples from the neural model use the annotation labels from Section 3. high-profile tension that has arisen in recent years between the early-and late-Universe determinations of the Hubble constant.Measurements based on the early Universe, notably measurements from the CMB by the Planck mission (Planck Collaboration et al. 2020), give a consistently lower value for H 0 , approximately 67 km s −1 Mpc −1 .Whereas late-Universe measurements, generally using standard candles such as Cepheids and Type Ia supernovae (and other, more novel objects, such as miras, masers, and lensing objects), lead to values slightly above 70 km s −1 Mpc −1 -these measurements also having become more prevalent lately, with the release of data from projects such as the Gaia mission (Gaia Collaboration et al. 2016).
Over the last decade, the measurement uncertainties on values for the Hubble constant have been decreasing (as can be seen in Figure 6), and with the publication of the results from the Planck Collaboration et al. (2020), the >3σ tension between these two epochs has become the topic of much debate (Riess 2020).In our results here, we may see this narrative unfold, from the decreasing uncertainties through to the explosion in the number of reported measurements after 2018 (see the time-axis histograms in Figure 6).This tension may be clearly seen in our model outputs7 from the two distinct peaks in the distribution of H 0 values in Figure 6 (see vertical axis histograms).
In order to better visualize the changing understanding of H 0 , we have used the extreme-deconvolution (XD) algorithm (Bovy et al. 2011) to fit Gaussian mixture models on overlapping 5 yr bins of the search results, as shown in Figure 7.This algorithm uses the stated uncertainties of the measurements in the fit, giving a better representation of the consensus value in the considered period.The Akaike information criterion (Akaike 1974) is used to determine the optimal number of components for the mixture models.From these fits, we clearly observe the decreasing measurement uncertainty in H 0 over time, followed by the bifurcation in the distributions after the Planck results.
We also see additional interesting features, such as the sudden increase in high-uncertainty values reported during this recent spike in popularity.Examination of these papers shows that this is due to various novel experimental techniques being explored in order to resolve the tension, such as gamma-ray burst supernovae (Cano 2018), active galactic nuclei (Turner & Shabala 2019;Wang et al. 2020), luminous red galaxies (Sridhar et al. 2020), and lensing objects (Birrer et al. 2020;Denzel et al. 2021).There is also a notable number of uses of gravitational-wave signals (Fishbach et al. 2019;Hotokezaka et al. 2019;Soares-Santos et al. 2019;Howlett & Davis 2020;Nicolaou et al. 2020;Palmese et al. 2020;Vasylyev & Filippenko 2020) to determine values for the Hubble constant.We can see that, in addition to the raw numerical values returned by our search (which can aid in more advanced statistical analysis, as in Press 1997), there are rich possibilities with these data for analysis of uptake of ideas and techniques within the astrophysics community.

Application to Other Cosmological Parameters
Having shown that our new model can perform well compared to our baseline on a well-structured case, we move on to more challenging examples.We note from our examination of the Hubble constant that filtering our result set by a known unit is a very effective way of identifying incorrect samples (especially for the Hubble constant, with a rather specific common expression for its dimensionality-as opposed to something more generic, e.g., kelvin or kiloparsec).However, there are many interesting quantities with more common units-or, indeed, dimensionless quantities.
However, the dimensionality filtering for the Hubble constant had far less impact on the result set from the neural model, with a drop-off in samples of only 33% for this step, in comparison to 74% for the rule-based model.This suggests then, as noted previously, that the neural model is already far more selective when identifying measurement spans in the text, and hence relies less on postprocessing to identify candidate measurements.
Furthermore, the availability of both a common, well-defined name and symbol for the Hubble constant is a special case in the scientific literature, and we must extend beyond this if we hope to produce a useful tool for the community.
We shall now present some test cases that emphasize this more challenging regime.For this, we have chosen a set of the cosmological parameters, as they are quantities of interest in the scientific community, with values and a relatively large of reported measurements, which exhibit the  challenging features mentioned above.We use the following list of parameters as case studies: 1. Ω M , the ratio of the present matter density to the critical density; 2. Ω Λ , the cosmological constant as a fraction of the critical density at the current epoch; 3. σ 8 , the amplitude of mass fluctuations; 4. Ω b h 2 , the baryon density parameter; 5. n s , the primordial spectral index; 6. ∑m ν , the sum of neutrino masses; 7. Ω k , the present curvature energy density; and 8. w 0 , the equation of state parameter for dark energy.
Fiducial values for each of these parameters may be found in Table 10; these values correspond to those reported by Planck Collaboration et al. (2020).
These parameters present a variety of interesting challenges to our models: Many of the parameters in question lack a welldefined name-which is not to say that they do not have established naming conventions, but rather that these conventions present greater challenges than a moniker such as "the Hubble constant."For example, the word "curvature" is relatively generic, in the context of astrophysics at large, but in the context of a cosmology paper it could reasonably be used with very little other explanatory information to refer to Ω k .So, something more specific would be required.Yet the phrase "spatial curvature of the Universe" has large potential for linguistic variability.This means we require our model to be able to identify grammatically significant sequences of tokens in the text, rather than simple name phrases like "Hubble constant." Additionally, many of the symbols for these parameters are commonly found in compound expressions, which proved a major stumbling block for our initial keyword search.For example, we wish to be able to distinguish between the Hubble age expressed as " - H 0 1 " and the Hubble constant expressed as "H 0 ," or between expressions such as "Ω M " and "Ω m h 2 ."For this, once again, we must not only find the tokens of interest but also take account of their context in the sentence.Finally, the majority of these parameters are dimensionless.This presented a major hurdle to our previous approaches for identifying measurements, as filtering candidate spans by stated units was an important step in reducing noise in the result set.
With these parameters, we will show both the power of our model and also the utility of our framework and how it may be used to intelligently search through the collected data to find sets of measurements relating to a certain physical quantity.
From this query, we find 1408 candidate measurements.Examination of the measurements and their associated names and symbols shows some false positives-for example, "baryonic mass density parameter" and "amplitude parameter of the matter density fluctuations" being incorrectly identified using our inclusion-based string matching (for Parameter-Names).However, the large majority of cases display sensible name/symbol combinations.The mean value of parsed measurements is 0.385, with a median value of 0.3.If we now add the stipulation that measurements must provide an uncertainty to be included in the result set, we find 449 values with a mean value of 0.297 and a median of 0.28.A plot of these identified measurements (uncertainty required), by publication date, is shown in Figure 8, along with Gaussian mixture models fitted using the XD algorithm (in the same manner as the H 0 plots).The figure shows a clear peak in the measurement distribution at a value of approximately 0.3, as expected from the known history of Ω M , and demonstrates the varying trend in the community's measurements of the parameter over the last two decades.It should be noted that there is no distinction made in the search query or the plot between measurements that assume a spatially flat Universe and those that do not.
For comparison, we have also plotted the results of this same query using the rule-based model in Figure 8.While the same general trends are observed in both plots, there is a broader distribution of outliers visible in the rule-based results.This is clearly visible in the fitted distributions, which are much more confined for the neural results.We also note that the neural model produces a smaller number of results overall (449 for the neural model versus 645 for the rule-based model), along with a mean value closer to the expected result (0.297 for the neural model versus 0.357 for the rule-based model).This further shows that the neural model has better intrinsic selectivity than the rule-based model, without the need for filtering based on dimensionality.
Figure 8 demonstrates the community's understanding of Ω M over the last two decades.The most decisive event appears to be the WMAP results from the First (Spergel et al. 2003) and Three-Year (Spergel et al. 2007) data releases.The years following these landmark papers see a much more confined region for the proposed values of Ω M than the preceding years.This is especially true throughout the majority of 2004, where the publications present values with tighter constraints than in the surrounding years.Considering that these publications utilize different data sources and techniques-including combinations of supernova and X-ray observations (Zhu et al. 2004;Zhu & Alcaniz 2005), large-scale structure with supernovae data (Odman et al. 2004), Chandra observations of clusters (Allen et al. 2004), combining the integrated Sachs-Wolfe effect and supernovae data (Gaztañaga et al. 2006), and SDSS data (Abazajian et al. 2005)-yet still find observations in such tight agreement, it is possible we are seeing a period of confirmation bias here.After the WMAP Three-Year data release, however, we see a period of relatively stable values and constraints on the value of Ω M , which exhibits a slight trend toward increasing values over time.An exception to this is the 2014-16 period, where a number of observations with much larger uncertainties may be seen.The use of lensing data appears to be a contributing factor to these measurements (Collett & Auger 2014;Jiménez-Vicente et al. 2015;Liu et al. 2015;Caminha et al. 2016), in addition to the innovative use of SDSS results, including the Alcock-Pacynski test with cosmic voids (Mao et al. 2017) and utilizing H II regions as standard candles (Wei et al. 2016).Following this period, we once more see a return to a relatively stable understanding of the quantity, yet with more variation between reported measurement values (as shown by the fitted distributions), with a trend toward a slightly higher value over time-following the trajectory from the ∼0.281WMAP value (Hinshaw et al. 2013)  As a complement to our previous example, we examine the Cosmological Constant as fraction of critical density, Ω Λ .Here, we use the following search parameters9 : 1. Symbol: "Omega _ { Lambda }." 2. Unit: Dimensionless. 3. Value range: 0 x 1.We find 421 results, with a mean value of 0.592 and a median of 0.7.Without requiring uncertainties, we find that more than half of the returned values are assumed values for the parameter (generally without provided uncertainties) clustered at the values 0.0 and 0.7.The usage of these assumed values appears to drop off after 2004 for 0.0, and after 2007 for 0.7.Requiring uncertainties, we find 88 values with a mean of 0.713 and a median of 0.712.A time-series plot of these measurements is shown in Figure 9 (again, no distinction is made in the search query between measurements reported assuming a spatially flat Universe and otherwise).
Here, also we see trends in the community's understanding of this value: a particularly striking change is the drop-off in values reported as upper or lower limits (i.e constraints) on Ω Λ (e.g., "Ω Λ > 0.5"), in favor of central values with uncertainties (e.g., "0.7 ± 0.1"), coinciding with the WMAP Three-Year Data Release (Spergel et al. 2007).It would appear that the influence of the WMAP data led to an acceptance of better constraints among the community, and hence a shift away from reporting Ω Λ as a constraint.Additionally, we once again see an increase in measurement uncertainties during the 2014-16 period.The publications in question make use of galaxy cluster and quasar observations (Mantz et al. 2014;Risaliti & Lusso 2015;Caminha et al. 2016;Bonvin et al. 2017), galaxy halo models (Conselice et al. 2014), and gamma-ray bursts (Wang et al. 2016).Given the timing of these publications, it is quite possible that this additional debate around the value may be related to the release of the Planck 2015 results (Ade et al. 2016)-possibly both in preparation (or anticipation) as well as in response.
There is also an interesting value reported by Ostriker & Steinhardt (1995), an early exploration of dark energy cosmology models using observational constraints.This publication appears to be several years ahead of the Nobel prize measurement of Ω Λ (Perlmutter et al. 1998;Schmidt et al. 1998), and it has perhaps not received a proportional amount of attention.
For this query, we find 410 samples, with a mean of 0.828 and median of 0.803.Requiring uncertainties, we have 235 samples, with a mean of 0.808 and median of 0.802.There is little consensus among the result set on a ParameterName string for this quantity, which is unsurprising, given the high linguistic variability seen for this parameter's name.A plot of the collected measurements is seen in Figure 10. 10ere, we see a clear convergence over time to a value of ∼0.8, as expected from the current understanding on the value of σ 8 , with seemingly minimal tension across the years.A slight downward trend in the value of σ 8 is observed since around 2005.Additionally, there is a clear drop-off in the number of reported measurements over the years since 2010.This is possibly due to an uptake in the use of S 8 (given by s W ( ) 0.3 M 8 0.5 ) over σ 8 in the literature.This results in 86 measurements with provided uncertainties, with a mean of 0.0215 and a median of 0.022, as shown in Figure 11.There is a clear consensus reached around 2003, possibly due to the WMAP publication in that year.This result demonstrates that the model can identify compound symbols in the text (i.e., parameter symbols comprised of more than one syntactic component).
Here, experimentation was required to find a clean result set, because the symbol "n" (as is sometimes used for primordial spectral index) is far too common to be of use in discriminating the desired measurements from other parameters.Using a simpler name for the parameter also led to a more productive search (as many instances in cosmology papers only state "spectral index," rather than "primordial spectral index").This search resulted in 100 measurements with provided uncertainties, with a mean of 0.972 and a median of 0.967.The plot for this result set is shown in Figure 12.
A notable feature of this plot is the large number of constraint values at 1.0.Many of these are erroneous or misleading-for example, many are simply expressing very general statements about assumed cosmologies.However, if we examine the trend of central value measurements, we may note some interesting features: First, we note that values with n s > 1 are not seen after the start of 2003 (except a trio of values around 2015, which are incorrectly identified-and are in fact measurements of other physical quantities), coinciding with the WMAP 1 Year Data Release (Spergel et al. 2003).By the publication of the WMAP 3 Year Data Release (Spergel et al. 2007), we see a much more cohesive set of results being reported (both in terms of value range and reported uncertainties), and the spread of values continues to narrow through to the present.While there appears to be a shift in uncertainty range during the 2013-16 period, many of these results are erroneous ("spectral index" measurements relating to other physical quantities, generally), with the few correctly identified measurements either being discussions of different inflation models (Takahashi 2013;Meerburg 2014) or using some new technique for probing the cosmology (e.g., Chantavat et al. 2016, using cosmic voids).
These results are seen in Figure 13.Here, we see the utility of distinguishing MeasuredValue and Constraint annotations, as this is a quantity that is generally expressed as a constraint rather than a central value.However, it also presents another interesting challenge with regard to inferencing: there is an implied lower bound (i.e., zero) on the measurements that is not explicitly stated.This is a natural assumption for a physicist reading the document, but one that relies on additional knowledge.As our future goals include automating aspects of the analysis phase as well as data collection, it is worth noting that these unspoken bounds must be taken into consideration.
We may also note from the plot the decided shift in the upper bound on ∑m ν occurring at the start of 2015.This is presumably the influence of the publication of the Planck 2015 results (Ade et al. 2016), which reported a lower value than had been previously accepted.However, we may also see that a trend toward lower values had been in progress since approximately 2010.This results in 40 measurements with provided uncertainties, with a mean of -1.05 and a median of -1.05.Here, we struggle with ParameterName annotations, most likely due to a  combination of the linguistic variability of this quantity's name and the manner in which it is often reported (either simply as w 0 or cryptically as "the equation of state parameter" or similar).This makes it difficult to be certain that we have identified the correct values, beyond utilizing some prior knowledge for the value range, considering the probability that the symbol "w _ { 0 }" may well be used in other contexts for different physical quantities.However, this being the case, the values collected by our search show a reasonable grouping, and the specialized nature of this parameter leads to a result set small enough to be easily examined manually.Plots of these results are shown in Figure 14.
There is a clear discontinuity in the plot after 2015, and examination of the papers following this shift suggest that this is due to new data from the Planck 2015 results (Ade et al. 2016) and the SDSS Data Release 12 (Alam et al. 2015)-as can be seen in Morandi & Sun (2016), Moresco et al. (2016), Trashorras et al. (2016), and Chuang et al. (2017).Additionally, there are several values reported over the years at approximately −1.4, which are found to be the result of investigations into different dark energy models (Movahed & Rahvar 2006;Ebrahimi et al. 2018) and cosmological measurements from GRBs (Izzo et al. 2015).

Conclusion
We have presented our investigations into utilizing artificial neural-network models for extracting numerical astrophysics measurements from astrophysical literature.We have successfully trained neural models for named-entity recognition and Entity Attribute labeling tasks in this domain, and designed a rule-based approach for Relation classification based on the outputs of these neural models.The predictions from these models have been processed and structured to allow for searching based on a variety of criteria, such as parameter name or symbol, dimensionality, value range, and so on.During this process, we have created a hand-annotated training data set for these tasks, based on paper abstracts from the arXiv repository.
We have compared the results from these new models to those of the model from our previous work (Crossland et al. 2020) and determined that there is significant overlap between the two result sets for our simple case study (the Hubble constant, H 0 ), showing that the new models have maintained the capabilities of the previous rule-based approach for simple cases.We then went on to show that the new models can be applied to a much broader range of scenarios, with a variety of different complexities, such as dimensionless quantities, symbols that commonly occur in compound expressions (such as Ω m occurring in Ω m h 2 ), or quantities with complex linguistic names (see σ 8 ).We have shown that, in these cases, with only a small amount of prior knowledge being leveraged in the search, a useful result set can be obtained, providing an excellent basis for further manual investigation or statistical analysis.The database framework ensures very fast access to the model outputs, with each of the example queries requiring only seconds of compute time, allowing for quick iterations of search parameters in order to arrive at the desired result set.
Our results have been made available via an online interface, allowing users to search for parameters of interest with a variety of search criteria.Users will be able to engage with search results in an interactive manner and download full result sets for their own experimentation and analysis.This interface, Numerical Atlas, can be found at https://numericalatlas.doc.ic.ac.uk.However, the numerical data are only one aspect of the model results.With the possibility of combining additional data from paper citations and references (e.g., from arXiv or NASA ADS), examining common naming conventions for symbols for use in other search environments, or finding common dimensions for a given parameter, there are many possibilities for examining the sociology and practices of the astrophysics community with this data.
With the extension of the capabilities of our model come some additional complexities.First, there is still a large amount of noise present in the results, due to the intrinsic complexities of dealing with text.As we are now using neural models, these failure states appear less predictable to a human observer, in comparison to the output of rule-based models.Refining these models and the pre-and postprocessing steps used in our pipeline will be an ongoing task involving the collection of additional training data and exploration of other potential model architectures and pipelines.
However, a bigger challenge than failure cases is dealing with the large variation seen in successfully extracted measurements-especially where parameter names and symbols are concerned.Our current strategy has involved extracting "parameter names" as single atomic entities.However, this is not a complete representation of the "name."For example, "Galactic radius" and "radius of the Galaxy" are, to an astronomer, clearly referencing the same physical quantity.However, this kind of entity normalization is a nontrivial task for machines.Currently, we are relying on simple inclusion-based string matching, but this has many drawbacks-in the above case, the only word shared between both forms ("radius") is far too common to be sufficiently discriminative for a large-scale search.The ability to automatically determine if two written names reference the same physical quantity (referred to as Entity Linking in the field of natural language processing) would be a great boost to the practical utility of our search tool.More than this, such an analysis of the textual names would lead to more refined information on the nature of the parameters, as many names in scientific literature are grammatically descriptive (not all, of course-there are plenty of "Proper Noun constants" to be found).For example, a grammatical breakdown of a name such as "star formation rate" provides additional insight into the nature of the quantity: it is a rate of some kind, relating to stars and their formation.With this breakdown, we could now search for parameters relating to stellar phenomena, and "star formation rate" would be included in our listing.Naturally, this is a simplistic case, but the ability to search for parameters at a more abstract level would have many benefits.
Beyond additional processing of information we are currently collecting, there is also still much scope for collecting additional contingent information.The most important, perhaps, is the collection of experimental methodology.This task is complicated by the fact that it is generally a summarization task-where a "Methodology" section must be read and condensed down into a more compact description (ideally, one comprehensible to a human as well as the machine).In many cases, there is no discrete method name provided at all (by the text itself, or indeed the community), and it is also possible that a paper is reporting a unique or ground-breaking experimental technique for which no term has yet been coined.There are certain subdomains where a finite set of experimental techniques is available and well documented, but this is not the general case and hence a more general approach must be found.
should only be used for MeasuredValue annotations that provide an uncertainty, but can be used for any Constraint annotation.4. Property: This Relation indicates that a measurement (MeasuredValue or Constraint) or parameter (Parame-terName or ParameterSymbol) is a direct property of an object specified by an ObjectName annotation.This generally means that the parameter is a physical characteristic of the object ("mass," "radius," etc.), or that it represents some important property associated with the object (e.g., "star formation rate" of a galaxy). 5. Equivalence: This Relation indicates that two Object-Name annotations (with different textual contents) relate to the same physical object.6. Contains: This Relation indicates that one object contains another object.This could be used for subcomponents of a system (e.g., members of a binary star system) or objects that reside within a larger object (e.g., stars within a galaxy).7. Uncertainty: This Relation exists to connect Measur-edValue or Constraint annotations to a SeparatedUncertainty annotation, indicating that the uncertainty is directly related to the measurement.This should only be used where the value and uncertainty share the same dimensions and require no additional manipulation to be used together.8. Defined: This Relation indicates that a Definition annotation contains a mathematical definition for another Entity.This is often of the form "y = mx + c," but could be more verbose (e.g., "alpha, which is defined to be ...").
And finally for the Attribute annotations (as summarized in Table 3): 1. Incorrect: This Attribute is applied to measurement annotations that are stated to be incorrect by the author (regardless of whether the author's determination is true).2. AcceptedValue: This Attribute indicates that a given measurement annotation is stated as final, or ultimately accepted, by the author-as may occur in cases where several possible numerical values are provided based on different assumptions.3. FromLiterature: This Attribute indicates that a measurement is not the work of the author, but instead quoted from some literature source.4. UpperBound: This Attribute indicates that a Constraint annotation represents an upper bound on a quantity.5. LowerBound: This Attribute indicates that a Constraint annotation represents a lower bound on a quantity.

Appendix B Consensus Annotation Algorithm
For the collection of annotated paper abstracts to be used as training data for machine-learning purposes, we must consolidate the repeated sets of annotations for each abstract (see Section 3) into a single annotation set for that particular piece of text.This should be done in such a way that we preserve the largest amount of information from the annotators, while also taking account of ambiguity and guarding against human error.There is not necessarily a canonical approach to take for this problem, and so we have chosen the following method: For each abstract, D, with a set of annotations, S, consisting of Entities, E, Relations, R, and Attributes, A, we group the Entities into overlapping groups.Each of these groups can be in one of several states: full agreement, partial agreement, or disagreement.In the case of full agreement, all annotators have exactly the same Entity annotations (both the span of the annotation and its label), and this annotation is accepted into the consensus annotation set.For partial agreement, more than half the annotators (two in our case) must have the same annotation, and this is also considered a consensus annotation.For the disagreement case, there are many possible situations: the annotators may all have different overlapping spans with the same label, or they may have selected different labels for the same span, multiple sets of partially overlapping spans, or some combination thereof.It is also possible that a single annotation span for one annotator may be multiple spans for another, or that only one annotator identified a certain span as containing an Entity, and other such combinations of labeling.One of these cases is resolved by the consensus algorithm in the following way: If more than half the annotators have overlapping annotation spans with the same label that do not intersect with any other spans (i.e., we are not in a case where one annotator has a single span and another multiple spans in the same region), then a consensus Entity is created from the overlap of the annotated spans and assigned the appropriate label (these substitutions are tracked for the purposes of consensus Relations-see below).For all other cases, the annotation is simply rejected from the consensus.
Next, we consider Relations: first, we filter the candidate Relations by whether their start and end Entities are in the consensus set-if not, the Relation is rejected.The remaining Relations are then grouped together by their start and end Entities, and the same process of identification as full agreement, partial agreement, or disagreement is performed.However, for Relations, the possible combinations of agreement and disagreement are less complex.A simple majority (two, in this case) voting system is sufficient to determine inclusion in the consensus set.
Finally, Attribute annotations are also filtered by subject Entity inclusion in the consensus set, and agreement is determined by voting, as for Relations.
LATEX strings into a recursive tree structure representing the components of the symbol-for example, individual characters, sub-and superscript symbols, functions (i.e., " ( ) f x "), bracketed expressions (respecting bracket type), binary operator expressions (e.g., "a + b"), and so on.These data structures may then be serialized into a standard string format, obeying LATEX style conventions.There are many cases where this parsing fails, either because the symbol represents some typographic edge case or because the span identified by the model is incomplete.Currently, no attempt is made to alter the span of the Entity in question, and failed parsing attempts result in the original string also being used to represent the normalized case for search purposes.This normalized string may then be used for queries based on mathematical symbols, with the query symbol string also being passed through this parsing algorithm to ensure that it is in line with the expected style conventions-for example, braces ("{}") are included in all cases of ambiguity (i.e., "H _ 0" becomes "H _ { 0 }"), and mathematical (as opposed to LATEX typesetting) braces use their simplest form (i.e., "right(" becomes "(") to improve readability for the user.
For the Relation annotations, as with the consensus data set (see Appendix B), we add any transitive or implied Relations into the annotation set for the document.This is especially crucial at this stage, as having all implied Relations be present in the annotations makes search-time operations more efficient by removing the need for further inferencing at a later stage.

Figure 1 .
Figure1.Schematic overview of the project.Using a hand-annotated sample of papers from the arXiv repository, we train a collection of models for measurement extraction and then perform this data extraction on all the astrophysics paper abstracts in arXiv.These results are then made available via an online interface for interactive user queries.

Figure 2 .
Figure 2. A schematic diagram of the feed-forward Entity model, where t n indicates the nth token embedding, y n indicates the model output for the nth token, and w is the window width.

Figure 3 .
Figure 3.A schematic diagram of the LSTM Entity model.Here, the Bi-LSTM node is the same node in both cases, evaluated forward and backward across the text token sequence.Here, t n and y n indicate the token embedding and model output for the nth token, respectively.

Figure 4 .
Figure 4.A schematic diagram of the neural Relation model.The Bi-LSTM nodes shown here refer to the same LSTM network, which is used for each of the spans.Here, t n and y n indicate the token embedding and Entity label prediction for the nth token, respectively, D is the direction bit indicating the direction of the Relation in the text, R is the Relation prediction for this Entity pair and direction, and w is the window width.The two Entities in question run from tokens i to j, and from tokens k to l.The h t−1 notations indicate that it is the hidden state from the final time step that is used as the output from the LSTM nodes.

Figure 6 .
Figure 6.Comparison of search results for the Hubble constant, H 0 , from the rule-based (a) and neural models (b).In addition to the measurements provided as central values with stated uncertainties (i.e., "x ± y," shown as blue circles with error bars), the neural model figure also shows values given in the source text as constraints (i.e., H 0 < x, or similar, shown as green arrows for lower bounds and orange arrows for upper bounds.)

Figure 7 .
Figure 7. Time series of the search results for H 0 , showing reported value against publication date (blue), along with the mean (red points) and dispersion (error bars) of the fitted Gaussian distributions for overlapping 5 yr periods.

Figure 8 .
Figure 8.Comparison of search results for the rule-based (a) and neural (b) models for the cosmological matter density, Ω M .Both plots show only measurements that report a central value and an uncertainty (the neural model also contains constraint measurements, but these have been omitted for clarity), shown in blue.The mean (red points) and dispersion (error bars) of the Gaussian mixture models fitted on overlapping 5 yr bins are also shown.

Figure 9 .
Figure 9.Time series of the search results for Ω Λ , showing reported value and publication date.

Figure 10 .
Figure 10.Time series of the search results for σ 8 , showing reported value and publication date.

Figure 11 .
Figure 11.Time series of the search results for Ω b h 2 , showing reported value and publication date.

Figure 12 .
Figure 12.Time series of the search results for n s , showing reported value and publication date.Figure13.Time series of the search results for ∑m ν , showing reported value and publication date.

Figure 13 .
Figure 12.Time series of the search results for n s , showing reported value and publication date.Figure13.Time series of the search results for ∑m ν , showing reported value and publication date.

Figure 14 .
Figure 14.Time series of the search results for w 0 , showing reported value and publication date.

Table 1
Entity Annotation Types in the Annotation Schema

Table 2
Relation Annotation Types in the Annotation Schema, Showing Any Constraints on the Start and End Entity Types for the Listed Relations

Table 4
Example BIO-labeled Tokenized Sentence

Table 5
Summary Metrics on the Test Set, for the Best-performing Entity Models from the Grid Search balance, we have chosen the LSTM model to be used for our final processing steps.A significant reason for this choice is the observed ability of the LSTM model to produce smoother output predictions-where fewer mid-Entity tokens are missed, resulting in less fragmented Entity predictions.

Table 6
Per-label Performance Metrics on the Test Set, for the Best-performing Entity Models from the Grid Search Note.Here, FF refers to the feed-forward model.

Table 7
Per-label Performance Metrics for the Neural and Rule-based Relation Models

Table 8
Per-label Performance Metrics for the Top-performing Attribute Model from the Model Search

Table 9
Example Annotations from the Rule-based Model ("Keyword") and Neural Model

Table 10
Fiducial Values of the Cosmological Constants Taken from Planck Collaboration et al. (2020) for Comparison with Model Results to the ∼0.315 value reported by Planck Collaboration et al. (2020).