Online meta-learned gradient norms for active learning in science and technology

Acquisition of scientific data can be expensive and time-consuming. Active learning is a solution to reduce costs and time by guiding the selection of scientific experiments. Autonomous and automatic identification of the most essential samples to annotate by active learning can also help to mitigate human bias. Previous research has demonstrated that unlabelled samples causing the largest gradient norms of neural network models can promote active learning in classification. However, gradient norm estimation in regression is non-trivial because the continuous one-dimensional output of regression significantly differs from classification. In this study, we propose a new active learning method that uses meta-learning to estimate the gradient norm of the unlabelled sample in regression. Specifically, we use a separate model to be a selector that learns knowledge from the previous active learning results and is used to predict the gradient norms of unlabelled samples. In each active learning iteration, we estimate and select unlabelled samples with the largest gradient norms to annotate. Our method is evaluated on six regression data sets in various domains, which include costly scientific data.


Introduction
The application of Machine learning (ML) in science and technology is increasingly influential.However, the success of ML intrinsically relies on high-quality and extensive labelled data, which can be prohibitively expensive to obtain in scientific fields, such as medical science [1], chemistry [2], and material science [3].Data labels in this context are numeric or categorical properties, which an ML model is trained to predict.Obtaining those labels, called annotation, might require a domain expert categorisation, an experiment, a measurement or a simulation [4][5][6].Active learning (AL) is an ML method that applies when we have a large pool of fully characterised feature data and a far smaller amount of annotated label data.AL directs annotation by automatically selecting a subset from the unlabelled pool in an iterative way [7].An acquisition score, calculated for all unlabelled data samples, directs the AL annotation.AL approaches are highly applicable to domains that prioritise reducing the cost associated with the overall number of experiments conducted, the annotation budget [8] and improving the generalisation performance of the model [9].The pool-based AL pipeline is shown in figure 1.
AL is usually studied in the context of classification problems [10][11][12][13][14][15][16], while AL methods for regression problems are underdeveloped [17].The difference between regression and classification here is non-trivial, so many existing classification-designed AL methods are inapplicable for regression.For example, a well-established classification AL acquisition score uses the predictive probability of unlabelled samples obtained from the classifier to estimate gradients and explicitly consider prediction confidence [10,18].However, this result cannot be generalised to regression because the continuous output of regression has no probability information and might yield a large interval range, making it intractable to estimate gradients of unlabelled samples.Pool-based AL pipeline.Annotated data with known labels is shown as filled (black) bins, and unlabelled feature data is shown as unfilled bins.AL is initialised with a large pool of unlabelled data (right) and a small set of labelled data (left).Oracles represent experiments, measurements, or simulations performed to obtain numerical or categorical labels when annotating data.The acquisition strategy (selector) ranks all unlabelled data.The base model is the selected predictive model.A single unlabelled data sample or batch of samples is selected for annotation in each AL iteration.
However, regression AL remains a strongly motivated approach to the science and technology domain, such as in material science [19,20], and chemistry [21,22].While a typical classification task may require annotation by a domain expert, regression tasks commonly require laboratory experiments or computational simulations that are highly expensive in terms of time or resources [4,23,24] and can help to accelerate scientific research, for examples, in drug discovery [8] or electronic structure calculations [25,26].Existing AL regression methods are predominantly based on the traditional model-driven mechanism.Model-driven AL pertains to methods formulated using human-defined statistical acquisition functions [27].For example, a classical and widely used model-driven AL regression method based on uncertainty is query-by-committee (QBC), which is to query the candidate sample with the largest disagreement (e.g.variance) among the different members in the committee [28].Another regression AL acquisition approach is to prioritise data diversity [29] such as greedy sampling [30].Different diversity metrics for greedy sampling have been defined in the input feature space (GSx), the output label space (GSy), and both (iGS) [30].Expected model change maximisation is another approach that prioritises querying samples that could have the most significant influence on the underlying ML model [31].
For a real-world user, selecting an appropriate AL method for a specific application is challenging: substantial knowledge of models' assumptions and implementations is required to use these methods effectively [27].Model-driven AL methods are based on different manually designed metrics, e.g.uncertainty and diversity, and can yield vastly different results across various data domains [27,32,33].An error in AL implementation could lead to a biased model or slower training performance than random sampling, and instability in the AL method performance may lead to disastrous consequences for downstream tasks.For instance, unstable model results might lead to misguided decisions and the waste of resources [34].
A newer AL paradigm is the data-driven rather than model-driven approach.This shift in perspective holds promise for optimising AL for regression tasks.Data-driven AL implements selection strategies derived from data, typically related to the data structure and parameters of the models [27].Data-driven AL can avoid the above pitfalls by using metrics learned from the data and applying automatically learned features to design the AL acquisition for guiding data selection.This type of method can be irrelevant to the regression or classification tasks and also provides more possible solutions for regression tasks.For instance, applying some acquisition functions designed for classification to regression by learning from data.Thus, it can reduce the dependence on domain experts and subsequently reduce bias.However, the research into designing the appropriate data-driven architecture and mechanism is challenging and still in the early stages [27].Several data-driven AL methods are verified in classification problems [35,36].
One emerging data-driven approach is meta-active learning (MAL) [36,37].In addition to the usual AL model training at each iteration, MAL preserves meta-knowledge about the AL training itself to improve performance (known as meta-learning).Meta-learning has the advantage of adapting rapidly to new learning tasks with fewer training samples, which aligns well with the scarcity of labelled data in the early stage of AL [27].Consequently, meta-learning is usually used to enhance the AL methods.The leading MAL method is learning active learning (LAL) [36], which aims to convert AL selection itself to a regression problem and predict the reduction in generalisation error for each unlabelled sample.The generalisation error is defined as the error when dealing with previously unseen data.The meta-features are handcrafted model states and data parameters that are accumulated by offline learning.Offline learning applies to a labelled data set before AL commences.So, in practice, offline MAL processes require a large amount of labelled or synthetic data to be initiated, making them undesirable for real-world annotation tasks.The feasibility and generalisability of current MAL methods are severely restricted in cases where prior conditions or knowledge may be limited or incomplete.For example, material science [3,38] and biomedical science fields [39,40] with smaller data sets due to resource or technical constraints.
Building upon prior research [36], we propose a new online meta-active learning (OMAL) method to address these limitations.The online strategy is motivated by and designed for regression tasks in scientific data domains.Our acquisition strategy is based on previous research on classification, which is to select samples with the largest gradient norms [10].Differently, our method capitalises on the direct relationship between the acquisition scores and the meta-features, which handle diverse data domains with higher flexibility and adaptability.By estimating sample gradient norm based on accumulated meta-knowledge, which has information with parameter gradients (from multi-layer perceptrons) and sample features, our method greatly decreases human intervention, especially when compared to acquisition functions with hand-selected or manually tuned hyperparameters.Our online learning process can use the meta-knowledge accumulated from the current data set with only a few initially labelled data samples, thereby enhancing the scalability of OMAL.We evaluate our OMAL method on six regression data sets and compare it to benchmark methods, finding that our approach has competitive AL performance on some data sets.
In this contribution, we propose a novel online data-driven AL approach designed specifically for scientific applications with an assumed high label query cost, smaller data size, minimised initialisation cost and a regression ML model.Experiments are performed on regression problems, but our method can also be applied to classification problems in the future.Second, our method does not need offline learning.The meta-knowledge is accumulated from the current data domain, which does not need to be collected from the other labelled data source.Third, experiments demonstrate the feasibility of OMAL, making it a potential alternative solution for AL regression.

Active learning
In pool-based AL for regression, for a set of indexed points P, we have complete feature information and incomplete label information.Here, the labels are continuous numerical variables rather than distinct categorical labels used in the more common AL for categorisation.We divide the index set P into unlabelled P U and labelled P L sets where the set X of known feature data vectors x (in d-dimensional feature space R d ) is defined for all samples and the set of known numerical data labels y ∈ R is We refer to the complete data set D as the known information for all samples AL uses a data selection algorithm to strategically label a subset from the unlabelled pool, thereby reducing the overall annotation cost and improving the efficiency of the labelling process [7].The acquisition function (also known as the selector [27]) of the chosen query strategy ranks all members of the unlabelled data set and selects one (sequential mode) or more (batch mode) for which the numerical label(s) y i are acquired.The selected sample(s) are moved from the unlabelled set to the labelled set.We describe the chosen regression model for the ML task learning as the base model f θ with model parameters θ.
The pipeline of AL is shown in figure 1 and iterates with the following steps.If the labelled set is initially empty, an initialisation strategy is typically required to acquire numerical labels for a small subset of samples.Then, the ML base model is trained at each AL iteration using the available labelled data.A selection strategy chooses sample(s), and numerical labels are acquired for those samples by performing annotation experiments or measurements (Oracle).Next, the data set, labelled index and unlabelled index are updated with this new information.These steps are repeated until a stopping criterion is reached or the annotation budget is exhausted.

Meta-learning
Meta-learning is a 'learning how to learn' process, which aims to develop models that can rapidly generalise to new unseen tasks by leveraging the prior knowledge from other known tasks [27,41].A classic example is generalising an image classifier trained to recognise dogs to different object classifications.The meta-learning approach learns underlying generalisable knowledge from multiple predictive tasks regardless of their specific details and allows better generalisation of a model across numerous tasks [42].Formally, meta-learning is a bi-level optimisation problem, where internal optimisation (related to the predictive task) is dependent on the restriction of outer optimisation (related to the meta-learning) [27,42,43].Meta-learning is characterised as the 'learning ability' acquired from this outer optimisation process, enabling rapid adaptation to a new task or environment using fewer training samples [42].These characteristics make meta-learning highly suited to an AL process, especially during the early stages when few training samples are available.However, significant modifications to the way meta-learning tasks are divided must be made to adapt this method for an AL pipeline.
For meta-learning, we consider ML tasks defined by a data set D and a loss function L. The meta-learning training data set D is a superset of D k for K different tasks indexed by integer k.The meta-training requires each D k to be divided into training (also named support sets) and test sets (also called query sets) We define the previously unseen meta-test task as a new set indexed as K + 1.First, the meta-training is defined as the process of optimises the ML model f ψ to acquire meta-knowledge in terms of the optimised meta-parameters ψ from the training tasks [27], achieved through maximum likelihood estimation where p is the probability function.
Then, optimising the outer algorithm can be regarded as 'learning to learn' from the meta set.Based on the learned parameters ψ during the meta-training phase, meta-learning aims to generalise these learning skills to the new unknown task.Similarly to equation (7) to optimise the meta-testing on the new task by where φK+1 is the updated meta parameters with respect to the new task and meta parameters ψ.With the trained parameters φK+1 , the meta learner can be evaluated in the corresponding test set with the new task D test K+1 to make the prediction.

Meta-active learning
It is established that characteristics of meta-learning align with the goals of AL [27,42].Fundamentally, both AL and meta-learning aim to improve the generalisation of the model using smaller labelled data.Each AL iteration can be framed as a new meta-learning task (K + 1, K + 2 . ..).The base ML model will be retrained and improved using the updated labelled data in each AL iteration, gradually enabling meta-learning to encode knowledge from the base model.
MAL is then formulated to use meta-learning as the selector for the AL selections.There are, however, some intrinsic differences between MAL and a typical meta-learning problem [27]: Firstly, there are no different tasks in AL but rather a constant data pool with a varying amount of known labels and the same predictive task.Secondly, the selected samples in AL iterations will be added to the same training set for the predictive task, which differs from the usual meta-learning task partitioning.Applying meta-learning to AL needs to consider (i) how to formulate the meta-training set appropriately, (ii) how to minimise the annotation cost for the meta-test set, (iii) how to define the method that is appropriate to both direct the meta-learning and evaluate the importance of the unlabelled samples.
Based on the above considerations, we regard the AL selection as a regression problem in our MAL method based on the previous method [36].The meta-features are the input, and the AL acquisition function score is the output for the regression.Learning the relation between the meta-features and acquisition function score is the optimisation process for meta-learning.We have two different functional models.One is called the base model for the predictive task (for example, image classification).Another separate model is named selector for meta-learning.
Our approach takes the following steps: (i) Randomly select and label a small amount of initial data.Use the labelled data to initialise the base model and accumulate the meta set.Update the unlabelled pool by removing the selected samples.(ii) Use the selector to train with the accumulated meta set and to predict the AL acquisition score for each candidate instance in the unlabelled pool.(iii) Select a batch of instances with the highest acquisition scores, query their numerical labels and sequentially add them to the training set.Update the unlabelled pool by removing the selected samples.(iv) Use the sequentially updated training set to retrain the base model and accumulate the meta set at each AL iteration.(v) Repeat the steps (ii)-(iv) in each AL iteration.
The meta-features of the meta (meta-training) set are designed based on the base model parameters and the sample features.The optimisation target (selector output) is the gradient norm for each unlabelled sample.All the information can be obtained from the base model in each AL iteration.In our method, we avoid using the meta-target that is related to the meta-test set (e.g. the loss reduction each training sample can bring in the meta-test set [36]) to save the annotation cost of labelling the meta-test set.Furthermore, our experimental and ablation studies show that meta-feature and meta-target designs are reasonable and can enhance AL selection.

Methodology
Our method applies meta-learning to learn the AL selection strategy from the specific domain with limited labelled data.The key to the design of the model state is how to narrow the domain of meta-feature variables to improve the generalisation, representation, and speed of the learning process for meta-learning.We defined the model state as gradient norms of parameters (weights and biases) in each layer in the multi-layer perceptron to characterise the continuous parameter changes in the neural network base model.Weights and biases play different roles in the learning process, and we represent them separately in the model state to interpret their different impacts on the model learning.
We integrated the data features with the specific model state to represent the learning state, which is associated with model influence as indicated by the gradient drop.Our MAL can directly predict the gradient norm for each unlabelled sample without relying on the predictive probability of unlabelled data for estimation [10].As a result, it is suitable for regression tasks.

Meta set design
The trained neural network f θ with Z hidden layers is denoted as where g i = a(w i x + b i ) for i = 1, . . ., Z. w i and b i are the weights and bias of the i th layer (not including the input layer), and a is the activation function.The output layer g Z+1 = (w i+1 x + b i+1 ) uses an identity function in the regression model.For a neural network with this structure, we accumulate meta-data, which characterises the model states and data parameters to define our meta-features σ.During meta-learning initialisation and the AL iterations t, when samples are selected and sequentially added into the labelled set (D L ), we retrain the model and extract the gradient norms of each sample.We use mini-batch gradient descent to train the model.The time state is defined in terms of each subsequent model retraining as τ = 1, 2, 3 . ... For each retraining, the labelled training data is divided into the mini-batches.The size of the mini-batch is a hyperparameter.The gradient is updated sequentially by training with each of these mini-batches.In state τ , we use the gradient of the last trained mini-batch M τ with respect to model parameters θ to characterise the latest model state.M τ is used to update θ to obtain the trained model in state τ .Therefore, the averaged loss L τ (θ) (mean squared error) with respect to the mini-batch M τ is where θ represents all the weights and bias of the model f θ (with H, the count of parameters), (x m , y m ) ∈ M τ and |M τ | refers to the size of M τ .The gradient with respect to all parameters is represented as We use ∇ w i L τ (θ) to represent the gradient matrix with respect to the weight matrix of the i th layer.We calculate the Frobenius norm of the gradient with respect to the weight parameters in each layer and denote as ∥∇ w i L τ (θ)∥ for i = 1, . . ., Z + 1.Similarly, the gradient norm with respect to the bias parameters is represented as where S τ ∈ R 2(Z+1) .To reduce randomness in the neural network states, we use an ensemble model as the base model f ρ with C members of the ensemble Each member in the f ρ is constructed with the same configurations, but parameter θ is initialised differently, and each member is optimised with D L .The model state is the ensemble average of the states of these models and is denoted as where S τ ∈ R 2(Z+1) .The data parameters are defined as the feature vector of the sample x i ∈ R d .We design the meta-features σ related to the sample x i as the concatenation of the averaged model state S τ with data parameters (the feature vectors of x i ) where ⊕ refers to the concatenation operation, and σ τ i ∈ R 2(Z+1)+d .By adding the sample x i to D L and retraining the model, we calculate the gradient norm with respect to all model parameters of the sample (x i , y i ) at the (τ + 1) state as the meta-target.Correspondingly, the averaged gradient norm among ensemble members is denoted as where δ τ +1 i ∈ R, and θ here are all of the updated model parameters in the state (τ + 1).We save the pair of meta-features and meta-target (σ τ i , δ τ +1 i ) to the meta set (Σ, ∆), which is used for training the selector R, is represented as In each AL iteration, we use a batch of queried samples to accumulate the meta set.Each queried sample is only calculated once for the meta-features and meta-target.The meta set accumulation process is described in algorithm 1, where warm-start training refers to model parameters not randomly initialised but from the previously trained states.

3:
Add the sample Retrain the base model f ρ using D L by employing the warm-start strategy.

Active learning strategy
In each AL iteration, we use the accumulated meta set to train the selector R. When query samples, the meta-features of each unlabelled sample x i in the t th AL iteration is defined as σ t i , which is composed of the current model state S t with feature vector x i .σ t i is used by the selector R to predict the gradient norm (δ i ) of x i .Subsequently, we select a batch of samples with the largest predicted gradient norms to annotate.The AL acquisition function guided by meta-learning is shown in algorithm 2. for Unlabelled sample x i ∈ D U do 3: end for 5: end for

Meta-active learning pipeline
The pipeline of our OMAL method is shown in figure 2. The pseudo-code for the method is shown as algorithm 3. Our method has two stages for the initialisation step.The first stage is for the initialisation of the base model f ρ , and the second stage is meta set initialisation.The samples are selected randomly for both stages because they are uniformly spread over the sample space.During the second stage, the selected labelled samples are added to the training set one at a time.As each sample is incorporated into the training set, the model is retrained using a warm-start approach, allowing for the accumulation of meta-features and their corresponding meta-targets (algorithm 1).When the selected batch of candidate samples is labelled in the AL iterations, the labelled samples are added to the training set sequentially, and the meta set is accumulated simultaneously.
There are two models in our method.One is the base model f ρ , which provides the model states information during the AL iterations and is related to the predictive task.The second model selector R predicts the gradient norm of each unlabelled sample by learning with the meta set.The meta set provides correlation information between meta-features and meta-targets, which are collected from previous active learning iterations.To ensure that the base model provides sufficient state information, we used a multi-layer perceptron, which offers abundant trainable parameters to capture the intermediate state of each layer.The selector requires that when the amount of data is small, it can improve learning and avoid overfitting.In addition, it should not have too many hyperparameters as an AL selector to fine-tune.Based on these considerations, we chose the random forest model [44] as the selector.Train the selector R using the meta set (Σ, ∆).// Σ is the meta-feature set, and ∆ is the meta-target set.

8:
for x i ∈ D U do 9: Calculate σ t i by equations ( 13), ( 15) and ( 16).// σ t i is the meta-features calculated with the model f ρ in the t th AL iteration.10:

Data sets
We employed six public data sets for experiments to validate our method.Data sets are Diabetes [45], AutoMPG [46], Silver Nanoparticle [47] NO2 [48], Red Wine [49], and Graphene Oxide [50].The detailed information after data processing, including the number of total samples (Data Size), feature dimensions (Feature Size), number of initial samples, AL batch size (the number of selected samples in each AL iteration), and regression label, is shown in table 1.We dropped some unnecessary features, for example, instance ID.We also dropped the feature columns that contain null values.We applied one-hot encoding to convert some categorical features to numerical features.We randomly split each data set into 80% of data as the unlabelled pool and 20% as the labelled test set for evaluation.We standardised the features by scaling to unit variance and removing the mean.

Comparison methods
We selected some related work as the comparison methods.
• Random Sampling: As the baseline of our method.
• QBC: The classical methods based on uncertainty [28].It first bootstraps samples with replacements from the labelled set for training each committee member.Each AL iteration selects a sample with the maximum disagreement (variance) among the committee members.In our experiment, we transitioned the method from sequential to batch-mode AL by selecting a batch of unlabelled samples with the maximum disagreement among the committee members.• Greedy Sampling (Greedy): An improved Greedy Sampling (iGS) [30] aims to select samples with the largest feature space and output space diversity.In our experiment, we transitioned the method from sequential to batch-mode AL by selecting a batch of unlabelled samples with the largest minimum distance to the labelled set.For a fair comparison of the AL acquisition function, we use the same random initialisation as the other methods.

• Largest Cluster Maximum Distance (LCMD):
The method is described as greedy distance maximisation in the largest cluster [51].They used the Neural Tangent Kernel (NTK) [52] as a base kernel to do transformation and have shown to improve the state-of-art of batch-mode deep AL in regression.• OMAL: The method proposed herein and shown in algorithm 3.

Experimental setup
We used the same 20 random splits for each data set with all the comparison methods.We used the Pytorch [53], Skorch [54], and Scikit-learn [55] for implementing our method.The base model for our method is an ensemble model of 3 fully-connected neural networks (NNs), which is the standard multi-layer perceptron with hidden layers sizes: l 1 = 50, l 2 = 30, l 3 = 15.Each ensemble member has the same structure.The activation function used is the exponential linear unit [56] because it can prevent issues such as vanishing and exploding gradients.This can provide non-zero gradient norms for the meta selector.The optimiser is Adam [57], and the mini-batch size is 200.If the training set has fewer than 200 samples, the full batch will be used for training in early AL iterations of smaller data sets.The loss function is the mean squared error (MSE), and the initial learning rate is 1 × 10 −3 with the dynamic learning rate reduction.The selector for our method is a random forest with 1000 trees.The base models for LCMD, Greedy, and QBC are fully connected NNs with the same architecture as our method.LCMD and Greedy sampling use the single NN model, and QBC uses the ensemble model with 3 members as our method.The hyper-parameters of LCMD are the same as their method [51], using LCMD-TP mode with a finite-width NTK transformation k grad→sketch (512) .For LCMD, we applied our NN model, implemented by PyTorch linear layers, to their provided function, as shown in the example at their code repository 1 .
AL experts may determine the number of samples used for initialisation based on their experience in real applications or by the availability of existing labelled samples.Using a small initial set reduces budget waste when samples are selected randomly at the beginning.However, this may result in the model being unable to provide reliable predictions for AL methods.Consequently, this presents a significant challenge in real-world scenarios.Addressing this use-case while working with data sets of varying sizes, we used a very small number of labelled samples, 3% of the unlabelled pool, as a unified setting for the initialisation process (a total of two stages).This equates to between 9-39 labelled samples as summarised in table 1.For each trial, the same randomly selected initialisation samples were used for each AL method.In our approach, we set m ′ = 5 for initialising the ensemble model in Initialisation Stage 1. m ′′ is the rest of the initial samples, which are used in Initialisation Stage 2 to initialise the meta set.The AL iteration T for all data sets is set as 25.
To compare methods fairly, we used the same random forest model with 100 trees after each AL selection iteration in each experiment to train with the updated labelled set, evaluate the model performance on test sets, and plot the diagrams (figures 3, 4, 5 and 6 ) with it.The purpose of the base model in OMAL is to provide the meta set with continuous learning states during the AL iterations.To save training time and get continuous model states, we used the warm-start training setting for the base model.The base model is used for guiding sampling but not for optimal predictive results, so the warm-start training method is better than the cold-start to get the continuous learning states.For our OMAL method, the maximum training epochs is 30 for all experiments.We performed the random internal split and used 20% of the labelled set as the validation set for early stopping.The training process will stop when the validation set results cannot improve within 5 epochs.For the other AL comparison methods, we applied the more general cold-start training, and the maximum training epochs is 100 for all experiments and 20 epochs for early stopping.
In addition, we conducted two sets of extensional experiments for the AutoMPG and NO2 data sets because of the suboptimal performance of our method with just 3% of the unlabelled pool as initial samples.Thus, we used 10% of the unlabelled pool for the initialisation process.In our approach, we still set m ′ = 5 for initialising the ensemble model.m ′′ is the rest of the initial samples used to initialise the meta set.

Performance evaluation
We use different metrics to evaluate experimental results.For each data set, we use the root mean squared error (RMSE) and coefficient of determination (R 2 ) on the test set after each AL iteration for plotting.For total k samples in the test set, y i is the ground truth label of the sample x i , ŷi is the predicted label value and y is the averaged label value, RMSE and R 2 are denoted as We plot the number of queried samples and the percentage of queried samples in the unlabelled pool as two axes in figures 3 and 4. Figure 3 starts from around 20% of the unlabelled pool to show more detailed comparisons among AL methods.Different data sets differ slightly from the exact 20% because of the different amounts of initial samples and AL batch sizes.The budget usually determines the stopping criteria for AL in practice, and here, we demonstrate the 25 AL iterations for all data sets.Thus, the stopping point is between around 70% to 80% among data sets.Figure 3 shows the averaged R 2 , and figure 4 shows the averaged RMSE among 20 experiments.RMSE in figure 4 is plotted with standard error σ √ 20 as shaded areas, where σ is the standard deviation.The extensional experiments for AutoMPG and NO2 data sets are shown in figure 5.

Experimental results analysis
Figure 3 shows the first performance comparisons among AL methods across different data sets.This figure highlights the difficulty of choosing the best-performing AL method in advance.Excluding our method, greedy sampling outperforms the state-of-the-art LCMD method on the Diabetes and Silver Nanoparticle data sets, while LCMD shows the strongest performance on Red Wine, AutoMPG and Graphene Oxide data sets.QBC almost matches the LCMD performance on Graphene Oxide, exceeding the greedy sampling performance.Our OMAL method approximately meets or exceeds the top-performing comparison method (whether greedy sampling or LCMD) on the Diabetes, Red Wine and Silver Nanoparticle data sets.
On the NO2 data set, no method significantly exceeds the random sampling performance.The performance of our OMAL method on the NO2 data set is worse than random sampling and equivalent to random sampling for the AutoMPG and Graphene Oxide data set.OMAL has a severe cold-start problem on the NO2 data set, but once approximately 50% of the unlabelled pool is labelled, its performance starts to approach other methods.This indicates that the extremely small initialisation sets (9 for AutoMPG and 12 for NO2) are below a lower bound for acceptable OMAL initialisation.AutoMPG is the smallest data set and has the smallest initialisation set as a result.As summarised in table 1, these two data sets are both small and have the fewest features (7).So, an absolute lower bound should be considered for implementing OMAL for similar data sets.
Therefore, we increase the number of initially labelled samples changes from 3% of the unlabelled pool to 10% for the AutoMPG and NO2 data sets, shown in figure 5.This is an increase from 9 (12) to 31 (40) initially labelled samples for AutoMPG (NO2) but remains a very small initialisation budget in absolute terms.We first compare OMAL and the LCMD method in these experiments.The top two diagrams of figure 5 demonstrate that with 10% of the unlabelled pool as initialisation, the performance of OMAL is close to the LCMD method, which shows the best performance on the AutoMPG data set and similarly with the other methods on the NO2 data set in figure 3. OMAL improves performance more than LCMD when increasing the initial labelled data from 3% to 10%.Taking into account the updated initialisation setting, we can summarise the best performing method(s) for each data set, based on figures 3 and 4 and the bottom two diagrams of figure 5.The Diabetes data set has the most noise (very wide error bar in figure 4(a)), but OMAL and greedy sampling display relatively stronger performance while QBC matches their performance at more than around 40% of the pool labelled.Our OMAL and LCMD show the strongest performance for the Red Wine data set, while for Silver Nanoparticles, OMAL and greedy sampling do.In figure 5, all methods perform similarly for the NO2 data set, while for the AutoMPG data, the remaining methods outperform random sampling but are otherwise comparable.Finally, LCMD and QBC have the strongest performance for the Graphene Oxide data.So, it is very difficult to determine a-priori the best AL approach for these challenging small datasets.Summarising the six data sets, our OMAL method meets or outperforms all methods in 5/6 data sets with worse performance on the Graphene Oxide data.For comparison, LCMD meets or exceeds the performance of all other methods for 4/6 data sets, greedy sampling: 4/6 and QBC: 4/6.

Ablation study
We also conducted an ablation study to verify the relationships among the meta-features and sample gradient norms.First, we changed our experiments to query the samples with the smallest gradient norms (Inverse Gradient Norms in figure 6).The results show significantly impaired model training for all data sets, strongly supporting the accuracy of our selector prediction.The model cannot effectively learn when training with these samples, and results are much worse than random sampling.In addition, we removed the model states from the meta-features (Removed Model States in figure 6) only using data feature vectors of unlabelled samples to predict the sample gradient norms.Our data sets were chosen to test a broad range of the number of features given they are included in our meta-features.When the data sets have more feature dimensions, the gaps between OMAL results with these experiments that removed the model states would be smaller.The dimensionality of the Silver Nanoparticle and Graphene Oxide is relatively higher than in other data sets.Thus, the gaps between OMAL and ablation results are less pronounced.

Visualization analysis
The numerical label distribution of experimental results and the prediction value of each data sample are also valuable information to help us better understand the method.Because of the page limit, we only demonstrate the random and OMAL results of the Diabetes data set in figure 7, and queried label distribution plots for other data sets can be found in the Supplementary data.
The results in figure 7 are all from the 20 th AL iteration.The top two diagrams show the numerical label distribution of selected samples from random and OMAL methods in the same single experiment on the Diabetes data set.The prediction results of the bottom two diagrams generated by the random forest with 100 trees that we used for plotting the figures 3 and 4. We first trained the random forest model with the whole unlabelled pool of the single experiment and then made predictions for each training sample.We plotted the queried samples as purple dots and unselected samples in the unlabelled pool as grey dots.Results demonstrate that the random sampling distribution is consistent with the original distribution in the unlabelled pool.However, in the subset queried by OMAL, the label distribution changes to be more balanced, which increases the proportion of the rare distributed area and decreases the intensive distributed area.From the two bottom diagrams, compared with random sampling, we can observe that OMAL queried more samples in the area where samples are sparsely distributed (The top right) and fewer samples in the area where samples are densely distributed (The bottom left).

Discussion
To the best of our knowledge, no MAL method was previously designed for small data without offline learning because the lack of information for meta-learning is challenging.Compared with the LAL method [36], our method does not need offline learning to accumulate the meta set from the other labelled data source and only relies on the current data set domain.Our method does not need to generate synthetic data sets as well.The online learning mode is more flexible and can stop collecting meta set and convert to an online prediction process when the selector has obtained a good prediction performance.Then, we only need to use the well-trained selector to predict the gradient norm for each unlabelled sample.In addition, our method can be applied to regression and classification problems without substantial changes.However, LAL is challenging to apply to regression problems because their handcrafted meta-features characterise some generality to represent the classification problem, for example, proportion of the class in the labelled set.Nevertheless, it is hard to capture such uniform patterns from synthetic data in regression.
The experimental results show that the selector can learn from our designed meta set.The selector can make accurate predictions even when there are only several or dozens of meta instances.It shows that our meta set design can help narrow the problem domain, which has a general feature representation for MAL.The limitation is that when the data features have considerably higher dimensions (For example, the Graphene Oxide data set), the selector might find it more difficult to learn the AL acquisition.However, this may be solved by feature engineering and is a topic of further research.
Compared with some methods that rely on model performance for prediction in AL iterations, our method has applicability in the case of insufficient initial data.Our method is under extremely small initial data (3% of the unlabelled pool) with small-scale data sets that can maintain competitive performance in some cases.In the suboptimal cases, when the initial labelled data increases to 10% of the unlabelled pool, the performance of our method can be close to the other AL methods.Moreover, our method tends to query a more balanced subset, which changes the original data distribution and is beneficial for model learning.Our method also might be suited for scenarios where real-time data is collected and processed continuously.MAL can give more accurate and efficient decisions in such scenarios based on the accumulated meta-knowledge.Our method can provide up-to-date decisions by continuously learning from new data in real time.Compared with other model-driven AL methods, it can quickly adapt to changing data distributions and patterns by selecting a subset from the accumulated meta set.

Conclusion
We designed a novel and flexible online MAL method suitable for smaller data sets typically seen in science and technology, which can be adapted to the situation of very few initial labelled samples.We combined the gradient norms of the hidden layer parameters in NN with the feature vectors of the data samples as the features of meta-learning to predict the gradient norm of the unlabelled sample and select the samples with the largest possible gradient norms for annotation.Experimental results demonstrate the feasibility of our method among different data sets.In practical scientific experiments, when the data is not large and the annotation cost is high, our method can be used as a general AL method to guide experimental design.Therefore, we conclude that our method will be useful to scientists needing to limit their data generation to only the most essential samples while avoiding selection bias.
This method has many related works that can be mined in the future, such as designing feature extraction or feature selection methods for meta-learning.Apply it properly to other network structures (CNNs, RNNs, etc) to deal with different tasks.Setting certain conditions to convert online and offline learning modes to maximise computing efficiency when applying more extensive data sets.

Figure 1 .
Figure 1.Pool-based AL pipeline.Annotated data with known labels is shown as filled (black) bins, and unlabelled feature data is shown as unfilled bins.AL is initialised with a large pool of unlabelled data (right) and a small set of labelled data (left).Oracles represent experiments, measurements, or simulations performed to obtain numerical or categorical labels when annotating data.The acquisition strategy (selector) ranks all unlabelled data.The base model is the selected predictive model.A single unlabelled data sample or batch of samples is selected for annotation in each AL iteration.

Algorithm 2 .
AL Acquisition function.1: for t th AL iteration do 2:

Figure 2 .Algorithm 3 . 1 : 2 : 4 :
Figure 2. The pipeline of our online meta-active learning method (OMAL): Labelled data is represented by filled bins, unlabelled data by unfilled bins.The yellow arrow indicates the flow of meta data, and the grey arrows indicate the AL query process.Purple arrows the predictive model task data flow.The selector can be any applicable model, and the base model is the chosen predictive ML model.

Figure 3 .
Figure 3. Experimental results of averaged R 2 on the test set.The horizontal dashed line represents the results when the full unlabelled pool is labelled as the training sets.The top horizontal axis refers to the percentage of the labelled set in the unlabelled pool.The bottom horizontal axis refers to the number of labelled samples.

Figure 4 .
Figure 4. Experimental results of averaged RMSE with standard error on the test set.The horizontal dashed line represents the results when the full unlabelled pool is labelled as the training sets.The top horizontal axis refers to the percentage of the labelled set in the unlabelled pool.The bottom horizontal axis refers to the number of labelled samples.

Figure 5 .
Figure 5.The top two diagrams are extensional experiment results of averaged R 2 on the test set.LCMD and OMAL methods, using 10% to compare with only 3% of the unlabelled pool as the initial samples.The x-axis refers to the AL iterations.The bottom two diagrams are the extensional experiment results of the averaged RMSE with the standard error on the test set, which is initialised using 10% of the unlabelled pool.The two horizontal axes have the same meaning as figures 3 and 4. The horizontal dashed line for all diagrams represents the results when the full unlabelled pool is labelled as the training set.

Figure 6 .
Figure 6.The ablation study results show the averaged R 2 with standard error on the test set.The results of the AutoMPG and NO2 datasets are initialised with 10% of the unlabeled pool, the same as the extensional experiments.Results of the other data sets are initialised with 3% of the unlabelled pool.

Figure 7 .
Figure 7. Analysis results of total queried samples at the 20 th AL iteration.The top two diagrams show the numerical label distribution of the queried results by Random and OMAL with the Diabetes data set.The bottom two diagrams show the true values and predictions of the queried and unselected samples of Random and OMAL with the Diabetes data set.
Using selector R to predict the meta-target δ i .