Transformer models coupled with a simplified molecular line entry system (SMILES) have recently proven to be a powerful combination for solving challenges in cheminformatics. These models, however, are often developed specifically for a single application and can be very resource-intensive to train. In this work we present the Chemformer model—a Transformer-based model which can be quickly applied to both sequence-to-sequence and discriminative cheminformatics tasks. Additionally, we show that self-supervised pre-training can improve performance and significantly speed up convergence on downstream tasks. On direct synthesis and retrosynthesis prediction benchmark datasets we publish state-of-the-art results for top-1 accuracy. We also improve on existing approaches for a molecular optimisation task and show that Chemformer can optimise on multiple discriminative tasks simultaneously. Models, datasets and code will be made available after publication.
Machine Learning: Science and Technology is a multidisciplinary open access journal that bridges the application of machine learning across the sciences with advances in machine learning methods and theory as motivated by physical insights.
Most read
Open all abstracts, in this tab
Ross Irwin et al 2022 Mach. Learn.: Sci. Technol. 3 015022
Ivan S Novikov et al 2021 Mach. Learn.: Sci. Technol. 2 025002
The subject of this paper is the technology (the 'how') of constructing machine-learning interatomic potentials, rather than science (the 'what' and 'why') of atomistic simulations using machine-learning potentials. Namely, we illustrate how to construct moment tensor potentials using active learning as implemented in the MLIP package, focusing on the efficient ways to automatically sample configurations for the training set, how expanding the training set changes the error of predictions, how to set up ab initio calculations in a cost-effective manner, etc. The MLIP package (short for Machine-Learning Interatomic Potentials) is available at https://mlip.skoltech.ru/download/.
Mario Krenn et al 2020 Mach. Learn.: Sci. Technol. 1 045024
The discovery of novel materials and functional molecules can help to solve some of society's most urgent challenges, ranging from efficient energy harvesting and storage to uncovering novel pharmaceutical drug candidates. Traditionally matter engineering–generally denoted as inverse design–was based massively on human intuition and high-throughput virtual screening. The last few years have seen the emergence of significant interest in computer-inspired designs based on evolutionary or deep learning methods. The major challenge here is that the standard strings molecular representation SMILES shows substantial weaknesses in that task because large fractions of strings do not correspond to valid molecules. Here, we solve this problem at a fundamental level and introduce SELFIES (SELF-referencIng Embedded Strings), a string-based representation of molecules which is 100% robust. Every SELFIES string corresponds to a valid molecule, and SELFIES can represent every molecule. SELFIES can be directly applied in arbitrary machine learning models without the adaptation of the models; each of the generated molecule candidates is valid. In our experiments, the model's internal memory stores two orders of magnitude more diverse molecules than a similar test with SMILES. Furthermore, as all molecules are valid, it allows for explanation and interpretation of the internal working of the generative models.
Moritz Hoffmann et al 2022 Mach. Learn.: Sci. Technol. 3 015009
Generation and analysis of time-series data is relevant to many quantitative fields ranging from economics to fluid mechanics. In the physical sciences, structures such as metastable and coherent sets, slow relaxation processes, collective variables, dominant transition pathways or manifolds and channels of probability flow can be of great importance for understanding and characterizing the kinetic, thermodynamic and mechanistic properties of the system. Deeptime is a general purpose Python library offering various tools to estimate dynamical models based on time-series data including conventional linear learning methods, such as Markov state models (MSMs), Hidden Markov Models and Koopman models, as well as kernel and deep learning approaches such as VAMPnets and deep MSMs. The library is largely compatible with scikit-learn, having a range of Estimator classes for these different models, but in contrast to scikit-learn also provides deep Model classes, e.g. in the case of an MSM, which provide a multitude of analysis methods to compute interesting thermodynamic, kinetic and dynamical quantities, such as free energies, relaxation times and transition paths. The library is designed for ease of use but also easily maintainable and extensible code. In this paper we introduce the main features and structure of the deeptime software. Deeptime can be found under https://deeptime-ml.github.io/.
Jan Weinreich et al 2023 Mach. Learn.: Sci. Technol. 4 025017
Large machine learning (ML) models with improved predictions have become widely available in the chemical sciences. Unfortunately, these models do not protect the privacy necessary within commercial settings, prohibiting the use of potentially extremely valuable data by others. Encrypting the prediction process can solve this problem by double-blind model evaluation and prohibits the extraction of training or query data. However, contemporary ML models based on fully homomorphic encryption or federated learning are either too expensive for practical use or have to trade higher speed for weaker security. We have implemented secure and computationally feasible encrypted ML models using oblivious transfer enabling and secure predictions of molecular quantum properties across chemical compound space. However, we find that encrypted predictions using kernel ridge regression models are a million times more expensive than without encryption. This demonstrates a dire need for a compact ML model architecture, including molecular representation and kernel matrix size, that minimizes model evaluation costs.
Philippe Schwaller et al 2021 Mach. Learn.: Sci. Technol. 2 015016
Artificial intelligence is driving one of the most important revolutions in organic chemistry. Multiple platforms, including tools for reaction prediction and synthesis planning based on machine learning, have successfully become part of the organic chemists' daily laboratory, assisting in domain-specific synthetic problems. Unlike reaction prediction and retrosynthetic models, the prediction of reaction yields has received less attention in spite of the enormous potential of accurately predicting reaction conversion rates. Reaction yields models, describing the percentage of the reactants converted to the desired products, could guide chemists and help them select high-yielding reactions and score synthesis routes, reducing the number of attempts. So far, yield predictions have been predominantly performed for high-throughput experiments using a categorical (one-hot) encoding of reactants, concatenated molecular fingerprints, or computed chemical descriptors. Here, we extend the application of natural language processing architectures to predict reaction properties given a text-based representation of the reaction, using an encoder transformer model combined with a regression layer. We demonstrate outstanding prediction performance on two high-throughput experiment reactions sets. An analysis of the yields reported in the open-source USPTO data set shows that their distribution differs depending on the mass scale, limiting the data set applicability in reaction yields predictions.
Jonathan Shlomi et al 2021 Mach. Learn.: Sci. Technol. 2 021001
Particle physics is a branch of science aiming at discovering the fundamental laws of matter and forces. Graph neural networks are trainable functions which operate on graphs—sets of elements and their pairwise relations—and are a central method within the broader field of geometric deep learning. They are very expressive and have demonstrated superior performance to other classical deep learning approaches in a variety of domains. The data in particle physics are often represented by sets and graphs and as such, graph neural networks offer key advantages. Here we review various applications of graph neural networks in particle physics, including different graph constructions, model architectures and learning objectives, as well as key open problems in particle physics for which graph neural networks are promising.
BL DeCost et al 2020 Mach. Learn.: Sci. Technol. 1 033001
Recently there has been an ever-increasing trend in the use of machine learning (ML) and artificial intelligence (AI) methods by the materials science, condensed matter physics, and chemistry communities. This perspective article identifies key scientific, technical, and social opportunities that the materials community must prioritize to consistently develop and leverage Scientific AI (SciAI) to provide a credible path towards the advancement of current materials-limited technologies. Here we highlight the intersections of these opportunities with a series of proposed paths forward. The opportunities are roughly sorted from scientific/technical (e.g. development of robust, physically meaningful multiscale material representations) to social (e.g. promoting an AI-ready workforce). The proposed paths forward range from developing new infrastructure and capabilities to deploying them in industry and academia. We provide a brief introduction to AI in materials science and engineering, followed by detailed discussions of each of the opportunities and paths forward.
Yongcheng Ding et al 2023 Mach. Learn.: Sci. Technol. 4 025020
The exotic nature of quantum mechanics differentiates machine learning applications in the quantum realm from classical ones. Stream learning is a powerful approach that can be applied to extract knowledge continuously from quantum systems in a wide range of tasks. In this paper, we propose a deep reinforcement learning method that uses streaming data from a continuously measured qubit in the presence of detuning, dephasing, and relaxation. The model receives streaming quantum information for learning and decision-making, providing instant feedback on the quantum system. We also explore the agent's adaptability to other quantum noise patterns through transfer learning. Our protocol offers insights into closed-loop quantum control, potentially advancing the development of quantum technologies.
Stefano Mensa et al 2023 Mach. Learn.: Sci. Technol. 4 015023
Machine Learning for ligand based virtual screening (LB-VS) is an important in-silico tool for discovering new drugs in a faster and cost-effective manner, especially for emerging diseases such as COVID-19. In this paper, we propose a general-purpose framework combining a classical Support Vector Classifier algorithm with quantum kernel estimation for LB-VS on real-world databases, and we argue in favor of its prospective quantum advantage. Indeed, we heuristically prove that our quantum integrated workflow can, at least in some relevant instances, provide a tangible advantage compared to state-of-art classical algorithms operating on the same datasets, showing strong dependence on target and features selection method. Finally, we test our algorithm on IBM Quantum processors using ADRB2 and COVID-19 datasets, showing that hardware simulations provide results in line with the predicted performances and can surpass classical equivalents.
Latest articles
Open all abstracts, in this tab
Mingtao Xia et al 2023 Mach. Learn.: Sci. Technol. 4 025024
Solving analytically intractable partial differential equations (PDEs) that involve at least one variable defined on an unbounded domain arises in numerous physical applications. Accurately solving unbounded domain PDEs requires efficient numerical methods that can resolve the dependence of the PDE on the unbounded variable over at least several orders of magnitude. We propose a solution to such problems by combining two classes of numerical methods: (i) adaptive spectral methods and (ii) physics-informed neural networks (PINNs). The numerical approach that we develop takes advantage of the ability of PINNs to easily implement high-order numerical schemes to efficiently solve PDEs and extrapolate numerical solutions at any point in space and time. We then show how recently introduced adaptive techniques for spectral methods can be integrated into PINN-based PDE solvers to obtain numerical solutions of unbounded domain problems that cannot be efficiently approximated by standard PINNs. Through a number of examples, we demonstrate the advantages of the proposed spectrally adapted PINNs in solving PDEs and estimating model parameters from noisy observations in unbounded domains.
Calin-Andrei Pantis-Simut et al 2023 Mach. Learn.: Sci. Technol. 4 025023
Accurate and efficient tools for calculating the ground state properties of interacting quantum systems are essential in the design of nanoelectronic devices. The exact diagonalization method fully accounts for the Coulomb interaction beyond mean field approximations and it is regarded as the gold-standard for few electron systems. However, by increasing the number of instances to be solved, the computational costs become prohibitive and new approaches based on machine learning techniques can provide a significant reduction in computational time and resources, maintaining a reasonable accuracy. Here, we employ pix2pix, a general-purpose image-to-image translation method based on conditional generative adversarial network (cGAN), for predicting ground state densities from randomly generated confinement potentials. Other mappings were also investigated, like potentials to non-interacting densities and the translation from non-interacting to interacting densities. The architecture of the cGAN was optimized with respect to the internal parameters of the generator and discriminator. Moreover, the inverse problem of finding the confinement potential given the interacting density can also be approached by the pix2pix mapping, which is an important step in finding near-optimal solutions for confinement potentials.
Sergei V Kalinin et al 2023 Mach. Learn.: Sci. Technol. 4 023001
We pose that microscopy offers an ideal real-world experimental environment for the development and deployment of active Bayesian and reinforcement learning methods. Indeed, the tremendous progress achieved by machine learning (ML) and artificial intelligence over the last decade has been largely achieved via the utilization of static data sets, from the paradigmatic MNIST to the bespoke corpora of text and image data used to train large models such as GPT3, DALL·E and others. However, it is now recognized that continuous, minute improvements to state-of-the-art do not necessarily translate to advances in real-world applications. We argue that a promising pathway for the development of ML methods is via the route of domain-specific deployable algorithms in areas such as electron and scanning probe microscopy and chemical imaging. This will benefit both fundamental physical studies and serve as a test bed for more complex autonomous systems such as robotics and manufacturing. Favorable environment characteristics of scanning and electron microscopy include low risk, extensive availability of domain-specific priors and rewards, relatively small effects of exogenous variables, and often the presence of both upstream first principles as well as downstream learnable physical models for both statics and dynamics. Recent developments in programmable interfaces, edge computing, and access to application programming interfaces (APIs) facilitating microscope control, all render the deployment of ML codes on operational microscopes straightforward. We discuss these considerations and hope that these arguments will lead to create novel set of development targets for the ML community by accelerating both real world ML applications and scientific progress.
Shawn G Rosofsky et al 2023 Mach. Learn.: Sci. Technol. 4 025022
We present a critical analysis of physics-informed neural operators (PINOs) to solve partial differential equations (PDEs) that are ubiquitous in the study and modeling of physics phenomena using carefully curated datasets. Further, we provide a benchmarking suite which can be used to evaluate PINOs in solving such problems. We first demonstrate that our methods reproduce the accuracy and performance of other neural operators published elsewhere in the literature to learn the 1D wave equation and the 1D Burgers equation. Thereafter, we apply our PINOs to learn new types of equations, including the 2D Burgers equation in the scalar, inviscid and vector types. Finally, we show that our approach is also applicable to learn the physics of the 2D linear and nonlinear shallow water equations, which involve three coupled PDEs. We release our artificial intelligence surrogates and scientific software to produce initial data and boundary conditions to study a broad range of physically motivated scenarios. We provide the source code, an interactive website to visualize the predictions of our PINOs, and a tutorial for their use at the Data and Learning Hub for Science.
Agughasi Victor Ikechukwu and Murali S 2023 Mach. Learn.: Sci. Technol. 4 025021
Automatic identification of salient features in large medical datasets, particularly in chest x-ray (CXR) images, is a crucial research area. Accurately detecting critical findings such as emphysema, pneumothorax, and chronic bronchitis can aid radiologists in prioritizing time-sensitive cases and screening for abnormalities. However, traditional deep neural network approaches often require bounding box annotations, which can be time-consuming and challenging to obtain. This study proposes an explainable ensemble learning approach, CX-Net, for lung segmentation and diagnosing lung disorders using CXR images. We compare four state-of-the-art convolutional neural network models, including feature pyramid network, U-Net, LinkNet, and a customized U-Net model with ImageNet feature extraction, data augmentation, and dropout regularizations. All models are trained on the Montgomery and VinDR-CXR datasets with and without segmented ground-truth masks. To achieve model explainability, we integrate SHapley Additive exPlanations (SHAP) and gradient-weighted class activation mapping (Grad-CAM) techniques, which enable a better understanding of the decision-making process and provide visual explanations of critical regions within the CXR images. By employing ensembling, our outlier-resistant CX-Net achieves superior performance in lung segmentation, with Jaccard overlap similarity of 0.992, Dice coefficients of 0.994, precision of 0.993, recall of 0.980, and accuracy of 0.976. The proposed approach demonstrates strong generalization capabilities on the VinDr-CXR dataset and is the first study to use these datasets for semantic lung segmentation with semi-supervised localization. In conclusion, this paper presents an explainable ensemble learning approach for lung segmentation and diagnosing lung disorders using CXR images. Extensive experimental results show that our method efficiently and accurately extracts regions of interest in CXR images from publicly available datasets, indicating its potential for integration into clinical decision support systems. Furthermore, incorporating SHAP and Grad-CAM techniques further enhances the interpretability and trustworthiness of the AI-driven diagnostic system.
Review articles
Open all abstracts, in this tab
James Stokes et al 2023 Mach. Learn.: Sci. Technol. 4 021001
This article aims to summarize recent and ongoing efforts to simulate continuous-variable quantum systems using flow-based variational quantum Monte Carlo techniques, focusing for pedagogical purposes on the example of bosons in the field amplitude (quadrature) basis. Particular emphasis is placed on the variational real- and imaginary-time evolution problems, carefully reviewing the stochastic estimation of the time-dependent variational principles and their relationship with information geometry. Some practical instructions are provided to guide the implementation of a PyTorch code. The review is intended to be accessible to researchers interested in machine learning and quantum information science.
Bahram Jalali et al 2022 Mach. Learn.: Sci. Technol. 3 041001
The phenomenal success of physics in explaining nature and engineering machines is predicated on low dimensional deterministic models that accurately describe a wide range of natural phenomena. Physics provides computational rules that govern physical systems and the interactions of the constituents therein. Led by deep neural networks, artificial intelligence (AI) has introduced an alternate data-driven computational framework, with astonishing performance in domains that do not lend themselves to deterministic models such as image classification and speech recognition. These gains, however, come at the expense of predictions that are inconsistent with the physical world as well as computational complexity, with the latter placing AI on a collision course with the expected end of the semiconductor scaling known as Moore's Law. This paper argues how an emerging symbiosis of physics and AI can overcome such formidable challenges, thereby not only extending AI's spectacular rise but also transforming the direction of engineering and physical science.
April M Miksch et al 2021 Mach. Learn.: Sci. Technol. 2 031001
Recent advances in machine-learning interatomic potentials have enabled the efficient modeling of complex atomistic systems with an accuracy that is comparable to that of conventional quantum-mechanics based methods. At the same time, the construction of new machine-learning potentials can seem a daunting task, as it involves data-science techniques that are not yet common in chemistry and materials science. Here, we provide a tutorial-style overview of strategies and best practices for the construction of artificial neural network (ANN) potentials. We illustrate the most important aspects of (a) data collection, (b) model selection, (c) training and validation, and (d) testing and refinement of ANN potentials on the basis of practical examples. Current research in the areas of active learning and delta learning are also discussed in the context of ANN potentials. This tutorial review aims at equipping computational chemists and materials scientists with the required background knowledge for ANN potential construction and application, with the intention to accelerate the adoption of the method, so that it can facilitate exciting research that would otherwise be challenging with conventional strategies.
Wen Guan et al 2021 Mach. Learn.: Sci. Technol. 2 011003
Machine learning has been used in high energy physics (HEP) for a long time, primarily at the analysis level with supervised classification. Quantum computing was postulated in the early 1980s as way to perform computations that would not be tractable with a classical computer. With the advent of noisy intermediate-scale quantum computing devices, more quantum algorithms are being developed with the aim at exploiting the capacity of the hardware for machine learning applications. An interesting question is whether there are ways to apply quantum machine learning to HEP. This paper reviews the first generation of ideas that use quantum machine learning on problems in HEP and provide an outlook on future applications.
Jeffrey M Ede 2021 Mach. Learn.: Sci. Technol. 2 011004
Deep learning is transforming most areas of science and technology, including electron microscopy. This review paper offers a practical perspective aimed at developers with limited familiarity. For context, we review popular applications of deep learning in electron microscopy. Following, we discuss hardware and software needed to get started with deep learning and interface with electron microscopes. We then review neural network components, popular architectures, and their optimization. Finally, we discuss future directions of deep learning in electron microscopy.
Accepted manuscripts
Open all abstracts, in this tab
Bacon et al
Broadband frequency output of gravitational-wave detectors is a non-stationary and non-Gaussian time series data stream dominated by noise populated by local disturbances and transient artifacts, which evolve on the same timescale as the gravitational-wave signals and may corrupt the astrophysical information. We study a denoising algorithm dedicated to expose the astrophysical signals by employing a convolutional neural network in the encoder-decoder configuration, i.e. apply the denoising procedure of coalescing binary black hole signals in the publicly available LIGO O1 time series strain data. The denoising convolutional autoencoder neural network is trained on a dataset of simulated astrophysical signals injected into the real detector's noise and a dataset of detector noise artifacts (''glitches''), and its fidelity is tested on real gravitational-wave events from O1 and O2 LIGO-Virgo observing runs.
Lin et al
Cell type identification using single-cell RNA sequencing (scRNA-seq) data is critical for understanding disease mechanisms and drug discovery. Cell clustering analysis has been widely studied in health research for rare tumor cell detection. In this study, we propose a Gaussian mixture model-based variational graph autoencoder on scRNA-seq data (scGMM-VGAE) that integrates a statistical clustering model to a deep learning algorithm to significantly improve the cell clustering performance. This model feeds a cell-cell graph adjacency matrix and a gene feature matrix into a graph variational autoencoder (VGAE) to generate latent data. These data are then used for cell clustering by the Gaussian mixture model (GMM) module. To optimize the algorithm, a designed loss function is derived by combining parameter estimates from the GMM and VGAE. We test the proposed method on four publicly available and three simulated datasets which contain many biological and technical zeros. The scGMM-VGAE outperforms four selected baseline methods on three evaluation metrics in cell clustering. By successfully incorporating GMM into deep learning VGAE on scRNA-seq data, the proposed method shows higher accuracy in cell clustering on scRNA-seq data. This improvement has a significant impact on detecting rare cell types in health research. All source codes used in this study can be found at https://github.com/ericlin1230/scGMM-VGAE.
Clarté et al
Being able to reliably assess not only the accuracy but also the uncertainty of models' predictions is an important endeavour in modern machine learning. Even if the model generating the data and labels is known, computing the intrinsic uncertainty after learning the model from a limited number of samples amounts to sampling the corresponding posterior probability measure. Such sampling is computationally challenging in high-dimensional problems and theoretical results on heuristic uncertainty estimators in high-dimensions are thus scarce. In this manuscript, we characterise uncertainty for learning from limited number of samples of high-dimensional Gaussian input data and labels generated by the probit model. In this setting, the Bayesian uncertainty (i.e. the posterior marginals) can be asymptotically obtained by the approximate message passing algorithm, bypassing the canonical but costly Monte Carlo sampling of the posterior.
We then provide a closed-form formula for the joint statistics between the logistic classifier, the uncertainty of the statistically optimal Bayesian classifier and the ground-truth probit uncertainty. The formula allows us to investigate calibration of the logistic classifier learning from limited amount of samples. We discuss how over-confidence can be mitigated by appropriately regularising.
Frising et al
We show how conditional generative neural networks can be used to efficiently find nanophotonic devices with desired properties, also known as inverse photonic design. Machine learning has emerged as a promising approach to overcome limitations imposed by the dimensionality and topology of the parameter space. Importantly, traditional optimization routines assume an invertible mapping between the design parameters and response. However, different designs may have comparable or even identical performance confusing the optimization algorithm when performing inverse design. Our generative modeling approach provides the full distribution of possible solutions to the inverse design problem, including multiple solutions. We compare a commonly used conditional Variational Autoencoder (cVAE) and a conditional Invertible Neural Network (cINN) on a proof-of-principle nanophotonic problem, consisting in tailoring the transmission spectrum trough a metallic film milled by subwavelength indentations. We show how cINNs have superior flexibility compared to cVAEs when dealing with multimodal device distributions.
Rajput et al
An artificial intelligence (AI) model's performance is strongly influenced by the input features. Therefore, it is vital to find the optimal feature set. It is more crucial for the survival prediction of the glioblastoma multiforme (GBM) type of brain tumor. In this study, we identify the best feature set for predicting the survival days (SD) of GBM patients that outranks the state-of-the-art methodologies currently in use. 

The proposed approach is an end-to-end AI model. This model first segments tumors from healthy brain parts in patients' MRI images, extract features from the segmented results, performs feature selection, and makes predictions about patients' survival days based on the features selected. The extracted features are primarily shape based, location-based, and radiomics-based features. Additionally, patient metadata is also included as a feature. The methods used for selecting features include recursive feature elimination (RFE), permutation importance (PI), and finding the correlation between the features. Finally, we examined features behavior at local (single sample) and global (all the samples) levels. In this study, we find that out of 1265 extracted features, only 29 dominant features play a crucial role in predicting patients' survival days (SD). Furthermore, we find explanations of these features using post-hoc interpretability methods to validate the model's robust prediction. Finally, we analysed the behavioural impact of the top six features on survival prediction, and the findings drawn from the explanations were coherent with medical facts. We find that after the Age of 50 years, the likelihood of survival of a patient deteriorates, and survival after 80 years is scarce. Again, for location-based features, the SD is less if the tumor location is in the central or back part of the brain. The results show an overall 33% improvement in the accuracy of SD prediction compared to the top-performing methods of the BraTS-2020 challenge