Application of machine learning in MOFs for gas adsorption and separation

Metal–organic frameworks (MOFs) with high specific surface area, permanent porosity and extreme modifiability had great potential for gas storage and separation applications. Considering the theoretically nearly infinite variety of MOFs, it was difficult but necessary to achieve high-throughput computational screening (HTCS) of high-performance MOFs for specific applications. Machine learning (ML) was a field of computer science where one of its research directions was the effective use of information in a big data environment, focusing on obtaining hidden, valid and understandable knowledge from huge amounts of data, and had been widely used in materials research. This paper firstly briefly introduced the MOFs databases and related algorithms for ML, followed by a detailed review of the research progress on HTCS of MOFs based on ML according to four classes of descriptors, including geometrical, chemical, topological and energy-based, for gas storage and separation, and finally a related outlook was presented. This paper aimed to deepen readers’ understanding of ML-based MOF research, and to provide some inspirations and help for related research.


Introduction
Carbon dioxide and light hydrocarbon were important raw materials for energy gas and chemical products in the modern petrochemical industry, and had a very important strategic position.For the petrochemical industry, the gas products in the form of mixtures cannot be directly used for industrial applications, and needed to be further separated and purified to obtain high-quality products, so as to carry out the next step of research and application [1,2].
The traditional gas separation method was mainly cryogenic separation method, which separated the hydrocarbon gas with a higher boiling point in the gradual cooling, but this separation method required a large rectification tower, which had high energy consumption, and will greatly increase the cost of separation [3][4][5][6].In addition, low temperature distillation [7] required multiple trays to achieve high reflux rates, which was extremely costly for industry.The solvent required for solvent-based extraction and adsorption [8][9][10] will be consumed and regenerated, with poor separation effect and increasing energy consumption.Metal-organic frameworks (MOFs) as a new porous material, the advantages, including high porosity [11,12], unique pore structure [4,13,14] and modifiable surface [15,16], showed MOFs huge adsorption capacity and loading capacity.
MOFs presented the following advantages in the field of gas separation.(1) Specific functional groups can be fixed on the structure of MOFs through pre/post-synthetic modification, giving specific active group sites on the surface of MOFs to improve affinity, thereby improving the adsorption capacity and selectivity of target gas, and achieving the purpose of separation and purification of mixed gas [17][18][19].(2) The pore size of MOFs can be precisely adjusted, and the interaction between the inner wall of the pore and the target gas can be selectively enhanced to achieve the adsorption and separation of the target gas [20].(3) The synthesized MOFs can be modified to secondly separate pore size, so that the MOFs formed an interpenetrating structure and enhanced the skeleton stability and interaction with target gas [21].For example, He et al reported a functionalized MOF of methylphthalamine for efficient separation of CH 4 from C 2 H 2 /CH 4 and CO 2 /CH 4 mixtures.Compared with traditional MOF, the selectivity of C 2 H 2 /CH 4 and CO 2 /CH 4 increased by 48.0% and 28.3%, respectively, which showed a better effect of gas adsorption and separation [22].
Although the flexible self-assembly method of MOFs made it have many advantages of properties and structure, it was almost impossible to complete the analytical exploration of MOFs by traditional methods [23][24][25].The emergence of computational materials science had perfectly solved the problems of time-consuming, costly and labor-intensive traditional experimental analysis of MOFs [26][27][28][29][30][31].It not only did not rely on high-end testing technology, expensive experimental equipment and frequent experimental verification, but also study spatial scales from micron level to nanostructure, atomic image and even to the electronic level [32,33].In the computer simulation environment, combined with theoretical research and experimental research from macro to micro levels to study the properties of MOFs, and explore the evolution of materials under extreme conditions.These characteristics made the application of computational materials science not only in some extremely difficult experiments, but also in general experimental research work as a theoretical guide and verification experiment [34].
The selection of the best performing MOFs for specific application scenarios was a key step in conducting relevant research.In recent years, numerous and excellent progress had been made in the research of MOFs in gas storage and separation [34][35][36] using density functional theory (DFT) [37] and Grand Canonical Monte Carlo (GCMC) simulation methods [38] to carry high-throughput computational screening (HTCS).With the continuous development of hardware, improvement of computational methods, and increasing accuracy of force field parameters, DFT and GCMC simulations can accurately evaluate the performance of a single MOF in a few hours at most, compared to experiments that can take days or even weeks [39].However, even so, given the structural diversity of MOFs, HTCS based on molecular simulations still required high time cost and significant workload [40].
The introduction of machine learning (ML) was an effective way to reduce the time cost of HTCS [41].ML was a subfield of artificial intelligence that learned from the large amount of data available, to determine known or even unknown physical laws, and then made decisions based on autonomous analysis [42,43], which was suitable for solving complex problems involving a large number of nonlinear processes [44], and highly promising for the design, synthesis, characterization of materials [45].This paper mainly reviewed the research progress on ML-based HTCS of MOFs for gas storage and separation, as well as presenting future progress considering different perspectives in this area: (1) Diversity database construction based on transfer learning; (2) Construction of unsupervised and reinforcement learning models with poor sensitivity to datasets; (3) Descriptor processing methods based on feature engineering.

MOFs databases
High-quality data sets were a prerequisite for accurate HTCS, which was even more important than algorithms [46].MOF databases can be divided into two types based on the data sources: (1) hMOFs (hypothetical MOFs) database: based on the idea of 'reticular chemistry', consisting of possible MOFs designed by computer; (2) eMOFs (experimental MOFs) database: consisting of experimentally synthesized MOFs.Table 1 displayed the MOF datasets employed in this study.
In 2011, based on 'reticular synthesis' idea [56], from a library of 102 building blocks, Wilmer et al [27] established the first hMOFs database, including 137953 possible MOFs constructed by self-assembly, and simulated the corresponding pore volume, specific surface area, pore size distribution, and CH 4 uptake.Using similar algorithms, some hMOFs databases were also established for multiple research purposes, with different types and numbers of SBUs and functional groups [47][48][49][50][51].
As a complete archive of all published organic, organometallic and metal-organic molecular structures whose united cell parameters, atomic coordinates and refinement parameters had been determined experimentally by either single-crystal x-ray or neutron diffraction studies [57].Cambridge Structural Database (CSD) was the main data source of eMOFs database, which was widely used in the computational screening studies of MOFs materials for gas adsorption and separation.However, the crystal structures in CSD may be problematic such as the inclusion of solvent molecules, disorderly structure, and missing H atoms [52], which significantly affected the accuracy of HTCS.
Watanabe et al [52] selected over 30000 metal-organic components from the CSD and finally obtained 1163 MOFs by calculating interatomic distances, removing solvents, and deleting disordered atoms and structures, and screened 359 MOFs for CO 2 /N 2 separation by pore diameter analysis.Goldsmith et al [31] also adopted a similar approach to screen 22700 MOFs with clear structural parameters and H 2 uptake from 550000 CSD structures, which can be directly used for computational screening.However, the interpenetrating structure and charge balance ions of MOFs are not considered within this database.The above databases cannot be updated in real time with the continuous expansion of CSD, and great workload was required to be establishing.
Chung et al [53] used a series of algorithms to screen over 20000 3D-MOF structures from the CSD, and some operations such as categorizing structures, retention of charge-balancing ions, solvent removal and manual structure editing were performed to finally obtain the database including 4764 MOFs in full atomic coordinates, which was known as CoRE 2014, also the first publicly shared MOF database.Fanourgakis et al [58] performed the GCMC simulations of MOFs in CoRE 2014 to accurately determine CH 4 uptake at different temperatures (270, 298 K) and different pressures (1, 5.8, 65 bar).Notably, as an upgraded version of CoRE 2014, CoRE 2019 shown in figure 1 had included accurate structural information [59].
The original purpose of the above databases was to focus on the gas storage and separation, and includes only 3D structures with appropriate pore limiting diameters and pore window sizes.To overcome these challenges, based on automated database construction and intelligent data mining methods, Moghadam et al [55] proposed seven search criteria, and built a CSD non-disordered MOF subset containing 69666 1D, 2D, 3D MOFs and MOFs-like structures, oriented to a broader application.Currently, both CoRE 2019 and CSD MOF subset had been embedded in the CSD system, which can be updated in real time and searched as entities independent of the main database [57].In addition to the common eMOFs and hMOFs databases, there were also QMOF databases for calculating the quantum chemical properties of MOFs [60,61].
However, even if the relevant databases were continuously updated, some errors were still unavoidable.Chen et al [62] found that 4901 MOFs in the CoRE 2019 database had isolated or overlapping atoms.In addition, although the procedures for building different eMOFs databases were similar, the differences in methods of handling charge balance ions and solvent removal also affected the final results [63,64].Daglar et al [65] studied the CH 4 , H 2 , CO 2 uptake and CH 4 /H 2 separation performance of thousands of commonly used MOFs in CoRE 2019 and CSD MOF subset, the results showed that the best performance materials for gas separation were highly correlated with the databases used, especially under low pressure conditions.Moosavi et al [66] analyzed multiple eMOFs and hMOFs databases and found that the distributions of the geometrical properties of the databases were considerably different from each other, and the difference in the database also affected the order of importance of the descriptors.Also, MOFs in reality were always defective [67,68], while those in the database were ideal and perfect.

ML algorithms
According to the well-known 'No Free Lunch' theorem in ML, no algorithm was suitable for solving all problems.The algorithms involved in ML can be divided into regression, clustering, classification, association [69], and the choice of the algorithm mainly depended on the properties of the materials to be calculated and the characteristics of the dataset.It was necessary to choose appropriate algorithm for a specific problem.Table 2 summarized the ML algorithms with these advantages and disadvantages involved in this article.
Owing to the huge developments of data storage capabilities and artificial intelligence technology, datadriven model based on ML algorithm can be easily constructed and has readily drawn considerable recent research interest in many areas [101].Machine learning can be divided into supervised learning, unsupervised learning and reinforcement learning.In particular, supervised learning algorithm aimed to capture the complex mathematical description of inputs and outputs [102], which happened to be consistent with the idea of finding the relationship between descriptors and target property, thus a major current focus in MOF field was how to apply supervised learning algorithm to optimize the combination of input descriptors.The performance of supervised learning was highly susceptible to variances and errors in property data, which significantly limited applicability in predicting complex structure-property relationships.Unlike supervised learning, unsupervised learning made it easy to filter out candidates that can be used for further computation from low-quality datasets without labels or properties, significantly reducing computational costs [103,104].However, the application of unsupervised learning to discover new MOFs with enhanced gas separation properties had rarely been explored.

ML-based HTCS of MOFs for gas storage and separation
In the HTCS process of MOFs based on ML, ensuring correlation between the input features (descriptors) and the output properties was a prerequisite for higher prediction accuracy [105].From different perspectives, descriptors can be classified into four categories: geometrical, chemical, topological and energy-based descriptors [30,106], as shown in figure 2. Correspondingly, this paper summarized the ML-based HTCS progress on MOFs for gas storage and separation, respectively.Tables A1 to A4 summarized the databases, algorithms and predictive properties of the relevant studies in this paper.

Performance prediction based on geometrical descriptors
The geometrical features of MOFs, including single crystal density (ρ crys ), pore volume (PV), gravimetric/ volumetric surface area (GSA/VSA), void fraction (VF), largest cavity diameter (LCD), pore-limiting diameter (PLD), dominant pore diameter (DPD), significantly affected the interaction between the adsorbed molecules and the framework, which can reflect the gas storage capacity of MOFs to a certain extent.Generally speaking, as the ideal input descriptors, geometrical features can be obtained directly from the databases.
To explore the CH 4 storage capacity of MOFs, Fernandez et al [107] developed multiple ML algorithms such as MLR, DT, SVMs, and reported the first large-scale, quantitative structure−property relationship (QSPR) analysis of MOFs based on geometrical descriptors, as shown in figure 3. The results showed that SVMs presented better prediction accuracy with R 2 values of 0.851 and 0.941 under 298 K, 35 and 100 bar, while the VF and DPD were the two features that had the greatest influence on CH 4 uptake.Following this, Fernandez et al [108] employed a variety of linear and nonlinear classification algorithms to predict the CO 2 and N 2 uptake of MOFs in the hMOFs database, and RF algorithm performed the best.Considering that deliverable capacity and selectivity were the two more important properties of MOFs for practical applications, Aghaji et al [47] used DT and SVMs algorithms to calculate and screen the CO 2 deliverable capacity and CO 2 /CH 4 selectivity of 324500 MOFs.The optimal geometrical parameters were determined by the DT algorithm: CO 2 /CH 4 selectivity > 10 required VF < 0.27 and pore diameter < 6.6 Å; CO 2 deliverable capacity > 4 mmol•g −1 required pore diameter < 8.5 Å and GSA > 2300 m 2 g −1 .Qiao et al [135] used RF algorithm to assess the permeability and selectivity of MOF membranes for 15 gas mixtures and found that PLD was the most important geometrical descriptor for the gas separation process, followed by VF and LCD.Two GBTR models were built by Dureckova et al [124] using a topologically diverse hMOFs database to filter out MOFs applicable to precombustion carbon capture, one of which used purely geometrical descriptors, and demonstrated similar prediction accuracy to Aghaji et al [47] with R 2 of 0.886 for CO 2 deliverable capacity (313 K, 1-40 bar) and 0.818 for CO 2 /H 2 selectivity (313 K, 40 bar).
Low energy density was a bottleneck restricting the large-scale application of hydrogen energy as a low-carbon fuel.To speed up the process of searching for hydrogen storage materials, Ahmed et al [109]    minimum number and optimal combination of features.From the results, ERT performed the best accuracy with R 2 of 0.967-0.997,and the most important features that determined the H 2 uptake of MOFs were PV (for gravimetric capacity) and VF (for volumetric capacity).8282 hMOFs with ρ crys < 0.31 g cm −3 , GSA > 5300 m 2 g −1 , VF = 0.90, PV > 3.3 cm 3 g −1 , were considered to have the potential to create a new record of H 2 uptake.
Table A1 summarized the detailed data of predicted performance using purely geometrical descriptors, including those already mentioned or not mentioned above.Although the geometrical descriptors performed well in some examples, accuracy was lacking in most cases, especially for dipole or quadrupole molecules such as CO, H 2 S, CO 2 and N 2 , because the electrostatic force between the adsorbed molecules and the framework cannot be reflected though geometrical descriptors, which was an important reason of low prediction accuracy.
To improve the prediction accuracy for ML models based on purely geometrical descriptors, one method was to explore the optimal combination of descriptors for specific application scenarios, the other was algorithms optimization.For example, previous studies had shown that nonlinear algorithms tend to perform better than linear algorithms, which can be attributed to the inability of linear algorithms to capture the nonlinear dependence of output characteristics on input characteristics [109].Algorithms should be adapted to the needs, rather than being generic.
Geometrical descriptors cannot adequately capture material characteristics [119,136].Limited by accuracy but easily accessible, geometrical descriptors were often used in combination with other more complex descriptors, which can often serve to aid in improving prediction accuracy.

Performance prediction based on chemical descriptors
The diversity of metal ions (clusters), organic ligands, and functional groups led to the rich chemical information of MOFs.Classification and quantitative extraction of chemical information, combined with geometric descriptors, resulted in a more accurate reflection of the uniqueness of the material [30].Chemical descriptors can be broadly classified into two categories: (1) features that characterized the interaction between the adsorbent and the framework, such as electronegativity, dipole moment/quadrupole moment, polarization rate, heat of adsorption (Q st ), etc; (2) features that quantitatively characterized the chemical composition of MOFs, such as species/number or number density of atomic/functional group.
Q st was one of the most important indicators of the adsorption capacity of an adsorbent [137].Performing GCMC simulations of 122835 hMOFs, Gómez-Gualdrón et al [138] found that the optimal Q st should be 10.5-14.5 kJ•mol −1 to meet the deliverable capacity requirement of 315 cm 3 (STP)•cm −3 .Amrouche et al [116] first developed the GCMC-QSPR method, which considered the dipole and quadrupole moments of organic ligands, the dipole moments and boiling points of adsorbents, the number of functional groups, and the pore mean curvature, to establish a linear relationship that can predict the isothermal Q st of zeolitic imidazolate frameworks (ZIFs) for 11 gases, including CH 4 and CO 2 , with (mean absolute error) MAE of 5.7 kJ•mol −1 and (mean absolute percentage error) MAPE of 24.5%.Sezginel et al [119] introduced Q st as a chemical descriptor, combined with geometrical descriptors, to established a multivariate linear relationship, as shown in figure 5, successfully predicting the CH 4 uptake of 45 MOFs with R 2 of 0.92 under 298 K and 35 bar, compared with an R 2 below 0.9 for the purely geometrical descriptors.
A critical issue for ML-based material property prediction was how to construct multi-scale features for crystal structures from a chemical perspective and to describe crystalline materials completely and accurately digitally.To solve it, building the atomic property weighted radial distribution function (AP-RDF) was a feasible way.In chemoinformatics, the radial distribution function (RDF) can be used to encode the 3D structural features of molecules [139][140][141][142], on the basis of which atomic properties can be introduced to obtain AP-RDF, which enabled the encoding of chemical information based on the atomic scale, with the expression as in equation (1).

RDF R f P P e 1
where, r ij was the minimum image convention distance of atom pairs, B was a smoothing parameter, and f was simply a scaling or normalization factor.Considering chemical properties such as electronegativity, polarizability and van der Waals volume, Fernandez et al [118] introduced AP-RDF for the first time in the HTCS process of MOFs to establish SVM AP-RDF model, which yielded R 2 of 0.69-0.83for CO 2 , N 2 , CH 4 uptake under different pressure conditions, compared with R 2 of 0.46-0.63 with purely geometrical descriptors.Furthermore, Fernandez et al [120] introduced ML classifiers to pre-screen the CO 2 uptake of 324500 MOFs, thereby significantly reducing the workload required for large-scale screening.Taking the screening process of 0.15 bar as an example, with the classifier, 945 of the 1000 best-performing MOFs can be selected by performing computation-intensive simulations on only 10% of the data.In order to eliminate the influence of the number of atoms on AP-RDF scores, Dureckova et al [124] developed GBTR models with standardized AP-RDF and six geometrical descriptors (figure 6), showing the optimal R 2 of 0.945 for CO 2 deliverable capacity and 0.873 for CO 2 /H 2 selectivity, and only 4000 MOFs need to be calculated to filter 999 out of 1000 optimal MOFs, demonstrating a very high computational efficiency.
The interaction between MOFs and adsorbent molecules, besides being related to their relative positions, also depended on a large extent on the type and number of exposed metal sites, organic ligands, and modified functional groups [1,2].However, the molecular structure was not a numerical value or a collection of numerical values, which must be transformed into a digital form to be used as input variables.Borboudakis et al [121] predicted the H 2 and CO 2 adsorption performance of 100 MOFs by encoding organic ligands, metal clusters and functional groups as binary parameters based on the presence or absence of key geometrical features in MOFs, showing the pour accuracy was poor because of the limitation of the training set size.
Considering that metal atoms significantly affected the adsorption properties of MOFs [143], Pardakhti et al [113] introduced standardized atomic species numbers and metallic percentage, and compared the prediction results of DT, PR, SVM, RF and other models for CH 4 uptake.Among them, RF algorithm performed best, with R 2  of 0.98 for CH 4 gravimetric uptake, and computational speed was several orders of magnitude faster than molecular simulations.Using similar descriptors, the results of Keskin et al [122] showed that geometrical descriptors exhibited superior accuracy in predicting the deliverable capacity of CH 4 using the ANN algorithm.The number of different elemental atoms or functional groups per unit volume were used as descriptors and also achieved satisfactory results [48,49,110] (see table A2 for detailed data, and figure 7 displayed one of the descriptors).
Unlike metals (clusters), organic ligands were more complex in structure, consisting of more atoms, and more precise description methods should be developed, in addition to concepts such as ligand species and number density.To this end, Gurnani et al [115] developed the linker smiles extractor (LSE) program for the rapid extraction of strings of organic ligands in a given MOFs, combining various geometrical properties and chemical properties of metal clusters.As shown in figure 8, the model exhibited greater generalizability and outstanding accuracy for CH 4 gravimetric and volumetric uptake with R 2 of even 1, which exceeded that of Wu et al [48] and Fanourgakis et al [110].The superior predictive performance can be attributed to two points: SMILES strings can not only exhibit rich properties of organic ligands, but also identify and remove chemically invalid MOFs, improving the quality of input data.
Crystals in the eMOFs database tended to be defect-free, while defective MOFs often had unexpected properties, especially for UiO-66 [144][145][146].Wu et al [125] in figure 9 constructed 425 UiO-66-Ds structures with different degrees of missing-linker ratios and short-range orders, and assessed the adsorption, separation properties, mechanical stability by including the concentration and distribution of various defects on the basis of various geometrical and chemical descriptors.The deliverable capacity of C 2 H 4 and C 2 H 6 , C 2 H 4 /C 2 H 6 selectivity at 0.1 bar, and moduli can be accurately reproduced by a simple (logistic regression) LR model, which provided a useful data-driven analysis method for the development of defect engineering in MOFs.
Chemical descriptors were second-order descriptors, which were more complex and computationally expensive compared to geometrical descriptors.However, due to the advantage of being able to encode the physicochemical properties of molecules, chemical descriptors more directly reflected the interactions between adsorbent molecules and the framework, leading to higher prediction performance.

Performance prediction based on topological descriptors
Data sets (including point clouds, matrices, graphs, functions, etc) were shaped [147], which can be strongly characterized by topology [148].Moreover, the findings of Colón et al [11] showed that the gas storage properties of MOFs were powerfully correlated with the topology, as shown in figure 10.Therefore, in recent years, some attempts had been implemented to introduce topological features as descriptors in material property prediction studies.
Traditional geometrical descriptors captured only some of the structural features and made it difficult to precisely identify crystalline materials with similar pore structures [149], for which some computational techniques and mathematical quantification methods for structural features came into being [149,150].Topological data analysis (TDA) was a technique that combines computational topology and data science, and was a method for analyzing topological features of data based on persistent homology, which, compared to principal component analysis (PCA), the analysis process did not cause loss of information and was considered stable for missing and noisy samples [151,152], so as to a widely use in material property prediction studies [153,154].
Lee et al [129] developed a TDA-based identification method to quantify the similarity of pore structures.The method took the point cloud that can characterize the pore structure as sphere center and continuously increase the radius to construct the filtered Vietoris-Rips complex, whose shape can be characterized by 0D, 1D, and 2D homology classes.The persistence barcodes can be acquired to characterize the overall shape of the pore structure by monitoring the variation of homology classes.In the 0D homology classes of figure 11, the eight long intervals of zeolite DON represented the pore structure consisting of eight disjoint components, while the pore structure of zeolite PCOD8331112 was connected, and the 1D and 2D homology classes characterized the shape of the cavities.Based on such analysis, Lee et al [129] searched for similar structures in 41498 MOFs using 20 MOFs with optimal CH 4 storage capacity as a benchmark, and showed that 85% of the MOFs with similar pore structures exhibited excellent deliverable capacity of more than 150 (v STP/v), even with different chemical compositions.In addition, it was found that MOFs with different topologies should be designed with different Q st to improve CH 4 uptake.Furthermore, Lee et al [130] extracted the most important 10 features from the barcode as descriptors (see table A3) and developed the KRR model, which not only presented good predictions for the CH 4 deliverable capacity but also accurately predicted the geometrical properties with R 2 of 0.88, 0.88, 0.69, 0.98 and 0.91 for DC, LCD, ρ crys , ASA and PV Krishnapriyan et al [131] built a combined RF model with topological features, including persistent homology and word embedding of chemical elements for capturing the chemical information, to predict CO 2 /CH 4 adsorption uptakes and Henry coefficients of MOFs in hMOF, BW [155] and CoRE 2019.Compared to purely geometrical or topology descriptors, the combination of topology with word embedding greatly improved the prediction accuracy.
In addition, it was also an acceptable attempt to regard topological blueprints directly as descriptors.Anderson et al [50] firstly adopted DT model to demonstrate that topology was an important parameter affecting the CO 2 capture ability of functionalized MOFs, and developed GBMs model by introducing topological blueprints of functionalized MOFs on the basis of geometrical and chemical descriptors, to predict the CO 2 uptake, CO 2 /N 2 and CO 2 /H 2 selectivity under different conditions (see table A3).
Kim et al [156] developed a constructor as shown in figure 12, which can construct MOFs with arbitrary topologies and building blocks without predefined symmetric information, to constructed a total of 247 trillion MOFs containing 1775 topologies, encoded the topologies and building blocks as fixed-length vectors as descriptors, and applied the self-developed multispecies genetic algorithm with fitness approximation (MSGA- FA) to predict the CH 4 deliverable capacity.A total of 964 MOFs with working capacity over 200 cm 3 •cm −3 were identified, of which 96 exceeded 208 cm 3 •cm −3 .
In addition to the application of gas adsorption and separation, topological descriptors also had great advantages in the prediction of MOF mechanical properties.Moghadam et al [157] significantly increased the R 2 from 0.696 to 0.979 with the introduction of topology when predicting the bulk modulus of MOFs.
Restricted by the carried information, geometrical descriptors can only reflect some features of the pore structure and lack an effective information combination method.Topological descriptors, on the other hand, can theoretically carry all the information to characterize the pore structure, which can make up for the shortcomings of geometrical descriptors very well.However, only a small number of frequently used topologies can be found from the synthesized MOFs [158], implying limited structural variation [111].

Performance prediction based on energy-based descriptors
The nature of adsorption was a spontaneous tendency to reduce the surface free energy by decreasing the surface tension, so the energy change during adsorption can be explained essentially for different adsorption processes [159].Considering that the binding energy of hydrogen may be related to the electron density of the MOF framework, Choi et al [133] performed DFT simulations to obtain the iso-value surface area of electrostatic potential (ESP) of 10 MOFs, from which three descriptors were extracted as independent variables and subjected to MLR with experimental H 2 uptake, which showed decent correlation with R 2 of 0.91 even with such a small data set.The work of Thornton et al [16] also presented good accuracy with R 2 of 0.88 for the volumetric deliverable capacity of H 2 by introducing adsorption energy as a descriptor.
Snurr et al [134] divided the MOF structure into grids as shown in figure 13, respectively calculated the interaction energy between H 2 -MOF, and plotted the energy histogram as a descriptor to convert the 3D energy landscape into 1D energy distribution information.LASSO was performed to accurately predict the H 2 deliverable capacity of 50000 experimentally synthesized MOFs, which yielded satisfactory results with RMSE of no more than 3 g•l −1 and computational speed of more than 3 orders of magnitude faster than molecular simulations, where the H 2 deliverable capacity of MFU-41 reached 47 g•l −1 .
Fanourgakis et al [58] considered that the probability of a probe atom being adsorbed at a given position in the framework depended on its interaction energy, which can be characterized by the Boltzmann factor.The CH 4 adsorption properties of 4700 MOFs were predicted by calculating the average Boltzmann factor (as in equation ( 2)), which in turn was used as a descriptor.The prediction accuracy was significantly improved compared with purely geometrical descriptors, especially in low pressure conditions.
To further improve the accuracy, Fanourgakis et al [160] employed the self-consistently method.The predicted 100 optimal performance MOFs were continuously added to the training set consisting of initial 100 MOFs, and the self-consistency process was completed until all the predicted 100 optimal performance MOFs had been included in the training set.In this way, the majority of optimal performance MOFs can be determined based on a small amount of data.However, single descriptor cannot give enough information for ML progress, and for H 2 , H 2 S, CO 2 , etc, in addition to van der Waals forces, electrostatic interactions need to be considered.Therefore, Fanourgakis et al [112] in figure 14 defined three types of probe atoms: Vprobes, which had only van der Waals interactions with the framework, Qprobes and Dprobes, which carry one or two charges in addition to van der Waals interactions respectively.This improvement achieved better prediction accuracy for uptake of CH 4 , H 2 , H 2 S and CO 2 .
In a general sense, the selection of suitable descriptors was an essential step in ML-based material screening, some attempts for 'descriptor-free' were also conducted.For example, Yao et al [161] developed a descriptorfree model based on deep neural networks to discover MOFs for CO 2 /CH 4 and CO 2 /N 2 separation, which introduced the idea of unsupervised learning in data processing.

Conclusions and perspectives
In recent years, the application of ML methods to MOFs was just starting to emerge and had been favored by researchers for its unique advantages.It was foreseeable that ML will be a hot tool in MOFs research.However, the application of ML to MOFs was still in a state of discovery, and we believed that improvements can be made from three aspects: databases, algorithms and descriptors.

Databases
Numerous MOF databases had been created with different properties as research purposes, which greatly reduced the transferability of data between different databases.Since the composition and structure of metal nodes can only be known experimentally, the hMOFs database was large but hardly diverse, focusing mostly on specific sub-classes of MOFs.Compared to the hMOFs database, eMOFs database was small but rich in information and patterns of metal nodes.Transfer learning was the ability of a system to recognize and apply knowledge and skills learned in previous domains/tasks to novel domains/tasks.The extensive knowledge about metal nodes can be extracted from the eMOFs database through transfer learning methods and further applied to the construction of the hMOFs database, thus greatly enriching the diversity of the hMOFs database.In addition, the different operational procedures for handling charge-balancing ions and solvent molecules can also significantly affect the final results.With the deepening of related research, it was particularly important to propose and develop a standard procedure for building MOF databases that was accepted by all.

Algorithms
(1) Supervised learning.For the input features, which were irrelevant to the target value, they simply increased data dimension and provides few valid information of interested system and therefore led to so called curse of dimensionality [162].Moreover, great over-fitting problem may easily occur once some features have linear correlation between each other [163].It can be concluded that using proper feature selection technologies to sort out key descriptors was a vital process to achieve application improvement of ML algorithm in MOF field.(2) Unsupervised learning.Supervised learning algorithms were currently used for the screening of MOFs, the biggest shortcoming of which was that their prediction performance was highly sensitive to the quality of the dataset.Unlike supervised learning, which had to accurately predict the target properties of all materials, unsupervised learning aimed to group similar materials, effectively alleviating the issues of poor data quality and greatly reducing the dependence of prediction performance on the dataset.The application of unsupervised learning to the discovery and screening of performance-enhancing MOFs was highly promising.(3) Reinforcement learning.Reinforcement learning was devoted to achieving maximum reciprocity through trial and error.Instead of pre-training on the dataset, models can be learned from scratch [163], which not only eliminated the effect of data quality, but also expanded the search space and avoid missing structures that were not in the dataset but perform better.

Descriptors
The development of new descriptors had been one of the hottest tasks.However, breakthroughs will be increasingly challenging to achieve as research progresses.While working on the development of new descriptors, feature engineering research should also be carried out on existing descriptors, perhaps with unexpected performance, mainly in the following aspects.(1) Features continuity.High dimensionality and discrete variables made the design of MOFs quite restrictive [161].To improve the coverage of the model, unsupervised learning can be performed to map the discrete input features into continuous vectors, and then supervised learning algorithms can be employed to learn the relationship between the feature vectors and the physical properties of the material in the latent space, further improving the representativeness of the features and the optimization capability of the model.(2) Features transformation and reorganization.Based on the original features, by introducing a priori knowledge and transforming the form of feature expressions, new features [114, 117, 123, 126-128, 132, 164, 165] that were more closely related to the target characteristics can be reconstructed and a dual mechanism-data driven model can be built, which not only reduced the computational cost but also improved the prediction accuracy.DC: Deliverable capacity; UG: Gravimetric deliverable capacity, g•g −1 ; UV: Volumetric deliverable capacity, cm 3 (STP)•cm −3 ; PS: Pressure swing; TPS: Temperature-pressure swing; IC: Interpenetration capacity; NIF: Number of interpenetration framework a Unless otherwise specified, the accuracy refers to the coefficient of determination, R 2 , the same below; b hMOFs refer to the databases established by Wilmer et al [27] or established in the same way, the same below.

Figure 1 .
Figure 1.HTS procedure used on the CoRE-MOF-2019 database to identify top-performing materials for CO2 capture under wet flue gas conditions.Reprinted with permission from [59].Copyright (2023) American Chemical Society.

Figure 2 .
Figure 2. Classification of descriptors used in MOF related ML studies.
combined 19 existing databases with a total of 918734 MOFs, of which 98695 MOFs with H 2 uptake data were divided into training and test sets of 74201 and 24674 MOFs, as shown in figure 4. Performance comparison of 14 ML algorithms were performed, and 508 ERT models covering all geometrical feature combinations were established to determine the

Figure 4 .
Figure 4. Performance of the ERT algorithm with respect to GCMC calculations for predicting H 2 deliverable capacities in MOFs.Reprinted from [109], Copyright (2021), with permission from Elsevier.

Figure 6 .
Figure 6.Heatmaps of (a) GCMC-calculated and (b) GBTR-predicted CO 2 deliverable capacity plotted against CO 2 /H 2 selectivity for the test set containing 35840 MOFs.The colors of the heatmaps correspond to the number of MOFs, where red is high and blue is low.Reprinted with permission from[124].Copyright (2019) American Chemical Society.

Figure 7 .
Figure 7. Descriptors constituting the MOF fingerprint.(a) Textural properties.(b) Seventeen chemical motifs, for which their respective number density in each MOF was calculated.Reprinted with permission from [49].Copyright (2020) American Chemical Society.

Figure 10 .
Figure10.Hydrogen cryo-adsorption in MOFs.Reproduced from[11] with permission from the Royal Society of Chemistry.

Figure 12 .
Figure 12.Diagram of computational screening of trillions of metal−organic frameworks for high-performance methane storage.Reprinted with permission from [156].Copyright (2021) American Chemical Society. probe

Figure 13 .
Figure 13.Overall workflow for feature extraction and model calculations.Reproduced from [134] with permission from the Royal Society of Chemistry.

μ
OL : Dipolar moment of organic linker, C•m; |Q OL |: Mean quadrupole moment of organic linkers, b; H: Pore mean curvature, Å −1 ; n fg : Number of functional groups; |μ g |: Dipolar moment of adsorbed gas, C•m; T b : Atmospheric boiling temperature of adsorbed gas, K; COP H : Ratio between energy usage of the compressor and the amount of useful heat extracted from the condenser for a heat pump; PSSD: Pore size standard deviation, Å; MA: Metal angle, deg; MC: Metal charge, e; MDM: Metal dipole moment, eÅ; ρ AS : Surface atom density, m −2 ; FFV: Fractional free volume; TDU: Total degree of unsaturation; DUC: Degree of unsaturation per carbon; fp ip : Degree of interpenetration; fp linker : Code of organic linker; Dif: Largest included sphere along free path; VPOV: Volumetric pore volume, m 2 •cm −3 ; GPOV: Gravimetric pore volume, m 2 •g −1 ; RACs: Revised autocorrelations.

Table 1 .
MOF datasets employed in this study.

Table 2 .
ML algorithms involved in this article.
Multilinear Regression [70-72] MLR Simply and widely use with high precision Multicollinearity problems, lack of causal inference, normal distribution hypothesis k-Nearest Neighbor [73-75] kNN High precision, insensitive to outliers, no data input assumptions High time and space complexity Kernel ridge regression [76, 77] KRR Able to process nonlinear functions, only affected by feature quantities Suitable for small sample data Decision tree [78-80]

Table A1 .
Summaries of geometrical descriptors to predict MOF properties.