Pipelines for automating compliance-based elimination and extension (PACE2): a systematic framework for high-throughput biomolecular materials simulation workflows

The formation of biomolecular materials via dynamical interfacial processes, such as self-assembly and fusion, for diverse compositions and external conditions can be efficiently probed using ensemble Molecular Dynamics (MD). However, this approach requires many simulations when investigating a large composition phase space. In addition, there is difficulty in predicting whether each simulation will yield biomolecular materials with the desired properties or outcomes and how long each simulation will run. These difficulties can be overcome by rules-based management systems, including intermittent inspection, variable sampling, and premature termination or extension of the individual MD simulations. Automating such a management system can significantly improve runtime efficiency and reduce the burden of organizing large ensembles of MD simulations. To this end, a computational framework, the Pipelines for Automating Compliance-based Elimination and Extension (PACE2), is proposed for high-throughput ensemble biomolecular materials simulations. The PACE2 framework encompasses Candidate pipelines, where each pipeline includes temporally separated simulation and analysis tasks. When a MD simulation is completed, an analysis task is triggered, which evaluates the MD trajectory for compliance. Compliant simulations are extended to the next MD phase with a suitable sample rate to allow additional, detailed analysis. Non-compliant simulations are eliminated, and their computational resources are reallocated or released. The framework is designed to run on local desktop computers and high-performance computing resources. Preliminary scientific results enabled by the use of PACE2 framework are presented, which demonstrate its potential and validates its function. In the future, the framework will be extended to address generalized workflows and investigate composition-structure-property relations for other classes of materials.


Introduction
The formation of biomolecular materials via dynamical processes such as self-assembly or fusion of interfaces is key to many subdisciplines in life and biomedical sciences [1].The characteristics of the materials depend on the chemistry and concentration of the constituent biomolecules (namely, the composition of the materials).Whereas experimental approaches can identify relationships between the composition and characteristics of biomolecular materials, they can be expensive and time-consuming.Experiments also have limited spatiotemporal resolution to elucidate dynamical molecular processes and mechanisms underlying the formation of materials.A large array of Molecular Dynamics (MD) simulations is better suited to investigating these processes and mechanisms under various compositions and external conditions.This approach is an example of 'ensemble MD' [2].
Investigations of the formation of biomolecular materials using ensemble MD poses two main challenges: (1) many independent ensembles are necessary to probe the composition phase space [3] adequately, and (2) it is difficult to predict how much simulation time will be needed for the desired materials to form [4]. Actual simulation runtimes are of interest because they provide a measure of the computational cost of those simulations.State-of-the-art MD simulations are frequently run on high-performance computing (HPC) clusters, thus requiring a significant financial investment.Hence, a computational framework that addresses these challenges and manages the efficient use of computational resources is critically needed.
An attentive computational scientist will strive to manage the length of each candidate simulation in the ensemble by adopting several management strategies.These management strategies include: (1) intermittent inspection of simulations for signs of the formation of the desired biomolecular material; (2) sparse sampling of simulations before they show evidence of formation and frequent sampling thereafter; (3) premature termination of simulations that evolve contrary to the reference analytical calculations; and (4) extension of simulations which terminate without clear results [4].
With these strategies, the computational scientist may decrease the duration of unproductive simulations and therefore reduce the average cost per candidate simulation.Implementing such strategies involves significant organization on the part of the scientist, with a relevant amount of time dedicated to overseeing both software development and experiments.Furthermore, strategies (1) through ( 4) can be implemented as a rules-based management routine.If candidate simulations can be automatically assessed for conformity using a user-defined rule, then the elimination or extension of candidates can be adaptively managed by software.Hence, the long-term management of an ensemble of MD simulations can be replaced by a decision loop with programmed analysis and conditional extension or termination of individual MD simulations.
To that end, a computational framework for ensembles of biomolecular materials simulation pipelines, the Pipelines for Automating Compliance-based Elimination and Extension (PACE 2 ), is proposed.Specifically, the computational framework encompasses Candidate pipelines (see figure 1).The notion of a Candidate pipeline is straightforward: a pipeline carries sequential simulation and analysis tasks, and a control flow decision is made based on the signal received from the analysis task.In other words, the first MD simulation runs for a fixed number of timesteps and triggers an analysis task.The analysis task operates on the resultant MD trajectory to evaluate a user-defined compliance metric.Compliant simulations are extended to the next MD phase, where the system can evolve.Non-compliant simulations are eliminated so their computational resources may be allocated elsewhere.Resource allocation is especially important due to the limited availability of HPC resources.Multiple pipelines can also be run in parallel for concurrent processing of multiple candidates simultaneously.Reference for the PACE 2 source code and examples is provided in the Data Availability statement.
In this study, the implementation of the PACE 2 framework as a general-purpose design for the management of MD ensembles is discussed.Two application case studies are illustrated in which PACE 2 is leveraged to automate the generation of a phase diagram for biomolecular materials encompassing chemical species of interest.Elements of the software design are emphasized which will be leveraged in future enhancements to support model development and ensemble simulations.Furthermore, the framework has the potential to be integrated with other computational methods and analysis tools to enable the identification of composition/sequence-structure-property relations in other classes of materials such as soft, polymeric materials, metals, ceramics and glasses.

Related work
A workflow-driven approach to problem solving in the domain of molecular simulations is not new.Several frameworks exist that provide the capability to separate tasks temporally and provide dependency resolution between them.Examples include general purpose workflow tools such as DASK [5], as well as problem and domain-specific tools such as BioExcel [6,7] for bioinformatics.In the domain of molecular simulations, packages such as HTMD [8], AiiDa [9], FireWorks [10], and MISPR [11] address the specialized requirement for a high throughput system for running simulation workloads.Another interesting approach is seen in HTP_MD [12], that addresses the problem of high volume data movement between different endpoints by moving any runtimes to a cloud server and acts more akin to a database querying system with a fully featured GUI, rather than a high throughput runtime system for molecular simulation workloads.These workflow systems are useful and powerful tools that serve specific use cases, but there are some limitations: DASK is extremely flexible but puts the burden of workflow design on the user, thereby increasing development overhead.BioExcel serves a specific set of use cases for the bioinformatics and drug discovery community, HTMD is limited to the functionality provided by the developers (such as support for only specific types of force fields).AiiDa, while extremely flexible, relies on a plugin ecosystem to ensure compatibility with a wide range of molecular simulation use cases.FireWorks is mainly targeted towards quantum chemistry, and MISPR serves a specialized use case of combined density functional theory-classical MD simulations and is unable to support a broad range of MD engines or custom analysis scripts.Finally, HTP_MD appears to provide very limited customizability.
These above examples broadly point to the fact that many computational chemistry applications would be very well served by a workflow tool which would allow users to specify their own analytical toolchain and integrate them into a custom workflow rather than be constrained to the analysis tools which are commonly packaged with MD engines [13,14].Further, there is the general limitation of such packages being designed and written around the binary file formats of their native MD kernels (such as NAMD) and analysis packages (such as VMD) [15].A kernel-agnostic framework would significantly lower the barrier to entry and encourage workflows-based research in the molecular science and materials research fields.

Method
We now describe the process adopted to develop PACE 2 .In this section we discuss the motivation and the initial design, followed by the choice of architecture and other strategic elements of implementation and the pseudocode.

Design
A large ensemble of molecular simulations may be necessary to achieve statistically meaningful results in many applications of biomolecular material simulations, such as evaluating the most likely self-assembled morphologies of a given chemical system.By varying simulation parameters, a phase map may be developed showing the possible system characteristics for a range of experimental conditions.When using ensemble MD methods to investigate a phase space for self-assembled nanostructures, structural analysis or visualization of the simulated system can be used to verify that the model is functioning as the experimenter intends [3].Often, the results of such analysis indicate that an expensive task has accomplished its objective earlier than expected or that the process has evolved in a pathological way that will provide no useful results.In either of these cases, any further computation may be considered a waste of computational resources.
Solving large ensemble problems via a high-throughput simulation framework can be set up in the form of shell scripts that individually assign resources to all the simulations in question and resolve the necessary dependencies.This approach does not scale well when many independent simulation tasks must be configured (see Appendix Table A1) [3,4].Further, driver scripts and other code usually must be largely rewritten when ported from one use case to another or across different computing platforms (e.g., different batch system, compiler, libraries or hardware).
There is, therefore, the necessity to address the following challenges: (i) the lack of portability and expression capability of shell scripts and (ii) the resolution of the complex dependencies between the various simulation and analysis tasks of the ensemble system.PACE 2 addresses the first of these by providing configurable software that addresses this specific family of use cases.The second challenge is addressed by introducing the notion of a task graph [16][17][18]: a set of tasks to be performed and the set of dependencies between those tasks.A workflow is expressed as a task graph, where these tasks are scheduled and their dependencies resolved to ensure complete execution.
For example, a user may need to run MD simulations (i.e.tasks) with multiple instrumental variables which start and stop at arbitrary points, dump simulation states to disk, or use these dumped states to set up and run new simulations (i.e., resolve dependencies).This points to the need for a workflow to be adaptive, i.e., can modify itself during runtime in response to specific events that occur during simulations.Examples of such events include: failed simulations that require re-running, ensemble members that no longer need to be sampled and must be terminated, or ensembles that can safely terminate because they have reached a dynamically evaluated 'desirable end-state.'Adaptivity allows for optimal use of compute time and resources, removing the need for human intervention and reducing the scope for human error.
PACE 2 is a package designed to enable automated termination or extension of MD tasks.In a PACE 2 workflow, MD tasks are executed incrementally to invest just enough computation to bring a task to the end of its useful life.Between computation tasks, PACE 2 workflows include regular analysis blocks in which incremental task outputs are evaluated to produce a signal that will initiate the next increment or not.
The tasks that are suited to this design use various application programs, such as MD solver kernels.Therefore, the specific programs that are executed in PACE 2 computation or analysis block are defined by the user, i.e., PACE 2 is uncoupled from any particular suite of MD tools.That allows an experienced user to implement the incremental compute-analyze workflow pattern with any software of choice.The user defines the compute resources and task commands for a PACE 2 workflow in brief configuration files.Task input files are staged in input directories.Because MD workflows are often executed on remote compute resources, execution of the PACE 2 software is via the UNIX command line.PACE 2 leverages the PILOT abstraction [19] in which the execution of individual tasks and analysis blocks is accomplished by a service running on the compute resource.This saves the user from the potentially significant overhead of logging in and resubmitting candidate tasks to the job scheduler.
PACE 2 is implemented using RADICAL-Cybertools (RCT).RCT is a suite of middleware tools designed to enable the execution of applications with heterogeneous tasks in a scalable, flexible, and extensible manner on a wide range of HPC platforms.RCT consists of three main components: RADICAL-SAGA (Simple API for Grid Applications), RADICAL-Pilot, and RADICAL-Ensemble Toolkit, hereafter just EnTK.Together, RCT allow coding and executing applications with large ensembles of simulation-analysis tasks on HPC resources [20]; managing resource acquisition, allocation and runtime details [21]; resolving data and compute task dependencies, and facilitating management of the runtime environment at the task level [18].RCT allows for an application written against their APIs to perform scalable, synchronous, or asynchronous execution of large ensembles of heterogeneous tasks in any arbitrarily complex dependency order.PACE 2 is a framework built using the EnTK API [22,23] to express workflow applications as sets of computational pipelines, each composed of an arbitrary number of stages, each with an arbitrary number of tasks.It is used for simulating and investigating biomolecular materials systems.A predefined number of different systems, such as different molecular chemistries or compositions at different physical conditions (temperatures, pressure, concentrations, or pH) are prepared and specified as Candidates in a Candidate Pool.A Candidate Pool is a set of prepared data, e.g., input structures, topologies, or runtime parameters.In practice, it is implemented as a sandbox that contains all the necessary input files.After describing a Candidate Pool, PACE 2 allows users to define k pipelines and, at execution time, PACE 2 uses RCT to simultaneously spawn the MD simulations of the first k Candidate systems.Based on available resources, typically each pipeline is assigned to a subset of an overall resource allocation, e.g., one node out of several nodes requested on a HPC platform.Pipelines may be handled in one of two ways: Elimination, or Extension.In the Elimination mode, the biomolecular materials systems are screened for the desired characteristics, and either allowed to propagate to the high-resolution MD phase or terminated based on the screening analysis signal.In the Extension mode, different protocols are applied sequentially to the biomolecular materials system, depending on the end state of each phase.This entire set of pipelines may then be replicated for statistical significance.
Figure 1 shows a schematic representation of the PACE 2 workflow.The blue blocks represent the MD simulations associated with the Candidates, whereas the green and red blocks show the Analysis (analysis) phase.The green blocks represent an analysis phase that have returned a positive (Boolean '1') signal, and the red blocks, a negative (Boolean '0') signal.The analysis can in principle, be any function of the files present in the pipelines, from an explicit determination of simulation parameters to using statistical or analytical models to make decisions.

Architecture and implementation
PACE 2 design is object-oriented and has two main classes: Candidate and CandidateManager.The PACE 2 top-level driver script uses both classes to invoke all PACE 2 functionalities.Using these two classes, PACE 2 can carry out automated compliance-based elimination and extension via EnTK.
At an abstract level, an instance of a Candidate class represents a simulation candidate containing a set of phases: a pre-MD phase, an MD phase, and an analysis phase.One iteration of these three phases is generally referred to as a 'cycle' .The pre-MD phase specifies any preprocessing required prior to MD, such as compiling parameter files into an executable binary file for simulation.Simulations are run during the MD phase to compute the trajectories associated with the system dynamics.The analysis phase computes a compliance metric for the simulations to facilitate the modification of the Candidate task graphs.The Candidate class does not inherently run each of these phases but specifies their behaviors and interactions.PACE 2 implements phases using the EnTK API [22,23], based on configuration files prepared by the user.PACE 2 configuration files contain a set of parameters to instruct RCT to manage resource acquisition, task scheduling, and data staging.In this way, the user can focus on constructing MD commands rather than writing code to manage the execution of the workflow application on HPC resources.EnTK's Stage and Task classes are not directly exposed to the PACE 2 user.At the RCT level, each phase is composed of EnTK Stages which themselves are composed of Tasks, but at the PACE 2 level, users see only the phase abstraction to hide the complexities related to translating the user-view of the application into an abstract and concrete workflow.
Each unique simulation candidate-i.e., Candidate object-is an ordered list of phases and their corresponding Stages.To facilitate the elimination and extension of Candidates, the Candidates invoke the function extend_pipeline(), which is a method of the class CandidateManager, to check for the compliance of each simulation Candidate, appending additional cycles or terminating the pipelines.The Candidate specifications, i.e., all runtime details required to run a Candidate to completion, are provided in a Python dictionary, and they explicitly specify the following: (i) the simulation basename; (ii) the total number of Candidates; (iii) the number of pipelines cores; (iv) the executable for the MD simulation engine; (v) an executable for any necessary pre-processing; (vi) the arguments for the MD simulation; (vii) the executable for the analysis; (viii) the arguments for the analysis; and (ix) the maximum number of cycles as a threshold to prevent infinite cycling.After successfully defining the task graph of each candidate through instances of the Candidate class, the top-level PACE 2 driver script creates a CandidateManager object which manages resource allocation and execution of the candidates.
The object-oriented design also allows abstraction of the details of the kernel code that runs the MD simulations or analysis.PACE 2 provides complete kernel agnosticism since tasks employ MD and analysis kernels via a set of pointers.These pointers and their relevant command line inputs are defined by the user as a part of the PACE 2 configuration files.Hence, the MD phase can run any MD engine of the user's choice.Furthermore, the framework does not necessitate that the kernel be a MD engine: engines associated with other simulation techniques, such as Monte Carlo, may be employed to pursue specific scientific problems.Similarly, the analysis kernel could be any standalone script or pre-compiled code that already resides on the target resource's storage disk.
The flexibility of PACE 2 can be seen in the execution map figure 2. The simulation and analysis programs that the user workload runs are defined separately from the main body of the PACE 2 code.This allows for the flexible expression of custom workflows for molecular simulations.The simulation kernel that generates molecular information (such as trajectories/structures) could be an MD engine, such as the one used here, or some other method, such as Monte Carlo, Molecular Mechanics or Quantum Mechanical methods.The Light lines indicate the point in execution when user inputs are accessed by PACE 2 .Dotted lines indicate where the user workload and PACE 2 business logic are executed.User-defined simulation programs (executable e.g., gmx and kernel e.g., mdrun) and analysis programs (executable e.g., python3 and kernel e.g., ML_classifier.py) are staged in the compute environment.
analysis kernel is likewise user-specified and inserted into the pipeline to allow custom analysis methods to trigger adaptive decisions regarding treating the data generated by the simulation engine.
An important step in determining the task graph is dependency resolution.Not all tasks that are known to reside on the task graph will have their dependencies met a priori, and these dependencies must be resolved.For example, a MD task that is known to exist in a future cycle may still need to generate its input files.Once these files are generated in the prior cycle, they must be staged into an appropriate location, accessible by the following cycle.PACE 2 makes this determination using the analysis kernel, and communicates the resolved dependencies to EnTK, which then creates symbolic links to the appropriate files at runtime.
The majority of the PACE 2 logic is implemented in the Candidate class.The class consists of methods to create the Candidate container (i.e., an EnTK pipelines object), and methods to populate the pipelines with the MD and analysis phases.The class also provides a function to perform the necessary decision making to extend or eliminate the candidate.The Candidate Manager class creates as many sandboxes as are necessary to host the candidates of interest and manages the spawning and execution of each candidate.Finally, the top-level driver script acts as the launcher for the PACE 2 application, allowing it to collect user input, spawn the Candidate Manager object, and terminate the application when all candidates have completed execution.Appendix Table A2 details the various components of the PACE 2 implementation.These components interact as laid out in table A2 to enable the necessary execution containerization and dependency resolution.

Results and discussion
The PACE 2 implementation is used to examine the extensive phase diagram of two biomolecular materials formed via dynamic interfacial phenomena, namely self-assembly and membrane-to-vesicle transition.The first biomolecular material is formed by self-assembling two distinct peptide sequences.The molecular composition of these materials dictates their morphology.The second biomolecular material is formed via the fusion of a membrane encompassing a mixture of phospholipids and dendron-grafted amphiphiles.The molecular composition of these materials can yield stable or unstable vesicles.The application of PACE 2 to the two materials systems is discussed in two case studies.
For each use case, the PACE 2 framework has been tested on both local desktop computing (i.e., local hosts) and external HPC (external hosts) resources to demonstrate the portability of the framework.The tests on the local hosts for each Candidate do not run MD simulations of the entire dynamical process but use the final structures from prior studies [3,4] to run short MD simulations whose outputs are tested by suitable analysis routines.This was done for the sake of computational efficiency; the simulations were not being tested for their outcomes but were being used as workloads to test the Elimination/Extension capabilities of PACE 2 .In contrast, the tests on the external hosts for each Candidate ran MD simulations resolving entire dynamical processes whose outputs are tested by suitable analyses.

Case 1: bi-component peptide coassembly
The morphological diversity of a biomolecular material with two peptide sequences is vast.The sequence of each peptide dictates its monomeric structure, which in turn impacts the morphology of the equilibrium self-assembled nanostructure.This system is explored as a use case for the PACE 2 framework: a bicomponent system of coassembling peptides.The peptides in question are diphenylalanine (FF) and phenylalanine-asparagine-phenylalanine (FNF), a system studied previously and reported to exhibit a vast morphological diversity [3] (see figure 3).
A system such as the one described in this case study is particularly amenable to the PACE 2 framework.It involves a large ensemble of simulations and requires the ability to discern simulations that yield desirable morphologies from those that do not, without requiring explicit user intervention.A convolutional neural network (CNN) driven [24] image recognition kernel is adopted to discriminate between the three possible nanostructures of FF-FNF peptide systems.This approach is an image classification problem.CNN-driven image recognition has been shown to be particularly effective [25].The technique involves 'flattening'   operations are then repeated for further filtration and extraction of features.The output is a single string that reports if the image initially input into the network was a nanotube, vesicle or lamella.
The model was trained using 300 images, each of vesicles, lamellae and nanotubes, and validated using 100 images of each morphology.Figure 6 shows the plots for the training and testing of the model.After 7 epochs, the accuracy of the prediction in the training sets goes to ∼1 in five independent training runs.Figures 6(a [28].Figure 6(f) shows the final accuracy of the model, defined as number of correct predictions/total number of predictions.Applying the model to test data that the model has not been trained on typically yields accuracies above 97%, as seen in figure 6(f).
Four molecular compositions were selected from the phase space corresponding to [3], see table 1.For each composition, an ensemble of 10 seeds was prepared as PACE 2 Candidates.A CNN was trained in the manner described above and employed as an analysis kernel.Each Candidate was run for two cycles of MD simulations and nanostructure appraisal by the CNN.Each MD run comprised 2 × 10 7 iterations at a time step of 25 fs.Simulations were executed through PACE 2 on the PSC Bridges-2 cluster [29].
The analysis kernel was used to assess which nanostructure type was present at the end of 4 × 10 7 MD iterations.The most common nanostructure identified by analysis in each ensemble was compared with [3].The conflicting results compared to [3] is due to the fact that 2 × 10 7 iterations are insufficient for the structure to experience the assembly pathway described by [3] in which the aggregates first establish a Table 1.Counts of nanostructures resolved in ensembles at select compositions of the system.Data from [3] and the present PACE 2 study.EV: elongated vesicle, NT: nanotube, BL: bilayer.For each system, the table entry corresponding to the most commonly observed morphology is indicated in bold.
Reference [3] Present study sufficient curvature and fuse their free edges to form a nanotube.In real world use cases similar to this example, this issue can be avoided entirely by utilizing PACE 2 's ability to independently invoke pipeline termination or extension.A user may choose to extend the pipeline until a desired structure is observed subject to a maximum limit, after which the pipeline may be eliminated.
It must be clarified at this point that PACE 2 , in this example, does not modify the statistical outcomes of the ensemble in a thermodynamic sense.It merely filters out undesirable simulations in a hypothetical use case that requires the generation of multiple independent trajectories of a given nanostructure.In other words, the ensemble is not being 'disturbed' or skewed artificially in any way.However, such thermodynamically informed ensemble modifications are possible with the framework provided they are accounted for in the analysis kernel.

Case 2: membrane-to-vesicle transition
The formation of vesicles via different self-assembly pathways has been investigated by theory [30][31][32][33] and experiments [34].In particular, the bicelle-to-vesicle transition has been extensively studied as one of the possible mechanisms for self-assembly [35].Here, a disk-like membrane called a bicelle (figure 7(C)) fuses its edges to form a vesicle (figure 7(F)).Hence, this phenomenon is also termed 'membrane-to-vesicle transition' .In a previous study [4], a phase space for dendronized vesicles is reported (figure 8).These vesicles consist of polyamidoamine dendron-grafted amphiphiles (PDAs) and dipalmitoyl-sn-glycero-3-phosphocholine (DPPC) lipids.The phase space of these vesicles is explored as a function of PDA generation and concentration.The green region in figure 7 represents the design conditions that generate stable vesicles.On the other hand, the red region represents design conditions that do not generate stable vesicles.Since membrane-to-vesicle transition simulations consume significant computing resources, the PACE 2 framework can terminate simulations that do not generate stable vesicles.
To generate dendronized vesicles, the process of membrane-to-vesicle transition begins with an equilibrated bilayer consisting of PDAs and DPPC lipids.The equilibrated bilayer (figure 7(A)) is placed at the center of a large simulation box encompassing water, with the edges of the bilayer exposed to water.This configuration generates unfavorable interactions between the hydrophobic moieties in the bilayer and water.The bilayer minimizes these interactions by transitioning to a disk-like structure (namely, a bicelle) and partially fusing its edges (figure 7(C)).Next, the bicelle further minimizes the unfavorable interactions by transitioning to a bowl-like structure (see figure 7(E)), enroute to forming a stable vesicle (figure 7(F)).
At high concentration of PDAs, the membrane-to-vesicle transition process does not yield stable vesicles, i.e., the membranes do not fuse their free edges to form a vesicle.The PACE 2 framework is used to eliminate these simulations from an ensemble of simulations.In this way, computational resources are optimally employed to simulate only membranes that are able to successfully form stable vesicles.Like case 1, a method to discern simulations that yield desirable morphologies is required.This is facilitated by a k-means clustering code that can determine if the membrane has fused or not.
The k-means clustering algorithm [36,37] is used to divide the membrane into two clusters of equal size (red dashed lines in figure 9).The k-means code provides the Cartesian coordinates for the center of mass of each cluster (green circles in figure 9).The Euclidean distance between the center of masses of the two  clusters is the characteristic length of the structure.The characteristic length of flat membranes in this class of systems is between 15 and 19 nm [4].On the other hand, the characteristic length of a spherical vesicle is less than 11 nm [4].Hence, the characteristic length is used to distinguish between desirable (membranes fusing to yield stable vesicles) and undesirable (membranes that do not fuse their edges to form stable vesicles) simulations.
The k-means clustering code was employed as the analysis kernel in PACE 2 framework (see figure 1(A)).To validate the adaptive functionality on local hosts, membrane-to-vesicle transition simulations for ten Candidates (with a specific generation and concentration of PDAs) were conducted using the PACE 2 framework (see table 2).The analysis kernel was set up to determine the characteristic length of the structure.Like case 1, the starting structures were run within the PACE 2 framework for 5000 timesteps, and the output structures were passed to the analysis kernel.If the characteristic length of the structure was less than or equal to 11 nm, an extension signal was triggered, resulting in another simulation spanning 5000 timesteps.If the characteristic length of the structure was more than 11 nm, the pipeline was terminated.
Computation for this use case was performed on Bridges2, a HPC resource located at the Pittsburgh Supercomputing Center.The prerequisite for operating on Bridges2 is to install the GROMACS and Python executables in the desired format.The PACE 2 simulation configuration files (table A.2) point to the location of these executables on Bridges2.Each candidate was simulated for 500 000 timesteps prior to analysis.Depending on the signal returned by the analysis kernel, the simulation was either extended or terminated.
Table 2. Simulation setup to validate pipeline extension/elimination protocol for case 2. The setup tested the extension behavior of the pipelines as dictated by the characteristics length detected by the k-means kernel which determines if the membrane has fused its free edges to form a vesicle or not.The pipeline was tested on a local host and PSC Bridges-2 [29].For both resources, all runs appropriately extended or eliminated the pipelines, in full agreement with [4].

PDA parameters
Local This demonstrates that the PACE 2 workflow can be used to launch long simulations on remote computing resources.The two case studies described above use custom analysis scripts written in Python, showcasing the flexibility available for workflow design.However, both case studies also use GROMACS [13] as the MD engine.To demonstrate flexibility with respect to choice of MD engine, three additional examples were prepared using LAMMPS [38], Amber [39], and no MD engine (dummy example), respectively.LAMMPS and Amber were chosen for their popularity but also for their distinct communities of practice, one frequently used in examinations of nanomaterials and the other in biomolecular simulations.Proof-of-concept workflows were constructed which cope with each MD engine's unique input/output file structure, resulting in successful pipeline management and program outputs.The dummy example executes only arbitrary Unix commands and produces no MD output or analysis.It therefore constitutes the PACE 2 minimum viable product, as well as a lightweight test case for users.

Limitations and caveats
PACE 2 provides a scalable and user-extensible toolkit to perform adaptive ensemble simulation with any simulation kernel of choice and fully custom analysis tools.However, there is a limitation that must be made explicitly clear.As alluded to earlier in this manuscript, certain types of problems cannot be formulated as an ensemble of independent simulations.PACE 2 runs pipelines that are completely decoupled from each other.Therefore, while it is possible to run workflows where there is some minimal or rudimentary interaction between candidates, PACE 2 currently lacks a mechanism for complex communication or explicit synchronization between candidates.This is not an uncommon use case, a classic example of which is Replica Exchange: a large number of replicas (analogous to PACE 2 Candidates) run concurrently, but also maintain a very high frequency of communication across the ensemble by exchanging molecular configurations or temperatures.Broadly speaking, any implementation of enhanced/thermodynamically informed sampling that requires high frequency communication is not currently supported in PACE 2 .Still, problems that can be formulated as a large ensemble of non/minimally-interacting concurrent simulations are well-suited to exploit the capabilities of PACE 2 , provided interactions are accounted for in the analysis kernel.
A second caveat must be made with respect to the individual test cases, The peptide assembly example uses a rudimentary CNN trained on images of various nanostructures taken from a fixed perspective.Similarly, the membrane closing example utilizes a simple K-means approach.The choices of fixed perspective/K-means here are made to simplify the way data is treated while meaningfully testing the robustness of the software.There are a number of ways to improve the performance of the predictive models such as training it on images from multiple perspectives or combining image recognition with characteristic dimension measurements, etc.However, more advanced machine learning techniques that could improve the performance of the CNN are beyond the scope of this work.

Conclusion
Computational investigations of the dynamical processes underlying the formation of biomolecular assemblies under a variety of compositions and external conditions can be efficiently performed using ensemble MD simulations.This approach faces two challenges: the large number of independent ensembles required to investigate the composition phase space, and the difficulty in predicting the formation of the material and the duration of each simulation.These challenges can be addressed by management strategies which include intermittent inspection, variable sampling, premature termination, and extension of individual MD simulations.Furthermore, the management strategies can be automated to facilitate the long-term management of a large ensemble of MD simulations.
To this end, a high-throughput workflows-based computational framework for biomolecular materials simulations is developed and its applications are discussed.The computational framework, PACE 2 , consists of a set of Candidate pipelines, each comprising an arbitrary number of cycles, i.e.Simulation-Analysis loops.The pipelines can be replicated for concurrent processing of multiple Candidates simultaneously.The capabilities of the PACE 2 framework are demonstrated through two use cases: peptide coassembly and membrane-to-vesicle transition.The peptide coassembly case demonstrates the operation of the framework in its 'Elimination' protocol, i.e., stopping undesirable simulations and extending desirable ones for a limited number of cycles.In contrast, the membrane-to-vesicle transition case demonstrates the 'Extension' protocol, i.e., extending a simulation for multiple cycles until a desired end state is achieved.These two cases are not mutually exclusive: it is possible to design a workload that combines these two protocols in any arbitrary manner, such as eliminating unfavorable pipelines while extending favorable ones until the desired end state is achieved.The operability of the framework has been demonstrated on a local machine as well as a HPC platform.
In the future, the PACE 2 framework will be extended to address more generalized workflow systems.Further, scalability studies will be performed in order to evaluate performance over large HPC platforms spanning several thousand nodes as well as quantify compute time savings when compared to running simulation studies without the framework.While this data is not yet directly available for PACE 2 , a reasonable assertion can be made that PACE 2 can scale to tens of thousands of concurrent MD simulations due to its use of the RCT suite [21].In addition, other computational methods, and analysis tools prevalent to the identification of sequence/composition-structure-property relations in diverse classes of materials, such as polymeric materials, composites, alloys, ceramics and glasses, can be potentially integrated into the framework.

Appendix. Terminology
Table A1.Definitions of terms used in the manuscript, including reference definitions for relevant EnTK terms [22,23].An abstraction of a computational task that contains information regarding an executable, its software environment and its data dependences.Note this is an EnTK entity.

Figure 1 .
Figure 1.(A) Execution pipeline schematic.A pipeline is a division of the available compute resources devoted to running a single candidate at a time.Candidates are constructed in alternating cycles of simulation and analysis.Periodic analysis is used to inform PACE 2 whether a candidate should be extended or eliminated.(B) PACE 2 Schematic.Candidates set up a priori are drawn sequentially from the Candidate Pool and simulated in concurrent pipelines.Each pipeline performs an MD phase (blue) followed by an Analysis (analysis) phase.If the analysis task returns a 'compliant' signal, the pipeline extends into an additional MD phase (shown in green).A 'non-compliant' signal eliminates the Candidate by terminating the pipeline (shown in red) and spawning a new pipeline with the next Candidate.

Figure 2 .
Figure 2. PACE 2 execution map.Candidate directories and config JSON files are provided by the user in the client environment.Light lines indicate the point in execution when user inputs are accessed by PACE2 .Dotted lines indicate where the user workload and PACE 2 business logic are executed.User-defined simulation programs (executable e.g., gmx and kernel e.g., mdrun) and analysis programs (executable e.g., python3 and kernel e.g., ML_classifier.py) are staged in the compute environment.

2. 3 . Pseudocode 1 .Figure 3 .
Figure 3. Phase map of bicomponent peptide coassembly of FF and FNF as a function of relative tripeptide and total peptide concentrations.

Figure 4 .
Figure 4. Images of a vesicle, lamellae and nanotube (left to right) generated by the MD engine (bottom) and the pre-processing module of the neural network (top).

Figure 5 .
Figure 5. Setup of the CNN kernel.A GROMACS .grofile is first used to generate an image that is then input into the CNN, which generates an output string and writes it to a file.

Figure 6 .
Figure 6.Training and validation of the CNN model.Five independent training runs were performed.(a) shows the training accuracy at the end of each epoch.(b) and (c) show the loss function and accuracy for the validation set, respectively.(d) and (e) Show fine-grained training metrics (accuracy and loss function respectively) but plotted during model training through each epoch instead of only at the end of each epoch.(f) Shows the final accuracy derived from running the trained model after 10 epochs on the test split data.
) and (b), show the training accuracy and loss function after every epoch.A loss function is a metric that quantifies the 'penalty' for an incorrect prediction by the model; this is the function that is minimized during training.Figure 6(c) shows the accuracy of the validation data after every epoch of training.Figures 6(d) and (e) show a fine-grained representation of the training accuracy and loss function during the training process through each epoch rather than only at the end of each epoch.The loss function value remains somewhat high until about 7 epochs, after which it begins to drop, and reaches very low values (10 −3 and below) at 10 epochs.Therefore, ten epochs is considered to be reasonable for good feature extraction.Extending the training further risks overfitting

7 Figure 7 .
Figure 7. (A)-(F) shows the transition of a bilayer (A) to a vesicle (F).Blue beads are DPPC lipids.PDAs are not shown for clarity.

Figure 8 .
Figure 8. Phase map of dendronized vesicles generated via membrane-to-vesicle transition.It is plotted as a function of PDA generation and relative concentration.Reproduced from [4] with permission from the Royal Society of Chemistry.

Figure 9 .
Figure 9. Images of a flat membrane (A) and a spherical vesicle (B).The red dashed lines divide the assemblies in (A) and (B) into two regions of equal size.The green circles represent the center of mass of the clusters.
Stages, which can be dynamically ordered and bound to a reserved compute resource.Pipelines are managed by PACE2 and are not explicitly invoked by the user.Note this is an EnTK entity.Sandbox [EnTK] A directory created by EnTK which contains all of the input files required for task execution as well as all task outputs.Sandboxes are managed by EnTK and not explicitly invoked by PACE2 or the user.Note this is an EnTK entity.(RADICAL-EnTK 2023) Stage [EnTK] A set of independent tasks to be executed concurrently.Stages are managed by PACE2 and are not explicitly invoked by the user.Note this is an EnTK entity.Task [EnTK]

Table A2 .
Functions implemented in PACE 2 to enable remote execution of task graphs.Bold subheadings indicate which program file contains each function.