Cyberinfrastructure for sustainability sciences

Meeting the United Nation’ Sustainable Development Goals (SDGs) calls for an integrative scientific approach, combining expertise, data, models and tools across many disciplines towards addressing sustainability challenges at various spatial and temporal scales. This holistic approach, while necessary, exacerbates the big data and computational challenges already faced by researchers. Many challenges in sustainability research can be tackled by harnessing the power of advanced cyberinfrastructure (CI). The objective of this paper is to highlight the key components and technologies of CI necessary for meeting the data and computational needs of the SDG research community. An overview of the CI ecosystem in the United States is provided with a specific focus on the investments made by academic institutions, government agencies and industry at national, regional, and local levels. Despite these investments, this paper identifies barriers to the adoption of CI in sustainability research that include, but are not limited to access to support structures; recruitment, retention and nurturing of an agile workforce; and lack of local infrastructure. Relevant CI components such as data, software, computational resources, and human-centered advances are discussed to explore how to resolve the barriers. The paper highlights multiple challenges in pursuing SDGs based on the outcomes of several expert meetings. These include multi-scale integration of data and domain-specific models, availability and usability of data, uncertainty quantification, mismatch between spatiotemporal scales at which decisions are made and the information generated from scientific analysis, and scientific reproducibility. We discuss ongoing and future research for bridging CI and SDGs to address these challenges.


Introduction
Many sustainable development goals (SDGs) are interconnected. Addressing the sustainability challenges to meet the SDGs calls for integrated approaches. For example, the food security challenge cannot be resolved without addressing water-related issues, which in turn is tied to the challenges associated with climate change. Solving sustainability challenges involves actions at many levels ranging from decision and actions at grassroot levels to changing policies and enforcing regulations at government levels which are impacted by individual-level actions and vice versa. Decisions about pursuing SDGs are increasingly relying on data and scientific models. This is more true today than ever where data is collected pervasively about many dimensions of the Earth and our society. For example, data-driven precision agriculture is revolutionizing farming practice by combining climate information and fusing detailed geospatial data. Similarly, improving flood resilience of cities involves heterogeneous data sources that are collected at multiple geospatial scales ranging from field sensors to satellite remote sensing images covering the entire globe. This digital transformation is permeating through all sustainable development dimensions, which necessitates translating big data into actionable information at appropriate spatial and time scales to support optimal decision making. The challenges of the multi-facet, multi-scale nature of research and decision making in addressing one or more SDGs are further compounded by the complexity of handling big data and related tools. Such challenges can be addressed by harnessing the power of advanced cyberinfrastructure (CI) to innovate integrated approaches to holistically understanding sustainable development through digital representations and computational capabilities.
The International Network of Networks for Global-Local-Global Analysis of Systems Sustainability (GLASSNET) aims to link global communities of researchers working towards achieving water and land related SDGs by creating a collaborative infrastructure across disciplines. The GLASSNET community members met in a series of 2021 summer workshops 7 on data integration, cross-scale sustainability analysis, linking science and policy in global-local-global (GLG) analysis of systems sustainability, and CI, and the 2022 GLASS Conference for 'Managing the Global Commons: Sustainable agriculture and use of the world's land and water resources in the 21st Century' 8 . They identified several challenges, which can be addressed by advances in CI. We argue that a lot of data are available from multiple disciplines, and to use them effectively we need CI. Additionally, the SDGs cannot be addressed by just looking at one goal and without integrative scientific approaches. For these integrative scientific approaches, CI needs to be integrated and broadly accessible as well. The objective of this paper is to highlight the key components and technologies of CI necessary for meeting the data and computational needs of the SDG research community. We start with an overview of the CI ecosystem in the United States with a specific emphasis on the investments made by academic institutions, government agencies, and industry at national, regional, and local levels. Despite these investments, this paper identifies barriers to the adoption of CI in sustainability research that include but are not limited to access to support structures; recruitment, retention and nurturing of an agile workforce; and lack of local infrastructure. Next, we discuss the relevant CI components such as data, software, computational resources, and humancentered CI advances and explore how to lower the barriers. We follow up with a discussion on multiple challenges in pursuing SDGs based on the outcomes of several expert meetings and how harnessing the 7 GLASSNET 2021 Summer Workshops, https://mygeohub.org/ groups/glassnet/learning-hub/workshops/. 8 The 2022 GLASS Conference, https://mygeohub.org/groups/ glassnet/calendar/glassconf2022. CI advances can bridge the gaps between SDG efforts and rapidly evolving technologies.

The CI ecosystem: core components and opportunities
The concept of CI was coined by a 2003 Blue Ribbon Panel report to the National Science Foundation (NSF) Directorate for Computer and Information Science and Engineering [Atkins2003]. In an analog to infrastructure, such as roads, bridges, rail lines, power grids and telephony networks, that underlie an industrial economy, the report used the word cyberinfrastructure to refer to the collective of advanced computing systems, data and information management, and high-performance networks that powers the 21st century science and engineering research and education. Today, advanced CI comprises not only hardware systems but also the software that links all the components and makes the system useful and usable, as well as the human expertise that operates and helps researchers utilize the resources [Stewart2010].
CI is permeating all areas of science and society and has become a key enabler of scientific discoveries and engineering innovations, as highlighted in the 2021 NASEM report on Global Change Research and Opportunities for 2022-2031 [NAS2021]. Significant investments have been made by NSF and other programs in a wide range of advanced CI resources and services, as indicated in the outer blocks of figure 1. Overall, the various CI components have evolved and improved significantly over the last few decades due to the tremendous advances in computing, networking and storage hardware, and increased accessibility of software and proliferation of applications. All of these are beginning to be integrated into an ecosystem of advanced capabilities. However, substantial barriers to broader use-to the 'democratization of access'-still exist. For example, traditionally underserved and under-resourced domains and institutions lack secure connections, adequate local infrastructure, and trained support staff to effectively use advanced CI resources [Parashar2022]. Many recent funding programs emphasize on bridging the gaps between advanced CI capabilities and tools that are usable by domain science researchers to study complex research questions, and development of next generation CI workforce. 9 We believe that such developments are creating significant opportunities for advancing science, democratizing access, and broadening participation, as discussed in the following sections.
Reaching the SDGs cannot be accomplished without integrative scientific approaches. Similarly, 9 National Science Foundation, Office of Advanced Cyberinfrastructure www.nsf.gov/funding/programs.jsp?org= OAC. the components of CI-data, software and analytics, modeling, artificial intelligence (AI), systems and people-also need to be integrated and accessible to meet these challenges (figure 1). The rest of this section describes the key CI components and technologies with the aim to highlight the opportunities from innovative CI advances and the overall computational ecosystem that will help transform multidisciplinary, multiscale GLG SDG research to address grand challenge problems.

Data, metadata, and the FAIR principles
Data is central to the research process, so it is no surprise that data sits at the center of many of the challenges in advancing sustainability science. Barriers on the access and reuse of data hamper open science. Datasets are often captured and shared without sufficient metadata and context to enable another researcher to find and properly understand and utilize the data, in particular, across multiple disciplines and across scales from global-to-local-to-global and back again. Data repositories must offer robust solutions for expressing and exposing metadata that describe the datasets that they provide access to both for users and user-agents as well as application program interfaces, standard protocols, and service endpoints for accessing and performing operations on datasets.
The United Nations Educational, Scientific and Cultural Organization (UNESCO) recommendation on open science provides an international framework for policy and practice to outline a common definition, shared values, principles and standards for open science at the international level, and it proposes a set of actions conducive to a fair and equitable operationalization of open science for all at the individual, institutional, national, regional and international levels [UNESCO2021]. It prominently includes the treatment of research data in its definition of open scientific knowledge and states that data should be as openly accessible as possible, made available in a timely fashion, in both human-and machine-readable and actionable formats. It also acknowledges the need for data governance, in particular, to respect the rights of indigenous peoples and local communities through mechanisms such as the CARE principles, which address considerations of collective benefit, ownership, authority to control, responsibility, and ethics [Carroll2020]. UNESCO advocates for open access to data from publicly funded research, continued innovation to broaden and deepen data capabilities for diverse stakeholders, and the importance of proper curation and long-term stewardship of data.
Progress towards solving many of these challenges can be accomplished by making data FAIR: findable, accessible, interoperable, and reusable [Wilkinson2016]. Not to be conflated with open access, the FAIR Principles encourage data to be as open as possible and as closed as necessary. Researchers, policy makers, and other stakeholders have difficulty in finding existing data for their models and analyses, which often results in redundant data collection and generation. In some cases, related data can be located but are not available because of authentication and authorization requirements, the implementation of non-standard protocols for accessing and transferring the data, or a lack of stewardship resulting in the data no longer existing. Interoperability of data is key in sustainability science because the problems are typically addressed using interdisciplinary and multidisciplinary approaches, with each discipline using its own, distinct vocabularies and formats. Machine interfaces must balance complexity with ease of use that can, in part, be addressed by supporting multiple protocols. Lastly, when researchers find datasets that relate to their inquiry, they are often poorly documented, constrained by license restrictions, and lack the provenance necessary to understand how the data were managed and if the data can be trusted for reuse. Applying these principles to data for use by machines (in addition to humans) exponentially increases both the challenges and the benefits of FAIR adoption.
It is notable that the FAIR principles apply to both data and metadata. Improving the quality and precision of metadata may represent low-hanging fruit for sustainability research and enabling data to be FAIR for GLG analyses. In particular, creating more robust, normalized metadata for geographic location, the subject and keywords related to the research, and time may provide the best impact for least investment. An exemplar is the United Nation's Open SDG Hub that classifies data related to each goal, describes each with geospatial and temporal references, and relates data for each goal to targets and indicators towards reaching the goal as well as related publications, policies, and events 10 . In terms of the lifecycle of data, primary considerations include the capture of sufficient context when researchers deposit their datasets in a repository and the functionality of the repository in expressing and exposing metadata for both users and user-agents. One challenge in enabling multiscale metadata is incentivizing researchers to invest time into providing additional description and context. Ideally, CI providers can build functionality to accomplish this into their tools and platforms to reduce the amount of effort on behalf of the researcher.

Data analytics
Tackling SDGs is increasingly dependent on complex and massive data streams 11 . One common type of such data streams is geospatial data (i.e. data with geographic and spatial components) that permeates broad scientific and societal realms and requires major advances of data analytics based on critical thinking about the complex interactions among various dimensions of sustainability across spatial and temporal scales [Wang2016]. Geospatial software plays a critical role in examining and understanding such interactions and has been widely developed and used by numerous communities to transform geospatial data into valuable insights and scientific knowledge. The growing benefits and importance of geospatial software are driven by tremendous needs in numerous fields such as agriculture, ecology, environmental engineering and sciences, humanenvironment and geographical sciences, geosciences, national security, public health, and social sciences, to name a few, and these are reflected by a massive digital geospatial industry [Vandewalle2021]. There are a variety of modalities of data analysis embedded in geospatial software, for example, AI and machine learning (ML) [Xu2018], spatial statistics [Anselin2012], spatial optimization [Lin2015], spatial simulation modeling [Tang2011], and spatial network analysis [Chen2014].
Spatiotemporal and multilevel variations are central to many investigations of sustainability problems. One challenge is that available datasets often exist at incongruent units of geography and disparate time scales. As such, there can be a mismatch between how researchers may theorize and what data are actually available at a given scale. A relevant example would be the study of the effects of neighborhoods and their characteristics on individuals' outcomes [Sampson2002]. Despite the importance of micro-level variability to the understanding of social cohesion, inequality, educational outcomes on future success, or air pollution effects on asthma, typically used datasets are often not available at the neighborhood or finer levels [Sampson2012]. This data unavailability helps protect privacy while readily downloadable data from major social surveys typically do not include georeferenced information at fine levels. These data sources along with other relevant data, such as environmental or sensor data or those from various commercial sources, may report data at the state, county, tract, block group, or address level, and at different times. Integrating heterogeneous geospatial data requires addressing inconsistent levels of data detail, incompatible formats, and differing levels of uncertainty and completeness [Gong1994]. Furthermore, data measurement errors can be accumulated or balanced out in data analyses at different spatial and temporal scales as well as data transformation and data reduction using analytical models [Hu2017]. Holistic research on integrating socioenvironmental data across multiple scales will facilitate groundbreaking analyses by providing a reproducible approach of using CI and geospatial software to fuse and analyze data from multiple sources and scales, facilitating the examination of small areas. This will advance the field's ability to examine geospatial characteristics at appropriate levels of spatiotemporal resolution, mitigating the risk of true associations being obscured by relying on data from overly broad geospatial levels. It is important to note that digital geospatial representations can be limited due to inadequate granularity or quality of measurements in diverse sustainability dimensions such as in coastal and marine contexts. This limitation must be explicated addressed in sustainability studies especially concerning broad social and environmental implications.

Modeling
Addressing sustainability requires the use of data and simulation tools from multiple disciplines including climate, water, ecology, agriculture, social sciences and economics, among others. While there are some systems-based approaches and models that use data from different disciplines to address broad questions related to sustainability, most often domainspecific simulation tools are needed to address issues at multiple spatial and temporal scales. The data and results associated with such simulations can then be used as inputs to perform simulations or analysis in another domain (aka soft coupling). For example, streamflow results from a hydrologic model can be used in an environmental model for computing sediment or pollutant loads, which in turn can be used in an ecological model for assessing ecosystem services. Thus, accessibility and interoperability of information, including data and analysis tools, among multiple domains is critical for addressing sustainability issues.
CI is playing a key role in enabling the flow of information among different disciplines across multiple platforms. Specifically, CI is enabling publication and sharing of data and models as web resources [Tarboton2014, Rajib2016, Kalyanam2019 and Khandelwal2022]. While some of these developments are domain specific, e.g. HydroShare for water, they enable interoperability with other systems such that a researcher from a different domain can access the needed information, process it and create new information/knowledge. For example, SWATShare, a platform for publication and sharing of soil and water assessment tool (SWAT) can fetch a SWAT model from HydroShare, use its visualization capabilities for displaying results and let users create a new instance of this model to answer a new question. A user can potentially validate the model's performance against observed data by accessing this information from the United States Geological Survey's National Water Information System or calibrate the model in SWATShare by accessing high performance computing (HPC) resources such as Extreme Science and Engineering Discovery Environment (XSEDE) on the web. This example using SWAT showed how CI is enabling reproducible and reusable research for sustainability science, and the overall framework and workflow can be replicated for any model. CI is also enabling access to Earth observations, performing all the pre-processing on the fly and providing results in a fraction of time without using any storage or computing power on the user's computer.
Solving SDG problems also requires multidisciplinary collaboration and coupling of models from different domains that simulate different components of the Earth and socio-economic systems. In addition to the common data and computation challenges associated with running any individual model, there exist many conceptual and technical challenges in coupling models across domains, ranging from model paradigm differences, domain knowledge gap across disciplines, to platform dependence, and the time-consuming effort to harmonize and exchange data across modeling teams. Barriers due to institution boundaries also often hinder access to collaborators' models and data. Efforts are underway to address these challenges, and advanced CI is demonstrating its potential to not only accelerate the effort in model crosslinking but also make this process more reproducible and interoperable. While most of the research work related to model coupling stops at the phase of one-way, offline, researchers in two projects, INFEWS (Innovations at the Nexus of Food, Energy, and Water Systems) and DOEfunded PCHES (Program on Coupled Human and Earth Systems), partnered with CI professionals in developing a collaborative container-based modeling infrastructure called C 3 F [Woo2022] and applied it in coupling the water balance model and SIMPLE-G model [Baldos2020] to understand how economically driven changes in agricultural production may impact on sustainable water use. In this system, the researchers independently packaged their models and the data processing code that convert the output of one model to the format of the input of another model into Singularity containers and collaboratively explore, create, and execute the coupled modeling workflows using the XSEDE HPC resources 12 .

AI
Increasing availability of data in all sectors of economy/governance offers the potential to greatly improve policy/decision making and incentivize actions at local levels. But realization of this potential will require effective harnessing of AI/ML that have already revolutionized many fields such as commerce, entertainment, and transportation.
Specifically, AI/ML can play a critical role in overcoming many of the CI barriers to address SDGs. First, AI/ML (along with geospatial analytics) can make data available globally and equitably to serve as input drivers for models. For example, remote sensing data from Earth observing satellites can be used to create virtual gauges that can be used to parameterize hydrology models in parts of the world where no streamflow observations are available [Gil2021]. Second, computationally efficient ML emulators of complex process-based models can enable fast evaluation of many scenarios for policy planning and decision making [Reichstein2019]. Third, ML methods can leverage observations from a collection of localized regions to build models that can provide high quality predictions on a global scale. For example, an ML model trained using data from a set of observed hydrological basins in the widely used CAMELS data set, is able to greatly outperform individually calibrated state-of-the-art process guided hydrological models [ Kratzert2019, Li2022] even in basins where observations are available to parameterize complex process-based models. The power of ML becomes more evident in un-gauged scenarios (e.g. basins where observations are not available), as meta-transfer learning approaches allow transfer of models from gauged regions to unobserved regions [Kratzert2019b, Willard2021a].
Addressing grand societal challenges also requires new innovations to address the well-known limitations of black-box ML methods in the context of scientific problems: (i) need for a large amount of observations for effective training of the state-ofthe-art ML models; (ii) inability to handle out of sample scenarios (e.g. where the training data is not representative of what would be encountered in test scenarios); (iii) inability to effectively represent nonlinear dynamics of bio-physical processes that are evolving and interacting and multiple spatial and temporal scales. These innovations are being pursued in the emerging field of knowledge-guided ML (KGML). For example, these methods can incorporate physical laws (e.g. conservation of energy, mass balance) in the loss function that is used to guide the training of the ML model, make use of the underlying relationships between various physical processes in the design of the ML architectures, and even use the output of imperfect physical models along with actual observations for training ML models. These KGML techniques are fundamentally more powerful than standard black-box ML approaches and traditional mechanistic process-based models used by the scientific community to address environmental problems, as they can leverage latest advances in ML without ignoring the scientific knowledge accumulated over decades and centuries [Karpatne2017, Willard2021b and Karpatne2022]. Already these techniques have led to novel data sets (e.g. ReaLSAT global data set of water body dynamics [Khandelwal2022]) and greatly improved predictive models for monitoring the quality of freshwater lakes, stream flow modeling in hydrological basins, GHG emission from agriculture) [Read2019, Jia2019, Hanson2020, Jia2021, Liu2022, Ghosh2022]. For example, in the context of modeling surface temperature dynamics of lakes, Read et al [Read2019] show that the KGML techniques are able to generalize in out of sample scenarios much better than traditional process based models as well as black-box ML models. Despite these early successes, research in these KGML methods is still in early stages and there is much work to be done, especially in areas such causal discovery, uncertainty quantification, and effective modeling of nonlinear dynamics in physical and environmental systems [Karpatne2022]

Human-centered CI
In the context of scientific practice, ethics typically involves consideration of moral questions that arise in research, publication, data collection and analysis, and other professional activities, all governed by a set of moral principles or codes of behavior [Proctor1998]. Ethics in science includes everything from human subjects to cultures of inclusion, and intersects particularly with the field of sustainability, in which a preponderance of diverse types of data are geolocated or otherwise associated with personally identifiable information [Nelson2022]. As CI has become indispensable in many scientific fields, how to make CI equally accessible and usable for solving sustainability challenges represents a new challenge of ethics beyond the ethics of human subjects and privacy. Specific examples of this challenge include how to assure scientific reproducibility for CI-based data-driven analytics; how to mitigate algorithmic bias baked in by assumptions made for algorithmic designs and parameter choices; how to accommodate different stakeholders and users who might not be able to take advantage of the same CI capabilities without serious and rigorous work on humancentered user interfaces adaptive to diverse backgrounds and needs. From these examples, it is evident that the magnitude of the challenge is large, especially affecting the viability of harnessing CI and big data to tackle sustainability challenges simply because any effective solutions to sustainable development must be ethical, equal, and just.
Part of the solution to this ethical challenge lies in education and workforce development. The power and capabilities of CI as articulated here cannot be fully utilized unless the skills and expertise needed to exploit its full potential are imparted to students, researchers and working professionals. Ideally CI training should be integrated into the sciences curriculum at different levels starting from lower undergraduate to graduate studies, but providing such training has several challenges. Specifically, there is a steep learning curve for instructors to become familiar with the technology and then keep up with the continuous changes. Second, given all other commitments, there is not enough time for researchers and educators to develop training materials and/or incorporate such training into an existing curriculum. Third, many institutions, especially two-and four-year colleges, do not have the necessary infrastructure, e.g. access to HPC resources, to provide CI training. To address these challenges and to make CI training more accessible, there are some recent community efforts to develop CI curriculum or training materials for including sustainability in science, engineering and agriculture related disciplines. These include the FAIR CyberTraining for climate and water 13 and IGUIDE 14 Education and Workforce Development activities, among others. To enable broader adoption, the training materials should adapt to the needs of different audiences including students, researchers and professionals at appropriate levels, easily available and offered in formal (credit courses or certificate at universities), semi-formal (non-credit certificates, continuing education units from professional organizations) and informal self-paced online training modules. The ongoing curriculum development incudes creation of modules that anyone can take with or without prerequisites in a self-pace mode. In a more formal mode, one can follow an instructor-led structured sequence of modules to earn credits towards getting credit that could lead to a badge or certificate.

CI resources
As key computing and data technologies are becoming a foundational enabler to all scientific research in the past several decades, the overall CI ecosystem has been evolving toward greater accessibility, usability, and interoperability. The US CI landscape changed significantly from the focus on high-performance parallel computing systems in the 1980s to an array of diverse system architectures and storage, complemented by an ever-expanding software ecosystem addressing researcher needs from data acquisition, curation, and management; modeling and simulation; to data analytics, visualization, virtual reality, and ML training and inference today.
The core of the NSF innovative CI balances the extreme-scale computing of leadership class resources where only a small number of applications can efficiently use and leverage the full capability of the system with the capacity systems to support the full range of computational applications, including many so called 'long tail' usages that center on data of multiple dimensions and scales. The latter group, including many domains of the GLASS community, represents most of the applications currently running on the national and campus computing resources. The portfolio also balances production-quality operational resources that researchers rely on in their daily activities and novel systems that experiment and explore potential benefits of newer technologies to research problems. For example, Anvil and Delta, the two newest NSF systems, provide a significant capacity in central processing unit (CPU) (one billion hours per year) and graphics processing unit (GPU) (eight hundred GPUs) computing; Neocortex explores the latest innovations in wafer chip design that packs trillions of transistors and close to one million cores onto a single chip, with the aim to accelerate deep learning training, one of the major challenges in developing deep learning models. Both Anvil and ACES explore composable infrastructures to effectively support the increasingly complex workflows and applications that do not fit neatly with classic 'batch' computing which is typical of the HPC system. Composable infrastructures provide highly customizable, on-demand provisioning of resources such as CPU, GPU, storage, and network according to workload requirements. These new resources aim at meeting the growing demands from domain science researchers for interactive ML and data analytics, hosting databases, web services, lab notebooks, to name a few, and sharing data and interactive applications to improve accessibility and reusability of their research.
However, the raw computational power would not be fully harvested for research without services and support, as mentioned in section 2. These national level advanced CI resources have been freely accessible to US researchers through the XSEDE Federation, which is now transitioning to ACCESS 18 , a virtual organization funded by NSF that coordinates the allocation, user support, training and outreach of the NSF CI resources. This provides a mature set of people and system interfaces to integrate with the resource providers distributed around the country. These user-oriented services lower the barriers to access, examples include a user portal for requesting access and an expedited process to grant access to start-up and education allocation requests, a common helpdesk to dispatch user inquiries to resource providers, a central place for advertising and registering for training events conducted by resource providers, and an online repository of training materials. Many resource providers are beginning to support the ever-expanding needs for interactive computing and web access to HPC (vs. the classic HPC access of batch computing and Linux commands). Anvil, for example, supports interactive computing provisioning as well as web interfaces such as Jupyter Notebook, R Studio, remote desktop Thinlinc and job submission through Open OnDemand portal [Hudak2018], which have been proven to shorten the learning curve for users new to HPC. Domain focused science gateways, i.e. web portals built for easy access to large CI resources through domain applications, are fast evolving CI platforms to help broaden access and support team science.

Discussion and conclusion
Accomplishing global sustainability requires harnessing the power of big data and advanced computing to integrate data and modeling across multiple spatial, temporal, and organizational scales. This raises significant challenges spanning data, modeling/computation, policy and governance, disciplines, and organizations. As mentioned in the Introduction section, The GLASSNET community members have met in a series of workshops in 2021 and 2022. They identified several challenges, which we believe can be addressed by the CI advances discussed in this paper.
First, GLG analysis involves downscaling global conditions or policies to individual decision markers, and then aggregating the cumulative effects of these individual decisions at micro scales to meso and macro scales. However, this cross-scale integration is challenged by 'the missing middle' across almost all dimensions in meso-scale representation and interactions of meso-scale processes/policies with local-and global-scale processes/policies. To gain a better understanding of these cross-scale interactions (figure 2), we need to bridge disciplinary knowledge gaps in translating data and models across scales and develop new frameworks and methods for the 'just right' data and model to represent heterogeneity for a given scale. These approaches must also enable FAIR data management and model analysis while supporting comparative analysis that can generate useful information for decision making.
Second, improving the GLG science-policy linkages calls for highly detailed modeling of both environmental and human processes and the integration of physical, environmental, economic, and behavioral process models and data. Geospatial technologies such as GPS and remote sensing have given us tremendous abilities to observe environmental systems at highly disaggregated spatial and temporal scales, but this does not necessarily translate into a better understanding of the underlying processes or explanation of observed spatial-temporal patterns and nonlinearities. Doing so requires the integration of process models and data, e.g. via ML approaches guided by domain knowledge that can account for high degrees of spatial and temporal heterogeneity, and advanced CI methods to efficiently process dynamic data and run model simulations, including those that can combine detailed spatiotemporal analytics with supercomputing capabilities through cyberGIS support. On the other hand, big data on human and social processes are often subject to a myriad of biases, e.g. due to non-representative data collection or biases in data processing. New methods are needed to address these biases, including technical solutions to overcome biases in data collection and processing and integration of human behavioral and social theories with ML (akin to the KGML approaches mentioned above) that can incorporate the needed spatial and temporal heterogeneity at a given scale and account for the cumulative effects of heterogeneous behaviors across scales.
Third, accounting for uncertainty, both in modeled outputs and as a property of complex human-environmental systems, remains a critical challenge. Quantitative models that generate numeric projections can be misleading without also quantifying levels of uncertainty and the conditions that make model projections more or less uncertain. It is equally important to understand how uncertainties compound across space and time and via interactions among multiple environmental processes, and how these uncertainties impact human choices. Ensemble modeling approaches that use different models to generate projections provide a means for accounting for structural or conceptual model uncertainty. Accounting for the various types and sources of uncertainty requires advanced CI to quantify and model the propagation of uncertainty and its effects through complex coupled human-environmental systems. The consideration of uncertainty propagation across various models requires holistic system thinking and integrating diverse uncertainty evaluation methods that are often computationally intensive. Such computational intensity cannot be fully resolved without advanced CI. When considering models for scenario planning, advanced model simulation methods are needed to explore the full range of potential future scenarios, including 'black swan' events that may be less likely, but nonetheless important for policy consideration.
Another obstacle in translating science findings to policy making is in the time it takes to effectively translate data analysis and modeling into real-time management decisions. It may take many months or even several years to develop an adequate environmental process model that is sufficiently specified and detailed that it can be used to generate prediction and inform policy making. Developing useful models of coupled human-environmental systems, which is often necessary for assessing the implications of policies, takes even longer. As a result, the information gets to decision makers too slowly. Advanced CI is needed to reduce time spent on data wrangling and model coupling and expedite the computation workflow to support timely decision making.
Finally, data and model sharing, reproducibility of data-intensive and computational scientific work, access to HPC and networking capabilities, and education, training and workforce development are common cross-cutting challenges in many fields that need to be tackled in a holistic way for achieving the global sustainability goals. Broadening access to the CI ecosystem is essential to democratizing science and ensuring every researcher has fair and equitable access to CI resources that support their work. As the needs for and opportunities from CI grow and broaden, eliminating the barriers noted above is essential. This approach requires strategic investments not only in a broad set of CI resources but also in support structures and services, and a strong pipeline of CI professional expertise to ensure a broadly accessible and integrated CI ecosystem. The development and sustainability of this ecosystem requires critical thinking of its environmental impacts. For example, advanced CI resources tend to consume massive electricity power and thus often bear significant carbon footprint. In this context, a key question is what should be the right balance of CI and environmental sustainability. We argue future research is much needed to better understand and optimize this balance.

Data availability statement
No new data were created or analysed in this study.