A boreal forest model benchmarking dataset for North America: a case study with the Canadian Land Surface Scheme Including Biogeochemical Cycles (CLASSIC)

Climate change is rapidly altering composition, structure, and functioning of the boreal biome, across North America often broadly categorized into ecoregions. The resulting complex changes in different ecoregions present a challenge for efforts to accurately simulate carbon dioxide (CO2) and energy exchanges between boreal forests and the atmosphere with terrestrial ecosystem models (TEMs). Eddy covariance measurements provide valuable information for evaluating the performance of TEMs and guiding their development. Here, we compiled a boreal forest model benchmarking dataset for North America by harmonizing eddy covariance and supporting measurements from eight black spruce (Picea mariana)-dominated, mature forest stands. The eight forest stands, located in six boreal ecoregions of North America, differ in stand characteristics, disturbance history, climate, permafrost conditions and soil properties. By compiling various data streams, the benchmarking dataset comprises data to parameterize, force, and evaluate TEMs. Specifically, it includes half-hourly, gap-filled meteorological forcing data, ancillary data essential for model parameterization, and half-hourly, gap-filled or partitioned component flux data on CO2 (net ecosystem production, gross primary production [GPP], and ecosystem respiration [ER]) and energy (latent [LE] and sensible heat [H]) and their daily aggregates screened based on half-hourly gap-filling quality criteria. We present a case study with the Canadian Land Surface Scheme Including Biogeochemical Cycles (CLASSIC) to: (1) demonstrate the utility of our dataset to benchmark TEMs and (2) provide guidance for model development and refinement. Model skill was evaluated using several statistical metrics and further examined through the flux responses to their environmental controls. Our results suggest that CLASSIC tended to overestimate GPP and ER among all stands. Model performance regarding the energy fluxes (i.e., LE and H) varied greatly among the stands and exhibited a moderate correlation with latitude. We identified strong relationships between simulated fluxes and their environmental controls except for H, thus highlighting current strengths and limitations of CLASSIC.


Introduction
The boreal biome is an integral component of the Earth's climate system, exchanging energy and matter (e.g., carbon and water) with the atmosphere (Bonan 2008). A large portion of the boreal biome is underlain by permafrost (perennially frozen ground; Gruber 2012), which increases in areal extent from south to north, from isolated (⩽10%), sporadic (>10%-50%) and discontinuous (>50%-90%), to continuous permafrost (>90%). In North America, forests occupy about 60% of the boreal biome along with lake, river, and wetland ecosystems (Brandt 2009). The presence of permafrost exerts a critical control over many physical, ecological, and biogeochemical processes governing boreal forest composition, structure, and functioning (Schuur and Mack 2018). Climate change over the Arctic-boreal region has led to unprecedented changes in boreal forests caused, for example, by amplified warming and associated permafrost thaw (Schuur and Abbott 2011) and an intensification of disturbance regimes (Foster et al 2022). Consequences of these changes include a potential release of large quantities of greenhouse gases to the atmosphere due to increased microbial activity in organic-rich soils and peats that are characteristic for large swaths of boreal forests (Koven et al 2011, Schuur et al 2015. However, how amplified climate warming changes boreal forests at the regional scale depends greatly on the complex interplay of permafrost, soils, vegetation, and natural and anthropogenic disturbance regimes (Kurz et al 2013, Gauthier et al 2015. The landscape diversity across the boreal biome has been broadly classified into distinctive ecoregions (e.g., Boreal Plain and Taiga Plain; United States Environmental Protection Agency Level II Ecoregions of North America). Understanding boreal forests' complex responses to climate change across ecoregions is important to evaluate the current and future role of boreal forests in the climate system (Nazarbakhsh et al 2020, Walker et al 2020.
Terrestrial ecosystem models (TEMs), including those integrated as the land component in Earth system models, are crucial tools to estimate boreal forests' responses to climate warming (Fisher et al 2018, Birch et al 2021. Recent decades have seen a rapid development in TEM skill to simulate carbon, water and energy fluxes and balances of terrestrial ecosystems (Fisher et al 2014, Bonan andDoney 2018). However, uncertainties in simulations remain large (Huntzinger et al 2012. In particular, the Arctic-boreal region has been regarded as a large source of uncertainty in simulations of the Earth's climate system (Fisher et al 2018, Braghiere et al 2023. Model benchmarking datasets harmonizing different data streams to parameterize, force, and evaluate TEMs can help with efforts addressing uncertainties through more comprehensive evaluation of model strengths and limitations (Fisher et al 2018, Stofferahn et al 2019. Data collection efforts in North America's boreal biome were launched decades ago and have increased consistently since then. Prominent examples include the Boreal Ecosystem-Atmosphere Study (BOREAS) conducted in the 1990s (Sellers et al 1995), and the Fluxnet-Canada Research Network (FCRN) and the Canadian Carbon Program (CCP) in the 2000s (Coursolle et al 2012). BOREAS and FCRN/CCP included and focused on towerbased carbon dioxide (CO 2 ) and energy flux measurements made with the eddy covariance technique (Baldocchi and Meyers 1998). Numerous eddy covariance towers have been established and operated over different boreal ecoregions of North America (e.g., Helbig et al 2020), some through BOREAS and FCRN/CCP, but also by individual site investigators (Bergeron et al 2007, Euskirchen et al 2014, Ueyama et al 2014, Ikawa et al 2015, Helbig et al 2017. Many of the flux-tower teams contribute their eddy covariance and supporting measurements to network initiatives such as AmeriFlux and FLUXNET (e.g., Pastorello et al 2020). However, boreal forests are not well represented in those networking initiatives or previously published model benchmarking datasets (Williams et al 2009, Ukkola et al 2021. Several challenges have been encountered by modellers attempting to use eddy covariance and supporting measurements. Examples include gaps in the meteorological forcing data and lack of convenient access to ancillary data to adequately parameterize soil, permafrost, and vegetation characteristics (e.g., near-surface organic soil depth, soil texture, and percentage covers of vegetation types). Eddy covariance networks have been pivotal to provide TEM communities with access to curated datasets (Boden et al 2013, Pastorello et al 2020. The FLUXNET2015 dataset is a prominent example that provides meteorological data, gap-filled using a global reanalysis dataset. The AmeriFlux Biological, Ancillary, Disturbance, and Metadata (BADM) comprise ancillary data describing soils and vegetation for each tower site. However, up to now, FLUXNET2015 contains only a limited number of sites representative of North America's boreal forests. The AmeriFlux BADM is incomplete at many sites, lacking some important data required to parameterize, for example, 'percentage covers of plants in understory and ground cover' (Turetsky et al 2012), and 'permeable soil depth' (Lawrence et al 2008). Sustained efforts are essential to provide the modelling communities with consistently gap-filled and harmonized model benchmarking datasets for North America's boreal forests. These efforts require close collaboration between modelling and measurement communities (Fisher et al 2018).
In this study, we describe a boreal forest model benchmarking dataset comprising various data streams including eddy covariance and supporting measurements collected at eight mature boreal forest stands, located in six boreal ecoregions of the North America (Level II Ecoregions of North America; table S1, figure S1). First, we describe the curation of the dataset comprising harmonized and gap-filled: (1) half-hourly CO 2 and energy fluxes and their daily aggregates, (2) half-hourly meteorological data, and (3) ancillary data describing soil, permafrost, and vegetation characteristics. Second, we use the dataset to evaluate the performance of the Canadian Land Surface Scheme Including Biogeochemical Cycles (CLASSIC).

Study sites
The study sites are eight boreal forest stands representative of a broad range of stand characteristics, disturbance history, climate, permafrost conditions and soil properties (figure 1, tables S1 and S2). All eight sites are mature boreal forest stands (>70 years old), dominated by black spruce (Picea mariana). Some forest stands also contain small percentages of tamarack (Larix laricina) and/or jack pine (Pinus banksiana). Permafrost extent varies among sites following a south-to-north gradient. Located near the southern limit of permafrost, the southern three sites (⩽60 • N), CA-Obs, CA-Qfo and CA-Man, are permafrost-free, closed forest stands with approximately 9-to 14 m-tall black spruce trees (Bergeron et al 2007;BADM). Four more northern sites (>60 • N), CA-SMC, US-BZS, US-Uaf, and US-Prr, are in the discontinuous permafrost zone, described as relatively open forest stands with shorter stature black spruce trees and more abundant understory and ground cover vegetation compared to their southern counterparts (Euskirchen et al 2014, Ueyama et al 2015, El-Amine et al 2022. One site, CA-HPC, is in the continuous permafrost zone and representative of the forest-tundra ecotone (Callaghan et al 2002), characterized by scattered, stunted, mostly black spruce, trees over continuous understory and ground cover vegetation (Martin et al 2022).

Eddy covariance measurements
We compiled and harmonized available half-hourly eddy covariance flux measurements from multiple sources (table S1). The fluxes include net ecosystem CO 2 exchange (NEE), its two partitioned component fluxes, gross primary production (GPP) and ecosystem respiration (ER), and latent (LE) and sensible heat (H). Data were available through FLUXNET2015 or were requested from site investigators. FLUXNET2015 provides NEE filtered for lowturbulence conditions using two thresholds for friction velocity (u * ) (Pastorello et al 2020): a variable u * threshold and a constant u * threshold (CUT). We chose the latter, 'NEE_CUT_REF' , partitioned into GPP and ER following both nighttime-based (Reichstein et al 2005;'GPP_NT_CUT_REF' and 'RECO_NT_CUT_REF') and daytime-based methods (Lasslop et al 2010;'GPP_DT_CUT_REF' and 'RECO_DT_CUT_REF'). This choice assumes negligible changes in forest stand structures over the measurement periods and in the geometric characteristics of the instrumental set-ups at each site (e.g., height of the eddy covariance systems above the canopy). Marginal distribution sampling (MDS) following Reichstein et al (2005)  Daily flux aggregates include the sums of CO 2 fluxes (net ecosystem production [NEP = −NEE], GPP and ER in g C m −2 d −1 ) and the averages of energy fluxes (LE and H in W m −2 ), computed using half-hourly gap-filled (NEE, LE and H) and partitioned fluxes (GPP, ER). Daily GPP and ER from nighttime-based and daytime-based flux partitioning methods were averaged to account for the uncertainty in flux partitioning methods. Uncertainties arising from data collection and postprocessing, including gap-filling (Richardson and Hollinger 2007), propagate into temporal flux aggregates (e.g., daily and monthly) (Soloway et al 2017) and might hamper model evaluation (Williams et al 2009). Thus, daily flux aggregates based on less than 80% high-quality (i.e., MDS quality flag 0 ['measured'] or 1 ['good quality gapfill']) half-hourly data (i.e., 39 out of 48 half-hours) were excluded. Similarly, daily aggregates computed from half-hourly fluxes not gap-filled with MDS (US-Uaf) were screened by excluding aggregates for which less than 40% highquality (i.e., measured) half-hourly data (i.e., 20 out of 48 half-hours) were available. Growing season start, end, and length for each site-year were delineated using daily GPP and a double-logistic function fitted to the daily GPP time series (Gonsamo et al 2013). The first maximum and the last minimum of the third derivative of the fitted function were used to determine the growing season start and end, respectively. The growing season length was defined as the total days from the growing season start to end (figure S2).

Meteorological measurements
Half-hourly meteorological measurements include downwelling shortwave radiation (SW, W m −2 ), downwelling longwave radiation (LW, W m −2 ), total precipitation (sums of rain-and snowfall [as snow water equivalent], mm), air temperature (TA, • C), vapour pressure deficit (VPD, hPa), wind speed (m s −1 ), and atmospheric pressure (PA, kPa). Specific humidity (Q, kg kg −1 ) was estimated using TA, PA, and VPD variables (Monteith and Unsworth 2013). FLUXNET2015 provides gap-filled meteorological data on the basis of downscaled ERA-Interim reanalysis data (Vuichard andPapale 2015, Pastorello et al 2020). Based on the partially gap-filled data from site investigators (table S1), we applied a downscaled, bias-adjusted and temporally disaggregated global reanalysis dataset, GSWP3-W5E5-ERA5 (Melton and Arora 2016, Cannon 2018, Lange 2019, Meyer et al 2021, to fill any remaining gaps in all variables and replace the medium-to-poor quality gap-filled data (quality flag of 2 and 3) in the variables that were gap-filled using MDS.

Ancillary data
Ancillary data describing soil, permafrost, and vegetation characteristics including disturbance history are essential to adequately represent ecosystems in TEMs through parameterization (Lawrence et al 2008, Launiainen et al 2015. Near-surface soil data include soil texture (e.g., sandy loam) and the proportion of sand, silt, and clay (%) comprising the mineral fraction, in addition to soil organic layer thickness (cm) and organic carbon content (g [100 g] −1 ), soil permeable depth (m) and drainage class (table 1). For three sites, CA-Obs, CA-Man, and CA-Qfo, most of the data were available through AmeriFlux BADM (table S1). For the remaining five sites, most of these data were obtained from the literature or from site investigators (unpublished data). In addition, soil data were extracted from the World Soil Information Service (Batjes et al 2020) when no published or unpublished data were available. A circular buffer of 20 km radius was defined around each site to select Table 1. Soil, permafrost, and vegetation characteristics. Site names refer to AmeriFlux ID (AMF-ID) and are ordered by latitude, from south (CA-Qfo) to north (CA-HPC; figure 1). Soil and permafrost characteristics are near-surface soil organic layer thickness (SOL) and organic carbon content (SOC), soil permeable depth (SPD), and soil drainage class (SDC), and active layer thicknesses (ALT), respectively. Vegetation is overstory tree crown closure (OCC), stand height (SH), stand age (SA), and percentage cover of plant functional types in overstory and understory (PFT-OU) and ground cover (PFT-G WoSIS soil profiles closest to each site. From these, the soil profile with the shortest radial distance to, and the most similar in near-surface organic layer thickness compared to the respective site was selected to extract missing soil data. Soil water drainage is a key property controlling the 'wetness' of the land surface, and is closely associated with plant composition and distribution, soil carbon storage, and disturbance history (Wickland et al 2010). To provide consistent data on soil water drainage for each site, soil drainage class from the Soil Survey Geographic database and the Soil Landscapes of Canada database was extracted for the sites in Alaska and in Canada, respectively (tables 1 and S1). Soil permeable depth was extracted as 'soil and sedimentary deposit thickness' from a global 1 km gridded soil thickness dataset (Pelletier et al 2016). Consistent soil permeable depth for each site was reported as the mean estimate of the pixel containing the respective site and its eight neighbouring pixels. The seasonal freeze-thaw cycle in the active layer regulates soil moisture and thermal conditions for plant roots and microbes in permafrost (Schuur and Mack 2018). Active layer thickness, commonly year-round maximum thaw depth describing growing-season permafrost conditions (Fisher et al 2016), was compiled based on estimates made by site investigators.
Vegetation data include the percentage covers (%) of plant functional types (PFTs) in overstory, understory and ground cover vegetation, stand age (years) and stand height (m; table 1). Data sources were published in-situ field measurements, e.g., Kobayashi et al (2013) and Euskirchen et al (2014), and site investigator assessments (unpublished data, table S1). Overstory and understory plants were aggregated into seven commonly used PFTs (Groenendijk et al 2011, Rogers et al 2021 based on the information on common life-forms, e.g., growth form (herbs, shrubs, and trees), leaf (needleleaf and broadleaf), and leaf status (evergreen and deciduous). Ground cover composition distinguishes Sphagnum, brown moss, feather moss, and lichen (Turetsky et al 2012).

Modelling protocol
We used our model benchmarking dataset to evaluate the performance of the Canadian Land Surface Scheme Including Biogeochemical Cycles (CLASSIC; Melton et al 2020). We demonstrated the usefulness of our dataset to benchmark TEMs by examining biases and uncertainties in simulated CO 2 and energy fluxes. CLASSIC is the successor to the coupled model framework of the Canadian Land Surface Scheme (CLASS; Verseghy 2017) and the Canadian Terrestrial Ecosystem Model (CTEM; Melton and Arora 2016). In CLASSIC, CTEM simulates biogeochemistry including the photosynthetic CO 2 uptake and respiratory CO 2 loss based on five carbon pools in litter, soils, leaves, stems, and roots (Arora and Boer 2005). CLASS solves the model physics including land surface energy and water fluxes using vegetation structural properties (e.g., leaf area index) simulated by CTEM (Verseghy 2017). The two models, CLASS and CTEM, have been under continuous development since they were introduced in the 1980s and 2000s, respectively. CLASSIC has recently been extensively benchmarked at several sites and tested at global scale (Seiler et al 2021).

Site parameterization
Vegetation in CLASSIC is represented by nine default PFTs including five forest (evergreen needleleaf forest, deciduous needleleaf forest, evergreen broadleaf forest, deciduous cold broadleaf forest and deciduous dry broadleaf forest), two grass (C 3 and C 4 grass), and two crop (C 3 and C 4 crop) PFTs (Melton and Arora 2016). Recently, a version with three additional PFTs (evergreen broadleaf shrub, deciduous broadleaf shrub and sedge) was developed for simulations in Low-Arctic tundra (Meyer et al 2021).
Our stand-level case study used a CLASSIC version with a total of 12 PFTs. We parameterized CLASSIC according to the compiled soil and vegetation data (section 2.4 and table S3). Over each site, the soil profile was stratified into 20 layers with a total ground column depth of 61.4 m (Melton et al 2019). Soil layers below the soil permeable depth were simulated as hydrologically inactive bedrock. The organic layers were assumed to be peat and assigned to be fibric for the first layer, hemic for the second and third layers, and sapric for any lower layers, using the thermal and hydraulic parameters described in Letts et al (2000). The permeable soil layers below soil horizon B were all assumed to be parent material (i.e., soil horizon C) by using the properties of soil horizon C. We used the default parameterization for PFTs, e.g., visible and near-infrared albedo and rooting depth Arora 2016, Meyer et al 2021).
The gap-filled meteorological data were used to force CLASSIC. We applied a spin-up procedure by repeating the forcing data until the model reached an equilibrium, which is defined by the soil carbon pools varying by less than 0.1% compared to the last loop. CLASSIC simulations were performed by forcing the model one more time after the spin-up.

Performance evaluation
We used a suite of statistical metrics to evaluate CLASSIC performance for NEP, GPP, ER, LE and H: bias, coefficient of determination (R 2 ), Pearson correlation coefficient (r; Taylor diagrams), and root mean square error (RMSE) (Taylor 2001, Tramontana et al 2016. The performance evaluation was based on screened daily fluxes and restricted for GPP to the delineated growing seasons (as defined in section 2.2).
Model performance was evaluated among sites and ecoregions, respectively. Model performance was also investigated by examining the responses of the modelled fluxes to different environmental controls including TA, SW and VPD (Schaefer et al 2012, Tramontana et al 2016. The locally estimated scatterplot smoothing (LOESS), a nonparametric regression function based on local polynomial fitting (Cleveland and Loader 1996), was applied to fit response curves for the modelled and observed fluxes, respectively. We analysed the shape characteristics of LOESS curves, e.g., slope, transition, and summit, to identify limitations and strengths of CLASSIC simulations (Schaefer et al 2012, Rogers et al 2017.

Model benchmarking dataset
The model benchmarking dataset comprises mature boreal forest stands dominated by black spruce, one of the most widespread tree species in the boreal forests of North America (Kurz et al 2013). The stands are located in six Level II Ecoregions of North America: Boreal Plain (CA-Obs), Softwood Shield (CA-Qfo and CA-Man), Taiga Plain (CA-SMC), Alaska Boreal Interior (US-BZS and US-Uaf), Boreal Cordillera (US-Prr) and Southern Arctic (CA-HPC) (figure S1; table S1). The dataset contains a total of 68 site-years distributed broadly over the 'boreal forest' and 'woodland/shrubland' classes defined by the Whittaker biomes (Whittaker 1970; figure 2(a)). Specifically, the sites span numerous gradients including, for example, wetness (East-West) in the Softwood Shield from CA-Qfo to CA-Man, and TA and permafrost extent (South-North) from the Boreal Plain (CA-Obs), though Taiga Plain (CA-SMC), to Southern Arctic (CA-HPC) (table S2). Other important variations across the sites include soil properties, overstory tree crown coverage, and understory and ground cover composition (table 1).
Growing season start, end, and length varied across ecoregions ( figure 2(b)). The growing season lengths in the Boreal Plain and Softwood Shield (CA-Obs, CA-Qfo, and CA-Man) tended to be longer than other ecoregions because of both an earlier start and a later end. In the Southern Arctic (CA-HPC) the growing seasons were substantially shorter. As an indicator of the forest-stand photosynthetic activity (Gonsamo et al 2013), the estimated growing seasons were closely linked to abiotic and biotic factors, such as dominant plant species, climate, soil freeze-thaw seasonality, and photoperiod (Bauerle et al 2012, Barichivich et al 2013.

Flux data screening
Data with high uncertainty in daily aggregates were effectively excluded by our screening procedures (section 2.2). For example, at US-Prr, the screening excluded most of the data in early to mid-2010, late 2012 and mid-2013 (figure S3). During those periods, numerous longer gaps (e.g., weeks to seasons) in half-hourly data existed due to, e.g., instrument failures and insufficient quality. Gap-filling uncertainty tends to increase with gap length (Richardson and Hollinger 2007), and the associated uncertainty might propagate into temporal aggregates (e.g., daily and monthly), increasing the uncertainty of model validations and then predictions (Williams et al 2009).

CLASSIC performance
The performance of CLASSIC was evaluated for sites ( figure 3) and ecoregions (figure 4). Simulated daily GPP, ER, and LE were well correlated with observations (r ⩾ 0.78), suggesting that the model well captured seasonal flux variations (Schaefer et al 2012), although the three fluxes tended to be overestimated. NEP was generally well simulated (e.g., low RMSE), but the Pearson correlations varied greatly among sites and ecoregions (r = 0.55-0.89).
Model performance varied greatly over the months for all fluxes (figures S4 and S5). Similar seasonal patterns were found among sites and ecoregions, e.g., high RMSE (all fluxes) and high R 2 (LE and H) in summer (from June to August). The high summer RMSE provided a major contribution to the overall model performance. This could be explained by the fact that model-observation discrepancies for CO 2 and energy fluxes tend to be higher with higher fluxes in summer (Kuppel et al 2012, Birch et al 2021. The monthly biases indicate that the model overestimated the seasonal amplitudes in NEP among sites except for US-Prr. Model performance variations among sites and ecoregions were remarkable. Comparing among ecoregions, Boreal Cordillera (i.e., US-Prr) was the best simulated with the lowest RMSE for all fluxes (figure 4). A clear latitudinal variation was shown in model performance of energy fluxes. Boreal Plain and Softwood Shield tended to have a higher RMSE in LE than northern ecoregions. The seasonal bias of H had distinct differences from Boreal Plain and Softwood Shield, though Taiga Plain, to other northern ecoregions (figure S5). Strong negative summer biases of H were found in the Boreal Plain and Softwood Shield.
The model generally performed well in simulating the seasonality of GPP, ER, LE, and H across siteyears (high r) but was challenged to simulate the yearto-year variations in the magnitudes of these fluxes (large variations in RMSE) (figure S6). The model tended to overestimate annual values of GPP, ER and LE (figure S7), consistent with the performance evaluated at daily scales. Small year-to-year variations in RMSE were found for NEP, while model performance of the seasonality in NEP tended to be low (low r; figure S6). Regarding annual values of NEP, model performance varied greatly across site-years, depending on simulated GPP and ER ( figure S7). However, the overall quality of model performance regarding  annual values was certainly influenced by the uncertainty in gap-filled fluxes (Soloway et al 2017).

Flux responses to environmental controls
The response of modelled NEP to TA generally captured magnitudes and variations in observations ( figure 5(a)). However, CLASSIC underestimated NEP between −5 • C and 5 • C and overestimated NEP between 15 • C and 25 • C. These mismatches are consistent with the amplified NEP seasonality simulated by CLASSIC. For example, the underestimated NEP response from −5 • C to 5 • C was primarily due to the overestimated temperature response of ER at the same TA ( figure 5(b)), which is possibly related to model parameters or formulations for autotrophic and/or heterotrophic respiration, e.g., Q 10 (Tuomi et al 2008, Piao et al 2010, or biases in simulated carbon pools (Schmid et al 2006) and soil thermal dynamics (Exbrayat et al 2013). The NEP overestimations from 15 • C to 25 • C, in contrast, were difficult to interpret because various responses of GPP and ER to environmental controls could possibly be involved,   (Cleveland and Loader 1996). The estimates were based on daily observations and simulations from all sites. The GPP comparison was limited to growing seasons (section 2.2). (a) Net ecosystem production (NEP) to air temperature (TA), (b) ecosystem respiration (ER) to TA, (c) gross primary production (GPP) to TA, (d) GPP to downwelling shortwave radiation (SW), (e) GPP to vapour pressure deficit (VPD), (f) latent heat (LE) to TA, (g) LE to SW, (h) LE to VPD, (i) sensible heat (H) to TA and (j) H to SW. e.g., GPP or ER to TA andGPP to TA, SW or VPD (Yuan et al 2008, Schaefer et al 2012).
The model closely matched observed GPP responses to TA and SW when TA is less than 5 • C (figure 5(c)) and SW is less than 150 W m −2 (figure 5(d)). However, clear overestimations in the modelled responses occurred when TA and SW increase, which explains the large biases of GPP during summer months and may likely be a major contributor to the NEP overestimations from 15 • C to 25 • C. Possible reasons for the GPP overestimations may include biases in photosynthetic parameters, in particular the maximum photosynthetic carboxylation rate (V cmax ), which determines modelled photosynthetic capacity under light saturation and controls the GPP responses to SW and TA (Rogers et al 2017). Although there were overestimations, the model still captured variations in the response of GPP to TA (e.g., the optimal temperature of GPP).
VPD reflects the moisture status of the atmosphere (Monteith and Unsworth 2013). The observed GPP increased with increasing VPD and reached a maximum at around 15 hPa (figure 5(e)). A main cause of this response might be that transpiration generally does not continually increase in proportion to VPD because high VPD (i.e., a dry atmosphere) leads to reduced stomatal conductance and photosynthesis (Schaefer et al 2012). CLASSIC overestimated GPP response to VPD, especially at >10 hPa, indicating that it simulated a smaller atmospheric humidity stress compared to the observations when accounting for stomatal conductance impacts on photosynthesis. Because photosynthesis and transpiration are tightly coupled through a stomatal conductance model in CLASSIC (Verseghy 2017), the GPP responses to environmental controls are generally closely associated with the responses of LE, e.g., LE and GPP to VPD (figures 5(e) and (h)).
In contrast to LE, CLASSIC closely matched observed responses of H to TA and SW (figures 5(i) and (j)). An important reason might be that H was not directly coupled to the stomatal conductance in CLASSIC (Verseghy 2017). However, the close matches were also possibly due to the contrasting performance in H among sites, where negative biases at some sites (e.g., CA-Qfo and CA-Obs) were offset by positive biases at others (e.g., US-Uaf and US-Prr; figure S5).

Limitations and perspectives
Eddy covariance and supporting measurements are affected by uncertainties which may hamper model parameterization and evaluation attempts (Williams et al 2009). Although we carefully screened the flux data, errors from data collection and processing may remain (Richardson et al 2006, Soloway et al 2017, in particular, GPP and ER, which were not directly measured but derived from measured NEE (Schaefer et al 2012). The uncertainty of LE and H may come from energy balance closure (Pastorello et al 2020). The compiled data did not include corrected LE and H products. CLASSIC enforces energy balance closure (Verseghy 2017). Correcting energy balance closure would help further examine model performance (Ukkola et al 2021).
Carefully isolating the most important model parameters or formulations for characterizing model performance is a challenge due to equifinality in model parameterization (Bonan et al 2011). Several possible reasons may be involved in CLASSIC model limitations. First, our analysis is based on ecosystem-level fluxes, but plant species as well as their aggregations by PFTs generally differ greatly in their responses to environment (Lloyd andBunn 2007, Pearson et al 2013). In models, parameters and/or formulations generally vary with PFTs (Melton and Arora 2016), interacting with each other in ecosystem-level simulations. Second, the impacts of moss and lichen on the ground are currently not simulated in CLASSIC, even though ground cover vegetation can contribute substantially to ecosystemlevel CO 2 and energy fluxes of boreal forest stands (Heijmans et al 2004, Gaumont-Guay et al 2014. The absence of moss and lichen ground cover in CLASSIC might be compensated by the parameterization of overstory and understory PFTs, but biases could still be introduced because moss and lichen have environmental responses distinctly different from overstory and understory plants (Nilsson andWardle 2005, Turetsky et al 2012). Third, the soil physics component of TEMs, e.g., soil thermal dynamics and soil moisture, is important for modelling CO 2 and energy fluxes as well, especially in the forest stands underlain by permafrost (Ju et al 2006, Melton et al 2019, Andresen et al 2020. Uncertainty in the soil physics component has been identified as an important contributor to uncertainty in modelling CO 2 and energy fluxes in boreal forests (Ju et al 2006, Jiang et al 2016. Despite the above limitations, our simulations with CLASSIC demonstrate the capacity of our model benchmarking dataset to parameterize, force, and evaluate TEMs. Our dataset provides numerous new opportunities for understanding and addressing model shortcomings and associated uncertainties in North America's boreal forests. In addition, our dataset has the potential to be a powerful tool for further model-data fusion attempts. For example, a parameter optimization (e.g., on V cmax ) could be conducted using our dataset to improve GPP performance (Kuppel et al 2014). Other observations at the studied boreal forest stands, e.g., ground CO 2 fluxes by soil chambers, can be easily combined with our dataset to constrain TEMs with both ecosystem-level and ground fluxes (Purdy et al 2016). Our dataset can be combined with satellite-derived datasets, e.g., leaf area index and GPP, for model development (Seiler et al 2021).
Our dataset can be utilized to study CO 2 and energy fluxes of black spruce-dominated, mature boreal forests, e.g., the magnitudes and links to the variability of environmental variables (Bergeron et al 2007), and for flux upscaling through, for example, machine learning (Jung et al 2020).

Conclusions
We compiled and harmonized eddy covariance and supporting measurements made at eight sites in a model benchmarking dataset for North America's boreal forests. We present a case study of the dataset to initialize, force, and evaluate CLASSIC. This case study allowed us to identify strengths and limitations of CLASSIC regarding its skill to reproduce CO 2 and energy fluxes across boreal forest stands. Based on the estimated responses of CO 2 and energy fluxes to their environmental controls, our case study highlights possible avenues for the improvement of CLASSIC through, e.g., V cmax parameterization. While model structures and parameters differ widely among TEMs, our analysis for CLASSIC provides an example of how similar studies could be performed with other TEMs to evaluate and improve their performance for North America's boreal forests.

Data availability statements
The data that support the findings of this study are openly available at the following URL/DOI: https:// doi.org/10.5281/zenodo.7266010. Variable descriptions and units are provided in table 1 (ancillary data) and table S5 (meteorological and flux data).