The MUSE Ultra Deep Field (MUDF). V. Characterizing the Mass–Metallicity Relation for Low-mass Galaxies at z ∼ 1–2

Using more than 100 galaxies in the MUSE Ultra Deep Field with spectroscopy from the Hubble Space Telescope’s (HST) Wide Field Camera 3 and the Very Large Telescope’s Multi Unit Spectroscopic Explorer, we extend the gas-phase mass–metallicity relation (MZR) at z ≈ 1–2 down to stellar masses of M ⋆ ≈ 107.5 M ⊙. The sample reaches 6 times lower in stellar mass and star formation rate (SFR) than previous HST studies at these redshifts, and we find that galaxy metallicities decrease to log(O/H) + 12 ≈ 7.8 ± 0.1 (15% solar) at log(M ⋆/M ⊙) ≈ 7.5, without evidence of a turnover in the shape of the MZR at low masses. We validate our strong-line metallicities using the direct method for sources with [O iii] λ4363 and [O iii] λ1666 detections, and find excellent agreement between the techniques. The [O iii] λ1666-based metallicities double existing measurements with a signal-to-noise ratio ≥ 5 for unlensed sources at z > 1, validating the strong-line calibrations up to z ∼ 2.5. We confirm that the MZR resides ∼0.3 dex lower in metallicity than local galaxies and is consistent with the fundamental metallicity relation if the low-mass slope varies with SFR. At lower redshifts (z ∼ 0.5) our sample reaches ∼0.5 dex lower in SFR than current calibrations and we find enhanced metallicities that are consistent with extrapolating the MZR to lower SFRs. Finally, we detect only an ∼0.1 dex difference in the metallicities of galaxies in groups versus isolated environments. These results are based on robust calibrations and reach the lowest masses and SFRs that are accessible with HST, providing a critical foundation for studies with the Webb and Roman Space Telescopes.


INTRODUCTION
A critical goal of extragalactic astronomy is to understand the cosmic baryon cycle, which describes how heavy elements Corresponding author: Mitchell Revalski mrevalski@stsci.eduflow through galaxies over time (Péroux & Howk 2020).This includes the synthesis of heavy elements in massive stars, outflows of gas driven into the circumgalactic medium (CGM) by stellar winds, and the recycling of enriched gas into new stars via inflows from the intergalactic medium (Tumlinson et al. 2017).These complex processes produce an observational signature in the form of the mass-metallicity relation (MZR), which is a strong correlation between the stellar masses of galaxies and their gas-phase and stellar metallicities.We hereafter refer to metallicity as the gas-phase abundance of oxygen and adopt a solar value of 12 + log(O/H) ⊙ = 8.69 ± 0.05 (Asplund et al. 2009).Characterizing the evolution of this relationship as a function of stellar mass and redshift can tightly constrain galaxy evolution models, as the shape and normalization of the MZR encapsulates how heavy elements have been produced and retained over time despite stellar feedback, gas inflows, and gas outflows (Bassini et al. 2024).
The MZR has been well-characterized in the local Universe, and was originally investigated as a function of galaxy luminosity due to difficulties in deriving accurate stellar masses (Lequeux et al. 1979).This was overcome with advances in spectral energy distribution (SED) modeling and extended to statistically significant samples using the Sloan Digital Sky Survey (Tremonti et al. 2004).Studies have now precisely characterized the MZR over the stellar mass range log(M ⋆ /M ⊙ ) ≈ 6 -12 at z ≈ 0.1 and find that metallicity is nearly constant with stellar mass at log(M ⋆ /M ⊙ ) > 10, followed by a turnover and a power law decrease in metallicity at lower stellar masses (Andrews & Martini 2013;Berg et al. 2012;Zahid et al. 2014;Somerville & Davé 2015;Ly et al. 2016;Curti et al. 2020;McQuinn et al. 2020;Cullen et al. 2021;Sanders et al. 2021;Pharo et al. 2023;He et al. 2024).
The dispersion in the local MZR is ∼0.1 dex, which is remarkably small given the complex processes that govern the production, inflow, and outflow of gas within galaxies.The dispersion is further reduced when star formation rate (SFR) is included as a parameter.Specifically, galaxies with high SFRs have lower metallicities as compared to galaxies of the same stellar mass with low SFRs.At low masses, outflows driven by high SFRs may decrease the metallicity, while at the high-mass end the metallicity saturates due to a balance between self-enrichment and dilution from inflows (Dayal et al. 2013).Other factors include the dependence of SFR on the H I gas density (Lagos et al. 2016;Bothwell et al. 2013), the rate at which stellar-driven feedback expels a significant portion of the metals, and whether inflows of pristine gas have diluted the gas reservoirs and triggered star-formation (Greener et al. 2022;Yang et al. 2022;Langan et al. 2023).Mannucci et al. (2010) demonstrated that galaxies reside on a three-dimensional plane defined by stellar mass, metallicity, and SFR, known as the fundamental metallicity relation (FMR; Ellison et al. 2008;Lara-López et al. 2010;Lilly et al. 2013).While still actively investigated, studies suggest that this relationship does not evolve with redshift up to at least z ≈ 3, which suggests that similar physical mechanisms may regulate the flow of gas through galaxies over the majority of cosmic time (Andrews & Martini 2013;Dayal et al. 2013;Hunt et al. 2016;Gao et al. 2018;Sanders et al. 2018;Cresci et al. 2019;Kumari et al. 2021;Curti et al. 2023a).
Despite a lack of redshift evolution in the FMR, the MZR does exhibit a modest dependence on environment.Numerous studies find an enhancement in the gas-phase metallicity of galaxies in denser environments.Specifically, satellite galaxies surrounding more massive central galaxies in groups and clusters show a small but statistically significant increase in metallicity of ∼0.1 dex (Mouhcine et al. 2007;Cooper et al. 2008;Ellison et al. 2009;Hughes et al. 2013;Kulas et al. 2013;Peng & Maiolino 2014;Pilyugin et al. 2017;Wang et al. 2022).This enhancement is in agreement with simulations, which find comparable or larger enhancements (Wang et al. 2023).This modest environmental dependence may be attributed to galaxy dynamics as satellite galaxies pass through the hot halos of their group (Maiolino & Mannucci 2019).
At redshifts of z ∼ 1 -3, the MZR has been characterized down to stellar masses of log(M ⋆ /M ⊙ ) ≈ 9 (Maiolino et al. 2008;Henry et al. 2013;Kacprzak et al. 2016;Ly et al. 2016;Onodera et al. 2016;Suzuki et al. 2017;Wang et al. 2020;Sanders et al. 2021).Studies beyond the local Universe were previously limited to higher mass galaxies due to the difficulty of observing optical emission lines at z > 1 from the ground that are required for metallicities (Clarke et al. 2023), combined with the intrinsic faintness of low-mass galaxies.However, the numerous population of low-mass galaxies may play the most important role in enriching the CGM, as metals can escape the weak gravitational potentials of their galaxies (Dekel & Silk 1986;D'Odorico et al. 2016;Carniani et al. 2023;Sharda et al. 2023; see also Baker & Maiolino 2023 that suggest galaxy stellar mass mainly drives the relation).
Significant progress has been made in this area with large slitless spectroscopic surveys using the near-infrared capabilities of the Wide Field Camera 3 (WFC3) onboard the Hubble Space Telescope (HST).The G102 and G140 grism dispersing elements provide continuous wavelength coverage from 0.8 -1.7 µm at low spectral resolution, enabling the characterization of rest-frame optical diagnostic emission lines up to z ≈ 3 (e.g.Atek et al. 2010;Brammer et al. 2012;Momcheva et al. 2016;Pharo et al. 2019;Mowla et al. 2022;Papovich et al. 2022;Estrada-Carpenter et al. 2023;Revalski et al. 2023;Simons et al. 2021Simons et al. , 2023;;Stephenson et al. 2023).Recently, the MZR was expanded by Henry et al. (2021) down to the lowest masses characterized at these redshifts, reaching log(M ⋆ /M ⊙ ) ≈ 8.3 at z ≈ 1 -2.Using a sample of 1056 star-forming galaxies, they found that the gas-phase metallicities are lower by 0.3 dex at z ≈ 2 as compared to the local Universe, and confirmed the presence of a fundamental metallicity relation at these higher redshifts of z ≈ 1 -2.
In following the advancements made by Henry et al. (2021), we aim to characterize the MZR at lower stellar masses in both field and group environments.This will allow for a consistent comparison with the properties of star-forming galaxies at the peak of cosmic star formation to understand the build up of heavy elements over time.However, it is difficult to study low mass galaxies at higher redshifts without extremely sensitive imaging and spectroscopy.For this reason, we utilize recent observations of the MUSE Ultra Deep Field (MUDF) in order to characterize galaxies that are six times lower in stellar mass and SFR than earlier studies with HST at these redshifts.
The MUDF is a unique ≈ 2 ′ × 2 ′ legacy field because it contains two closely-spaced quasars at z ≈ 3.22 that allow for galaxies observed in emission to be compared with their surrounding gas in the CGM via high resolution absorption line spectroscopy of the quasars.This field has been observed for 142 hours using the Very Large Telescope's Multi Unit Spectroscopic Explorer (VLT MUSE; ESO PID 1100.A−0528, PI: M. Fumagalli; Lusso et al. 2019;Fossati et al. 2019) In Revalski et al. (2023), we described the acquisition and custom-calibration of the HST imaging and spectroscopy, and provided photometric and morphological catalogs for sources in the field.In this work, we describe our spectral fitting procedure, provide a public emission line catalog, and characterize the MZR down to the lowest stellar masses ever investigated for a sample of galaxies using HST at z ≈ 1 -2.The content is organized as follows: In §2 we discuss the observations and our spectral fitting procedure.In §3 we describe our methodology for deriving gas-phase metallicities of individual galaxies, as well as those stacked in discrete mass intervals.In §4 we present our MZR results and explore the effects of SFR and environment.We discuss the implications of these results in §5 and present our conclusions in §6.We adopt the AB magnitude system (m AB , Oke & Gunn 1983) in this study, and assume a standard cosmology with Ω M ≈ 0.3, Ω Λ ≈ 0.7, and h ≈ 0.7 (Planck Collaboration et al. 2020).

Survey Overview
The survey design and data reduction for the MUSE observations (ESO PID 1100.A−0528, PI: M. Fumagalli) are described in Fossati et al. (2019), and consist of 142 hours of integral field unit spectroscopy in the optical (4600 − 9350 Å) with a spectral resolution of R ≈ 2000 − 4000.These observations are publicly available, and calibrated data products can be retrieved using the following DOI: 10.18727/archive/84.
The observing strategy and custom-calibration for the HST imaging and grism spectroscopy are detailed extensively in Revalski et al. (2023), and we provide a summary here for completeness.We obtained 90 orbits of WFC3/IR G141 grism spectroscopy (1.0 − 1.7 µm) and F140W imaging in HST Cycle 26 (PID 15637, PI: M. Rafelski & M. Fumagalli), resulting in ultra-deep photometry and spectroscopy for ∼1,500 sources with a spectral resolution of R ≈ 150.We also utilize F336W photometry from our follow-up program (PID 15968, PI: M. Fossati), together with archival WFPC2 F702W and F450W imaging (PID 6631, PI: P. Francis) that provide five-filter photometry of the field for detailed SED modeling.
The default data files are available from the Mikulski Archive for Space Telescopes (MAST): 10.17909/q67p-ym16, together with our custom-calibrated data products as High Level Science Products (HLSPs) at: 10.17909/81fp-2g44 and https://archive.stsci.edu/hlsp/mudf.In Revalski et al. (2023) we provided the photometric catalog for the field, and in this work we describe our spectral fitting procedure and publicly release the spectral emission line catalog.In the context of this study, the MUSE and HST spectra provide matched optical and near-infrared spectroscopy of ∼1,500 galaxies from z ≈ 0 -6 that enable a wide variety of science investigations.

Spectral Fitting
The process of simultaneously fitting ground-based and space-based spectroscopy from 4600 Å to 1.70 µm for thousands of sources presents several distinct challenges.We started with the emission line identification and fitting routine described in Henry et al. (2021) and modified it for these data to measure the redshifts and emission line fluxes of detected galaxies.First, we perform a continuous wavelet transform (CWT) on each spectrum using the CWT feature within the SciPy (v1.2.1) signal package to identify sources with emission lines.We required a minimum integrated line detection of 4σ, corresponding to 2.89σ over three continuous wavelength elements.We found that this process robustly identifies emission line sources while rejecting spectra with spurious noise and recovers 97% of sources that were identified by visual inspection or through other automated line finding routines.
We examined high signal-to-noise (S/N) spectra of the sample with strong emission lines at both low and high redshifts to identify the full range of emission features in the data, with examples shown in Figure 1.We identified 32 emission lines with sufficient strength to be measured in the spectra of multiple galaxies, and they are listed in Table 1.We adopt vacuum wavelengths in units of Angstroms ( Å) throughout the analysis, but refer to lines in the text using labels common in the literature that are based on air wavelengths (e.g.There is a spectral coverage gap between MUSE and WFC3 from 9353 -10020 Å, while the vertical gray mask is due to the MUSE AO notch filter.These are examples of the highest S/N spectra available for the sample, which are suitable for individual metallicity measurements.Several features are notable, including the blueshifted P Cygni absorption profile of C IV in the lower source and a weak residual skyline near [O III] λ4363 in the upper source.The line fitting constraints allow for the effects of spurious artifacts to be minimized and flagged in the resulting fits.so are omitted from the catalog.These include Hδ λ4102, [Ne III] λλ3869, 3968, Hϵ λ3971, and weaker Balmer lines.The examples in Figure 1 highlight several key aspects of the observations.First, there is a spectral coverage gap between MUSE and WFC3 from 9353 -10020 Å.There is an additional small gap from 5758 -6008 Å in MUSE due to the adaptive optics (AO) NaD laser notch blocking filter.These regions are masked in the spectral fitting, and are shown as gray regions in Figure 1.In addition, the spectral dispersion of MUSE is ∼17 times larger than WFC3, making the lines appear more narrow and peaked as compared to those in the WFC3 spectra.The low amplitude spikes in the MUSE spectra that do not correspond to emission lines are generally weak residual skylines that are difficult to model in the near-infrared.Overall, there are matched spectra for ∼1,500 galaxies.
Next, we fit the spectra for each object with detected emission lines.As briefly described in Revalski et al. (2023), a cubic spline is first fit to the continuum of the WFC3 and MUSE spectra simultaneously, making an initial guess of the source redshift based on the strongest emission line.We can then easily change the guess to other emission lines, adjust the wavelength regions used in the continuum fit, mask contaminated regions, and then choose to reject the fit or save the results for each spectral fit to an emission line catalog.We place constraints on the model parameters to converge on a successful fit and implement physical limits.First, the widths  of emission lines in WFC3 with a dispersion of 21.5 Å pixel −1 , relative to those in MUSE with a dispersion of 1.25 Å pixel −1 , are initially fixed at a ratio of 17.2.However, sources that are physically extended in the grism observations have larger line spread functions, and we found that allowing this ratio to increase by up to a factor of two for the most extended galaxies produced excellent fits.We thus constrained the relative widths by this factor when fitting the full sample of galaxies.
Next, we implement limits on the relative ratios of emission line doublets based on atomic physics and Cloudy photoionization models (Ferland et al. 2017).Specifically, doublets with the same upper energy level and closely spaced lower levels have relative intensities that are constant.In addition, doublets with closely spaced upper levels and the same lower energy state that have different transition probabilities (or critical densities) are sensitive to the electron density of the gas.In these cases, we use a grid of Cloudy models over a broad range of 9 dex in density and ionization from Revalski et al. (2018) to constrain the doublet ratios between their low and high density limits.These grids were generated for a hydrogen column density of N H = 10 21.5 cm −2 without dust.A lack of dust in the models is suitable for low mass galaxies (Shapley et al. 2022(Shapley et al. , 2023)), but generally has a negligible effect on the relative ratios of the same ionic species.The ratios of emission lines with fixed relative intensities, as well as those constrained to specific ranges, are provided in Table 2.
In addition, all of the emission lines must yield a consistent redshift, but modest deviations are allowed for specific pairs of lines due to small differences in wavelength calibration between the two spectrographs, as well as radiative transfer effects that can result in offsets between emission lines of different ions.We impose the following constraints on eight groups of lines, requiring that each group of lines have the same redshift: 1) Lyα λ1216 and N V λλ1239, 1243; 2)  Leitherer et al. (2011).The transition terms and ionization potentials required to create the emission lines are collated from Draine (2011), with the later quoted from the NIST Atomic Spectra Database (Kramida et al. 2022).The critical densities of the forbidden emission lines are from Draine (2011, Tables 18.1 -18.2), while those for the semi-forbidden lines are from Wei (1988), except for Si III] sourced from PyNeb (v1.1.17;Luridiana et al. 2015).The quoted critical densities depend strongly on the adopted atomic data, with some values in the literature differing by up to factors of three.
1.40 NOTE-The relative flux ratios of emission line doublets that were constrained to exact values (upper rows) or specific ranges (lower rows) in the spectral fits.The fixed ratios are based on atomic transition probabilities, while those with ranges are based on Cloudy photoionization models (see text).In the case of variable ratios, the initial value used in the fitting process is listed and typically represents the low density limit.These sets of lines are allowed to vary from one another by up to δz = 0.00334 (1000 km s −1 ), which produces visually excellent fits for 98% of the galaxies, with lines for 83% of galaxies offset by ≤ 500 km s −1 (see Figure 11 in Revalski et al. 2023).In the remaining few objects, using twice this limit produced visually acceptable fits.Given the different spectral resolutions, calibration systematics between ground and space-based observations, and radiative transfer effects on lines of different ionic species, this agreement is excellent.
Furthermore, emission lines close in wavelength are often blended in the low dispersion WFC3 grism observations, requiring additional constraints for successful fitting.Specifically, the Hα λ6565 and [N II] λλ6550, 6585 emission lines are fully blended and cannot be deconvolved (Henry et al. 2021).In general, the [N II] emission lines are weak in starforming galaxies and following the model of Henry et al. (2021), we fix the flux of [N II] λ6585 to 10% of Hα in the fitting procedure when these lines are in the grism (z > 0.52).We note that this constraint does not apply to the metallicity analysis discussed later, where the contribution from [N II] is self-consistently forward modeled to determine metallicities in a Bayesian framework.This constraint corresponds to 13% of the total flux including both [N II] lines.This overall contribution is in good agreement with the values presented in Table 2 of Erb et al. (2006) for moderate mass galaxies, where the [N II] contribution ranges from 6 -22%, increasing with galaxy stellar mass.When these lines are observed in the MUSE spectra, their fluxes are free to vary independently within the constraints listed in Table 2.This degree of freedom allows us to confirm the validity of the constraint at higher redshifts, and we used 30 lower mass galaxies in MUSE with [N II] detections at S/N > 5 to confirm that, on average, the [N II] doublet contributes 12% of the combined flux with Hα.
Finally, the Lyα λ1216 emission line is often asymmetric, displaying a strong wing at longer wavelengths due to radiative transfer effects (Verhamme et al. 2006;Dijkstra et al. 2007;Laursen 2010;Childs & Stanway 2018;Hu et al. 2023).We adopt a simple approach (Hu et al. 2010) for the continuum-detected Lyα emitters and fit a truncated half-Gaussian with the same centroid as the main Gaussian component to account for this wing emission.While directly integrating under the line profile may produce a more precise estimate of the total Lyα flux, this parametric approach allows the full spectrum to be fit self-consistently.In addition, Lyα nebulae are known to have larger extents than the general nebular emission lines as emission diffuses outwards to larger radii such that an aperture of fixed size will miss a substantial portion of the emission (Leclercq et al. 2017;Fossati et al. 2019;Urrutia et al. 2019).Considering these factors, we fit the Lyα line in our spectra so it will not affect the continuum model and adjacent emission lines, but will provide the measured fluxes in a subsequent publication using dedicated apertures to properly extract the extended Lyα emission.
In many cases, the Lyα, Mg II, and C IV lines show interesting systematic offsets relative to the other emission lines, which is expected due to resonant scattering (Henry et al. 2018;Berg et al. 2019).The C IV lines specifically are created by nebular emission, stellar winds, and are also seen in absorption in the ISM.These lines are susceptible to complex optical depth effects where photons near the systemic velocity encountering high optical depth are scattered into the wings of the lines where the optical depth is lower.As such, we recommend specialized fitting of these UV emission lines beyond that completed in this study for users interested in the detailed physics, complex line profiles, and precise fluxes of these UV diagnostic lines.In cases where strong absorption causes deviations from the line ratio constraints listed in Table 2 (e.g.C IV in Figure 1), the line flux measurements are generally flagged as unreliable in the spectral catalog (see Table 3).
Using this framework, we successfully derived spectroscopic redshifts for 419 sources that were published in the publicly-available source catalog (Revalski et al. 2023) 1 .Those redshifts are duplicated in the current emission line catalog presented here for completeness.By default the redshift measurements are based on Hα λ6565, and when the well-resolved [O II] λλ3727, 3730 doublet is available in MUSE (z < 1.5) we apply the o2_3727_dz offset value in the catalog (see Table 3) to take advantage of MUSE's superior spectral resolution and obtain the most precise redshifts.The fitting procedure does not include a model for any underlying stellar absorption around the Balmer emission lines, which we account for later in our metallicity calculations.
In Table 3, we list the columns that are included in the emission line catalog.These include the object ID, redshift, sky coordinates, magnitude, size measurements, spectral fit parameters, and their associated uncertainties.These are followed by the integrated line flux and error, both in units of erg s −1 cm −2 , the observed equivalent width (EW) in Å (which is negative when the continuum is not detected), and a contamination flag to indicate the quality of the fit for each emission line.These flags have values of 0 -5, with the flag for an excellent fit being 0. A slightly contaminated fit has a flag of 1, and a poor fit that should not be used has a flag of 5. A value of 3 indicates a non-ideal continuum fit such that the fluxes may be suspect, and flags are added in a bit-wise manner.Only measurements with a flag of 0, or a flag of 1 after visual inspection, should be used for science investigations to ensure the cleanest samples free from fitting and data artifacts.We note that we also fit the spectra of the two primary quasars (IDs 1535 and 20405) to determine their redshifts, but their line fluxes are flagged and should not be used for investigations.These quasar spectra contain complex broad and narrow emission and absorption features, as well as asymmetric line profiles and large centroid shifts that require detailed modeling that is beyond the scope of this study.
Finally, the individual and integrated fluxes of closely spaced doublet lines are both saved in the catalog.As examples, the [O II] λλ3727, 3730 and [S II] λλ6716, 6731 emission line doublets are resolved in MUSE, which allows their individual components to be used for density diagnostics.However, these lines are fully blended in the WFC3 data and so only the total line flux should be used.In general, the individual components and their sums can be used when present in the MUSE data, and only the summed flux should be used when the lines are present in WFC3 at higher redshift (e.g.Hα + [N II] and [S II] at z > 0.5, and [O II] at z > 1.7).3. ANALYSIS

Stellar Population Synthesis Models
We require a robust stellar mass and star-formation rate for each galaxy to investigate their location in the MZR and FMR, which we calculate using stellar population synthesis (SPS) models.The stellar mass and star-formation rate estimates are obtained by simultaneously fitting the multiwavelength photometry and the MUSE spectra (when available) with the Monte Carlo Spectro-Photometric Fitter (MC-SPF) code (Fossati et al. 2018).The details of the fitting procedure are given in Fossati et al. (2019) and Fossati et al. (in preparation).In summary, we start by fitting the data with Bruzual & Charlot (2003) models obtained with a Chabrier (2003) initial mass function (IMF) and an exponentially declining star-formation history (SFH).The models include emission lines using line ratios from Byler et al. (2017) that are scaled to the Lyman continuum luminosity of the stellar templates.We also include dust attenuation with a double Calzetti et al. (2000) attenuation law where the flux from stars older than 10 Myr are attenuated with a curve that is normalised by a free parameter (A V ) in the fitting procedure, while the same curve is normalised by 2.27 ×A V for stars younger than 10 Myr.While SFRs can also be estimated directly from the Balmer emission lines, Hα is blended with [N II] in the grism, it falls outside of our wavelength coverage at higher redshifts (z > 1.6), and the Balmer lines require a correction for stellar absorption, so adopting the SPS model parameters allows us to self-consistently model galaxies across all redshifts.
While it is common to adopt solar metallicity for the stellar component, this assumption is likely not appropriate for all of our galaxies that span a large dynamic range in redshift and stellar mass.We use our multi-filter photometry and high S/N spectroscopy to model the metallicity of the stars in discrete metallicity bins.The Bruzual & Charlot (2003) models provide stellar templates at 100% solar, 40% solar and 20% solar metallicity.We first calculate the stellar masses using stellar and emission line templates at solar metallicity.Next, we refine the SPS metallicity in one of two ways.First, for sources with an individual gas-phase metallicity measurement, we use the closest stellar template metallicity value.For all other sources, we compare the mass of the target with our stacked MZRs in the appropriate redshift bin (above and below z = 1), and recalculate the SPS models using a lower stellar metallicity if the object is in the low mass regime.We iteratively evaluate the template metallicity based on the new value of the stellar mass until all galaxies are assigned into one of the three broad metallicity bins provided by the Bruzual & Charlot (2003) models (100%, 40%, or 20% solar metallicity).The effect on the derived masses is smallest at low masses and increasingly important for higher mass galaxies.We only utilize stellar masses and SFRs obtained with this procedure.

Sample Selection
We minimally require spectra with detections of the [O II] λλ3727, 3730, Hβ λ4862, and [O III] λλ4959, 5007 emission lines to calculate metallicities (see §3.4).An ideal survey would have very high S/N spectra such that we could derive metallicities for each galaxy.However, in practice it is necessary to implement S/N criteria on the emission lines for individual sources, which can bias the sample towards galaxies with higher SFRs that produce stronger emission lines.This can bias the measurements of lines near the S/N limit, as the larger number of weakly emitting sources below the threshold have a higher chance of being scattered into the sample than strong line emitters being scattered out of the sample due to the measurement uncertainties (Eddington 1913).
Alternatively, spectra that share common properties can be stacked together to create high S/N composites (Henry et al. 2013;Wang et al. 2018).This reduces noise and leads to better constraints on the average emission line properties of galaxies, but this can also bias the resulting measurements if all of the galaxies do not share similar properties (Andrews & Martini 2013).Importantly, by stacking spectra we reduce the measurement scatter in the MZR and robustly determine the metallicities of galaxies down to lower masses than is possible for individual sources.By utilizing the very sensitive spectroscopy available for the MUDF, we minimize biases due to SFR selection by reaching SFRs of ∼1 M ⊙ yr −1 for stellar masses of M ⋆ ≲ 10 8 M ⊙ at z ≈ 1 -2.Using these observations, we investigate the average MZR for composite spectra produced by stacking galaxies with similar masses, as well as for individual galaxies with very high S/N spectroscopy.
We use our emission line catalog to select star-forming galaxies from the 419 sources with spectroscopic redshifts.First, we require that the [O III] λλ4959, 5007 emission line doublet is detected at S/N ≥ 3, which results in 249 sources.Next, we use the Mass-Excitation (MEx) diagram developed by Juneau et al. (2011Juneau et al. ( , 2014) ) to remove active galactic nuclei (AGN) from the sample.This diagnostic is similar to traditional Baldwin-Phillips-Terlevich (BPT) diagrams (Baldwin et al. 1981;Veilleux & Osterbrock 1987;Kewley et al. 2001), except the redder of the two emission line ratios (e.g.[N II]/Hα) is replaced with stellar mass (M ⋆ ), enabling it to be used at higher redshifts and lower spectral resolution.
We show the MEx diagram for our sample in Figure 2, where the demarcation lines separating stellar-ionized and AGN-ionized emission line sources at z ≈ 0 from Juneau et al. (2014) are also shown shifted by 0.75 dex to higher masses for use at z ≈ 2. As described by Coil et al. (2015), this redshift evolution occurs because galaxies at higher redshifts have lower metallicities, resulting in higher [O III]/Hβ ratios at fixed stellar mass.Without this correction, the high mass z ≥ 1 star-forming galaxies that reside between these two regions would be incorrectly flagged as AGN.This  2014), and are also shown shifted by 0.75 dex to higher mass for use at z ≈ 2 (dashed lines).The area between the gray and black lines represents a composite region with contributions from AGN and star-formation.The z < 1 sources (circles) should be compared with the dotted lines, while the z ≥ 1 sources (stars) are compared with the dashed lines.This diagnostic identifies 20 sources as AGN that we exclude from the metallicity analysis.
diagnostic uses only the λ5007 component of the [O III] doublet, and including the λ4959 line, which is always 1/3 the strength, would require shifting the demarcation lines to higher [O III]/Hβ by log(1 + 1/3) ≈ 0.125 dex.This diagnostic identifies 20 sources as AGN that we exclude from the analysis because the abundance methods are calibrated for stellar-ionized gas.In addition, we take advantage of the rest-frame UV coverage at higher redshifts to select sources where the He II λ1640, C III] λ1907, 1909, and O III] λ1666 emission lines are detected at S/N ≥ 4 to construct the C3He2-O3He2 ionization diagram discussed in Mingozzi et al. (2023) and Feltre et al. (2016).There are 11 sources meeting these criteria, and all are consistent with star-forming galaxies.Finally, the XMM-Newton sources reported by Lusso et al. (2023) yield no additional AGN beyond those identified in the MEx diagram, and we omit four more sources that have insufficient photometry required to calculate accurate stellar mass models.
These selection criteria result in 225 sources at z ≤ 2.39, which is the highest redshift at which we can detect [O III] λ5007 in WFC3 at its maximum wavelength limit of 1.70 µm.There are also four additional limiting redshift windows: We divide this subsample into six mass bins from log(M ⋆ /M ⊙ ) = 6.5 -11.0, with each bin having a width of 0.5 dex in stellar mass, except the highest and lowest mass bins that span ∼1 dex to encapsulate enough galaxies for useful statistics.Similarly for the low redshift sample, we divide the galaxies into six mass bins from log(M ⋆ /M ⊙ ) = 6.5 -11.0, with each bin having a width of 0.5 dex in stellar mass, except the highest bin that encapsulates all galaxies at log(M ⋆ /M ⊙ ) > 9.5.Next, we create a median stack of the spectra in each bin following the procedures of Henry et al. (2013).First, the spectra are continuum-subtracted using the results from our spectral fits, normalized to their [O III] fluxes, and de-redshifted to rest wavelengths.A median spectrum is then calculated for each mass bin, and the uncertainties are determined using bootstrap resampling (Henry et al. 2021).In this process, we resample the MUSE spectra to the spectral resolution of WFC3 by smoothing the MUSE spectra using a Gaussian kernel with a width equal to the WFC3 spectral resolution, and then apply a flux-conserving resampler (Carnall 2017) to match the spectrum to the WFC3 wavelength grid.
We then determine the emission line fluxes using a Gaussian fitting procedure that is nearly identical to that described in §2.2, but that does not constrain the relative contribution of These galaxies are divided into a low redshift sample of 77 galaxies shown in orange, and a high redshift sample of 120 galaxies shown in blue.
ponents.Unlike Henry et al. (2021), we find that we do not require both narrow and broad Gaussian components to fit our spectral stacks due to the smaller number of sources that result in overall narrower lines with lower S/N in the faint, extended wings.The spectral stacks for the low and high redshift samples are shown as functions of mass in Figures 4  and 5, respectively.The weak He I λ5877 and λ6680 lines were also fit, but are not marked in the figures for clarity.
Our second goal is to characterize the metallicities of individual galaxies over all redshifts for comparison with the properties of their surrounding gas viewed in absorption along the quasar sightlines (Beckett et al. 2024, submitted).While our spectral stacks only require a S/N ≥ 5 in the [O III] emission line, sources near this limit have insufficient S/N to calculate metallicities individually.We therefore selected the 90 galaxies across all redshifts with S/N ≥ 10 in the [O III] emission line for individual analysis.Using the strong [O III] line for selection with this high S/N limit generally ensures detections of the metallicity sensitive lines while minimizing Eddington bias (Eddington 1913).In order to avoid significant SFR selection biases, we confirmed that only requiring S/N ≥ 3 in the Hβ emission line would produce a nearly identical sample, sharing 84 out of 90 of the same targets.

Metallicity Methods
The most abundant metal produced by stellar processes is oxygen, which serves as an excellent tracer of gas metallicity because it displays strong optical emission lines and is less depleted by dust than many other elements (Stasińska et al. 2012).The specific technique used to calculate the abundance of oxygen, and how that technique is calibrated, can substantially change the derived metallicity.The most common techniques for determining gas-phase metallicities are the direct method based on comparing temperature sensitive auroral emission lines with nebular lines from the same element and ionization state2 , and the empirical strong-line method using bright nebular emission lines (Maiolino & Mannucci 2019).The latter can be calibrated using photoionization models (e.g.Kobulnicky & Kewley 2004), or empirically from electron temperature diagnostics (e.g.Pettini & Pagel 2004). 3he direct method employs temperature sensitive line ratios to derive the electron temperature and corresponding emissivity for each ionization state of an element.The most frequently used ratios for deriving the abundances of O + and O +2 are [O II] λλ7321,7332/λλ3727,3730 and [O III] λ4363/λ5007.An ionization correction is assumed for higher states that cannot be observed, but the amount of gas in the O +3 and more highly ionized gas is negligible in star-forming galaxies with softer SEDs than AGN (Berg et al. 2021).The abundance in each ionization state is then summed to derive the total oxygen abundance.This technique has the advantage that the weak auroral lines remain strong at lower metallicities as oxygen becomes a more important avenue for cooling the gas.However, the auroral lines are typically ∼10 -100 times weaker than the nebular lines they are compared to, so they can be difficult to detect observationally (Yin et al. 2007).Spectroscopic Stacks by Mass (z 0.2 0.9) Figure 4.The stacked and flux-normalized spectra for galaxies at z ≈ 0.2 -0.9 in six bins of stellar mass with emission lines labeled.The total stack of all galaxies is shown in black, followed by stacks in order of decreasing stellar mass.The stacks are shown with different colors for clarity, with uncertainties represented by the gray shaded regions.The stellar mass range for each stack is shown above the spectrum and additional properties are given in Table 4.The fluxes at longer wavelengths are noisy due to the redshift distribution of targets over the adaptive optics notch filter wavelength gap (∼5740 -6000 Å), which yields insufficient spectral coverage to determine robust fluxes in that spectral range.The strong-line method overcomes this limitation by only using the ratios of bright nebular emission lines in the spectra, but they must be empirically calibrated using sources where the direct method can also be applied, or by comparison with theoretical photoionization models.While this technique can be applied to large spectroscopic surveys with more limited S/N, this method has systematic uncertainties tied to the as-sumptions of the direct method or photoionization models used to calibrate the strong lines.Studies have found discrepancies as large as ∼0.7 dex between these two techniques when calibrating the strong-lines based on photoionization models (Kewley & Ellison 2008;Stasińska et al. 2012); however, strong-line diagnostics anchored by the direct method agree with recombination line metallicities at the ∼0.2 dex level (Maiolino & Mannucci 2019;Henry et al. 2021).Critically, when multiple strong emission line ratios are employed, degeneracies between the metallicity, ionization parameter, gas density, and other properties are reduced, allowing for robust metallicity determinations (Maiolino & Mannucci 2019).

Bayesian Method
Considering the factors discussed in §3.4,we adopt the methodology of Henry et al. (2021) that employs the direct method-based strong-line calibrations of Curti et al. (2017) to derive metallicites, and compare with the direct method for sources with auroral line detections.The justification for this choice of calibrations, their applicability at higher redshifts, and potential biases introduced by the emission line selection criteria are detailed comprehensively in Henry et al. (2021).We also note that our adopted value of 12 + log(O/H) ⊙ = 8.69 ± 0.05 for the solar abundance of oxygen from Asplund et al. (2009) Curti et al. (2017) to calculate the most probabilistic metallicity.This process has the advantage that it selfconsistently models the extinction from dust, stellar absorption of the Balmer lines, and contamination of Hγ by [O III] λ4363 to marginalize over poorly-constrained parameters and derive realistic uncertainties on the metallicity estimates.First, in the redshift ranges with coverage of Hα and Hβ, we use the observed ratio with a Calzetti et al. (2000) extinction curve to determine the dust extinction assuming an intrinsic Balmer decrement of Hα/Hβ = 2.86.These measurements for each mass range can be used at adjacent redshifts without coverage of both lines to place appropriate priors on the extinction.The sample of Henry et al. (2021) included significantly more galaxies that are distributed randomly across the sky, and we therefore adopt identical extinction priors that have been matched to the closest mass bin for the high redshift sample.
Next, we account for Balmer emission line (Hα, Hβ, Hγ, Hδ) stellar absorption by using a measure of the emission line equivalent width (EW) to reduce the model fluxes.The Bayesian framework allows for a range of stellar absorption based on both continuous and instantaneous burst models as described in Appendix C of Henry et al. (2021).Henry et al. (2021) also investigated measuring emission line EWs directly from WFC3 grism spectra as well as broad-band photometry.While estimates from the spectra are the most appropriate in principle, this process is sensitive to how the background is subtracted and fails for emission line sources where the continuum is not detected.We follow the procedure of Henry et al. (2021) and use our broad-band HST photometry to estimate EWs by subtracting line contamination and performing a linear interpolation across the filters to estimate the continuum flux at the wavelength of each emission line.At low redshifts we use the F336W and F702W for this purpose, while at higher redshifts we use the F125W and F140W photometry.We show a comparison of the spectroscopic and photometric EWs in Figure 6, which demonstrates that the two estimates agree to within a systematic uncertainty of ≈ 3%, but the photometric technique is able to successfully derive EWs for all galaxies, including sources without spectroscopic continuum detections.The observed Balmer line fluxes are compared to the distribution of model values that have been reduced by the ratio of the emission line EW to the stellar absorption EW.
Finally, the emission line fluxes are compared with the Curti et al. ( 2017) calibrations to determine the most probabilistic Figure 6.A comparison of the equivalent widths (EWs) measured from the spectroscopic and imaging observations.The median systematic offset between the two measures is only 3%.The main panel is shown on a logarithmic scale with a blue dashed unity line flanked by factor of three boundaries in dotted gray.The inset shows a larger parameter space using a symmetrical log (symlog) scale to highlight the sources without spectroscopic continuum detections (red points).While small changes in EW do not strongly affect the results, accurate order of magnitude estimates for all sources are important for deriving accurate metallicites.We adopt the broad-band photometric EWs to correct the Balmer emission line fluxes for stellar absorption.
value of the metallicity for each galaxy.This is completed using a Bayesian framework that considers extinction from dust, stellar absorption of the Balmer lines, contamination of Hγ by [O III] λ4363, and the gas-phase metallicity using priors bounded over a reasonable range of parameter space.
In most cases the priors have flat distributions except when informed by emission line diagnostics, with the extinction being a key example.Henry et al. (2021) explored the effects of correcting for reddening before or after spectral stacking and found that the derived metallicities agreed to within a few percent, indicating that the reddening treatment does not systematically bias the metallicities.The details of this process are documented in the appendices of Henry et al. (2021) and we refer to that study for a comprehensive explanation.

Direct Method
In order to confirm the validity of the abundances that we derive using the strong-line method, we use the direct method to calculate metallicities for galaxies with auroral line detections.We use our emission line catalog to identify six galaxies where the [O III] λ4363 emission line is detected at S/N ≥ 4 (discarding IDs 1069 and 1023 due to blending), which is only possible using the MUSE data due to blending of [O III] λ4363 and Hγ in the WFC3 spectra.In addition, we select six galaxies where the [O III] λ1666 emission line is detected at S/N ≥ 4 in MUSE.While this line is rarely detected beyond the local Universe except in gravitationally lensed sources (Sanders et al. 2020;Citro et al. 2023), it provides a similar temperature diagnostic to [O III] λ4363 in the rest-frame UV (Mingozzi et al. 2022), which is covered by MUSE at z ≈ 2.
We visually inspect each source and remove those with S/N ≤ 5 that display noisy line profiles.We then use PyNeb (v1.1.17;Luridiana et al. 2015) to convert the [O III] λ4363 / λ5007 and λ1666 / λ5007 ratios to electron temperatures, and calculate the ionic abundances of [O III] and [O II].We measure the [O III] gas temperature directly with the auroral lines, and use the T e (O + ) -T e (O ++ ) relation of Campbell et al. (1986) to derive the [O II] gas temperature.We assume an electron density of n e = 250 cm −3 , which is typical of star-forming galaxies at these redshifts (Sanders et al. 2018).We note that the metallicities are robust across a wide range of densities from n e = 100 −1000 cm −3 (Steidel et al. 2014).
We adopt a Galactic extinction curve (Savage & Mathis 1979) to correct the line ratios for reddening using the Hγ/Hβ and Hα/Hβ ratios for the low redshift sources and conservatively adopt the lower of the two extinctions.At high redshifts, we only have wavelength coverage of the Hγ/Hβ ratio, which is contaminated by [O III] λ4363.We use the observed Hγ/Hβ ratios to determine a lower limit on the reddening, as well as assuming Hγ is contaminated by 20% from III] λ4363 for an upper limit on the reddening.The later only lowers the metallicity by > 0.1 dex for one source, and so we correct the line ratios using the observed Hγ/Hβ ratios.
Finally, the uncertainties for our strong-line metallicities are propagated from the Bayesian analysis, while those for the direct method are the quadrature sum of the uncertainties in the emission line fluxes that are used in the line ratios for the direct method.Specifically, [O III] λ4363 / λ5007 and λ1666 / λ5007 used to derive electron temperatures, as well as

Validating Strong-line Calibrations with the T e Method
In Figure 7, we compare the metallicities derived using our strong-line methodology based on the Curti et al. (2017) calibrations to the metallicities determined from the direct T e method using PyNeb.The blue circles represent [O III] λ4363 measurements at z ≈ 0.7, while the orange circles are [O III] λ1666 measurements at z ≈ 2.3.Overall, there is excellent agreement for all of the sources at both low and high redshift with a systematic offset of only 0.03 dex and a dispersion of 0.17 dex between the two methods.Importantly, this agreement is seen for the lowest metallicity sources where the [O III] λ4363 line is detected at S/N > 10.The remaining [O III] λ4363 sources have weaker detections that may drive the scatter.The comparison of the six sources at z ≈ 2.3 represents a fundamental result, as there are only a few [O III] λ1666 metallicity determinations at these redshifts for unlensed sources (Sanders et al. 2020).Recently, Llerena et al. (2023) reported 21 [O III] λ1666 detections at z ≈ 3, with just five having S/N ≥ 5.These detections required VUDS and VANDELS spectroscopy with ∼20 − 80 hour integration times, highlighting the difficulty of detecting this auroral line at high redshift even with very deep observations.We approximately double existing samples at z > 1 with detections at S/N ≥ 5 for unlensed systems, and the results show the validity of using these strong-line calibrations up to z ≈ 2.5.
The fact that the strong-line and direct-method results agree for the lower redshift [O III] λ4363 sources where all of the emission lines are in MUSE, as well as for the [O III] λ1666 sources where [O III] λ5007 is in the HST grism, gives confidence in the inter-flux calibration between the instruments.Finally, the derived extinction for these sources is low, and consistent with zero within the uncertainties in several cases, which is expected for low-mass sources and mitigates concerns about the choice of reddening correction for the widely spaced [O III] λ5007 and λ1666 emission lines.5.

The Mass-Metallicity Relation
We present the gas-phase mass-metallicity relation (MZR) at z ≈ 1 -2 for our sample in Figure 8.The stacked spectra reach stellar masses of M ⋆ ≈ 10 7.5 M ⊙ at SFRs of ∼ 0.3 M ⊙ yr −1 , which is approximately six times lower in stellar mass, and an order of magnitude lower in SFR, than earlier studies using HST in this redshift range (Sanders et al. 2018(Sanders et al. , 2021;;Henry et al. 2021).This is also ∼1.5 dex lower in mass than the majority of previous studies that were limited to log(M ⋆ /M ⊙ ) > 9.0 at these redshifts before JWST.We find that the MZR decreases to log(O/H) + 12 ≈ 7.8 ± 0.1 (15% solar) at log(M ⋆ /M ⊙ ) ≈ 7.5, without evidence of a turnover or flattening in the shape of the MZR at the lowest masses.The shape of the MZR across all masses is consistent with the local relations derived for SDSS galaxies (Andrews & Martini 2013;Curti et al. 2020), except shifted to lower metallicities.We confirm that the MZR resides ∼0.3 dex lower in metallicity than local galaxies, but note that the magnitude of the offset is a function of both metallicity and SFR, as discussed in §4.3.
In Figure 8 we also display the results of Henry et al. ( 2021) and Sanders et al. (2018Sanders et al. ( , 2021) ) at similar redshifts.We have used published emission line fluxes to recalibrate the Sanders et al. (2018Sanders et al. ( , 2021) ) results to the Curti et al. (2017) calibration for an equal comparison across studies.Overall, we see excellent agreement between the studies at high masses, with some deviations between our result and that of Henry et al. (2021) at lower masses, which are primarily due to the different average SFRs of the samples.These results are also very consistent with the recent study by He et al. (2024) at similar redshifts.
We used the LMFIT package (Newville et al. 2023) to simultaneously fit our results with those of Henry et al. (2021), as we employ equivalent methodologies spanning a large range in SFR.We fit the MZR using the same approach as Curti et al. (2020), who characterize the MZR using a functional form similar to a Lomax (or Pareto Type II) distribution.Specifically, where Z 0 is the maximum metallicity, M 0 is the turnover mass, γ is the low mass slope, and β is the width of the turnover.In general, M 0 , γ, and β are functions of SFR.The best-fit line and the corresponding parameter values are shown in Figure 8, which are nearly identical to that of Curti et al. (2020) after shifting it by 0.3 dex to lower metallicities to account for redshift evolution.

The Mass-Metallicity-SFR Relation
It can be misleading to compare the MZRs from different studies unless they are derived for galaxies with similar SFRs.As discussed in §1, galaxies of a specific mass show lower metallicities at higher SFRs (e.g.see Figures 6 and 9 in Curti et al. 2020).This dependence on SFR defines a threedimensional plane of Mass-Metallicity-SFR (M-Z-SFR) that is often termed the Fundamental Metallicity Relation (FMR; Mannucci et al. 2010).We can account for this dependence on SFR using two different approaches.Commonly in the literature, the low mass slope (γ) is fixed and the 3D plane can be reduced to a 2D projection of least scatter by scaling the stellar masses using a fraction of the SFR, which is parameterized by µ = log(M ⋆ ) -αlog(SFR), where α quantifies the degree to which metallicity and SFR are correlated.Studies have found various values for α due to differences in galaxy selection criteria and metallicity calibrations (Andrews & Martini 2013;Curti et al. 2020).This projection effectively aligns MZRs with different SFRs to a common location on the MZR diagram.In the case of SDSS galaxies, the range of γ values is sufficiently modest that this approach removes nearly all of the spread in the MZR introduced by SFR.
However, at low redshifts our results reside well above the local MZR relations of Andrews & Martini (2013) and Curti et al. (2020).This offset is expected as our sample reaches much lower SFRs than the average for SDSS galaxies.Thus, we use an alternative approach and account for the correlation between the low mass slope of the MZR (γ) and the SFR   Curti et al. 2020 (shifted 0.3 dex) Figure 8.The mass-metallicity relation (MZR) for galaxies at z ≈ 1.1 -2.4, with the results for our stacked spectra shown with blue squares.The MZR decreases to log(O/H) + 12 ≈ 7.8 ± 0.1 (15% solar) at log(M⋆/M⊙) ≈ 7.5, without any evidence of a turnover or flattening in the shape of the MZR at low stellar masses.The overall shape of the MZR is consistent with the local relations of Andrews & Martini (2013) and Curti et al. (2020) when these relations are shifted by ∼0.3 dex to lower metallicities as shown by the gray line.This shift is primarily driven by galaxies having lower metallicities at higher redshifts, but also has a secondary dependence on SFR.As compared with Henry et al. (2021), our stacks reach > 0.5 dex lower in SFR so the trending of our results to higher metallicities is expected and consistent with the M-Z-SFR relation.We have used published emission line fluxes to calibrate the results of Sanders et al. (2018Sanders et al. ( , 2021) )  by accounting for γ as a function of SFR.While γ is now a variable parameter, this approach eliminates the need to determine a projection of least scatter, which can convolve the uncertainties in mass and SFR in a complex manner.As shown in Figure 9, we use the values provided in Table 5 of Curti et al. (2020) to extrapolate γ to the lower SFRs covered by our observations, with extrapolations shown for log(SFR) = −1.25 (γ = 0.18) and −1.75 (γ = 0.13) M ⊙ yr −1 .
We show the M-Z-SFR relation for our low and high redshift samples in Figure 10, with the results for individual galaxies color-coded based on their SFRs and the extrapolated relations shown for the low redshift sample.The uncertainties in the masses are derived from the SPS models added in quadrature with a ±0.1 dex systematic model uncertainty, while those in the abundances are from the Bayesian model added in quadrature with a ±0.02 dex uncertainty from the model grid resolution.Overall, we find excellent agreement between our results and the extrapolations, which agree within the uncertainties to better than one standard deviation.These results indicate that the local MZR is valid to at least ∼1 dex lower in SFR than current calibrations based on SDSS.
At both low and high redshifts the results for the individual galaxies are consistent with those derived for the stacked spectra, with measurements scattered above and below the stacked results.This indicates that the stacks are not biased in metallicity as compared to the individual galaxies.However, the stacks do reach lower SFRs than measurements for the individual sources at high redshift due to S/N limitations.There is also no noticeable trend in SFR for sources above and below the stacked results within a given mass bin, suggesting that the sample is complete over the ranges where both individual  and stacked metallicities can be derived.However, there are an insufficient number of sources in the high redshift sample to detect such a trend if it is weakly present.The results shown in Figures 8 and 10 are tabulated in Tables 4 and 6, respectively, for easier comparison with other MZR studies.
As seen in Figure 10, our low redshift sample resides well above the local MZR.This result is expected, as the SDSS calibrations adopt a constant slope (γ = 0.31) for the low mass end of the relation, which neglects the correlation between the low mass slope and the SFR.This is appropriate for the range of masses, metallicities, and SFRs of local SDSS galaxies, and also safely prevents extrapolation errors.However, our deep observations reach log(SFR) ≈ −2.5, which is significantly lower than where the existing calibration is constrained.Without accounting for this dependence of γ on the SFR, galaxies can artificially appear to reside above or below the FMR.These results indicate that expanding calibrations to larger samples at lower masses and SFRs with JWST and Roman may require incorporating the correlation between γ and SFR, which avoids correlated errors between mass and SFR that occur when using the projection of least scatter.

The Role of Environment
Finally, we investigate the influence of galaxy environment on the metallicities of individual galaxies.We present the differences between the measured metallicities for individual sources and the values predicted by the MZR using our best fit value of γ for each galaxy in order to control for SFR in Figure 11, with the points color-coded based on whether the galaxies reside in isolated (black) or group (gold) environments.We used our group catalog developed in Dutta et al. (2023) to separate galaxies by environment, where the catalog was obtained by running a Friends-of-Friends (FoF) algorithm on the galaxy sample with reliable redshifts using linking lengths of 400 kpc in the transverse spatial direction and 400 km s −1 in the line-of-sight velocity direction.
In the low redshift sample, we find that the majority of galaxies are in group environments, which is expected given the presence of a large group at z ≈ 0.68 (Fossati et al. 2019).In the high redshift sample, half of the sources with individual metallicities are in group environments located at z ≈ 2.25 and z ≈ 2.38.We calculated the mean and median metallicities for the isolated and group galaxies and find that the metallicities of galaxies in group environments are equal to isolated sources to within 0.1 dex for the low redshift sample, and 0.2 dex for the high redshift sample.However, this slight enhancement of metallicity in group galaxies is not statistically significant according to the KS test, due to the small sample sizes and scatter in the individual metallicity measurements.The relatively small scatter surrounding zero indicates that the variable γ approach accurately captures the location of each galaxy on the M-Z-SFR plane.

DISCUSSION
The successful launch of JWST (Gardner et al. 2006(Gardner et al. , 2023) ) provides an opportunity to study galaxies at lower masses and SFRs and higher redshifts than previously possible.This creates a timely need to validate existing metallicity calibrations, and to extend them over larger ranges of physical parameters.Our results using the direct method indicate that the existing strong-line calibrations from Curti et al. (2020) are valid up to at least z ≈ 2.5 across a broad range of masses and SFRs.
However, recent studies with JWST suggest that common strong-line calibrations may not be valid at z > 3. Specifically, Laseter et al. (2023) compared the metallicities of 10 sources at z ≈ 2 − 9.5 and found that strong-line calibrations failed to match direct method measurements.Curti et al. (2023a) expanded this investigation to 146 star-forming galaxies and found evidence that the slope of the MZR may begin to flatten at lower masses, and that galaxies increasingly diverge from the FMR beyond z ≈ 3. Evidence of a redshift evolution in the MZR and FMR and a lack of consistency of local strong-line calibrations with the direct method are supported by several recent JWST studies (Li et al. 2023;Nakajima et al. 2023).
The results of these studies have important implications for the evolution of the MZR and FMR with mass, SFR, and redshift.Historically, the low mass regime of the MZR has been represented as a decreasing power-law.However, the relationship must turnover at some critical mass and approach a metallicity floor, otherwise the lowest mass galaxies would consist of only hydrogen and helium.Studies have been searching for this flattening of the MZR, with evidence at the lowest masses observed for local galaxies (Hirschauer et al. 2022).The results of Curti et al. (2023a) and others suggest that we are beginning to observe a flattening in the low-mass slope of the MZR across a range of redshifts at z > 3.
Interestingly, we do not see evidence of a flattening or turnover in our results (Figure 8).The overall shape is very consistent with local MZR relations after they are shifted by ∼0.3 dex to lower metallicities to account for redshift evolution.The results for our lowest mass stack in the high redshift sample reside almost exactly between the Andrews & Martini (2013) and Curti et al. (2020) relations that have been extrapolated to log(M ⋆ /M ⊙ ) ≈ 7.5.The lowest mass stack also has the lowest SFR, suggesting that galaxies with higher SFRs at this mass would have even lower metallicity.Nonetheless, we do not see evidence for a flattening or turnover in the MZR at these masses, SFRs, and redshifts.Given the importance of our lowest mass stack in the high redshift sample, we further investigated the effects of having a small number of targets within the stack.In addition to the boot-strapped uncertainties, Offset from -Dependent MZR (z 0.2 0.9) Individual Galaxies (isolated) Individual Galaxies (groups) Offset from -Dependent MZR (z 1.1 2.4) Individual Galaxies (isolated) Individual Galaxies (groups) Figure 11.The offset from the MZR for individual galaxies at z ≈ 0.2 -0.9 (left) and z ≈ 1.1 -2.4 (right) using the best-fit γ-values that were extrapolated based on the SFR of each source (see Figure 9).The results for sources with [O III] S/N ≥ 10 are color-coded based on whether they are isolated galaxies (black), or reside in a group environment (gold).The best-fit γ values encompass γ = 0.1 -0.3 for the low redshift sample, and γ = 0.3 -0.4 at high redshifts.The small dispersion of sources centered on zero indicates that the γ-dependent MZR accurately encapsulates the large range of observed SFRs.Overall, the offset from the MZR for galaxies in groups are the same as isolated galaxies to within 0.1 dex (left) and 0.2 dex (right) in metallicity.The errors are propagated from those in the metallicities and the uncertainty in extrapolating γ to lower SFRs.
we confirmed the robustness of this measurement by adding one, two, and three of the next highest mass targets into the stack.While the average mass of the bin shifted minutely, the derived metallicities agree to within 0.04 dex, which is the size of our Bayesian grid interval, in all three cases.
However, assuming constant γ, we see evidence that our results deviate from the local MZR.We observe that our low redshift sample resides above it by ∼0.2 dex, while the high redshift results are consistent with it.In these cases, the assumption that the low mass slope (γ) is constant with SFR is no longer valid.Accounting for the low mass slope as a function of SFR brings our results into agreement with the local MZR, without including any direct redshift evolution.
These results provide a crucial constraint on the MZRs and FMRs at low stellar masses, a regime in which modern hydrodynamical simulations are currently limited.Simulation suites such as SIMBA (Davé et al. 2019), IllustrisTNG (Vogelsberger et al. 2014;Genel et al. 2014;Springel et al. 2018), FIRE (Hopkins et al. 2014(Hopkins et al. , 2018;;Muratov et al. 2015), and EAGLE (Schaye et al. 2015) require a compromise between the cosmological volume that is probed and mass resolution, such that low mass systems can only be studied in smaller volumes.With advancements in computation, the MZR and FMR have now been extended below ∼10 8.5 M ⊙ (e.g., Ma et al. 2016;Torrey et al. 2018Torrey et al. , 2019;;Davé et al. 2019;Feldmann et al. 2023).These works have successfully reproduced the broad slope and normalization of these relations across a range of redshifts; however, the low-mass end of the MZR and FMR require further investigation in observations, simulations, and models (Ucci et al. 2023).Hence, this work can serve as benchmark for future observations and simulations.
As the archive of deep, high-redshift observations with JWST grows, and the quality of the data calibrations steadily improve, emission line calibrations are being expanded to a vastly increased parameter space in a robust, selfconsistent manner.Progress is already being made in this area with revised calibrations proposed by Sanders et al. (2023), Hirschmann et al. (2023), and additional studies (Nakajima et al. 2022;Garg et al. 2023;Yang et al. 2023).This rapid release of results related to the metallicities of galaxies across cosmic time should provide a consensus on these issues that was not possible before JWST (e.g.Atek et al. 2023;Bagley et al. 2023;Curti et al. 2023b;Heintz et al. 2023;Langeroodi et al. 2022;Langeroodi & Hjorth 2023;Maseda et al. 2023;Matthee et al. 2023;Shapley et al. 2023;Sun et al. 2023;Trump et al. 2023).These investigations are crucial, as our results approach the low metallicity limit of the current strongline calibrations.Critically, our direct method metallicities for the [O III] λ1666 emitters validates the use of the existing Curti et al. (2017) strong-line calibrations up to z ≈ 2.5.

CONCLUSIONS
We have used ultra-deep imaging and spectroscopy of the MUSE Ultra Deep Field to characterize the gas-phase metallicities of low mass galaxies up to z ≈ 2.3 during the peak of cosmic star-formation.Our conclusions are the following: 1. We find excellent agreement between the metallicities derived from the Curti et al. (2017) strong-line calibrations and the direct method for the 12 sources with detections of the [O III] λ4363 (z ≈ 0.7) and [O III] λ1666 (z ≈ 2.3) auroral lines.The results of these techniques agree to within a factor of two (0.3 dex) within the uncertainties for all galaxies, which validates the use of these strong-line calibrations up to z ≈ 2.5.
The six [O III] λ1666 detections with S/N > 5 nearly doubles the number previously reported at z > 2 for unlensed galaxies, providing an anchor for these strongline calibrations at the peak of cosmic star-formation.
2. We find that the MZR at z ≈ 1 -2 decreases to log(O/H) + 12 ≈ 7.8 ± 0.1 (15% solar) at log(M ⋆ /M ⊙ ) ≈ 7.5, without evidence of a turnover or flattening at low stellar masses.The shape of the MZR is consistent with the local relations of Andrews & Martini (2013) and Curti et al. (2020) when they are shifted by ∼0.3 dex to lower metallicities.This shift is primarily driven by galaxies having lower metallicities at higher redshifts, but also has a secondary dependence on SFR.These results extend the MZR to six times lower in stellar mass and SFR than earlier studies with HST in this redshift range.
3. At z ≈ 0.2 -0.9 our ultra-deep observations reach SFRs that are ∼1 dex lower than current calibrations of the MZR from the local Universe.We extrapolate the MZR of Curti et al. (2020) to lower SFRs and find a close agreement with the observations.These findings extend the valid parameter space of the MZR to encompass a range that spans nearly 4 dex in SFR, which will be beneficial for current and future studies utilizing JWST.
4. Our sample at z ≈ 0.2 -0.9 reaches significantly lower SFRs than SDSS galaxies that are well modeled by an FMR that adopts a constant slope (γ) for the low mass end of the relationship (Curti et al. 2020).Our results align if we relax this constraint and adopt a shallower slope that is consistent with extrapolating to lower SFRs.At z ≈ 1 -2 an MZR with constant γ matches our results well, encompassing higher SFR galaxies similar to SDSS.This indicates that future calibrations of the MZR could incorporate γ as a function of SFR as an alternative approach to the projection of least scatter.
5. We examined the properties of galaxies in different environments and find at most a tentative ∼0.2 dex enhancement in the metallicities of galaxies in groups versus isolated environments.However, the result is not statistically significant because there are a small number of galaxies with individual metallicity measurements between the environments that are also in the same mass range, which is required for a consistent comparison.
In closing, the metallicities of the lowest mass galaxies in our sample approach the low metallicity limits of many current strong-line calibrations.These results together with recent studies utilizing JWST highlight the current need to make direct method measurements at lower stellar masses and SFRs for determining direct metallicities that can be used to extend strong-line calibrations for samples without auroral emission line measurements.These observations reach the lowest masses and SFRs that are accessible to HST, and provide a key benchmark based on established calibrations for comparison with future studies utilizing JWST and Roman.

Figure 1 .
Figure1.The results produced by the fitting routine for two sources, with the MUSE spectra on the left and the WFC3 grism on the right.The data are shown in black, with Gaussian fits in red, and the derived centroids with blue vertical lines.The inset panels zoom-in on emission lines in MUSE in order of wavelength.The element and ionization state and rest wavelength for each feature are shown in the top center of each inset.There is a spectral coverage gap between MUSE and WFC3 from 9353 -10020 Å, while the vertical gray mask is due to the MUSE AO notch filter.These are examples of the highest S/N spectra available for the sample, which are suitable for individual metallicity measurements.Several features are notable, including the blueshifted P Cygni absorption profile of C IV in the lower source and a weak residual skyline near [O III] λ4363 in the upper source.The line fitting constraints allow for the effects of spurious artifacts to be minimized and flagged in the resulting fits.

Figure 1 .
Figure1.continued.The same as above for two sources with modest S/N spectra that are representative of the typical data quality for the sample.
The [Si III], [C III], [O II], and [S II] doublets are all sensitive to the electron density of the gas.λ4687,Hβ λ4862, and [O III] λλ4959, 5007; 4) [O I] λλ6300, 6363, [N II] λλ6550, 6585, Hα λ6565, and [S II] λλ6716, 6731; 5) [S III] λλ9071, 9533, and He I λ10832.The remaining three constraints are for C IV λλ1548, 1551, Mg II λλ2796, 2804, and [O II] λλ3727, 3730, which are independent of the other lines except for when Mg II or [O II] are in the grism spectral range at higher redshifts, in which cases they are fixed to the other emission lines seen in the grism.

(
202-205) he10830 ⋆ flux, error, ew obs, contam NOTE-A list of the columns provided in the emission line catalog.Entries include the object ID, redshift, sky coordinates, magnitude, size measurements, spectral fit parameters, and their associated uncertainties.The δz values are offsets relative to the redshift of Hα.These are followed by entries for each emission line with the flux and flux error in units of erg s −1 cm −2 , observed equivalent width in Å, and a quality flag (see text).The individual and total fluxes of closely spaced doublets are saved in the catalog so the component fluxes can be used when the lines are resolved in MUSE, and the integrated fluxes can be used when lines are blended in WFC3.The fluxes of non-detected lines are recorded with a value of -1.

Figure 2 .
Figure 2. The mass-excitation (MEx) diagram for 243 galaxies where the [O III] λ5007 Å emission line is detected at S/N ≥ 3. Sources at z < 1 are shown with circles, while sources at z ≥ 1 are shown with stars, with points color-coded based on their SFR from SED modeling.The demarcation lines separating stellar-ionized and AGN-ionized emission line sources at z ≈ 0 (dotted lines) are from Juneau et al. (2014), and are also shown shifted by 0.75 dex to higher mass for use at z ≈ 2 (dashed lines).The area between the gray and black lines represents a composite region with contributions from AGN and star-formation.The z < 1 sources (circles) should be compared with the dotted lines, while the z ≥ 1 sources (stars) are compared with the dashed lines.This diagnostic identifies 20 sources as AGN that we exclude from the metallicity analysis.
[O II] is blueward of the MUSE spectral range at z < 0.23 and falls in the adaptive optics NaD laser notch blocking filter at z = 0.54 -0.61; [O III] and/or Hβ reside in the wavelength gap between MUSE and WFC3 (9353 -10020 Å) at z = 0.87 -1.06; and [O II] resides in the wavelength gap at z = 1.51 -1.69.These redshift constraints yield 197 sources at z = 0.23 -2.39 with wavelength coverage of [O II], [O III], and Hβ.Of these, 115 have non-zero fluxes of all three lines that are required for individual metallicity measurements.The properties of these 197 galaxies are summarized in Figure 3. 3.3.Subsample Construction We use the sources with wavelength coverage of [O II], [O III], and Hβ to construct subsamples for analysis.Specifically, from the full sample described above we select 72 galaxies at z = 0.23 -0.87 and 109 galaxies at z = 1.00 -2.39 with S/N ≥ 5 in [O III] λλ4959, 5007.We refer to these as our low and high redshift samples, respectively.In the low redshift sample, [O II], [O III], and Hβ are observed in the MUSE spectra, while for the high redshift sample [O III] and Hβ are redshifted into WFC3.Primarily, we are interested in characterizing the MZR at z ≈ 1 -2 over the same mass range that has been studied in the local Universe.The galaxies in our high redshift sample have high quality spectra with a median S/N ≈ 17 for the [O III] doublet and S/N ≈ 2.5 for Hβ.Our requirment of detecting [O III] at S/N ≥ 5 ensures that only sources with robust redshifts are selected and they will contribute meaningful information to the spectral stacks.We apply the o3_5007_dz offset from our catalog so the redshifts are based on the strong [O III] λ5007 emission line.

Figure 3 .
Figure 3.The redshifts, stellar masses, SFRs, and extinctions for the 197 sources with coverage of the [O II], [O III], and Hβ emission lines.These galaxies are divided into a low redshift sample of 77 galaxies shown in orange, and a high redshift sample of 120 galaxies shown in blue.

Figure 5 .
Figure5.The stacked and flux-normalized spectra for galaxies at z ≈ 1.1 -2.4 in six bins of stellar mass with emission lines labeled.The total stack of all galaxies is shown in black, followed by stacks of decreasing stellar mass.Stacks are shown with different colors for clarity, with uncertainties represented by the gray shaded regions.The stellar mass range for each stack is shown above the spectrum with additional properties listed in Table 4.The increase of [O III]/[O II] and [O III]/Hα with decreasing stellar mass indicate changes in metallicity and ionization state.
is identical to the value used in the Curti et al. (2017) calibrations, and matches other widely-used studies (Allende Prieto et al. 2001) for an easier cross-comparison of results.The Bayesian methodology developed by Henry et al. (2021) uses the R 2 ≡ log([O II]/Hβ), R 3 ≡ log([O III]/Hβ), and O 32 ≡ log([O III]/[O II]) strong-line calibrations presented in Photometric EW (Å) [O III]/Hβ and [O II]/Hβ that are used to determine the ionic fractions.Considering the weakness of the auroral lines, and the need to measure fluxes for [O II], [O III], and Hβ, the comparison of these techniques is only possible for six sources at z ≈ 0.68 -0.78 using [O III] λ4363, and for six additional sources at z ≈ 2.25 -2.32 using [O III] λ1666.The spectra for these 12 sources are shown in the Appendix.

Figure 7 .
Figure 7.A comparison of the gas-phase oxygen abundances derived using our Bayesian framework with the Curti et al. (2017) strongline calibrations, versus those found using the direct Te method with PyNeb.The solid unity line is flanked by factor of two dashed boundaries (± 0.30 dex).The results from these techniques agree to within a factor of two within the uncertainties for all sources, with the lowest metallicity sources at log(O/H) + 12 ≈ 7.7 having robust [O III] λ4363 detections at S/N > 10.The results for [O III] λ1666 validate the use of these strong-line calibrations up to z ≈ 2.5.The data shown in this figure are available in the Appendix in Table5.
(O/H) = Z 0 / × (1 + (M/M 0 ) ) [Z 0 = 8.493, = 0.26, = 1.08,M 0 = 10.00]MassMetallicity Relation (z 1.1 2.4) Figure8.The mass-metallicity relation (MZR) for galaxies at z ≈ 1.1 -2.4, with the results for our stacked spectra shown with blue squares.The MZR decreases to log(O/H) + 12 ≈ 7.8 ± 0.1 (15% solar) at log(M⋆/M⊙) ≈ 7.5, without any evidence of a turnover or flattening in the shape of the MZR at low stellar masses.The overall shape of the MZR is consistent with the local relations ofAndrews & Martini (2013) andCurti et al. (2020) when these relations are shifted by ∼0.3 dex to lower metallicities as shown by the gray line.This shift is primarily driven by galaxies having lower metallicities at higher redshifts, but also has a secondary dependence on SFR.As compared withHenry et al. (2021), our stacks reach > 0.5 dex lower in SFR so the trending of our results to higher metallicities is expected and consistent with the M-Z-SFR relation.We have used published emission line fluxes to calibrate the results ofSanders et al. (2018Sanders et al. ( , 2021) ) to the Curti et al. (2017) relation for equal comparison across studies.We use the same methodology as Henry et al. (2021) and thus derive a best fit model for our combined results shown with the dashed blue-red line.The equation from Curti et al. (2020) and our best fit model parameters are shown in the lower-right of the figure.

Figure 9 .
Figure 9.The dependence of the low-mass slope (γ) in Equation 1 on the SFR of the galaxy.The values and uncertainties from Table 5 of Curti et al. (2020) are shown with black circles, while the extrapolated values for log(SFR) = −1.25 M⊙ yr −1 (γ = 0.18) and −1.75 M⊙ yr −1 (γ = 0.13) are shown as blue squares.The shaded region represents the error-weighted uncertainty on the slope (0.11) and intercept (0.32) for the line of best fit, shown by the solid back line.
Figure 10.The mass-metallicity relation (MZR) for galaxies at z ≈ 0.2 -0.9 (left) and z ≈ 1.1 -2.4 (right).The results for stacked spectra are shown with joined blue squares color-coded by their median SFRs, while the points for individual galaxies with [O III] S/N ≥ 10 are shown as circles color-coded by their model-based SFRs.In the left panel, we show the local MZR from Curti et al. (2020) with a black line, with values for specific SFRs color-coded.The solid sections of each line are from Curti et al. (2020), while the dashed sections are extrapolations to lower masses or higher SFRs, and the dotted sections are extrapolated across both mass and SFR.The shaded region shows the uncertainty in the extrapolation of γ = 0.18 for log(SFR) = −1.25 M⊙ yr −1 .In the right panel, our results are consistent with the local MZR after it is shifted by ∼0.3 dex to lower metallicities to account for redshift evolution and different SFRs.The vertical offset from the local MZR in the left panel is driven by the lower SFRs of the stacks as compared to SDSS galaxies, while the offset in the right panel is primarily due to redshift evolution.
[O III] λ5007 at 5008.240 Å).This table includes the element and ionization state, vacuum rest wavelength, energy level transition, creation ionization potential, and critical density for each emission line.While several additional weak lines are detected in a few sources, each emission line adds significant complexity and computational time in the fitting process and

Table 1 .
Galaxy Emission Lines

Table 3 .
Emission Line Catalog Columns