ABYSS. II. Identification of Young Stars in Optical SDSS Spectra and Their Properties

We developed a tool that measures equivalent widths of various lines in low-resolution optical spectra, and it was applied to stellar spectra obtained as part of SDSS-V and LAMOST programs. These lines, such as Li i, which directly indicates stellar youth, or optical H i and Ca ii, which in emission indicate activity associated with stellar youth, are commonly seen in YSOs. We observe several notable differences in the properties of these lines between YSOs and the field stars. Using these data, we devise a set of criteria through which it is possible to confirm the youth of stars that have been observed by the ABYSS program, as well as to identify likely young stars that have serendipitously been observed by other programs. We examine the decrement of H lines seen in emission in CTTSs, and estimate the properties of the accretion stream that is responsible for the production of these lines. Finally, we examine the evolution of Li i as a function of age, and characterize the scatter in its abundance that appears to be intrinsic in young M dwarfs.


Introduction
In its observing strategy, the Sloan Digital Sky Survey in its fifth iteration (SDSS-V) is set to obtain optical and near-IR spectra of several million stars through the Milky Way Mapper (MWM) program (Almeida et al. 2023).This will be achieved with both the BOSS (Baryonic Oscillation Spectroscopic Survey) and APOGEE (APO Galactic Evolution Experiment) spectrographs, using a state-of-the-art fiber robotic positioner (Pogge et al. 2020) that allows the simultaneous observation of up to 500 targets.As part of the SDSS-V MWM program, the APOGEE & BOSS Young Star Survey (ABYSS) is expected to produce multi-epoch spectra of >100,000 photometrically identified young star candidates with ages <30 Myr (Kounkel et al. 2023b, hereafter Paper I).
To date, single epoch spectra of several tens of thousands of such candidates have been observed, allowing us to perform an initial characterization of their properties.Because the initial selection of the ABYSS targets was not free of contamination, it is necessary to use spectroscopic criteria to separate out older field stars from the bona fide young stellar objects (YSOs), to enable any follow-up work.We take advantage of the fact that optical spectra of young stars typically exhibit a number of unique features (e.g., Li I absorption, Hα emission, and inflated radii) that are not present in the field stars, allowing us to perform their identification.
The Li I 6708 Å absorption line has particular importance.Li I is present in the ISM, and as such, it is included in the chemical composition of YSOs as they form.However, Li is easily destroyed: as soon as the internal temperature of a star reaches 3 × 10 6 K, it is rapidly processed in nuclear reactions (Clayton 1983).In higher-mass stars with radiative envelopes, a trace fraction of Li I will persist near the photosphere, regardless of the internal temperature.However, when fully or mostly convective stars reach the age where they have sufficiently contracted to reach that temperature in the core/ convective boundary layer, Li I becomes depleted everywhere within its envelope.Thus, Li I will not be detected in low-mass field stars, and its detection can act as an unambiguous confirmation of stellar youth (e.g., Briceno et al. 1997;Jeffries et al. 2014;Gutiérrez Albarrán et al. 2020), although some care needs to be applied with regards to the solar-type stars (Žerjal et al. 2019).Additionally, at its strongest, Li I absorption typically has equivalent widths of only 0.5 Å, and this feature is found near several Fe I lines.Thus, care needs to be taken in order to minimize the confusion and contamination.
In the case of YSOs with dusty disks, they often exhibit strong and wide emission lines from the accretion stream shock and outflows.The strongest of those lines in the optical spectra is Hα, often presenting equivalent widths (EqW) of tens to hundreds of Å (White & Basri 2003).Particularly energetic shocks are capable of exciting other transitions of H (Kwan & Fischer 2011;Wilson et al. 2022;Campbell et al. 2023), as well Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence.Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
as a number of other elements, including various O, Fe, He, S, N, and others (Hamann & Persson 1992a, 1992b, 1992c;Hamann 1994;Baldovin-Saavedra et al. 2012;Ballabio et al. 2023).Young stars exhibiting accretion are known as Classical T Tauri Stars (CTTSs).Because disks tend to be short-lived, CTTSs are not ubiquitous among all young stars.Broad emission lines can help to cleanly separate out YSOs from the field stars, although a few rare types of more evolved stars (e.g., cataclysmic variables) could also exhibit them (Echevarria 1988).Additionally, in some rare cases, Hα in CTTSs can be so strong that it may confuse the data processing pipelines in large surveys such as SDSS, which could erroneously result in masking such a feature as a cosmic ray.
Young stars also tend to be 10 2 -10 4 times more magnetically active than the field stars (e.g., Skumanich 1972;Feigelson et al. 2007;Kounkel et al. 2022).This activity produces a number of emission lines, most notably in Hα and other H lines, as well as Ca (Vaughan et al. 1978;White & Basri 2003;Briceño et al. 2019), although they are significantly weaker and narrower than the lines dominated by accretion.Young stars with emission lines driven primarily from activity are known as Weak Lined T Tauri Stars (WTTSs).Activity-driven emission is significantly longer lived however, it is able to persist for several Gyr in late M dwarfs, and and for several 100 s of Myr in early M dwarfs (West et al. 2008;Newton et al. 2017).However, any emission in G & K dwarfs is a clearer indication of stellar youth.
In addition to the above features that are primarily observed in the optical regime, with sufficiently high ratios of signal-tonoise (S/N), g log can also be used as a reliable tracer in many young stars.Low-mass pre-main-sequence (PMS) stars are still inflated over their main-sequence counterparts; as such, they have a systematically lower g log in comparison to the field stars.For high-mass stars, it is the inverse: because OB (and to a lesser extent, A) stars evolve so rapidly, catching them on the main sequence is more likely than not to signify their youth in comparison to those stars that have already evolved into giants.Although there has been difficulty in accurately calibrating g log measurements of young stars in the past, recent efforts to process SDSS spectra (both APOGEE and BOSS) now allow g log of low-mass YSOs to be an independent estimate of age (Olney et al. 2020;Sprague et al. 2022).
Each individual tracer may provide a strong indication of stellar youth, and for some sources, in particular if the scope of the study is limited to, e.g., an individual star-forming region, simple criteria are capable of producing clean samples.However, any individual criteria are significantly easier to apply to select low-mass KM dwarfs, in comparison to OBAFG stars, and stars with the age of a few Myr are easier to identify than those with an age of 20-30 Myr.
Dealing with the data from a large all-sky spectroscopic survey results in its own challenges.Given that YSOs tend to be somewhat rare, any selection criteria have to be carefully adjusted, as even a small fraction of the field stars in the the parameter space similar to that of the young stars can overwhelm the sample, while any cuts that are too strict may exclude a significant fraction of bona fide YSOs that could have been easily identifiable in a smaller study.As such, using a combination of different criteria is advantageous for developing a cleaner but still comprehensive sample.At the same time, however, it is necessary to efficiently and carefully extract relevant spectral features in a manner that could be applied to all of the stars regardless of their mass, evolutionary status, or the unique properties than a star may exhibit.
In this paper, we aim to develop a pipeline for measuring equivalent widths of various youth-sensitive lines in lowresolution optical spectra, such as those produced with BOSS and LAMOST.Using these data, we develop a classifier for identification of young stars, and we study the evolution of the properties of these lines as a function of age.

Data
BOSS is a low-resolution optical spectrograph with R ∼ 1800 with a typical pixel scale of ∼1 Å, covering a wavelength range of 3600-10400 Å (Smee et al. 2013).Twin instruments are mounted at two observatories, at Apache Point Observatory (APO; Gunn et al. 2006;Blanton et al. 2017) and Las Campanas Observatory (LCO; Bowen & Vaughan 1973), to ensure a complete coverage of the entire sky.Each spectrograph is capable of observing up to 500 spectra within a 3°and a 2°field of view, respectively.In prior iterations of SDSS, BOSS was primarily to observe extragalactic targets, obtaining spectra of only a few thousand of stars in total (Abdurro'uf et al. 2022), although it did include a couple of fields rich in PMS stars (Suárez et al. 2017).Since the transition to SDSS-V in 2021, it has observed well over 400,000 stars, including more than 17,000 stars targeted by ABYSS.
LAMOST (Large Sky Area Multi-Object Fiber Spectroscopic Telescope) is a spectrograph very similar to BOSS in terms of its resolution, although it does have a slightly narrower wavelength range of 3700-9000 Å, and it can observe up to 4000 stars simultaneously (Yan et al. 2022).Beginning its operations in 2011, it has obtained spectra of more than 10 million stars.So far, there have been only a few studies utilizing LAMOST spectra of young stars (Liu et al. 2021;Wang et al. 2022;Hernández et al. 2023;Lin et al. 2023); however, in LAMOST DR8, there have been more than 13,000 serendipitously observed stars that are included in ABYSS SDSS targeting (most of which, to date, are yet to be observed with BOSS).Given the similarity between BOSS and LAMOST, it is beneficial to include these data in the analysis.
Fundamental stellar parameters for the data from both of these spectrographs have been extracted using BOSS Net (Sizemore et al. 2024), a neural net that is capable of predicting self-consistent T eff and g log (with respective precision of 0.008 and 0.1 dex at S/N ∼ 15) values that are calibrated to theoretical models for stars of all types, including PMS stars, main-sequence sources, brown dwarfs, red giants, subdwarfs, and white dwarfs, as well as OB stars.Additionally, it provides an independent estimate of stellar radial velocities (RV).This is important because RVs that are included in LAMOST data releases have a number of artefacts, systematic offsets on orders of up to 10 km s −1 that vary depending on sky position and spectral type.Accurate RVs are crucial for accurate centroiding of spectral lines, especially those that are weak and narrow.

Measuring Line Widths
Here, we describe the development of the LineForest pipeline, which extracts a number of youth-sensitive features from the optical spectra.

Initial Measurements
Different lines present different challenges to the accurate measurement of EqW.Some lines are quite strong, such as Hα, with absolute widths (AbW) that can vary with spectral type, and depending on the range over which the line is integrated, the EqW would either be underestimated, or it could be contaminated by the flux well outside of the line.On the other hand, Li I is a relatively weak absorption line, typically not exceeding an EqW of 0.5 Å (Briceno et al. 1997).A number of lines such as Fe I and CN can often blend in together with Li I when its strength decreases to <0.2 Å.One of the methods through which Li I EqW has been measured in a large census in the past was through identifying the closest matching spectrum of an evolved star and subtracting it out such that only Li I remains (Žerjal et al. 2019); however, while it is possible to do this in high-resolution and high-S/N spectra, such an approach creates a challenge for BOSS, in particuar because >70% of visits have S/N < 30.We attempted a number of other approaches-Gaussian profile fitting, automated continuum determination, automated width estimation-all of which had some degree of success in a subset of stars, but none could be generalized across all sources even within a single class.
We found that the most reliable approach to measuring EqW was through manually defining each line in each spectrum, which is an adaptive approach that allows measurement of the width of the entire line, and it enables skipping sources that do not present a feature in either emission or absorption.This task is not feasible to do for all stars observed in a large survey, due to a significant person-power requirement.However, doing it on a subset of stars can create a vetted set of training labels for data-driven machine-learning applications, which can then be packaged into an automated pipeline.
For this purpose, we selected a random set of ∼3500 spectra, among which about 1000 sources had been targeted by ABYSS, and the remaining ones were chosen to create a representative subset of targets from other programs.Those other programs include OB stars, brown dwarfs, white dwarfs, cataclysmic variables, and red giants, featuring ∼250 stars in each class.For all those sources, we have measured line properties of the 23 most prominent transitions, while another 29 lines were measured in a subset of 350 spectra to enable the model to learn the shape of the continuum at these transitions.This represents a total of 52 lines (out of which a half are Balmer or Paschen H series lines that fall in the BOSS wavelength range).The features are listed in Table 1.
Using a custom-built interactive code previously used in Campbell et al. (2023), we recorded, for each line in each source, the wavelength and flux (λ 1 , F 1 ) near the start of the line as well as those near the end of the line (λ 2 , F 2 ).These set of points define the continuum under which the spectrum was integrated to estimate EqW (Figure 1).By convention, absorption lines are defined with positive EqW, and emission lines are defined with negative EqW.The difference λ 2 − λ 1 corresponds to AbW.In cases where no apparent line was observed, both EqW and AbW were recorded as 0. In cases where significant blending was suspected, depending on the exact profile, the line was either recorded as 0, or the continuum was drawn over the portion of the line most likely to be associated with the given element.While such an approach is imperfect, the neural net is capable of interpolating through these measurements to achieve a more general solution.
We do note that the measured line properties do not take into account various physical effects that may alter them-for example, veiling due to accretion-and model fitting is required in order to characterize them (e.g., Fang et al. 2020Fang et al. , 2021)).We defer full analysis of the effects of veiling in young stars to future works in the series.
Low-resolution spectra such as BOSS may present a challenge in identifying some weaker lines if the radial velocity is very uncertain, in particular for sources with very low S/N.As a consequence of this, if during manual measurements a confident identification was not possible, nothing would be recorded for it, thus attempting to condition the neural net to disregard it.
However, it should be noted that, in some cases, some lines may be systematically misattributed.For example, He transitions are most commonly observed in OB-type stars, although they can also be seen in emission in low-mass PMS stars as a result of certain processes related to accretion.For instance, in a number of low-mass main-sequence stars, the Fe I 6678 Å absorption feature has a wavelength almost identical to that of the He I 6678 line.It is unlikely that the lines are blended in single stars, as they are prominent at completely different T eff regimes; however, for the sake of self-consistency, all of the likely misattributed measurements were preserved in the sample.
In some cases, there is considerable blending; for example, Pa 13, Pa 15, and Pa 16 are located extremely close to the lines of Ca II triplet.Significant confusion is also presetnt between Ca H and Hò lines.While some effort was made to define the continuum in such a way as to minimize the contribution from the neighboring strong features.In particular, in young lowmass stars, calcium lines are in emission, with only weak Pa lines.It was not always possible to separate the lines cleanly, and thus in some of the approaches described below, there are cases with partial confusion among those lines.Reasonable judgment should be exercised in using the catalog for the appropriate T eff ranges for each line, and referring back to the spectra of the sources is suggested for the unexpected features.

Model Architecture and Training
We categorize the studied lines into two groups: one contains lines that are typically broad, and the other groups those lines that are typically narrow.Broad lines (which include most of H lines, as well as Ca H & K) are evaluated over a ±200 Å window centered on the Doppler-corrected line.Narrow lines (all of the remaining elements, as well as highorder lines in the Balmer series, in particular those found close to the edge of the wavelength range of BOSS) are evaluated over a ±50 Å window.We note that the width of the "broad" lines is driven by the maximum width seen in these lines across the entire sample, such as, e.g., white dwarfs.And although many lines that are typically narrow can be significantly broadened to widths of 100 s of km s −1 ,thorough accretion or winds (e.g., Banzatti et al. 2019;Sicilia-Aguilar et al. 2020), even in their most extreme cases, they can easily fit within the 50 Å window.
The flux in these windows is linearly reinterpolated onto uniformly spaced 128 element arrays with respective pixel scales of ∼3.15 and ∼0.79 Å.Also, in order to normalize the flux in these windows, we transform it onto the log space.
The lines can significantly vary in strength.As such, to normalize their properties, we also take the log of the absolute value of EqW and AbW.However, while the AbW are always positive, EqW can be either positive or negative.In order to preserve this information, we add the sign of EqW to the output: the sign is set to +1 if EqW>0, −1 if EqW < 0, and it is left at 0 if the line is undetected.
The model was constructed in TensorFlow, using a convolutional neural network (CNN) architecture.The data are passed through through 6 convolutional layers, with a convolutional kernel of 8 pixels; each layer breaks the data into an increasingly larger number of filters, from 8 to 32.Each Notes.
a All wavelengths are given in air, typically taken from Kramida et al. (2023).
b Scatter between the manual measurements and the predictions; reported only if the test set has >10 spectra with EqW > 0.2 Å c E: Line is most prominent in early-type stars.L: Line is most prominent in late-type stars convolutional layer is followed by a maxpooling layer to reduce the dimensionality of the data, and a tanh activation function.The outputs of the first six layers are flattened, and then they are passed through 3 fully connected layers with 128, 256, and 128 neurons with ReLU activation functions, after which they predict the EqW, AbW, and EqW sign (which is also used as a classifier for the detection).In training, we used mean square error loss, masking it for EqW and AbW in sources where the line is undetected.Two models are trained, one for all of the broad lines, and one for all of the narrow lines (which refers solely to the adopted window size and not the properties of the lines themselves).Doing so allows the CNN to better develop the ability to recognize the center of the window and to extract its properties.These models, however, cannot be used as a general tool for extracting any lines not listed in Table 1, as the unfamiliar continuum can skew the weights.For example, the model trained on broad lines is generally able to reproduce EqW measurements of the narrow lines and vice versa (with the outputs scaled accordingly to their input pixel scales), but there are some systematic offsets, along with a large number of false positives and false negatives, as well as a significantly larger scatter than when the dedicated model is used.

Testing
The labeled data were split 80:20 into the train and test sets.We evaluate the performance of the models on the test set, and consider a line to be "detected" in a given spectrum if the prediction for its EqW sign is >0.5 or < − 0.5.Overall, across the sample, the typical root mean squared in both EqW and AbW measurement is 0.09 dex.For each transition, we examine the number of true positives (tp), false positives (fp), and false negatives (fn), and we evaluate both the precision (tp/(tp+fp)) and recall (tp/(tp+fn)).They are recorded in Table 1.
The model is performing extremely well with strong lines (e.g., Hα, with precision and recall of 98% and 96%, respectively).In Li I, precision is relatively high, at 88%, but its recall is somewhat lower, at 64%-in these cases, the fp and fn sources are typically those that are borderline detections, with EqW < 0.05 Å.Some transitions do have quite low precision and recall.For instance, the He II 4685.7 Å line has 8% precision and 38% recall; however, this transition is extremely rare, recorded for only 29 out of ∼3500 sources in the train set, of which only 8 are in the test sample.Thus, the ratios for this line are heavily affected by the small number statistics.When we evaluate the line on the full set of stars observed by BOSS (Section 3.1.4),when it is seen in absorption in cool stars, it is likely to be confused with other nearby features, but in stars with T eff > 30,000 K, the absorption line measurements seem to be more robust, as do the rare cases when it is seen in emission, in particular for the ABYSS targets.

Evaluation
The resulting model is made available in Kounkel et al. (2023c).We apply LineForest to all of the stellar sources observed by BOSS to date, including all of the legacy data observed prior to SDSS-V.However, we note that the legacy SDSS data archive does not provide robust RV measurements as SDSS-V.This is mainly due to poor wavelength calibration, which was only identified at the beginning of the current iteration of the survey.As such, measurements of weak and narrow lines from legacy data spectra may be of lesser quality than in other data sets.
All of the lines were chosen to ensure they are detectable within the wavelength range of BOSS.LAMOST, however, has a narrower coverage; thus, we only record the properties of the lines in the range of Pa11 to Hò, inclusively.
We estimate the uncertainties in all of the measurements using a technique established in Olney et al. (2020), through computing several different realizations of a line by scattering the flux randomly by the reported uncertainties, passing each realization through the model, and averaging the predictions.In total, 100 iterations were made, and 1σ errors were computed from 16, 50, and 84th percentiles.
The line measurements are reported only for the sources in which the line is detected confidently in at least 30 of these iterations; this suppresses the model from acquiring very noisy features.Sources with significant noise are also likely to have very uncertain RV measurements, thus making it difficult to identify weak lines in the first place.For example, 65% of YSOs have σ RV < 5 km s −1 .On the other hand, for YSOs with confident Li I detection, 75% of them have σ RV < 5 km s −1 , and thus Li I measurements do indeed appear to be suppressed in noisy data.

YSO Classifier and Sample Definition
The lines that have been measured with LineForest can be used to improve the identification of PMS stars.In order to utilize all of the features, we built a fully connected neural net Figure 1.An example of a spectral lines (Hα and Li I) of a PMS star in BOSS spectra.The red line is centered on the feature, and the blue line is the manually defined continuum over which the line is integrated to measure EqW. using TensorFlow.We include T eff , g log , all of the EqW and AbW, as well as six fluxes, from Gaia (G, G BP , and G RP ) and from 2MASS (J, H, and K ), for a total of 164 inputs.
The network consists of three layers with 256, 512, and 1024 neurons connected using ReLU, a dropout layer with a rate of 0.5, distilling everything to a layer with a single output with a sigmoid activation function that returns the probability of a source being on the pre-main sequence.
The initial training set consisted of all of the ABYSS sources that have been observed to date, both with BOSS and with LAMOST, along with all of the SDSS-V BOSS sources, as well as a random subset of ∼1.4 million LAMOST field stars.Eighty percent of the sample was used in training, and the remaining 20% was used for testing.Because YSOs are relatively rare in the total training set, accuracy is an imperfect metric.Rather, we used an F1 score (the harmonic mean of the precision and recall) to evaluate the performance of the model.
While the targeting from ABYSS is rather comprehensive, it does include a significant fraction of more evolved field stars.Additionally, ABYSS may have missed some YSOs that have nonetheless have been serendipitously targeted by other programs.This mislabeling affects the quality of the model.To compensate for this, we reevaluate some of the labels.
First, we excluded ABYSS targets that are found at high galactic latitudes, such as |b| > 30°.We similarly excluded sources with RVs larger than what is typically found for the clustered sources at a given distance, sources that have g log < 3.4 or g log > 5.1, cool magnetically inactive stars with very low Hα emission, all of the sources without Li I measurement, and finally, K dwarfs with insufficiently high Li I absorption.Also, we included all of the sources with Hα emission that is consistent with originating from a protoplanetary disk, as well as K dwarfs with Li I absorption.By training a model using this sample, we were then able to examine the outputs, in particular for sources that the model identified as high-confidence YSOs but were originally labeled as field stars.In that sample, we interactively identified the sources that are consistent with being associated with young clusters and star-forming regions that appeared to stand out as the overdensities in the plane of the sky, verifying that they are also clustered in proper motion and parallax phase space.These sources were then relabeled as YSOs, and the model was retrained.This procedure was repeated until the number of "discovered" YSOs was negligibly small.
The final model was trained from a sample of 34,407 YSOs (down from 57,036 ABYSS targets).It achieves F1 score of 0.633 on the test sample, with precision of 0.752 and recall of 0.543 at a probability cutoff of 50%.A higher probability cutoff improves precision, although we note that, in that case, many of the "false positives" may actually be bona fide YSOs found in areas that are not strongly clustered in our sample.The bulk of the sources that are missing are found at higher T eff that are difficult to recover using traditional techniques as well.Nonetheless, we do note that, somewhat surprisingly, we are able to classify YSOs across the whole T eff range, and we are able to autonomously recover many of more distant populations that lack low-mass stars in our sample (e.g., Cygnus X).Although the bulk of the information necessary for the classification is carried by just T eff , g log , and EqW from Hα and Li I, other features do still improve the recovery of sources at these higher T eff .
In total, we are able to identify, to date, 17.9 K stars observed with SDSS-V BOSS as YSOs with probability >0.5, from which 11.1 K have a probability >0.8.Similarly, there are 9371 stars in LAMOST DR8 with probability >0.5, of which 6105 stars have probability >0.8.Finally, there are 285/225 stars in the legacy SDSS data.Unsurprisingly, most of the sources are concentrated along the galactic plane and the Gould's belt, with only a few sources found at higher galactic latitudes (Figure 2)

Ages
To examine the evolution of the sample (Section 5), it is necessary first to determine the ages of the stars.For this purpose, we use Sagitta (McBride et al. 2021), a neural net that has two components.It first classifies sources into those that can be identified as belonging to the pre-main sequence based on their photometry (G, BP, RP, J, H, and K ), parallax, and average extinction toward a given line of sight.Then, using the same input features, it estimates photometric ages for the individual PMS stars.It has been trained on the average ages within a subcluster in which a given star is found, and at the moment this pipeline offers the most stable performance with respect to T eff (i.e., producing self-consistent ages for G, K, and M-type stars in a given population), in comparison to the more traditional techniques that rely on the theoretical isochrones (Kounkel et al. 2023a).
Nonetheless, Sagitta is only reliable for the stars that are classified as PMS.Higher-mass stars quickly reach the main sequence, and as such, their photometry stops being a reliable indicator of their ages in comparison to the lower-mass stars of the same age.For this reason, in the total sample of stars with spectroscopic YSO probability >0.5, there are a number of sources whose ages must be derived through other means in order to avoid biases in the overall distribution (Figure 3).
To fill in this gap, we only retain ages for those stars with Sagitta-derived PMS probability of >0.5.We then calculate the 3D distance from a lower-probability source to the closest higher-probability source.We then assign the age of that closest high-probability source, if the separation is <30 pc.We have experimented with different cutoffs, allowing us to observe the effects on the analysis presented later in the paper, in order to achieve an appropriate trade-off in terms of completeness at higher T eff while maintaining robustness of the sample.Although this separation is considerable, it does incorporate within it a typical uncertainty in the parallax, and it is also permissive for the inclusion of the more diffuse populations.On the other hand, in denser regions, the closest neighbor is usually well under that limit.
Such substitution has been done for 26% of the full sample, with the bulk of it dominated by hotter stars.Among sources with T eff > 5000 K, 74% of sources have their ages substituted; on the other hand, this is the case for only 12% of cooler stars, most of which have age >10 Myr.& Basri (2003) have developed a traditional method to select CTTSs based on Hα emission strength as a function of spectral type, employing four flat cuts.The simplicity of its implementation becomes a downside when dealing with a very large sample, as the transition between these cuts become very pronounced.Briceño et al. (2019) have improved on this approach by developing a much more continuous selection, but their approach relies strongly on spectral types of the stars.Commonly used transformations between spectral type and T eff exist (e.g., Pecaut & Mamajek 2013), but because we do have direct measurements of T eff available, it is much more optimal to use a selection criterion that does not rely on such a conversion.

White
After examining the distribution of Hα EqW as a function of T eff , we notice a clear concentration of sources with weak emission lines (i.e., WTTSs).On the other hand, among those sources with stronger emission lines (i.e., CTTSs), there is a much more significant scatter (Figure 4).Using this overdensity of WTTSs, we define a cut

Results
We present LineForest measurements for 23,903 stars identified as YSO candidates in LAMOST spectra with probability >0.1 in Table 2. Outputs for all of the BOSS spectra (including non-YSOs), will be incorporated in the subsequent SDSS data releases.

General Line Properties
We examine the overall differences in the measured properties of the lines between stars identified as young and the significantly more evolved field stars (Figure 5).
As expected, Li I shows an obvious signature of evolution.In our sample, we observe how the YSO candidate distribution clearly peaks near 0.45 Å at T eff ∼ 4000 K, while only a trace amount of Li I is observed at this T eff in the field stars, largely within the errors.While there is some variance in its distribution with T eff , it is not significant.On the other hand, in YSOs, Li I abundance significantly decreases toward the cooler end, because these stars become fully convective, resulting in a higher internal T eff that is available for Li I depletion.Similarly, toward the hotter end, the internal temperature of the star itself increases significantly, even though the convective envelope shrinks.The overall shape of Li I distribution in young stars is consistent to what has been observed in the past (e.g., Jeffries et al. 2014Jeffries et al. , 2023)).We discuss evolution of Li I in greater detail in Section 5.
Hα is another notable line.It is commonly seen in emission in cool stars, and in absorption at stars with T eff > 6500 K, reaching a maximum EqW at ∼10,000 K.There is a difference in the EqW in early-type stars, with younger stars typically having stronger absorption lines; this difference is most likely driven by different g log distribution (Sizemore L. et al. 2024, submitted).Young high-mass stars are expected to be on the main sequence; on the other hand, because high-mass stars have short lifetimes, older stars rapidly evolve away from it.Among late-type stars, the difference is much more significant.This is apparent not just in CTTSs (which show Hα emission lines far in excess of what is observed in the field), but in WTTSs as well.Evolved M dwarfs in the field can be divided into those that are magnetically inactive (with Hα EqW∼0 Å) and those that are active, where Hα has stronger emission (Newton et al. 2017).Both are present in the SDSS-V sample.Active M dwarfs are usually considered to be relatively young, with ages younger than a few Gyr (West et al. 2008).Here, we are able to observe a significant evolution between these active M dwarfs and WTTSs, with Hα emission lines being 1.5-2 times stronger in WTTSs than in somewhat older field stars.This is an indication of the magnetic activity decreasing in strength as stars age.
A line behavior similar to that of Hα is observed in other H lines as well, both in the Balmer and in the Paschen series, though higher-energy lines are less likely to be seen in cooler stars.Similar trends can also be observed in Ca lines as wellthey also show evolution of g log among early-type stars (although whether YSOs have stronger or weaker lines usually depends on the precise type of transition), and they also separate late-type stars based on their age.The Ca triplet usually consists of narrow lines that are seen in emission only in some CTTSs.And although they are primarily seen in absorption, low-mass YSOs tend to have weaker Ca triplet lines.This effect is likely to be also primarily driven by the magnetic activity filling the lines (but not to the degree of creating emission lines), as these lines are magnetically sensitive, and the excess flux in them typically correlates with emission in Ca H & K (Martin et al. 2017).However, veiling could also have some influence on these lines.Some differences are observed between the Ca triplet, and the Ca H & K lines.The latter two tend to have very broad absorption lines in which a narrow emission component may emerge.This results in a discontinuity at EqW = 0 Å as the pipeline transitions from measuring the full absorption line to only measuring its emission portion.Low-mass YSOs are commonly seen with Ca H & K in emission; this is only rarely the case in the field stars (in which case, they are most likely to be younger than a few 100 Myr, Cunningham et al. 2020).
CTTSs produce emission lines from many different transitions.Often (e.g., in He I or Fe II lines), this occurs in cool stars, with T eff < 4500 K presenting very strong Hα EqW < −60 Å.Some transitions (e.g., O II or N II) can produce emission lines across all T eff .In early-type stars, this may be a more reliable indicator of accretion than Hα.

Hydrogen Decrements in CTTSs
While CTTSs are defined using the strength of Hα, it is not uncommon to see emission lines in other H lines as well.Of 6444 CTTSs, 3085 show emission in Hβ, 1966in Hγ, 1215in Hδ, 1021 in Hò, etc.When multiple lines are available, because each of them has a different excitation temperature, it is possible to probe the physical conditions such as temperature (Temp) and density (log n) of the emitting gas (i.e., the accretion stream).This was previously done in Campbell et al. (2023) using Brackett lines observed with APOGEE, through comparing the ratios of equivalent widths of these lines to the models from Kwan & Fischer (2011).
We attempted to replicate this experiment using the Balmer lines (Figure 6).In CTTSs, we isolated the H lines in all of the sources that appeared to be in emission.We then compared them to the models from Kwan & Fischer (2011).Given that some of the H lines are likely to be optically thick (and thus their EqW would not scale accurately with the abundances), we examined different subsets of these lines, normalizing all EqWs by the strongest line in the subset.
Because of this, we discarded Hα and Hβ from the analysis; the higher-order lines, on the other hand, produced a more self-consistent fit.Thus, we limited the sample only to the sources that have Hγ and at least two higher-energy lines to construct a decrement.Similarly, we attempted to examine the decrements in the Paschen series, but while results were broadly consistent with the Balmer series fit, there was significant uncertainty in the fit.Therefore, they were excluded from the analysis.
Through the model comparison, we identified the Temp and log n of the gas.We note that we have not corrected EqWs for veiling or extinction.The results here should be considered preliminary, and they are presented only for the sources with a clearly defined decrement.An estimation of these parameters will be done in the subsequent papers in the series.
We find that the Temp of the accretion shock is typically ∼6000-10,000 K, and that the density is ∼10 11−11.5 cm −3 for the sources with the most confident measurements.There is no strong correlation with the properties of the star itself, such as T eff , g log , age, or any other.For comparison, Campbell et al. (2023) find a much larger range in Temp, from 4000 to 16,000 K (with the sources with hotter Temp preferentially being Be stars) and density ranging from ∼10 12 cm −3 at 5000 K, to ∼10 11 cm −3 at 12,500 K.Over the range of the overlapping Temp, there is a relative agreement in the derived log n.

Discussion: Li I Evolution With Age
To assess the reliability of the measurements, in Figure 7 we compare our measurements of Li I with the measurements reported by Binks et al. (2022) from Gaia-ESO spectra.Gaia-ESO spectra have significantly higher resolution (R ∼ 15,000) than BOSS or LAMOST spectra, making these measurements significantly more robust.Indeed, the comparison does show significant scatter, but this scatter is mostly reproduced by our reported uncertainties.Furthermore, there is an excellent correlation between the two sets of measurements, in particular at T eff 4500 K, although Gaia-ESO measurements appear to be systematically larger by ∼0.05 Å, likely due to the differences in the continuum definition.
In young stars, Li I is known to precipitously deplete as they get older.Recently, Jeffries et al. (2023) have developed an empirical model of Li I depletion using Gaia-ESO measurements of 52 open clusters with age ranges from 2 Myr to 6 Gyr.While it serves as a good picture of the overall global evolution of Li I, the sampling at the younger age ranges is somewhat poor, and no attempt was made to distinguish between clusters younger than 10 Myr.Because our sample has more than an order of magnitude more star younger than 25 Myr, we reexamine the trend at these age ranges.
To do this, we separate the sources 0.1 dex bins both in age and in T eff , and we find the median, as well as 16th and 84th percentile of the distribution to check the scatter (Figure 8).We also find the typical uncertainty in Li I measurement of the sources in the bin (to evaluate the significance of the scatter), as well as the weighted average error for the median itself.Because the external comparison does show our uncertainties as robust, even with considerable individual errors, a sufficiently large statistical sample allows for improvments in finding the mean of the distribution.
The relation from Jeffries et al. (2023) appears to be somewhat inconsistent in describing our Li I measurements of these young stars.In part, it is due to the systematics in the assumptions of age and T eff , as well as the aforementioned systematic offsets in our determination of Li I, but also it can be attributed to the sparse young sample they used.Instead, we fit    ( ) where t is log 10 age in years, T is log 10 T eff in K, and the coefficients are listed in Table 3.This relationship is valid for 3200 < T eff < 6000 K, and for ages <30 Myr.We find that, at all age ranges, at T eff > 4500 K, the scatter is statistically reproducible by the measurement uncertainties, but at cooler T eff , the scatter becomes almost two times larger than the uncertainties (Figure 9).This supports the conclusions of Binks et al. (2022) that, at low T eff , Li I abundance in the chromosphere is influenced by the variability in the spot coverage of stars (as opposed to erroneous age measurements that would lead to unrealistic age spread in a given cluster).
Comparing Li I at the neighboring age bins, we find that there is a significant depletion from ∼1.5 to 3 Myr, after which it stalls until 10 Myr, and beyond that point the abundances once again continue to decline (Figure 10).This is apparent both in the fitted expression for estimating average Li I EqW, as well as in comparing the medians computed from the data directly.It is not entirely clear what causes this stalling.

Conclusions
We present a set of measurements of equivalent-width, youth-sensitive lines in optical BOSS & LAMOST spectra, including lines such as Hα and Li I, as well as a number of others.These measurements were taken using a newly developed data-driven pipeline LineForest that was trained on manual measurements of EqWs of ∼3500 stars.Although there is a significant emphasis in using these data for characterizing young stars, LineForest is capable of measuring EqWs of these lines in all stellar spectra.
Combining EqWs and AbWs, as well as T eff , g log , and colors, we developed a classifier that identifies stars younger than a few tens of Myr.This classifier is most effective in identifying late-type YSOs; however, it is capable of confirming youth in the bulk of early-type stars, as well as some of the solar-type stars.
We observe a number of differences in the line properties of YSOs and the more evolved field stars.In particular, activitysensitive lines (such as the lines of Ca, and the lines of H) tend to show stronger emission (or weaker absorption) in WTTSs than in magnetically active field dwarfs.This is consistent with the decrease in magnetic activity strength as stars become older.We also observe emission lines that are attributable to accretion across a number of elements, some of which are only seen in late-type stars, while others are prominent in stars of all masses.
We examine the decrement of the H lines in the Balmer series for CTTSs, and use these data to estimate the properties of the accretion stream.While the results are preliminary, due to the further need to correct the lines for processes such as veiling or extinction, we do find the Temp.and log n of the streams to be somewhat consistent with previous studies.
Finally, we examine the evolution of Li I as a function of age.We find that Li I abundance decreases strongly in the first couple of Myr, but then it stalls until the age of 10 Myr, after which point it continues to deplete.We also find that late-type stars appear to exhibit significant scatter in LI I EqWs regardless of the age of the star, and that Li I abundances may be strongly influenced by the variability in the magnetic activity.

Figure 3 .
Figure 3. Top: Kiel diagram of the sample, color coded by the photometrically derived probability of a star being on the PMS using Sagitta (McBride et al. 2021).Bottom: Adopted ages for the stars in the sample; for the sources with PMS prob.<0.5, the assumed age is that of the closest high-probability neighbor.

Figure 4 .
Figure 4. Distribution of Hα as a function of temperature for the YSOs in our sample.The red line shows the adopted separation between CTTSs and WTTSs.The yellow lines are the ranges for the selection from White & Basri (2003).

Figure 5 .
Figure 5. EqW measurements in BOSS spectra for some of the lines used in this study.The colored background (color coded by the density of points) represents field stars of different evolutionary stages.Black contours are the sources identified as YSOs.

Figure 6 .
Figure 6.Left: An example of the decrement of EqW of H lines in the Balmer series, normalized by EqW of Hγ (n u = 5).The black line shows the best-fitted model from Kwan & Fischer (2011).Hò is likely contaminated from Ca H. Right: Distribution of the derived Temp and log n models for the sources with cleanly defined decrements in Balmer lines.Points are color coded by the uncertainty in log n.We note that most confident measurements tend to have log n > 11.

Figure 7 .
Figure 7.Comparison of Li I measurements reported here to those from Gaia-ESO spectra from Binks et al. (2022), color coded by the T eff of the star.The black line shows the best fit between the two.

Figure 8 .
Figure 8. Li I distribution as a function of T eff across different age bins.Yellow dots show the raw data, black squares show the median Li I EqW in 0.1 dex T eff bins, and the black error bars show the 16-84 percentile scatter.The gray error bar shows the typical uncertainty in the individual measurements.The red dashed line shows the relationship of Li I as a function of T eff and age from Jeffries et al. (2023), and the blue solid line shows the polynomial fit presented in Equation (2).

Table 1
Set of Lines Used in This Study

Table 3
Fitted Coefficients for Li I Estimation Ratio between the typical scatter in Li I in each age and T eff bin, and the typical uncertainty in Li I measurements in that bin.The black line shows the fitted average across all age ranges.Figure 10.Evolution of Li I as a function of age, as characterized by Equation (2).