Non-parametric projections of the net-income distribution for all U.S. states for the Shared Socioeconomic Pathways

Income distributions are a growing area of interest in the examination of equity impacts brought on by climate change and its responses. Such impacts are especially important at subnational levels, but projections of income distributions at these levels are scarce. Here, we project U.S. state-level income distributions for the Shared Socioeconomic Pathways (SSPs). We apply a non-parametric approach, specifically a recently developed principal components algorithm to generate net income distributions for deciles across 50 U.S. states and the District of Columbia. We produce these projections to 2100 for three SSP scenarios in combination with varying projections of GDP per capita to represent a wide range of possible futures and uncertainties. In the generation of these scenarios, we also generated tax adjusted historical deciles by U.S. states, which we used for validating model performance. Our method thus produces income distributions by decile for each state, reflecting the variability in state income, population, and tax regimes. Our net income projections by decile can be used in both emissions- and impact-related research to understand distributional effects at various income levels and identify economically vulnerable populations.


Introduction
Income is a key driver of consumption, vulnerability, and resilience of populations to external stressors such as climate change.The distribution of income within a region, therefore, plays an important role in identifying potentially vulnerable groups and understanding variations in demand from these groups.Additionally, consumption is a large determinant of future climate outcomes, with increased consumption potentially driving emissions upward and producing more pronounced climate impacts (Van Ruijven et al 2014).For these reasons, there is a need for income distributions to be incorporated into integrated assessment models (IAMs) and impacts, adaptation, and vulnerability (IAV) research to better account for heterogeneity in the population, as well as maintain consistency with the Shared Socioeconomic Pathways (SSPs) that are widely used in the field of climate research (SI section 1).This work seeks to project income distributions for use in IAM and IAV research to better understand the distributional impacts of climate change.
Previous work on projecting income distribution predominantly focused on distributions at the global or national scale, with few studies covering this topic at the sub-national scale.For example, several studies have projected country-level income distributions to examine inequality and income changes in response to climate outcomes (Hertel et al 2010, Hallegatte and Rozenberg 2017, Jafino et al 2020, Narayan et al 2023b).Other studies focus more on the effect of emission mitigation policies on inequality (Campagnolo and Davide 2019, Fujimori et al 2020, Soergel et al 2021), finding that climate policies have the potential to negatively impact lower income groups if no effort is made to provide financial support to these groups.Oswald et al (2020) examine the effect of income level on energy consumption, and, consequently, emissions, across income groups at national scales, with higher income groups being found to have higher energy footprints.Several other country-level metrics were produced in the literature to project measures of income and inequality (Murakami and Yamagata 2019, Rao et al 2019, Murakami et al 2021, CBO 2022, Dubina et al 2022), but not the income distribution itself.
At the sub-national scale, specifically for the U.S., Sampedro et al (2022) model income distribution at the level of U.S. states, while Liu et al (2019) model income inequality at the same scale.Both sources conclude that increased inequality leads to lower emissions in the long run as a smaller percentage of higher income population results in less overall energy use in the United States, though Sampedro et al (2022) only discusses this in the context of emissions from residential energy demand.Rausch et al (2011) explore the distributional impacts of emission policies at the U.S. state level.Wear and Prestemon (2019) produce spatially explicit joint projections of income and population at the U.S. county-level consistent with the SSPs, observing spatial convergence in income.However, they focus on income-driven migration rather than income inequality.For some individual states (Kim and Bai 2018, Washington State OFM 2022), only projections of income are available but not the income distribution.U.S.specific inequality metrics at a sub-national scale, such as the Gini, were constructed by Frank (2009) and Piketty and Saez (2003) using income data from the Internal Revenue Service (IRS), but only for historical years.We have thus identified that there is a gap in the literature of internally consistent long-term projections of income distribution by state, which this work seeks to fill.
In our approach, we build on methodology recently developed by Narayan et al (2023b) in producing their set of global projections of national income distributions by decile under the SSPs.Narayan et al (2023b) utilized a non-parametric approach, specifically a principal component analysis (PCA) algorithm to produce their projections.Their method is an improvement upon previous projections of the income distribution that used the Gini coefficient combined with an assumed lognormal distribution.Their national-level model is further explained in the methods section of this paper.
We apply an adjusted version of the PCA-based approach outlined in Narayan et al (2023b) to produce projections for all 50 U.S. states, plus the District of Columbia (DC), annually over the 21st century for a set of alternative scenarios (SSPs).Our method accounts for the redistributive impact of federal and state taxes in the U.S. and produces a net per capita income distribution.We examine the results of our projections using specific state examples, displaying how our projections vary across states, SSPs, and time.We also discuss the limitations and implications of the produced datasets, including opportunities for future work.

Data and methods
In this section, we discuss the data and application of the PCA method for our income distribution projections.We begin by describing the processing and transformation of historical data needed for model validation and initialization.Subsequently, we provide detail on our application of the PCA and the derivation of the income shares by decile.Quintiles and deciles as referred to hereafter, represent 1/5th and 1/10th of the population, respectively.We conclude with the construction of the absolute values of net income distribution projections across deciles, using GDP per capita as a proxy.The input data used is summarized in table 1.A visual representation of our data and methods can be found in figure 1.

Constructing the historical net-income distribution at the state level
We first required data on income distributions to initialize and validate our model.Due to issues in availability and consistency, data had to be synthesized from a variety of sources.Data are harmonized to a single equivalence scale (for converting between household and individual income), with income being defined as per capita household income (calculated as household income divided by the number of household members).
The PCA method we employ is estimated on global net income data by Narayan et al (2023b).Therefore, net income data at the state level is required to initialize the modified model in this study.The American Community Survey (ACS), a widely used dataset to analyze income distributions at the state level, only provides gross income shares at the state level.There have been attempts to reconstruct income distributions using IRS income data adjusted for taxes (Frank 2009)  specifically 2015.Our tax adjustments are described in more detail in SI section 2. We chose only to account for direct income taxes, rather than both direct and indirect taxes, for several reasons.Firstly, direct taxes have a more pronounced redistributive impact (Bastagli et al 2012).Secondly, indirect taxes are specific to municipalities and, therefore, inconsistent across a state.Finally, indirect taxes are based on level of consumption, making them more challenging to estimate as consumption levels can vary widely based on prevalent prices.We show the redistributive impact of taxes across income groups by comparing the gross and the net versions of our income distributions across states historically (SI figure 5).
Compiled historical gross and net income quintile data are used to derive state-level Gini coefficients annually from 2011 to 2015.These net income historical Ginis combined with our PCA algorithm (as described in the section below), allows us to reconstruct historical net income deciles at the state level (figure 1).The historical net income decile distribution dataset is subsequently used for validating model performance and for the initialization of decile income projections (figure 2).Again, we only use net income data in our PCA method (Narayan et al 2023b), although we produce novel datasets for both gross and net income historical Gini coefficients for user convenience.

Description of our projection model
Narayan et al (2023b) utilized a PCA-based method to produce income distributions by decile on a global scale.This method is summarized below-1.First, they performed a PCA on a global synthesized dataset of income deciles (Narayan et al 2023a) for 171 countries over the period 1968-2018.2. Through this PCA, they found that 98% of the variation across income deciles across countryyears could be explained by two principal components (PC1, PC2).These principal components are ten-point functions.3.They found that the Gini coefficient was highly correlated with the coefficient of PC1.Labor share of GDP, the lagged value of the 9th decile (income share), and the lagged Palma ratio (10th decile income share divided by the sum of the first four deciles) were found to explain well the coefficient of PC2. 4. Therefore, they projected the coefficient of PC1 using other projections of the GINI and the coefficient of PC2 was projected using the labor share of GDP, the lagged values of the 9th decile, and the lagged value of the Palma ratio.These coefficients were then multiplied by the ten-point PCs to derive deciles for each country, year, SSP.
In utilizing this approach at the state level, we apply the same PCA method on state level GINI projections.Specifically, decile values at the state level are derived using equation (1): where, the subscript s is the state and t is the year, a and b are coefficients of two principal components (PC1, PC2), and D is a vector of income shares for each of ten deciles.PC1 and PC2 are ten-point principal components derived using the PCA conducted by Narayan et al 2023b (see SI section 3 for values of PC1 and PC2).The coefficients of PC1 (a) and PC2 (b) are derived using equations ( 2) and ( 3), where the intercept and slope parameters are estimated in Narayan et al (2023b).
In our analysis, the coefficient of the first component (a) is projected on the basis of projected statelevel Gini coefficients (as explained in equation ( 2) below) a s,t = −11.4815+ 29.71708 * GINI s,t (2) where, the subscript s is the state and t is the year, GINI is a projected GINI coefficient for a state and year.There are no current projections of the Gini for U.S. states under the SSPs.However, Rao et al (2019) produced projections of national Gini coefficients consistent with the five SSPs that vary in their assumptions about future inequality.We use their Gini projections for the U.S. for SSP2 (middle of the road), SSP3 (high inequality), and SSP5 (low inequality) through 2100 to obtain national-level Gini growth rates.We apply the national growth rates from Rao et al (2019) equally to state base year GINIs (in 2015) derived in the previous section to produce state-level Gini projections.
To project the coefficient for the second component (b), we use the labor share of GDP at the national level with lagged state-level values of the Palma ratio and the income share of the 9th decile (These lagged values being re-computed in each time step.).Equation ( 3 (3) where, the subscript s is the state and t is the year, LabShareGDP is the labor share of GDP for the U.S. in 2015.In the equation above, labor share of GDP is kept constant across time using the value reported for the U.S. in 2015 from the Penn World Tables (version  10) (Feenstra et al 2015).Ideally, the labor share of GDP within a state would be used, but this data was unavailable at the state level.
We use the equation ( 1) in combination with projections of coefficients a and b to project income shares for each decile in each state under all five SSPs (SI figure 7).However, we only use projections for three SSPs in the following per capita based analysis (SSP2, SSP3, SSP5) since they are the only ones for which corresponding state-level population projections are available (Jiang et al 2020).
We use all parameter values (PC values, regression coefficients) in the equations described above from Narayan et al (2023b) and did not retrain the model on state level data.This is because the statelevel data is only available for gross income quintiles, not deciles, for only 5 years (2011)(2012)(2013)(2014)(2015).Also, we validated the performance of our model using our reconstructed data on net income quintiles (figure 2) to ensure a reasonable fit at the state level.Figure 2 shows that the model fits historical data reasonably well for all quintiles, despite slight under (over) predictions for the fifth (second) quintile.This result also supports the adequacy of our use of national level specifications (i.e., Gini growth rates, labor share of GDP) across all states.Additionally, we compare our model to the standard lognormal approach and find that our PCA-based approach fits the data better, especially for higher quintiles (SI section 5).

State absolute net income distribution projections by decile
To produce projections of absolute income levels in future years we combined our projections of income distributions with projections of the GDP per capita at the state level, using GDP per capita as a proxy for net income.Specifically, we combined the projected income shares with projections of population and GDP per capita for each state.State-level population projections were gathered from Jiang et al (2020), who provide updated, state-level projections for the U.S. relative to the original national SSP population projections.We were limited in this analysis to the SSPs for which they produced projections: SSP2, SSP3, and SSP5.Projected state population growth rates are applied to historical population data (U.S. Census Bureau 2021) to obtain our population projections (see SI figure 10 for the population projections methodology).We were limited in this analysis to the three SSPs for which state population projections existed (SSP2, SSP3, SSP5).
Note that there were no previous projections of the GDP per capita at the state level for the SSPs.Hence, state GDP per capita projections were synthesized using several sources.First, in SSP2, which we consider to be our baseline scenario, the state GDP per capita projections were sourced from the reference scenario in Global Change Analysis Model (GCAM)-USA (Binsted et al 2022).These projections were produced using state GDP per capita growth rates from a U.S. subregion-specific energy scenario produced annually by AEO (2019) (shown in SI figure 11).To obtain state GDP per capita projections for SSP3 and SSP5, we scaled our baseline GDP per capita projections based on national-level GDP per capita projections for the SSPs from the SSP Database (Riahi et al 2017).
We calculated sets of scaling factors based on three models used to produce national GDP projections for the SSPs: OECD (Dellink et al 2017), IIASA (Crespo Cuaresma 2017), and PIK (Leimbach et al 2017) and applied these scaling factors to our baseline GDP per capita projections.These models make different assumptions regarding the drivers of economic growth resulting in different national GDP growth rates and varying outcomes for each SSP.We created separate GDP per capita projections using each model to reflect the variability of economic assumptions available in the data and represent uncertainty more fully.State GDP projections were then produced by combining state population projections with state GDP per capita projections for the respective SSPs.
Once state GDP and population projections were derived, these projections could be combined with the projected income shares produced by our approach.This allowed us to produce projections of state GDP per capita distributions by decile under three SSPs from 2020 to 2100.The projections are further analyzed below.

Results
Our method produces income distributions by decile for each state, reflecting the variability in state income, population, and tax regimes.We see a wide range of absolute income per capita across states for each decile (figures 3 and 4).Looking at the outcomes for midcentury (2050) in our baseline scenario (SSP2), we highlight states of interest (California, Florida, New York, West Virginia, and Wisconsin) alongside DC and the national-level income distribution (figure 4).Outcomes over time and for other states can be found in SI section 9.
As seen in figures 3 and 4, California and New York have some of the highest state GDPs and GDP per capita values in the nation.Both states have among the highest incomes as well, especially for the uppermost deciles (figure 3).However, New York  also has the highest Palma ratio in the U.S. across scenarios indicating a high level of income inequality (figure 8) in spite of high overall incomes.Florida and West Virginia have some of the lowest incomes, especially for the lowermost deciles (figure 3).Wisconsin is middle of the road in terms of state GDP per capita.DC is found to have very high income in every decile compared to the rest of the states, likely because DC is a small urban area and therefore very different in terms of its income distribution.
Understanding the income distribution in terms of both income shares and GDP per capita allows us to compare the relative wealth of income groups (deciles) within a state, as well as compare income groups across states.In figure 5, we see that New York, with its high inequality, has an income share for a middle income group (d4) that is smaller than the d4 value in both West Virginia and Wisconsin, and a upper decile (d10) income share that is larger than the d10 in those states.However, when we compare the d4 values in terms of absolute income, New York has a higher mean GDP per capita, reflecting the differences in state income.
Our income distributions for the three SSPs (SSP2-middle of the road, SSP3-high inequality, SSP5-low inequality) are based on three different models of national income growth (IIASA, OECD, PIK).Users can pick which set of projections fits their needs given the growth assumptions of a particular SSP-model pair.For example, for those interested in maximizing the range of income between highest-and lowest-income SSPs, we recommend using SSP3 from the PIK model and SSP5 from the OECD model, since SSP3-PIK produces the lowest national income over time and SSP5-OECD produces the highest (SI figure 12).As observed in figure 6, this combination of models provides the greatest difference between the SSPs in a given timestep.It represents uncertainty not only in the future development pathway, as represented by the SSP, but also in the functioning of the economy, as represented by the model.Note that the differences across SSPmodel combinations only apply to absolute income projections and do not affect income shares as the models reflect differences in GDP per capita growth assumptions.
Alternatively, users may wish to represent uncertainty only in the SSP for a given assumption about the model of economic growth, in which case choosing alternative projections across SSPs from the same model would be preferable.We provide projections for all SSP-model combinations to allow users to make their own choices but focus our results in the main text on the SSP3-PIK/SSP5-OECD range (more on range of outcomes across economic models and use cases can be found in the supplementary information, SI section 8).We see a variety of outcomes for each decile in a state under the different SSPs, as can be observed in figure 7, influenced by both income  growth and inequality changes depicted in the SSP narratives.
The data we produce can be used to derive other inequality metrics that can provide further information about inequality in a state.The Palma ratio, a metric of extreme inequality, is constructed by dividing the richest decile's share of income (d10) by the sum of the poorest four deciles' share of income (d1 + d2 + d3 + d4).A higher Palma ratio indicates greater levels of inequality in a state, especially at the ends of the distribution.Since we have income shares for all five SSPs, we were able to derive the Palma ratios for each scenario for all states in any given year (figure 8). Figure 8 shows that SSP3 and SSP4 are substantially more unequal than SSP2, with inequality especially high in New York, Connecticut, Massachusetts, and New Jersey.The difference in inequality between states in our scenarios is a reflection of variances in base year inequality across states that continue over time.This is because the same national (U.S.) growth rate from Rao et al (2019) (differentiated by SSP scenario) is applied to all states when it comes to inequality in future years.Note that there is not much difference in inequality outcomes between SSP2 and SSP5 in the U.S.This is consistent with the SSP narratives as developed countries like the United States do not see large improvements in inequality under SSP5, and these trends As the results in this section display, our method produces projections quantitatively consistent with the SSP narratives and demonstrates the variance in outcomes across states over time.

Discussion
Our approach produces income distributions for 50 states and DC that vary across time, states, and SSPs, while being consistent with SSP narratives.These projections were originally developed for use in GCAM-USA, a state-level model of energy-water-land dynamics that exists within the larger framework of the GCAM (JGCRI 2023;Binsted et al 2022) The incorporation of income distributions, such as those produced here, in IAMs allows for the representation of multiple consumers that have heterogeneous consumption patterns and responses.These datasets could be employed in research such as that of Sampedro et al (2022), with our finer scale data (deciles rather than quintiles) providing more information about inequality trends and consumer responses, especially at the tails of the distribution where changes in income share matter most.This could improve the modeling of climate impacts on various consumer groups and the impact of consumption patterns on climate outcomes.We also note that these datasets are being employed in ongoing work examining the effect of decarbonization on residential energy consumption across income deciles in the U.S. Income is closely related to the affordability of residential energy services, influencing the residential energy security faced by consumers across various income brackets.These datasets not only enable the modeling of how these impacts are distributed among different income levels, but also facilitates the analysis of the distributional impacts over time and across different states.In addition, sub-national income distributions provide information on location-based-in this case state specific-vulnerabilities (Van Ruijven et al 2014).This information can be used to understand deeper inequality trends, identify economically vulnerable populations, and discuss poverty implications (SI section 10).
We chose to produce multiple sets of absolute income projections using three different SSPs under three different socioeconomic projection models (Crespo Cuaresma 2017, Dellink et al 2017, Leimbach et al 2017) to capture a wide range of alternative futures.Each model considers the same drivers of GDP per capita growth but varies the extent to which each driver impacts growth.Producing a variety of datasets spanning multiple models allows for greater applications of our work as users can utilize datasets at their discretion to best fit their needs (SI section 8).
In addition to our projections, we produce preand post-tax historical datasets of state income distribution by quintile.These datasets are novel in their own right and highlight the importance of accounting for taxes when constructing an income distribution.This information could be useful for statelevel decision-makers when evaluating the current tax regime or possible future alternatives.Although we only account for direct taxes, the effect of indirect taxes (e.g. on consumption) may be worth exploring for future analyses, data permitting.
We note the projections' current limitations and suggest improvements for future iterations.For example, due to lack of state-level data, some national assumptions had to be applied uniformly at the state level, including for labor share of GDP and rates of change in Gini coefficients.This means that our projections do not fully account for changing patterns of inequality across states or over time.However, it is worth noting that although we use equal growth rates across states, the resulting inequality still differs due to states' different starting inequality levels.The increased availability of state-level data and projections, particularly state-specific Gini growth rates and labor share of GDP, would improve our projections in capturing diverse pathways for each state.Additionally, we assume that states maintain a single tax regime over time and do not account for deviations in tax structure from historical years.Varying tax assumptions could be built into the analysis as a scenario assumption in later research.Moreover, a greater timespan of historical state tax data would increase the robustness of the projections, as currently, we only have four years of data to use for model validation.
We reiterate that the parameters for the PCA method we employ were not re-estimated on statelevel data as we did not have the necessary data at the state level (i.e.long time series of decile income distributions, tax adjustments, labor share of GDP, Gini coefficient, etc.).State-level income data that was available was limited to 2011-2015 gross income quintiles, not deciles.If we did have state-level income decile data (along with federal and state tax data) for a larger range of years, we would be able to improve the fit and confidence of the model results at the decile level.Given limited data, we had to assume that our national level model specifications hold true at the state-level.We have ensured that the method provides reasonable distributions for states, but that does not discount that the model itself is estimated on crosscountry data.We also compared our model results to a lognormal estimation (which is a popular method in the literature) to determine if the PCA provided more reasonable estimates.Future work could take on the substantial task of developing a longer time series of historical state-level income distribution data and using it in the PCA, potentially improving the fit of the model.
Future work could also add the ability to jointly project income and demographic characteristics, which we feel would be a valuable extension.For example, joint projections of age, consumption, and income would allow researchers to better understand the drivers of inequality and to project vulnerable populations according to multiple dimensions.Similarly, joint projections of race/ethnicity and income would allow for better identification of disadvantaged communities in studies of equity related to environmental change.Furthermore, if within-decile distributions could be derived, it would give researchers a more complete picture of income inequality and socioeconomic trends.
Since these projections were created under the SSPs, it is important to note that the SSPs do not include feedbacks from climate policy on the income projections themselves.These projections are meant to be employed as exogenous socioeconomic conditions in IAMs.Sub-national storylines consistent with global SSP narratives would be of interest to account for inter-and intra-state dynamics and provide greater granularity to enhance the utility of projections like these for state stakeholders (Absar and Preston 2015).Sub-national storylines could describe development patterns for different regions of the US that are different from but consistent with each other (i.e.internally consistent across regions).As an example, if the southwest were assumed to have high immigration due to regional economic development and labor demand, other regions should see outmigration and less comparative advantage in the job market.Such storylines would provide a basis for regionally varying assumptions about future demographic, economic, and social factors that affect inequality, and provide a basis in our work for assumptions of Gini coefficients and labor shares of income that vary across states.

Data availability statements
Any code to reproduce results presented here can be made available upon reasonable request.A majority of the code is available on GitHub (https://github.com/JGCRI/pridr/blob/main/Code/GODEEEP_workflow.R).
The data that support the findings of this study are openly available at the following URL/DOI: https:// doi.org/10.5281/zenodo.7227128.

Figure 1 .
Figure 1.Input data and produced output using a PCA-based approach.Blue boxes here represent datasets or exogenous projections.Orange boxes represent projections that were generated as a part of this analysis.The upper box represents processing of historical data and the lower box shows processing for generating future projections.The red star indicates usage of the PCA algorithm developed by Narayan et al (2023b).

Figure 2 .
Figure2.Predicted income shares from PCA algorithm vs observed historical income shares by quintile.Each dot is a quintile value in a state in a given year.X axis shows net income quintiles as recalculated by combining ACS gross income data with our tax adjustments.Y axis shows net income predicted (deciles were predicted and aggregated to quintiles for the sake of the comparison).

Figure 3 .
Figure 3. Histogram of GDP per capita across all states and DC for GCAM-USA baseline scenario (SSP2) in 2050 for two income groups namely decile 1 and decile 10.Histogram shows number of states in a given bin.Bins that contain states of interest are identified.

Figure 4 .
Figure 4. Income distributions for all 50 states, plus DC and the U.S. national level income distribution in 2050 for GCAM-USA baseline scenario (SSP2).Close up of lower deciles included for better view of data.The black line represents the income distribution for the U.S. as a whole (all states aggregated to the national distribution).

Figure 5 .
Figure 5. Income distribution in terms of income shares and absolute income for highlighted states in 2050 under GCAM-USA baseline scenario.

Figure 6 .
Figure 6.GDP per capita distribution for New York in 2050 for SSP3 and SSP5 under various economic growth models.

Figure 7 .
Figure 7. GDP per capita distribution for New York and West Virginia in 2050 under SSP2 (GCAM-USA baseline), SSP3 (PIK), and SSP5 (OECD).See results for all other states in the SI section 9.

Table 1 .
Description of all input data used in data transformations and projections.Data is summarized by type, spatial scale, and temporal scale.Projected data is projected for all five SSPs unless noted otherwise.
Labor share of GDP(Feenstra et al 2015)Used in conjunction with state-level lagged value of the Palma ratio and lagged value of the 9th decile's income share to project the coefficient of PC2.aOnly have projections for SSP2, SSP3, and SSP5.b GDP in SSP3 and SSP5 is produced for multiple economic models (OECD, PIK, and IIASA).