P-index: A Generalized Statistical Aggregation Metric

Data aggregation serves as a crucial technique in data analysis, particularly in the context of data dimensionality reduction. The most common approach involves employing statistical aggregation metrics such as mean, median, and others to characterize a dataset. However, different statistical aggregation metrics may yield disparate results, making the discrimination and selection of suitable metrics for describing data a research topic. Addressing this, the paper proposes a generalized statistical aggregation metric, categorizing several commonly used metrics under specific scenarios. This model offers a novel perspective for choosing among different statistical aggregation metrics and assists researchers in developing a deeper understanding of common statistical aggregation metrics. The introduction of this generalized metric provides a flexible and comprehensive method for considering multiple statistical measures simultaneously, enabling a more accurate grasp of the dataset’s essence and offering reliable insights for decision-making. The study contributes a new viewpoint to the selection and application of statistical aggregation metrics, advancing discussions in multi-objective optimization and data interpretability.


Introduction
Statistical aggregation is a process in data analysis where multiple individual data points are combined or summarized to derive meaningful insights and trends [1][2].This technique is employed to simplify complex datasets, making them more manageable and interpretable.Statistical aggregation involves the use of various mathematical and statistical methods to condense information while preserving key characteristics of the original data.
A Statistical Aggregation Metric is a quantitative measure used to summarize and express various statistical aspects of a dataset after aggregation [3][4][5].This metric provides a consolidated view of the data by condensing multiple data points into a single numerical value, offering insights into the overall characteristics or trends of the aggregated information.
In analyzing different statistical metrics, there are common approaches for assessment.For instance, the median can mitigate the impact of extreme values, while the mean is suitable for Gaussian distributions [6][7].Clearly, each statistical metric has its strengths and weaknesses.Comparing the characteristics of various metrics provides qualitative criteria, aiding the selection of appropriate metrics in different scenarios.However, such qualitative assessments limit the applicability scope.Specifically, we know the median is unaffected by extreme values, while the mean is influenced.This qualitative analysis, though informative, lacks quantification regarding the extent of sensitivity to extreme values.Therefore, this paper introduces a quantitative approach to quantify the sensitivity of statistical metrics to deviations from data values, resulting in a parameterized metric known as the p-index.This quantification significantly expands the range of available statistical metrics from a finite set to parameterized infinite possibilities, thereby increasing the applicable scenarios of these metrics.
The definition and mathematical model of the p-index are described in Section 2. On this basis, we prove that three common statistical metrics-mode, median, mean and midrange-are equivalent to the 0-index, 1-index, 2-index and ∞ -index, respectively.We further explore the scenario as p approaches infinity.Subsequently, an analysis of the performance of the p-index and a discussion of its application are presented.Finally, the conclusion and prospects for future work are outlined in Section 3.

P-index
This section abstracts the process of "selecting statistical metrics" into a mathematical model, transforming the issue of metric selection into an abstract optimization problem.Furthermore, by constraining the expression of this optimization problem, the definition of the p-index is derived.The special cases of mode, median, and mean are proven to be instances of this index.Finally, an analysis and discussion of this metric are conducted.

Modeling and Definition
We start our modeling abstraction by putting together an assertion about data aggregation choices described in natural language.This assertion is: "In scenarios where one needs to avoid being influenced by extreme values, the median is more suitable than the mean."We can translate this into mathematical language along the following lines.Provided that the original data before aggregation is � = {� 1 , � 2 , …, � � }, let the median be represented by � 1 , the mean as � 2 , and denote the scenario where the aggregation metric under the condition of "not desiring the influence of extreme values" is � as � �; � .We use the partial order relation ≻ to represent the notion of " more suitable ", and then we have: � � 1 ; � ≻ � � 2 ; � .
Assuming that the fit of the indicator to the scenario is quantifiable, i.e., the return value of � �; � is a real number and ≻ is > on the set of real numbers, the indicator selection problem can then be abstractly modeled as Let � * be the maximum value of � �; � , and � �; � = � * − �(�; �), then we have: It is an obvious fact that if there exist � 1 and � 2 such that for any � ∈ � there is � 1 − � < |� 2 − �|, then it is logical to assume that � 1 is a more appropriate index of aggregation than � 2 .This condition leads to the following two assertions without proof.First, a reasonable index should lie between the minimum and maximum values of the data �.Second, for any �_�, �(�; �) should be an increasing function with respect to |� − � � |.

Define a monotonic function on |� − �| as
where � ≥ 0. Based on this function, define the p-loss function as Substituting the p-loss function into Equation ( 2), we obtain the definition of p-index: Solving Equation ( 5) is often difficult, especially when � is large or � is not an integer.However, there are a few special cases where the solution to Equation ( 4) is simple and happens to correspond to common statistical aggregation metrics, which are developed in the next subsection.

A Few Special Cases
This subsection provides four special p-values that happen to be relatively easy to solve.

2.2.1.
Case for � = 0.When � = 0, Equation (4) essentially returns the number of values in � that are not equal to �, and solving for its minimum value is equivalent to solving for the value that occurs the most times in �.That is, the 0-index for a set of data is the mode.

Case for
, this proves that the 2-index of a set of arrays is its mean value.

Case for � → ∞.
To solve for the case when � tends to infinity, we first introduce a lemma that will be used later.Lemma 1.Let � 1 , � 2 , …, � � be non-negative real numbers, � be the maximum of them and � > 0, then Proof.Multiplying the original equation by � and multiplying into each of the subterms, we get since � is the maximum, there are finite items in } whose value equals to 1 (whose infinite power is 1) while others are less than 1 (whose infinite power is 0), thus On the basis of Lemma 1, we can prove the following theorem.Theorem 1.Let �, � be the minimum and maximum values of �, respectively, provided that not all elements in � are equal, we have: Proof.Considering Lemma 1, to prove that Equation ( 6) holds, it is only necessary to prove that In summary, Equation ( 7) is proved.Theorem 1 discusses the case where the elements of � are not all equal, on the other hand, if the elements of � are all equal, then � = � = � 1 holds, and then � = � 1 = �+� 2 also holds.This shows that p-index is equal to minrange as � tends to infinity.
The above four special cases are directly proved to be corresponding to the plural, the median, the mean and the midrange, respectively.This result proves the necessity of the p-index, which provides at least one aspect of the explanation of these special cases and a basis for justifying them in a certain scenario (if this scenario can be represented by the corresponding loss function).Moreover, it provides a deeper mathematical understanding as well as a wider choice of indexes, which we will discuss in the next subsection.

Analysis and Discussion
This section presents the p-index and proves that the mode, median, mean and midrange are 0-index, 1-index, 2-index and infinity-index, respectively.These results can be analyzed and discussed both theoretically and practically.
First, on the theoretical side, the p-index provides a mathematical explanation of the rationality of the mode, median, mean and midrange by proving that they are optimal solutions under corresponding loss functions.In addition, through these functions, we can clearly sort out the relationship between these metrics and their meanings.Specifically, from Equation (3), it can be concluded that when � is higher, the impact of the data that deviates more from the center metrics will rise with it, i.e., the pvalue represents the extent to which its corresponding metrics are affected by the deviation.The mode is an extreme that is completely unaffected by deviation, neither a very large change in the maximum value, nor the addition of two large numbers will affect the mode.On the contrary, the midrange is the other extreme, and depends only on the two values that produce the largest deviations, and has nothing to do with the non-extreme values.The median and mean are somewhere in between.
Second, on the practical side, we can discuss the application of p-index in two cases.Ideally, if the loss function � can be obtained or fitted experimentally, then we can determine the corresponding index by direct determination.Under the assumption that the loss function conforms to Equation (3), we can determine a reasonable � by fitting the data.In real scenarios, it may be difficult to quantitatively estimate the effect of an indicator, but the p-index can be used as a reference object for several common indicators to give some general suggestions.For example, if one wants an indicator that is required to be more sensitive to the deviations than the mean, then it can set � = 3.If it wants an indicator that is between the median and the mean, then it can set � = 1.5,By doing this it can greatly increase the range of options for the indicator.
To summarize, p-indexes have mathematical support in theory and corresponding application scenarios in practice, so they are a series of indexes worth being studied and widely used.

Conclusion
This work introduces a generalized statistical aggregation metric, the p-index, and demonstrates that common statistical aggregation metrics-mode, median, mean and midrange-are special cases of the p-index.