Quasi-optimal weights: a versatile tool of data analysis

In this first part of the two-part account of the enabling technologies behind a successful completion of the Troitsk-nu-mass experiment, the parameter estimation method of quasioptimal weights is reviewed. In regard of statistical quality, it is on a par with the maximal likelihood method. but exceeds the latter in analytical transparency, flexibility, scope of applicability and numerical robustness. It also couples perfectly with the optimal jet definition and thus provides a comprehensive framework for data processing in particle physics and beyond.


Introduction
The currently best ν mass bound was published by the Troitsk-ν-mass experiment [1]. This happened after the experiment had been plagued by the so-called "Troitsk anomaly" for about 10 years. The anomaly went away after a reanalysis that was made possible by two enabling technologies: the statistical method of quasioptimal weights for parameter estimation [2] and the Oberon technologies for software development. The latter are discussed in a companion talk [3]. The present talk reviews the former.
It should be noted that the two technologies per se did not solve any specific physics problem, but provided a flexible and efficient framework to experiment with and implement specific improvements of the analysis in a real-life context under real-life limitations. Everything could in theory have been done with maximal likelihood and fortran or even C++, but at a cost that proved prohibitive in that context. A proper technique is about minimizing effort while attaining the best result, it is rooted in how well one understands the basic principles of what one works with.

The method
A basic problem of mathematical statistics is, given a parameterized distribution  The popular least squares method does not provide a full control; in fact, this project was prompted in the Fall of 2005 by a doubt expressed by the leader of the Troitsk-ν-mass experiment, the late Vladimir Lobashev as to whether the Poisson-distributed background was treated correctly by least squares.
The simple and transparent method of moments is universally regarded as inferior and mostly neglected in data processing applications. Among the few exceptions are [4] and [5]. Both turn out to be special cases of the method of quasi-optimal weights.
The maximal likelihood method, if applicable, yields the best possible estimate, obtained by maximizing the likelihood function, The estimate corresponds to the fundamental Fisher-Frechet-Rao-Cramer bound: [ ] Note that the maximum can be found by solving But what if ( ) M π P is unknown as a formula? This is often the case in high energy physics where only a Monte Carlo event generator may be available but not an explicit expression for the probability distribution (this is due to a very high dimensionality of the event space). On top of that, HEP data processing involves a heavy dose of chiropractic (event selection, jet algorithms, etc.).
Consider the method of generalized moments. For any weight ("generalized moment") ( ) f P , its mean value is a function of the unknown parameter On the experimental side, one can estimate this mean as follows ( ) One then estimates the parameter, This is easily visualized: Note that both error estimates are easily evaluated. Note also that Var Originally Pearson (1894) dealt with a one-dimensional event space and used: ( ) n f = P P (8) which yielded suboptimal estimates compared with ML.
However, the method is analytically transparent unlike either of the alternatives, and there are deep mathematical reasons to consider weights. Indeed, an alternative to the set-theoretic notion of a general function is one based on weighted values rather than point values of the function. In this alternative scheme the basic notions are choosen to be functions/mappings (with compositions, morphisms, categories etc.) rather than (sub)sets (with the corresponding set operations such as intersections etc.). An accessible review of these matters is given in [6]. It should be emphasized that such a functional view of the mathematical analysis is as comprehensive as the set-theoretic one. But the heuristics of the two schemes are entirely different: the weighted values are an artificial device in the set-theoretic view whereas in the functional view they are a natural starting point.
In the conventional view, a function is a correspondence ( ) This, however, is fully meaningful only for continuous ( ) g x . There is no natural way to define the value at a discontinuity.
? (9) In that regard a more satisfactory way is to define a function via its weighted values where f are the so-called test functions usually chosen to be smooth (or at least continuous): (11) Such a definition allows one to recover the function values at the points of continuity but the problem of values at discontinuities is bypassed altogether.
In particular -and immediately relevant to approximation methods -instead of comparing point values, one now compares weighted/smeared values. This directly leads to the powerful Bubnov-Galerkin method very widely used in engineering and other numerical applications (see e.g. [7]).
The functional viewpoint singles out the method of generalized weights as a fundamental starting point of any thinking about parameter estimation. So, let us look at it more closely, putting aside the mantra of its inferiority.
One has a simple and explicit expression for the error estimate: Let us minimize it. From the minimum equation This point turns out to be a true minimum, with Fisher's information as the minimum value. This exercise was performed in [8]. However, the optimal weight depends on M, and its actual value 0 M is unknown. Then an iterative procedure naturally suggests itself: Given that the minimum of the variance in the space of weights is quadratic, there is a freedom in the choice of quasi-optimal weights, and the method is rather less demanding in regard of the knowledge of the underlying probability density than the ML method.
That the iterative procedure is equivalent to ML is seen from the fact that (This assumes that the optimal weight and the averaging are taken at the same M.) The iterative procedure is similar to the optimizations involved in both the ML and least squares methods. It appears that the method of quasi-optimal weights reveals and efficiently exploits an analytical structure that have been there all along but remained eclipsed by the glory of the maximal likelihood. (One might also say that it fell victim to the personal rivalry between Sir Robert Fisher and Karl Pearson.)

Example 1
Some special cases of the quasi-optimal weighting have been known for a while. The case of linear dependence of the probability distribution on M has been known since 1992 [4] This has been used in a number of weak signal searches. This is closely related to the following criterion for hypothesis testing which is locally most powerful near 0 g = : ( ) ( ) 1 0 π π P P .
Another special case was constructed by Jorg Pretz within the context of the g-2 experiments [5].

Example 2
Consider the Cauchy/Breit-Wigner distribution. It has no mean yet the quasi-optimal weights work just fine to determine M: In this simple example one can see the various options for an approximate choice of the quasi-optimal weight which, it should be emphasized, need not be chosen to exactly coincide with the analytical formula: Any of these shapes would yield a valid estimate for M.
The two optimal weights for the two parameters are as follows: Now consider the Poisson distribution with an integer n in place of the real P: Using methods devised for the normal distribution to estimate the Poisson parameter µ is similar to and the corresponding family of weights is The difference from the purely Poisson weight is indicative of a bias introduced thereby. Numerically, for µ~ 0.15-1.5, the effect is 10-15%, which is large enough to be of concern.

Example 4
Consider how the continuous weights compare with event selection cuts widely used in HEP experiments. Consider a one-dimensional event space [0,1] and the linear probability density: Consider the variances of the linear weight and the shown cut relative to the linear probability distribution. The ratio of the two variances is 3, and the ratio of the corresponding sigmas is 1.71. In other words, replacing the cut with a continuous weight in this case converts a 3σ signal into a 5σ one.

Example 5
Our last example is the observation of the top quark in all-hadrons channel by D0 as described in [10]. A continuous weight ("an average jet count') was devised empirically and proved to be "particularly powerful" in a multivariate analysis with thirteen variables.
Even if a formula such as

The case of non-uniform event sample
The method can be extended to the case of non-uniform samples [2].
The optimal weight then is: with the explicit formula for variance: All this is straightforward to code.

4.
Notes on usage in the Troitsk-ν ν ν ν-mass experiment A number of Monte Carlo tests showed that the method is indeed asymptotically equivalent to ML (which fact is quite obvious from the analytical formulae of the method).
Programming implementation of the method is quite straightforward, although some flexibility of the implementation language is desirable (actually used was a minimalistic strictly type-safe language Oberon/Component Pascal, a direct descendant of Pascal, see the companion talk for details [3]).
Numerically, one has to solve a system of equations, which is a simpler task than an optimum search (for a fast Newton-type algorithm, one only needs first derivatives of the mapping involved in the former case, whereas the optimum search requires second derivatives).
The method is less sensitive than, say, MINUIT to the "narrow valleys" situation (although this difficulty cannot be eliminated altogether), as well as to non-smoothness of the functions involved.
One can easily accomodate knowledge about the nature of the probability distributions involved (Poisson etc.).
With a poor statistics, the resulting multidimensional systems of equations may not have a solution; in practice this was not a problem (mismatches that were encountered were small more than enough not to affect the results) but some (theoretical) research may be called for. Similarly, a Student-type correction factors for confidence intervals may be required in the poor statistics case (this problem, however, is a general one and not limited to the quasi-optimal weights; in this regard the present method is neither better nor worse than everything else).
The same conceptual framework earlier yielded the so-called Optimal Jet Definition [11] which was the first to demonstrate an O(N) behavior with respect to the number of particles in the final state [12]. The OJD remains the only jet definition derived from first principles in a systematical scientific fashion, without ad hoc constructions. The fact that it meshes well with quasi-optimal weights bodes well for the complicated experimental situations such as the abovementioned one encountered by D0 in the all-jets channel.
Lastly, a fundamental nature of the argument behind OJD is underscored by the fact that a very similar derivation yielded an efficient multi-dimensional density modelling algorithm [13].
To summarize, a general and flexible method of quasi-optimal weights (together with a closely related optimal jet definition, the general multi-dimensional density modelling algorithm, etc.) offers a rather comprehensive unified framework to handle a wide array of HEP data processing problems in a simplified and systematic manner and to help to streamline the motley of (sometimes ad hoc) methods currently in use.