PHOTOMETRIC REDSHIFTS FOR QUASARS IN MULTI-BAND SURVEYS

M. Brescia; S. Cavuoti; R. D'Abrusco; G. Longo; A. Mercurio

doi:10.1088/0004-637X/772/2/140

1. INTRODUCTION

Photometric redshifts (hereafter photo-z) provide an estimate of the redshift of sources obtained using photometry instead of spectroscopy. They are in fact driven (1) by the shape of the broadband continuum of the object's spectroscopic emission and (2) by a limited number of strong spectral features (i.e., the one at 4000 Å, the Lyα forest, and the Lyman limit), which are still recognizable after the integration of the spectral energy distribution (SED) sampled by the filter's transmission function.

At the cost of lower accuracy, photo-z offer several advantages with respect to their spectroscopic counterparts: (1) being derived from intermediate/broadband imaging, photo-z are much more effective in terms of observing time; (2) they may allow one to probe objects much fainter than the spectroscopic flux limit; and (3) under specific conditions, they allow one to correct some biases, such as those encountered at high redshift where, as it has been noted (Fernandez-Soto et al. 2001), spectroscopy is pushed to its limits both by the low signal-to-noise ratio (S/N) in the spectra and by the fact that in many cases, even when a good S/N is achieved, the lack of features in the observed spectral range may undermine the estimation of a trustworthy redshift (Lanzetta et al. 1998).

The latter aspect becomes crucial when photometric redshift methods are applied to quasars (QSO) and, in particular, to the construction and characterization of the large, complete samples which are required by modern cosmology. In fact, quasar samples have always been, and still are, constructed either by compiling lists of more or less serendipitous discoveries obtained with different techniques and selection criteria (Veron-Cetty & Veron 2000), or via a two-step process where the first one consists in the identification of QSO candidates from multi-wavelength surveys, and the second one requires the spectroscopic validation of the candidates. In practice, due to the large amount of observing time required by spectroscopy, the latter step is usually optimized by applying the spectroscopic validation procedure just to a more or less significant subsample of the candidates and then by extrapolating the resulting statistics to the whole sample. Modern surveys are usually so deep and extensive that the number of candidates rapidly becomes too large to be handled with the latter approach. On the other hand, modern multi-wavelength digital surveys also provide such a wealth of information (multi-band high-accuracy photometry) that it becomes feasible to approximate the SED of objects over a quite large range of redshifts (Richards et al. 2001a, 2001b; Budavari et al. 2001; Wolf et al. 2004), thus minimizing the need for spectroscopic follow-up.

In the last few years, it has in fact been demonstrated that after having provided an accurate enough photometry and significant wavelength coverage, it is possible to obtain samples of photometrically selected quasars matching the low contamination and high completeness (D'Abrusco et al. 2009; Bovy et al. 2012) required by many fields of modern cosmology. The relevance of these photometric samples will increase more and more in the near future, when the new generation of deeper and more accurate surveys will allow us to access larger and more complete samples of QSOs. These photometric samples are in fact already being used for a variety of applications such as the measurement of the integrated Sachs–Wolfe effect (Giannantonio et al. 2008), the cosmic magnification bias (Scranton et al. 2005), and the clustering of quasars on large (Myers et al. 2006) and small (Hennawi et al. 2006) scales, to note just a few. Since both candidate selection and photometry redshift estimates are performed on the same data (colors in many bands), it is also apparent that for the same samples, photometric data alone should carry enough information to characterize in an almost univocal way the SED, and therefore also to derive accurate estimates of photometric redshifts (D'Abrusco et al. 2009; Laurino et al. 2011; Bovy et al. 2012).

It goes without saying that the utility of the photometric samples goes hand in hand with the development of photo-z methods capable of providing accurate enough estimates of the redshifts.

In this paper, we use a new empirical method, named Multi Layer Perceptron with Quasi Newton Algorithm (MLPQNA), and apply it to the evaluation of photometric redshift of quasars. In Section 2, we discuss the data sets used for the experiments, and in Section 3 we present both a detailed description of the MLPQNA method and the statistical indicators used throughout the paper. We wish to stress that the lack of a common agreement on such indicators is among the main obstacles in comparing the performances of different methods. In Section 4, we describe the experiments performed in order to select the best combination of input parameters, bands, and network topology. The results of these experiments are summarized and discussed in Section 5, where we also present the final performances of the best experiments. Finally, we compare our results with those available in the literature and draw some general conclusions.

A short Appendix provides the reader with the math behind the Quasi Newton Algorithm (QNA).

2. THE DATA SET

The sample of quasars used in the experiments described in this paper is based on the spectroscopically selected quasars from the Sloan Digital Sky Survey (SDSS) DR7 database (table Star of the SDSS database). According to the spectroscopic classification index (index SP or specClass) provided in the SDSS-DR7 release (Schneider et al. 2010), we selected quasars for which a reliable measure of the spectroscopic redshifts (with zConf > 0.90) is available.

We then cross-matched the SDSS sample quasars identified as point sources with clean measured photometry in all filters (ugriz), with the latest versions of the data sets from Galaxy Evolution Explorer (GALEX; Martin et al. 2005), UKIDSS (Lawrence et al. 2007), and WISE (Wright et al. 2010). These three surveys observed large fractions of the sky in the ultraviolet, near-infrared, and middle-infrared spectral intervals, respectively. After the cross-matching, we obtained a series of multi-band catalogs, defined as follows.

SDSS (DR7) (Aihara et al. 2011) has observed ∼1.4 × 10⁴ deg² of the sky in five bands (ugriz) covering the [3551, 8931] Å wavelength range. Photometric SDSS observations reach the limiting magnitude of 22.2 in the r band (95% completeness for point sources; Abazajian et al. 2009).

GALEX (DR6/7) (Martin et al. 2005) is a two-band survey (nuv, fuv for near and far ultraviolet, respectively) covering the [1300,3000] Å wavelength interval. The GALEX photometric survey has observed the whole sky to the near ultraviolet limiting magnitude $\textit {nuv}\!=\!20.5$ .

UKIDSS (DR9) (Lawrence et al. 2007) has been designed to be the SDSS infrared counterpart and covers ∼7000 deg² of the sky in the YJHK near-infrared bands covering the ∼0.9–2.4 μm spectral range down to the limiting magnitude K = 18.3. The Large Area Survey has imaged ∼4000 deg² (overlapping with the SDSS), with the additional Y band down to the limiting magnitude of 20.5.

The WISE mission (Wright et al. 2010) has observed the entire sky in the mid-infrared spectral interval at 3.4, 4.6, 12, and 22 μm with an angular resolution of 6 farcs 1, 6 farcs 4, 6 farcs 5, and 12 farcs 0 in the four bands, achieving 5σ point source sensitivities of 0.08, 0.11, 1, and 6 mJy in unconfused regions on the ecliptic, respectively. The astrometric accuracy of WISE is ∼0 farcs 50, 0 farcs 26, 0 farcs 26, and 1 farcs 4 for the four WISE bands, respectively.

The transmission curves of all filters related to the four surveys are shown in Figure 1. All these surveys present a large common overlap region and overall good astrometry with comparable astrometric accuracy. In order to cross-match the catalogs, we used a maximum radius r = 1 farcs 5 to associate the optical quasars with counterparts in each of the three catalogs. Afterward, we rejected all sources containing one or more missing data in any of their photometric parameters. In this case, the term missing data means undefined numerical values underlying either not detected or contaminated magnitude measurements. This last step is crucial in empirical methods since the presence of missing data might affect their generalization capabilities (Marlin 2008).

Figure 1. Refer to the following caption and surrounding text. — **Figure 1.** Transmission curves for all filters in the four surveys considered.
Download figure:
Standard image High-resolution image

The resulting number of objects in the data sets used for the experiments are the following:

1.
SDSS: ∼1.1 × 10⁵;
2.
SDSS ∩ GALEX: ∼4.5 × 10⁴;
3.
SDSS ∩ UKIDSS: ∼3.1 × 10⁴;
4.
SDSS ∩ GALEX ∩ UKIDSS: ∼1.5 × 10⁴;
5.
SDSS ∩ GALEX ∩ UKIDSS ∩ WISE: ∼1.4 × 10⁴.

An additional data set was produced by decimating the final four-survey cross-matched catalog. This data set was used to perform the preliminary feature selection or pruning phase (see Section 4.2) and consisted of ∼3.8 × 10³ objects, each observed in 15 bands (4 UKIDSS, 2 GALEX, 5 SDSS, and 4 WISE) and with accurate spectroscopic redshift estimates. The decimation was needed to reduce the computational time needed to perform the large number of experiments described in what follows. For some bands there were multiple measurements (i.e., magnitude measured accordingly to different definitions), and therefore we are left with a total of 43 different features.

Finally, when producing training and test sets, we made sure that they had compatible spectroscopic redshift distributions (see Figure 2).

Figure 2. Refer to the following caption and surrounding text. — **Figure 2.** Histograms of the spectroscopic redshift distribution in the five survey cross-matched samples as derived from the SDSS spectroscopic data. (a) SDSS, (b) SDSS+*GALEX*, (c) SDSS+UKIDSS, (d) SDSS+*GALEX*+UKIDSS, and (e) SDSS+*GALEX*+UKIDSS+*WISE*. The gray dotted line is the training sample. The black line is the test sample.
Download figure:
Standard image High-resolution image

3. THE METHOD

This section is dedicated to the description of the machine learning method used for the experiments. All mathematical details are reported in the Appendix.

3.1. Multi Layer Perceptron

From a technical point of view, the MLPQNA method is a multi layer perceptron (MLP; Bishop 2006) neural network (NN) trained by a learning rule based on the QNA; in other words and as it is synthesized in the acronym, MLPQNA differs from more traditional MLPs' implementations in the way the optimal solution of the regression problem is found. In previous papers, most of the characteristics of the method have been described in the contexts of both classification (Brescia et al. 2012) and regression (Cavuoti et al. 2012b).

According to Bishop (2006), feed-forward NNs (in their various implementations) provide a general framework for representing nonlinear functional mappings between a set of input variables (also called features) and a set of output variables (the targets). The training of an NN can be in fact seen as the search for the function which minimizes the errors of the predicted values with respect to the true values available for a small but significant subsample of objects in the same parameter space (PS). This subset is also called training set or knowledge base (KB). The final performances of a specific NN depend on many factors, such as topology, the way the minimum of the error function is searched and found, the way errors are computed, as well as the intrinsic quality of training data.

The formal description of a feed-forward NN with two computational layers is given in Equation (1):

$\begin{equation} y_k = \sum _{j=0}^{M} w^{(2)}_{kj} g \left(\sum _{i=0}^{d} w^{(1)}_{ji} x_i \right). \end{equation} \tag{ 1 }$

Equation (1) can be better understood by using a graph like the one shown in Figure 3. The input layer (x_i) is made of a number of neurons (also known as perceptrons) equal to the number of input variables (d); the output layer, on the other hand, will have as many neurons as the output variables (k).

Figure 3. Refer to the following caption and surrounding text. — **Figure 3.** Scheme of a Multi Layer Perceptron general architecture for two input variables, one hidden layer with three neurons and one output value.
Download figure:
Standard image High-resolution image

In the general case, the network may have an arbitrary number of hidden layers (in the depicted case there is just one hidden layer with three neurons), each of which can be formed by an arbitrary number of neurons (M). In a fully connected feed-forward network, each node of a layer is connected to all the nodes in the adjacent layers. Each connection is represented by an adaptive weight $(w^{l}_{kj})$ that can be regarded as the strength of the synaptic connection between neurons k and j, while the response of each perceptron to the inputs is represented by a nonlinear function g, referred to as the activation function. All the above characteristics, the topology of the network and the weight matrix of its connections, define a specific implementation and are usually called a model.

The model, however, is only part of the story. In fact, in order to find the model that best fits the data in a specific problem, one has to provide the network with a set of examples, id est, of objects for which the final output is known by independent means. These data, already defined as training set or KB, are used by the network to find the optimal model.

In our implementation, we choose as learning rule the QNA, which differs from the Newton Algorithm in how the Hessian of the error function is computed. Newtonian models are variable metric methods used to find local maxima and minima of functions (Davidon 1968) and, in the case of MLPs, they can be used to find the stationary (i.e., the zero gradient) point of the learning function. A complete mathematical description of the MLP with the QNA model is reported in the Appendix.

The model has been made available to the community through the DAta Mining & Exploration Web Application REsource (DAMEWARE⁵; Cavuoti et al. 2012a).

3.2. The Implementation of MLPQNA

In this work, we use our implementation of the QNA based on the limited-memory BFGS (L-BFGS; Byrd et al. 1994), where BFGS is the acronym composed of the names of the four inventors (Broyden 1970; Fletcher 1970; Goldfarb 1970; Shanno 1970).

To summarize, the algorithm for MLP with QNA is the following. Let us consider a generic MLP with w^(t) being the weight vector at time (t).

1.
Initialize all weights w⁽⁰⁾ with small random values (typically normalized in [ − 1, 1]), and set the constant error tolerance ε and t = 0;
2.
present to the network the whole training set and calculate E(w^(t)) as the error function for the current weight configuration;
3.
if t = 0, then d^(t) = −∇E^(t);
4.
otherwise d^(t) = −∇E^{(t − 1)} + Ap + Bν, where p = w^{(t + 1)} − w^(t) and ν = g^{(t + 1)} − g^(t);
5.
calculate w^{(t + 1)} = w^(t) − αd^(t), where α is obtained by a line search equation (see Equation (A6) in the appendix);
6.
calculate A and B for the next iteration, as reported in Equation (A19);
7.
if E(w^{(t + 1)}) > ε, then t = t + 1 and go to (2), otherwise STOP.

As is known, all line search methods, which are based on techniques that search for the minimum error by exploring the error function surface, are likely to get stuck in a local minimum and many solutions to this problem have been proposed (Floudas & Jongen 2005). In order to optimize the convergence of the gradient descent analysis (GDA; see the Appendix), Newton's method uses the information on the second-order derivatives. By having information on the second derivatives, QNA is more effective in avoiding local minima of the error function and more accurate in the error function trend follow-up, thus revealing a natural capability to find the absolute minimum error of the optimization problem (Shanno 1990).

In the L-BFGS version of the algorithm, in the case of high dimensionality (i.e., input data with many parameters), the amount of memory required to store the Hessian is too large, along with the machine time required to process it. Therefore, instead of using a complete number of gradient values to generate the Hessian, we can use a smaller number of values to approximate it.

By the way, if the convergence slows down, the performance may even increase. This statement might seem paradoxical but, although the convergence is measured by the number of iterations, the performance depends on the number of the processor's time units spent calculating the result.

Related to the computational cost, there is also the strategy adopted in terms of stopping criteria for the method. As is known, the process of adjusting the weights based on the gradients is repeated until a minimum is reached. In practice, one has to decide the stopping condition of the algorithm. Among the possible criteria, the algorithm could be terminated when (1) the Hessian approximation error becomes sufficiently small (by definition, the gradient will be zero at a minimum), (2) the maximum number of iterations is reached, in terms of a fixed threshold, and (3) is based on the cross validation.

The cross validation can be used to monitor generalization performance during training and to terminate the algorithm when there is no more improvement. Statistically significant results come out by trying multiple independent data partitions and then averaging the performances. There are several variants of cross validation methods (Sylvain & Celisse 2010). In particular, we have chosen the k-fold cross validation, which is particularly suited in the presence of a scarcity of known data samples (Geisser 1975). The mechanism, also known as leave-one-out, is quite simple, since it consists of dividing the training set of N samples into k subsets (k > 1). The model is then trained on k − 1 subsets and validated by testing it on the left-out subset. This procedure is then iterated, leaving out each time a different subset for validation, and its mean-squared error is averaged on all cycles.

Regarding the MLP topology, a significant contribution came from a seminal paper by Bengio & LeCun (2007). In fact, they re-analyzed the implications of the Haykin pseudotheorem (Haykin 1998), proving that complex problems in which the mapping function is highly nonlinear and the local density of data in the PS is very variable are better matched by deep networks with more than one hidden computational layer.

3.3. Statistical Indicators

In order to evaluate and reciprocally compare the experiments described in the next section, we adopted the following definitions:

$\begin{equation} {\rm bias}({\it {x}}) = \frac{{\sum \limits _{i = 1}^N {x_i } }}{N} \end{equation} \tag{ 2 }$

$\begin{equation} \sigma (x) = \sqrt{\frac{{\sum \limits _{i = 1}^N {\left[ {x_i - \left({\frac{{\sum \limits _{i = 1}^N {x_i } }}{N}} \right)} \right]^2 } }}{N}} \end{equation} \tag{ 3 }$

$\begin{equation} {\rm MAD}({\it {x}}) = {\rm median}\left(\left| x \right| \right) \end{equation} \tag{ 4 }$

$\begin{equation} {\rm NMAD}({\it {x}}) = 1.48 \times {\rm median}\left(\left| x \right| \right) \end{equation} \tag{ 5 }$

$\begin{equation} {\rm rms}({\it {x}}) = \sqrt{\frac{{\sum \limits _{i = 1}^N {x_i^2 } }}{N}} , \end{equation} \tag{ 6 }$

where σ is the standard deviation, MAD is the median absolute deviation, NMAD is the normalized MAD, and rms is the root mean square. The term x in all of the above expressions may be either Δz defined as

$\begin{equation} \Delta z = (z_{{\rm spec}}-z_{{\rm phot}}) \end{equation} \tag{ 7 }$

or the normalized residuals Δz_norm defined as

$\begin{equation} \Delta z_{{\rm norm}} = (z_{{\rm spec}}-z_{{\rm phot}}) / (1+z_{{\rm spec}}) . \end{equation} \tag{ 8 }$

4. THE EXPERIMENTS

Our approach is based on machine learning methods, and therefore it needs to be as automatic as possible in order to optimize the decisional support to the user (in this case the astronomer). Therefore, the results of the individual experiments as well as their comparison with others have to be evaluated in a consistent and objective manner through a homogeneous set of statistical indicators.

In what follows, we shall discuss the general experiment work flow and the outcome of the experiment phases.

4.1. The Knowledge Base and Model Setup

For machine-learning-supervised methods, it is a common practice to use the available KB to obtain at least three disjoint subsets for every experiment: one (training set) for training purposes, i.e., to train the method in order to acquire the hidden correlation among the input features, which is needed to perform the regression; the second one (validation set) to check the training, in particular against loss of generalization capabilities (a phenomenon also known as overfitting); and the third one (test set) to evaluate the overall performances of the model. As a rule of thumb in the case of machine learning methods, these sets should be populated with, respectively, 60%, 20%, and 20% of the objects in the KB (Kearns 1996). In our case, however, we reduced the training+validation data amount (from 80% to 60%), driven by the past experience with the very accurate regression capabilities of the model also in the case of smaller KBs (Brescia et al. 2012; Cavuoti et al. 2012b), obtaining implicitly the possibility to verify its prediction performance on a larger test set, as well as a faster execution of the training phase. Furthermore, in order to ensure a proper coverage of the KB in the PS, the data objects were indeed divided up among the three data sets by random extraction, and this process is usually iterated several times to minimize the possible biases induced by fluctuations in the coverage of the PS, namely, small differences in the redshift distribution of training and test samples used in the experiments.

The first two criteria used to decide the stopping condition of the algorithm, as mentioned at the end of Section 3.2, are mainly sensitive to the choice of specific parameters and may lead to poor results if the parameters are improperly set. The cross validation does not suffer from such a drawback; it can avoid overfitting the data and is able to improve the generalization performance of the model. However, when compared to the traditional training procedures, the cross validation is much more computationally expensive. Therefore, by exploiting the cross validation technique (see Section 3.2), training and validation were indeed performed together using ∼60% of the objects as a training + validation set, and the remaining ∼40% as a test set.

The automatized process of the cross validation was done by performing 10 different training runs with the following procedure: (1) splitting of the training/validation set into 10 random subsets, each one composed of 10% of the data set; (2) at each training run, we applied 90% of the data set and then excluded 10% for validation.

As remarked in Section 3.2, the k-fold cross validation is able to avoid overfitting on the training set (Bishop 2006) with an increase of the execution time estimable around k − 1 times the total number of runs (Cavuoti et al. 2012b).

In terms of the internal parameter setup of the MLPQNA, we need to consider the following topological parameters.

1.
Input layer. A variable number of neurons, corresponding to the pruned number of survey parameters used in all experiments, up to a maximum number of 43 nodes (all available features).
2.
Neurons on the first hidden layer. A variable number of hidden neurons, depending on the number N of input neurons (features in the data set), equal to 2N + 1 as a rule of thumb.
3.
Neurons on the second hidden layer. A variable number of hidden neurons, ranging from 0 (to ignore the second layer) to N − 1.
4.
Output layer. One neuron, returning the reconstructed photo-z value.

For the QNA learning rule, we heuristically fixed the following values as best parameters for the final experiments:

1.
step: 0.0001 (one of the two stopping criteria. The algorithm stops if the approximation error step size is less than this value. A step value equal to zero means using the parameter MaxIt as a unique stopping criterion);
2.
res: 40 (number of restarts of the Hessian approximation from random positions, performed at each iteration);
3.
dec: 0.1 (regularization factor for weight decay. The term dec*||network weights||² is added to the error function, where the network weights is the total number of weights in the network, directly depending on the total number of neurons inside. When properly chosen, the generalization performances of the network are highly improved);
4.
MaxIt: 8000 (max number of iterations of the Hessian approximation. If zero, the step parameter is used as a stopping criterion);
5.
CV (k): 10 (k-fold cross validation, with k = 10);
6.
error evaluation: mean square error (between the target and network output).

Using these parameters, we obtained the statistical results reported in Section 4.4.

4.2. Selection of Features

As is known, supervised machine learning models are powerful methods for learning the hidden correlation between input and output features from training data. Of course, their generalization and prediction capabilities strongly depend on the intrinsic quality of data (S/N), the level of correlation among different features, and the amount of missing data present in the data set (Ripley 1996). It is obvious that some, possibly many, of the 43 parameters listed in Table 1 may not be independent and that their number needs to be reduced in order to speed up the computation (which scales badly with the number of features). This is a common problem in data mining and there is a wide body of literature on how to optimize the selection of features that are most relevant for a given purpose (Lindeberg 1998; Guyon & Elisseeff 2003, 2006; Brescia 2012). This process is usually called feature selection or pruning, and consists of finding a compromise between the number of features (and therefore the computational time) and the required accuracy of the final results. In order to do so, we extracted from the main catalog several subsets containing different groups of variables (features). Each one of these subsets was then analyzed separately in specific runs of the method (runs that in the data mining community are usually called experiments), in order to allow for comparison and evaluation. We wish to stress that our main concern was not only to disentangle which bands carry the most information but also, for a given band, which type of measurements (e.g., point-spread function (PSF), Petrosian, or isophotal magnitude) are more effective.

Table 1. Summary of the Data Extracted from the Databases of the Four Surveys and Merged to Form Our Final Catalog

Survey	Bands	Name of Feature	Synthetic Description
GALEX	nuv, fuv	mag, mag_iso	Near and Far UV total and isophotal mags
		mag_Aper_1 mag_Aper_2 mag_Aper_3	phot. through 3, 4.5, and 7.5 arcsec apertures
		mag_auto and kron_radius	magnitudes and Kron radius in units of A or B
SDSS	u, g, r, i, z	psfMag	PSF fitting magnitude in the u, g, r, i, z bands
UKIDSS	Y, J, H, K	PsfMag	PSF fitting magnitude in the Y, J, H, K bands
		AperMag3, AperMag4, AperMag6	aperture photometry through 2, 2.8 and 57
			circular aperture in each band
		HallMag, PetroMag	Calibrated magnitude within circular
			aperture r_hall and Petrosian magnitude
			in the Y, J, H, K bands
WISE	W1, W2, W3, W4	W1mpro, W2mpro, W3mpro, W4mpro	W1: 3.4 μm and 61 angular resolution;
			W2: 4.6 μm and 64 angular resolution;
			W3: 12 μm and 65 angular resolution;
			W4: 22 μm and 12'' angular resolution.
			Magnitudes measured with profile-fitting photometry
			at the 95% level. Brightness upper limit if the flux
			measurement has S/N < 2
SDSS	...	z_spec	Spectroscopic redshift

Notes. Even though most names of the parameters are self-explanatory, we wish to remind that the various psfMag are magnitudes derived by integrating fluxes over the best fitting point-spread function. The aperture sizes refer to the radii.

Download table as: ASCII Typeset image

We performed a series of regression experiments to evaluate the performances obtained by the pruning of photometric quantities on the small data set described in Section 2. The pruning experiments consisted of several combinations of surveys and their features:

1.
a full feature experiment to be used as a benchmark for all the other experiments;
2.
some service experiments used to select the best combination of input features in order to eliminate redundancies in the flux measurements (i.e., Petrosian magnitudes against isophotal magnitudes);
3.
three-survey experiments for all possible combinations of three (out of four) surveys;
4.
two-survey experiments with all possible combinations of two (out of four) surveys;
5.
single-survey experiments.

The output of the experiments consisted of lists of photometric redshift estimates for all objects in the KB. All pruning experiments were performed using ∼3000 objects in the training set and ∼800 in the test set. In Table 2, we list the outcome of the experiments for the feature selection. Both bias(Δz) and σ(Δz) were computed using the objects in the test set alone. As can be seen, among the various types of magnitudes available for GALEX and UKIDSS, the best combination is obtained using the isophotal magnitudes for GALEX and the calibrated magnitudes (HallMag) for UKIDSS.

Table 2. Experiments for the Feature Selection Phase

GALEX	SDSS	UKIDSS	WISE	Bias	σ
GALEX	SDSS	UKIDSS	WISE	(Δz)	(Δz)
Service experiments
X	X	X	X	0.0033	0.174
X^{1, 2}	X	X⁶	X	−0.0001	0.152
X³	X	X⁶	X	−0.0016	0.165
X¹	X	X⁶	X	0.0054	0.151
X²	X	X⁶	X	−0.0026	0.151
X^{4, 5}	X	X⁶	X	−0.0008	0.152
X^{1, 2, 3}	X	X⁶	X	0.0041	0.163
X^{2, 3}	X	X⁶	X	−0.0033	0.155
		X^{6, 7}		−0.0091	0.299
		X⁷		0.0111	0.465
		X⁶		−0.0081	0.294
Four survey experiment
X²	X	X⁶	X	−0.0026	0.151
Three survey experiment
X²	X	X⁶		−0.0046	0.152
X²	X		X	0.0025	0.162
	X	X⁶	X	−0.0032	0.179
X²		X⁶	X	0.0110	0.203
Two survey experiment
		X⁶	X	0.0045	0.236
X²			X	0.0175	0.288
	X	X⁶		−0.0027	0.210
	X		X	−0.0039	0.197
X²	X			−0.0055	0.240
X²		X⁶		0.0133	0.238
One survey experiment
			X	0.0165	0.297
	X			−0.0162	0.338
X^{1, 2}				0.0550	0.419
		X⁶		−0.0081	0.294

Notes. Columns 1–4: surveys used for the experiment, where the superscript index indicates the used magnitudes: ¹ mag; ² mag $\_$ iso; ³ magnitudes through 3, 4.5, and 7.5 arcsec apertures, mag_Aper 1, 2, and 3; ⁴ mag $\_$ auto; ⁵ kron_radius; ⁶ HallMag; ⁷ PetroMag. A cross in a column means that the survey corresponding to that column was used for the experiment.

Download table as: ASCII Typeset image

Therefore, at the end of the pruning phase, the best combination of features turned out to be the five SDSS psfMag, the two isophotal magnitudes of GALEX, the four HallMag for UKIDSS, and the four magnitudes for WISE.

4.3. Magnitudes versus Colors

Once the most significant features had been identified, we had to check which types of flux combinations were more effective, in terms of magnitudes or related colors. Experiments were performed on all five cross-matched data sets listed in Section 2.

As could be expected, the optimal combination turned out to be always the mixed one, i.e., the one including colors and one reference magnitude for each of the included surveys (r for SDSS, nuv for GALEX, K for UKIDSS, and W4 for WISE). From the data mining point of view, this is rather surprising since the amount of information should not change by applying linear combinations between features. However, from the physical point of view, this can be easily understood by noticing that even though colors are derived as a subtraction of magnitudes, the content of information is quite different, since an ordering relationship is implicitly assumed, thus increasing the amount of information in the final output (gradients instead of fluxes). The additional reference magnitude instead removes the degeneracy in the luminosity class for a specific galaxy type.

4.4. MLPQNA Network Topology

The final check was about the hierarchical complexity of the network in terms of the number of internal layers, whether shallow or deep according to the definitions in Bengio & LeCun (2007), where deep is referred to as a feed-forward network with more than one hidden layer. The above-quoted cross-matched data sets were therefore processed through both a three-layer (input + hidden + output) and a four-layer (input + two hidden layers + output) network. In all cases, the four-layer network performed significantly better, thus confirming the performance enhancement with deep networks in the case of particularly complex nonlinear regression cases, i.e., in the case of highly multi-variate distributions of the input PS.

The experiments with the best results have been obtained using a four-layer network trained on the mixed (colors + reference magnitudes) data sets, and their statistics are reported in Tables 3–6.

Table 3. Comparison Among the Performances of the Different References

Exp	Bias	σ	MAD	rms
Exp	(Δz)	(Δz)	(Δz)	(Δz)
SDSS
MLPQNA	0.007	0.25	0.102	0.26
Bovy et al.	...	0.46	...	...
Laurino et al.	0.210	0.28	0.110	0.35
Ball et al.	...	0.35	...	...
Richards et al.	...	0.52	...	...
SDSS + GALEX
MLPQNA	0.003	0.21	0.060	0.22
Bovy et al.	...	0.26	...	...
Laurino et al.	0.13	0.21	0.061	0.25
Ball et al.	...	0.23	...	...
Richards et al.	...	0.37	...	...
SDSS + UKIDSS
MLPQNA	0.001	0.25	0.066	0.26
Bovy et al.	...	0.28	...	...
SDSS + GALEX + UKIDSS
MLPQNA	0.0009	0.18	0.043	0.19
Bovy et al.	...	0.21	...	...
SDSS + GALEX + UKIDSS + WISE
MLPQNA	0.002	0.15	0.040	0.15

Notes. MLPQNA is related to our experiments, based on a four-layer network, trained on the mixed (colors + reference magnitudes) data sets. In some cases, the comparison references are not reported due to the missing statistics. Column 1: reference; Column 2–5, respectively: bias, standard deviation, MAD, and rms, calculated on Δz = (z_spec − z_phot) related to the test sets. For the definition of the parameters and for discussion see the text.

Download table as: ASCII Typeset image

Table 4. Comparison Among the Performances of the Different References

Exp	Bias	σ	MAD	rms	NMAD
Exp	(Δz_norm)	(Δz_norm)	(Δz_norm)	(Δz_norm)	(Δz_norm)
SDSS
MLPQNA	0.032	0.15	0.039	0.17	0.058
Laurino et al.	0.095	0.16	0.041	0.19	...
Ball et al.	0.095	0.18	...	...	...
Richards et al.	0.115	0.28	...	...	...
SDSS + GALEX
MLPQNA	0.012	0.11	0.029	0.11	0.043
Laurino et al.	0.058	0.29	0.029	0.11	...
Ball et al.	0.06	0.12	...	...	...
Richards et al.	0.071	0.18	...	...	...
SDSS + UKIDSS
MLPQNA	0.008	0.11	0.027	0.11	0.040
SDSS + GALEX + UKIDSS
MLPQNA	0.005	0.087	0.022	0.088	0.032
SDSS + GALEX + UKIDSS + WISE
MLPQNA	0.004	0.069	0.020	0.069	0.029

Notes. MLPQNA is related to our experiments, based on a four-layer network, trained on the mixed (colors + reference magnitudes) data sets. In some cases, the comparison references are not reported, due to the missing statistics. Column 1: reference; Columns 2–6, respectively: bias, standard deviation, MAD, rms, and NMAD calculated on Δz_norm = (z_spec − z_phot)/(1 + z_spec) related to the test sets. For the definition of the parameters and for discussion see the text.

Download table as: ASCII Typeset image

Table 5. Comparison in Terms of Outliers Percentages Among the Different References

Exp	Outliers		Outliers
Exp	(\|Δz\|)		(\|Δz_norm\|)
	>2σ(Δz)	>4σ(Δz)	>2σ(Δz_norm)	>4σ(Δz_norm)
SDSS
MLPQNA	7.68	0.38	6.53	1.24
Bovy et al.		0.51
SDSS + GALEX
MLPQNA	4.88	1.61	4.57	1.37
Bovy et al.		1.86
SDSS + UKIDSS
MLPQNA	4.00	1.73	3.82	1.38
Bovy et al.		1.92
SDSS + GALEX + UKIDSS
MLPQNA	2.86	1.47	3.05	0.23
Bovy et al.		1.13
SDSS + GALEX + UKIDSS + WISE
MLPQNA	2.57	0.87	2.88	0.91

Notes. In some cases, the comparison references are not reported, due to the missing statistics. Column 1: reference; Columns 2 and 3 are fractions of outliers at different σ based on Δz = (z_spec − z_phot); Columns 4 and 5 are the fractions of outliers at different σ based on Δz_norm = (z_spec − z_phot)/(1 + z_spec). Column 4 reports our catastrophic outliers, defined as |Δz_norm| > 2σ(Δz_norm).

Download table as: ASCII Typeset image

Table 6. Catastrophic Outliers Evaluation and Comparison Between the Residual σ_clean(Δz_norm) and NMAD(Δz_norm)

Exp	No. of Obj.	σ	% Catas. Outliers	σ_clean	NMAD
Exp	No. of Obj.	(Δz_norm)	% Catas. Outliers	(Δz_norm)	(Δz_norm)
SDSS	41431	0.15	6.53	0.062	0.058
SDSS + GALEX	17876	0.11	4.57	0.045	0.043
SDSS+UKIDSS	12438	0.11	3.82	0.041	0.040
SDSS+GALEX+UKIDSS	5836	0.087	3.05	0.040	0.032
SDSS+GALEX+UKIDSS+WISE	5716	0.069	2.88	0.035	0.029

Notes. The reported number of objects for each cross-matched catalog refers to the test sets only. Catastrophic outliers are defined as objects where |Δz_norm| > 2σ(Δz_norm). The standard deviation σ_clean(Δz_norm) is calculated after removing the catastrophic outliers, i.e., on the data sample for which |Δz_norm| ⩽ 2σ(Δz_norm).

Download table as: ASCII Typeset image

5. DISCUSSION AND CONCLUSIONS

In 2002, we began to explore the usage of MLPs for the evaluation of photo-z both for normal galaxies and quasars (Tagliaferri et al. 2002). Several years later, D'Abrusco et al. (2007) used a combination of two MLPs to correct for the degeneracy introduced by the inhomogeneities in the KB. Then, Laurino et al. (2011) demonstrated that the subtleties in the mapping function could be more easily captured using the so-called Weak Gated Experts (WGE) method, a hierarchical combination of MLPs each specialized in a specific partition of the PS, whose individual outputs were combined by an additional MLP.

Furthermore, Bengio & LeCun (2007) published a seminal paper that somehow has disproved the Haykin pseudotheorem (Haykin 1998), pointing out that problems with a large amount of distribution irregularities in the PS are better treated by what they defined as deep networks, i.e., networks with more than one computational (hidden) layer. In this paper, we exploited the Bengio & LeCun (2007) findings by using the supervised machine learning based method MLPQNA to evaluate photometric redshifts of quasars using multi-band data obtained from the cross-matching of the GALEX, SDSS, UKIDSS, and WISE surveys.

In Tables 3–5, we compare our best results to those presented by other authors (Ball et al. 2008; Richards et al. 2009; Laurino et al. 2011; Bovy et al. 2012), in terms of a homogeneous set of statistical indicators, defined in Section 3.3. Unfortunately, the whole set of indicators was not available for all bibliographical sources and in several cases we could only use a few quantities. Results are listed according to the combinations of surveys used in the experiment.

The best experiment, which makes use of a selected combination of parameters drawn from the four cross-matched surveys, leads to a bias = 0.004 and a median absolute deviation, MAD = 0.02. The fraction of catastrophic outliers, i.e., of objects with photo-z deviating more than 2σ from the spectroscopic value is <3%, leading to σ(Δz_norm) = 0.035 after their removal (as reported in Table 6). The larger the number of surveys (bands) used, the more accurate the results are. This result, which might seem evident, is not obvious at all, since the higher amount of information carried by the additional bands implies also a strong decrease in the number of objects that are contained in the training set and should therefore cause a decrease in the generalization performances of the network.

This result, together with the fact that MLPQNA also performs well with small KBs (Cavuoti et al. 2012b), seems particularly interesting, since it has far reaching implications on ongoing and future surveys: a better photometric coverage is much more relevant than an increase of spectroscopic templates in the KB.

Concerning the performance evaluation in terms of photometric redshift reconstruction, all statistical results reported throughout this paper refer to test data sets only. In fact, it is a good practice to evaluate the results on data (i.e., the test set) that have never been presented to the network during the training and/or validation phases. The use of test plus training data might introduce an obvious positive systematic bias which could mask reality.

More generally, empirical methods, such as MLPQNA, have the advantage that the training set is made up of real sky objects. Hence, they do not suffer from the uncertainty of having accurate templates. In this sense, any empirical method intrinsically includes effects such as the filter bandpass and flux calibrations. In fact, as discussed in depth by Collister & Lahav (2004), one of the main drawbacks of these methods is the difficulty in extrapolating to regions of the input PS that are not well sampled by the training data. Therefore, the efficiency of empirical methods degrades for objects at fainter magnitudes than those included in the training set, as this would require an extrapolation capability for data having properties, such as redshift and photometry, not included in the learned sample. In fact, another strong requirement of such methods is that the training set must be large enough to properly cover the PS in terms of colors, magnitudes, object types, and redshift. In this case, the calibrations and corresponding uncertainties are well known, and only limited extrapolations beyond the observed locus in color–magnitude space are required. In conclusion, under the conditions described above about the consistency of the training set, a realistic way to measure photometric uncertainties is to compare the photometric redshift estimation with spectroscopic measures in the test samples.

As can be seen in Tables 3–5, in all cases MLPQNA obtains very relevant results. Only in the SDSS+GALEX case do the non-normalized quantities (i.e., those referred to the error Δz = z_spec − z_phot) show a substantial agreement between our results and those by Laurino et al. (2011). The better performances of MLPQNA in the normalized indicators (i.e., those referred to the error Δz_norm = (z_spec − z_phot)/(1 + z_spec)) are a consequence of the better performance of the MLPQNA method in terms of the fraction of catastrophic outliers.

We wish to stress that both our four-layer MLPQNA and the WGE method discussed in Laurino et al. (2011) take advantage of a substantial improvement in complexity with respect to the traditional three-layer MLP networks used in the literature, and therefore deal better with the complexity of the multi-color PS. Average statistical indicators such as bias and standard deviation, however, provide only part of the information that allows us to correctly evaluate the performances of a method and, for instance, they provide only very little evidence of the systematic trends that are observed as a sudden increase in the residuals spread over specific regions of the redshift space (Laurino et al. 2011). In the worst cases, these regions correspond to degeneracies in the PS and, as could be expected, the relevance of such degeneracies decreases for an increasing number of bands.

As far as the analysis of the catastrophic outliers is concerned, according to Mobasher et al. (2007), the parameter D₉₅ ≡ Δ₉₅/(1 + z_phot) enables the identification of outliers in photometric redshifts derived through SED fitting methods (usually evaluated through numerical simulations based on mock catalogs). In fact, in the hypothesis that the redshift error Δz_norm = (z_spec − z_phot)/(1 + z_spec) is Gaussian, the catastrophic redshift error limit would be constrained by the width of the redshift probability distribution, corresponding to the 95% confidence interval, i.e., with Δ₉₅ = 2σ(Δz_norm). In our case, however, photo-z are empirical, i.e., not based on any specific fitting model, and it is preferable to use the standard deviation value σ(Δz_norm) derived from the photometric cross-matched samples, although it could overestimate the theoretical Gaussian σ, due to the residual spectroscopic uncertainty as well as to the method training error. Therefore, we consider as catastrophic outliers the objects with |Δz_norm| > 2σ(Δz_norm). It is also important to note that for empirical methods, it is useful to analyze the correlation between the NMAD(Δz_norm) = 1.48 × median(|Δz_norm|) and the standard deviation σ_clean(Δz_norm) calculated for the data sample for which |Δz_norm| ⩽ 2σ(Δz_norm). In fact, the quantity NMAD would be comparable to the value of the σ_clean.

As is shown in Table 6, in our data the σ_clean(Δz_norm) is always slightly larger than the corresponding NMAD(Δz_norm), which is exactly what is expected due to the overestimate induced by the above considerations (see also Figures 4 and 5).

Figure 4. Refer to the following caption and surrounding text. — **Figure 4.** Δz_norm distributions for all five cross-matched test data sets. The lines referred to are, respectively, SDSS (gray), SDSS+*GALEX* (blue), SDSS+UKIDSS (green), SDSS+*GALEX*+UKIDSS (red), and SDSS+*GALEX*+UKIDSS+*WISE* (black).
Download figure:
Standard image High-resolution image

Figure 5. Refer to the following caption and surrounding text. — **Figure 5.** Scatter plots (z_spec vs. z_phot); (a) SDSS, (b) SDSS+*GALEX*, (c) SDSS+UKIDSS, (d) SDSS+*GALEX*+UKIDSS, and (e) SDSS+*GALEX*+UKIDSS+*WISE*. All diagrams refer to results on test sets. Gray points are catastrophic outliers (defined in Table 5). The red line is the dot-to-dot straight line passing through photometric and spectroscopic redshift limits in the available knowledge base.
Download figure:
Standard image High-resolution image

Finally, we would like to stress that the difficulties encountered by us and by other teams in comparing different methods, especially in light of the crucial role that photo-z play in the scientific exploitation of present and future large surveys (cf. The Dark Energy Survey Collaboration 2005, Chambers 2011, Refregier et al. 2010), confirm that it would be desirable to re-propose an upgraded version of the extremely useful PHAT contest (Hildebrandt et al. 2010, Cavuoti et al. 2012b), which allowed a direct, effective, and non-ambiguous comparison of different methods applied on the same data sets and through the same set of statistical indicators. This new contest should be applied to a much larger data set, with a more practical selection of photometric bands, and should also take into account other parameters such as the scalability and robustness of the algorithms, as well as the degeneracy characterization.

The authors thank the anonymous referee for comments and suggestions that helped to improve the paper, and the whole DAMEWARE team⁶ for many useful discussions. The authors also thank the financial support of Project F.A.R.O., third call by the University Federico II of Naples, and of the PRIN-MIUR 2011 for Euclid Mission. G.L. thanks Professor G. S. Djorgovski and the whole Department of Astronomy at the California Institute of Technology in Pasadena for their hospitality. A.M. and M.B. acknowledge the financial support of PRIN-INAF 2010, Architecture and Tomography of Galaxy Clusters. R. D'A. acknowledges the financial support of the US Virtual Astronomical Observatory, which is sponsored by the National Science Foundation and the National Aeronautics and Space Administration.

APPENDIX: THE QUASI NEWTON LEARNING RULE

Most Newton methods use the Hessian of the function to find the stationary point of a quadratic form. It needs to be stressed, however, that the Hessian of a function is not always available and in many cases is far too complex to be computed in an analytical way. More often, it is easier to compute the function gradient, which can be used to approximate the Hessian via N consequent gradient calculations. In order to better understand why QNAs are so powerful, it is convenient to start from the classical and quite common Gradient Descent Algorithm (GDA) used for back propagation (Bishop 2006). In GDA, the direction of each updating step for the MLP weights is derived from the error descent gradient, while the length of the step is determined from the learning rate. In the case of particularly complex problems, this method is inaccurate and ineffective, and therefore may get stuck in local minima. A more effective approach is to move toward the negative direction of the gradient (line search direction) not by a fixed step, but by moving toward the minimum of the function along that direction. This can be achieved by first deriving the descent gradient and then by analyzing it with the variation of the learning rate (Brescia 2012). Let us suppose that at step t, the current weight vector is w^(t), and let us consider a search direction d^(t) = −∇E^(t). If we select the parameter λ in order to minimize E(λ) = E(w^(t) + λd^(t)), then the new weight vector can be expressed as

$\begin{equation} w^{({t + 1})} = w^{(t)} + \lambda d^{(t)} \end{equation} \tag{ A1 }$

and the problem of line search becomes a one-dimensional minimization problem that can be solved in many different ways. Simple variants are (1) to move E(λ) by varying λ by small intervals, then evaluate the error function at each new position, and stop when the error begins to increase; or (2) to use the parabolic search for a minimum and compute the parabolic curve crossing pre-defined learning rate points. The minimum d of the parabolic curve is a good approximation of the minimum of E(λ) and it can be derived by means of the parabolic curve which crosses the fixed points with the lowest error values.

Another approach instead makes use of trust region based strategies that minimize the error function by iteratively growing or contracting the region of the function by adjusting a quadratic model function that best approximates the error function. In this sense, this technique can be considered as a dual to line search, since it tries to find the best size of the region by fixing the step size (while the line search strategy always chooses the step direction before selecting the step size; Celis et al. 1985). All these approaches, however, rely on the assumption that the optimal search direction is given at each step by the negative gradient: an assumption that not only is not always true, but can also lead to serious wrong convergence. In fact, if the minimization is done along the negative gradient direction, then the subsequent search direction (the new gradient) will be orthogonal to the previous one: in fact, note that when the line search founds the minimum, it is

$\begin{equation} \frac{{\partial E}}{{\partial \lambda }}({w^{(t)} + \lambda d^{(t)} }) = 0, \end{equation} \tag{ A2 }$

and hence

$\begin{equation} g^{({t + 1})T} d^{(t)} = 0 , \end{equation} \tag{ A3 }$

where g ≡ ∇E. The iteration of the process therefore leads to oscillations of the error function which slow down the convergence process. The method implemented here relies on selecting other directions so that the gradient component, parallel to the previous search direction, would remain unchanged at each step. Suppose that you have already minimized with respect to the direction d^(t) starting from the point w^(t) and reaching the point w^{(t + 1)}, where Equation (A3) becomes

$\begin{equation} g({w^{({t + 1})}})^T d^{(t)} = 0 \end{equation} \tag{ A4 }$

by choosing d^{(t + 1)} so as to preserve the gradient component parallel to d^(t) equal to zero. It is possible to build a sequence of directions d in such a way that each direction is conjugated to the previous one on the dimension |w| of the search space (this is known as a conjugate gradient method; Golub & Ye 1999). In the presence of a squared error function, the updated weight algorithm is

$\begin{equation} w^{({t + 1})} = w^{(t)} + \alpha ^{(t)} d^{(t)} \end{equation} \tag{ A5 }$

with

$\begin{equation} \alpha ^{(t)} = - \frac{{d^{(t)T} g^{(t)} }}{{d^{(t)T} Hd^{(t)} }} . \end{equation} \tag{ A6 }$

Furthermore, d can be obtained for the first time via the negative gradient, and in the subsequent iterations as a linear combination of the current gradient and of the previous search directions:

$\begin{equation} d^{({t + 1})} = - g^{({t + 1})} + \beta ^{(t)} d^{(t)} \end{equation} \tag{ A7 }$

with

$\begin{equation} \beta ^{(t)} = \frac{{g^{({t + 1})T} Hd^{(t)} }}{{d^{(t)T} Hd^{(t)} }} . \end{equation} \tag{ A8 }$

This algorithm finds the minimum of a square error function at most in |w| steps but at a high computational cost, since in order to determine the values of α and β it makes use of that Hessian matrix H which, as we already mentioned, is very demanding in terms of computing time: a fact that puts serious constraints on the application of this family of methods to medium/large data sets. Excellent approximations for the coefficients α and β can, however, be obtained from analytical expressions that do not use the Hessian matrix explicitly. For instance, β can be calculated through any one of the following expressions (respectively, Hestenes & Stiefel 1952; Fletcher & Reeves 1964; Polak & Ribiere 1969):

$\begin{equation} \hbox{Hestenes--Stiefel}: \beta ^{(t)} = \frac{{g^{({t + 1})T}({g^{({t + 1})} - g^{(t)}})}}{{d^{(t)T}({g^{({t + 1})} - g^{(t)}})}} \end{equation} \tag{ A9 }$

$\begin{equation} \hbox{Fletcher--Reeves}: \beta ^{(t)} = \frac{{g^{({t + 1})T} g^{({t + 1})} }}{{g^{(t)T} g^{(t)} }} \end{equation} \tag{ A10 }$

$\begin{equation} \hbox{Polak--Ribiere}: \beta ^{(t)} = \frac{{g^{({t + 1})T}({g^{({t + 1})} - g^{(t)} })}}{{g^{(t)T}g^{(t)}}} . \end{equation} \tag{ A11 }$

These expressions are all equivalent if the error function is square-typed, otherwise they assume different values. Typically, the Polak–Ribiere equation obtains better results because, if the algorithm is slow and subsequent gradients are quite alike between them, its equation produces values of β such that the search direction tends to assume the negative gradient direction (Vetterling & Flannery 1992).

Concerning the parameter α, its value can be obtained by using the line search method directly. The method of conjugate gradients reduces the number of steps to minimize the error up to a maximum of |w| because there could be almost |w| conjugate directions in a |w|-dimensional space. In practice, however, the algorithm is slower because, during the learning process, the property conjugate of the search directions tends to deteriorate. In order to avoid the deterioration, it is useful to restart the algorithm after |w| steps by resetting the search direction with the negative gradient direction.

By using a local square approximation of the error function, we can obtain an expression for the minimum position. The gradient in every point w is in fact given by

$\begin{equation} \nabla E = H \times ({w - w^* }) , \end{equation} \tag{ A12 }$

where w* corresponds to the minimum of the error function, which satisfies the condition

$\begin{equation} w^* = w - H^{ - 1} \times \nabla E . \end{equation} \tag{ A13 }$

The vector −H⁻¹ × ∇E is known as a Newton direction and it is the base for a variety of optimization strategies, such as, for instance, the QNA, which instead of calculating the H matrix and then its inverse, uses a series of intermediate steps of lower computational cost to generate a sequence of matrices that are more and more accurate approximations of H⁻¹. From the Newton formula (Equation (A13)) we note that the weight vectors on steps t and t + 1 are correlated with the correspondent gradients by the formula

$\begin{equation} w^{({t + 1})} - w^{(t)} = - H^{({ - 1})}({g^{({t + 1})} - g^{(t)}}) , \end{equation} \tag{ A14 }$

which is known as the Quasi Newton Condition. The approximation G is therefore built in order to satisfy this condition. The formula for G is

$\begin{equation} G^{({t + 1})} = G^{(t)} + \frac{{pp^T }}{{p^T \nu }} - \frac{{({G^{(t)} \nu })\nu ^T G^{(t)}}}{{\nu ^T G^{(t)} \nu }} +({\nu ^T G^{(t)} \nu })uu^T , \end{equation} \tag{ A15 }$

where the vectors are

$\begin{equation} p = w^{({t + 1})} - w^{(t)}; \nu = g^{({t + 1})} - g^{(t)}; u = \frac{p}{{p^T \nu }} - \frac{{G^{(t)} \nu }}{{\nu ^T G^{(t)} \nu }} . \end{equation} \tag{ A16 }$

Using the identity matrix to initialize the procedure is equivalent to considering, step by step, the direction of the negative gradient while, at each next step, the direction −Gg is definitely a descent direction. The above expression could carry the search out of the interval of validity for the squared approximation. The solution is hence to use the line search to find the minimum of the function along the search direction. By using such a system, the weight updating expression (Equation (A5)) can be formulated as follows

$\begin{equation} w^{({t + 1})} = w^{(t)} + \alpha ^{(t)} G^{(T)} g^{(t)} , \end{equation} \tag{ A17 }$

where α is obtained by the line search.

One of the main advantages of QNA, compared with conjugate gradients, is that the line search does not require the calculation of α with a high precision because it is not a critical parameter. Unfortunately, however, it again requires a large amount of memory to calculate the matrix G (|w| × |w|) for large |w|. One way to reduce the required memory is to replace at each step the matrix G with a unitary matrix. With such replacement and after multiplying by g (the current gradient), we obtain

$\begin{equation} d^{({t + 1})} = - g^{(t)} + Ap + B\nu . \end{equation} \tag{ A18 }$

Note that if the line search returns exact values, then the above equation produces mutually conjugate directions. A and B are scalar values defined as

$\begin{eqnarray} A &=& {-} \left({1 + \frac{{\nu ^T \nu }}{{p^T \nu }}}\right)\frac{{p^T g^{({t + 1})} }}{{p^T \nu }} + \frac{{\nu ^T g^{({t + 1})} }}{{p^T \nu }} \nonumber\\ \ms B &=& \frac{{p^T g^{({t + 1})} }}{{p^T \nu }}. \end{eqnarray} \tag{ A19 }$

PHOTOMETRIC REDSHIFTS FOR QUASARS IN MULTI-BAND SURVEYS

Author notes

Notes

Article metrics

Permissions

Share this article

Dates

ABSTRACT

1. INTRODUCTION

2. THE DATA SET

3. THE METHOD

3.1. Multi Layer Perceptron

3.2. The Implementation of MLPQNA

3.3. Statistical Indicators

4. THE EXPERIMENTS

4.1. The Knowledge Base and Model Setup

4.2. Selection of Features

4.3. Magnitudes versus Colors

4.4. MLPQNA Network Topology

5. DISCUSSION AND CONCLUSIONS

APPENDIX: THE QUASI NEWTON LEARNING RULE

Footnotes