This site uses cookies. By continuing to use this site you agree to our use of cookies. To find out more, see our Privacy and Cookies policy.
Perspective

A skeptical view on the Hirsch index and its predictive power*

Published 3 September 2018 © 2018 IOP Publishing Ltd
, , Focus issue: Quantum Optics and Beyond - in honour of Wolfgang Schleich Citation Michael Schreiber 2018 Phys. Scr. 93 102501 DOI 10.1088/1402-4896/aad959

1402-4896/93/10/102501

Abstract

The h-index was proposed by Hirsch in 2005. It is now often used as a measure of scientific performance of individual researchers. Assuming that it has predictive power, the index is even considered in academic appointment processes and for the allocation of research resources. But this is questionable, because the development of the index is dominated by citations to rather old publications. This inert behavior is demonstrated for the citation record of W P Schleich. Other variants of the h-index are analyzed, in particular a timely h-index ht which takes into account only the publications and citations of the previous t years. Finally, some alternative measures of scientific performance are discussed.

Export citation and abstract BibTeX RIS

1. Introduction

Nowadays many scientists feel uneasy due to a growing pressure to quantify their scientific performance in various ways. Such pressure is exercised not only by the administration but also by politics and the general public, but frequently even by the peers especially in academic appointment processes or for the allocation of research resources. This is somewhat understandable, because quantitative measures appear to be an easy way to determine whether the tax money is spent in a reasonable way. However, it is difficult to decide which way is reasonable, not only regarding the spending of the money but also regarding the measuring.

One fashionable approach is bibliometrics, in particular analyzing the number of publications and citations, especially in terms of the h-index proposed by Hirsch [1]. This index is given by the largest number h of a scientist's publications which received h or more citations each:

Equation (1)

Here c(r) denotes the number of citations to the paper at rank r, after the papers have been sorted according to decreasing c(r). Thus it appears to be easy to determine h which is probably one reason why the index became popular not only among non-scientists. It is also famous among scientists although many researchers consider it rather infamous, especially but not only if they have obtained a relatively low index value.

Hirsch considered it an advantage of the index that it combines the dimension of quantity, namely the number of publications, with the dimension of quality, as expressed by the citation frequency. But it is certainly not obvious whether the number of citations reflects the quality of a manuscript. Sometimes wrong papers are heavily discussed in literature and thus obtain a considerable number of citations. It is not easy to eliminate such faulty publications. Otherwise, highly cited review articles do not present new research results. Thus although their quality could be very high, they do not reflect research performance.

From a methodological point of view such a mixture of different dimensions into one indicator is questionable in principle. And in the present case, the mixture is not even as unique as it appears. Only on first sight the definition in equation (1) is parameter free. However, one could demand that hq publications have received q ${\cdot }$ hq or more citations each [2]. The arbitrary prefactor q yields an infinite number of generalized indices hq. It has been shown [3] that already moderate values of q can change the ranking of scientists considerably compared with the original h-index. For highly cited researchers a larger value of q = 10 was suggested; this so-called w index has the advantage that the results are much smaller so that the precision problem discussed in the next section is significantly smaller, too [4].

I find it interesting to observe that counting citations in order to take economic decisions has a long history. In order to measure the desirability of purchasing a particular journal, the citations in volume 48 of J. Am. Chem. Soc. from 1926 were analyzed and found to be distributed very unevenly among 247 different journals in chemistry and physics for the time span of 55 years from 1871 up to 1925 [5]. So citation analysis with the purpose of allocating or not allocating funds is nothing new. It has just become less cumbersome by the advent of electronic databases. But as discussed below, this does not automatically mean improved accuracy. In any case these databases help to 'measure what is measurable and make measurable what is not so' [6] in the spirit of Galileo Galilei.

The following analysis concentrates on various variants of the h-index which I have experience with. In a recent review [7] other variants and also other citation impact indicators are comprehensively discussed. Besides traditional bibliometrics which are based on citations, altmetrics [8] capture different parts of impact of a research, e.g. views and downloads of web items, discussions in science blogs, Twitter, Facebook, and other social media, as well as bookmarking of respective web pages; in addition to citations. In this way altmetrics provides complementary perspectives of a researcher's impact. It also provides a very fast measure, reflecting early attention, too. However, the question arises whether online popularity is correlated with scientific value.

While the number of received citations is certainly an interesting quantity, it would be more informative to know the relevance of the citing papers. Assuming that this relevance is correlated with the number of citations obtained by the citing paper, one can apply the PageRank algorithm which is used by Google search to determine the relevance of web sites [9]. In effect this means that the whole network of authors and publications is taken into account. The mentioned relevance need not be restricted to the impact of the citing paper, but could be extended to the impact of the citing author(s). The suggested π-index [9] is robust to self-citations, and other means which increase citation counts disproportionately. It is, however, computationally rather involved so that one cannot compute it oneself, but it needs to be implemented in a large data base which comprises a sufficiently complete citation network.

The Eigenfactor score was developed as a superior alternative to the journal impact factor to measure the importance of a scientific journal, weighting citations by the ranks of their journals. The algorithm computes the eigenvector centrality of each node in the network and can be applied to the level of individual authors in a citation network, too [10]. Like the π-index, one needs a large, comprehensive data base to determine and analyze the network.

The investigation of bibliographic networks is nothing new, but has become more popular in recent years due to advanced visualization [11] techniques which complement traditional network metrics like degree centrality, closeness centrality, betweenness centrality, and eigenvector centrality [12]. For example, it was shown that the network metrics of co-authorship networks positively impact the citation performance [12, 13]. Besides co-authorship also co-citations (papers cited by the same source) and bibliographic coupling (sources citing the same paper) can be used for the construction of a network. Most common is the direct citation network. An example is presented below in section 10, visualizing the quantum optics network.

2. The precision problem

The determination of the h-index turns out to be not as straightforward as one might imagine. The most serious problem is the integrity of the citation data. In many cases it is difficult to eliminate homographs, i.e. to exclude papers which have not been authored by the investigated researcher. In my own case there are several other scientists with the same name and the same initials, at least two of them have published papers with titles which are not obviously different from my own research interests.

A distinction by address is usually possible, but sometimes complicated by variation in the (abbreviated) name of the institute and/or university. A colleague at a Max-Planck-Institute needed only 4 years for 7 different versions of his institute's name in the web of science (WoS) records. Thus a discrimination by address could lead to unwanted exclusion of publications, especially if one does not have a CV available so that some affiliations may be missed.

Similar problems arise when one tries to discriminate by research field. In my own case, selecting only physics and maybe chemistry would mean to neglect my hobbyhorse bibliometrics which is contributing more and more to my h-index value.

I came across another especially difficult case when I searched in the WoS for Pierre-Gilles deGennes where one has to combine the records for deGennes PG, deGennes P, de Gennes PG, Gennes PG, and Gennes PGD. Similarly, the transliteration from other alphabets is not unique and has also changed over the years, so that it is rather difficult to obtain the complete citation record of scientists from, e.g., China, Iran, Japan, Russia.

Failing author disambiguity does not only enhance the h-index value, if the determination is not done with sufficient care. It can also lead to gratuitous citations, if somebody intends to quote a famous established scientist but erroneously misquotes a homonymous author.

A solution to this so-called author ambiguity problem would be a ResearcherID as provided by Clarivate Analytics which integrates with the WoS database. Another, non-proprietary, persistent digital identifier is promoted by ORCID (Open Researcher and Contributor ID). But although the number of ORCID IDs has been increasing significantly (4.3 million in January 2018) since the introduction in 2012, many researchers have not yet registered. Moreover, it is tedious for the individual researcher to add the ID to all of his/her old publications. Consequently, at present an accurate citation database still requires to compare the publications in the citation record with a publication list of the author. The value of h which is automatically calculated in the WoS is usually too large, sometimes only a few index points, but sometimes by a factor of two or more.

The h-index which is provided by Google Scholar suffers from similar problems. Moreover, it also takes into account various versions of the same paper, e.g., the journal paper and the archive version as citable and as citing items. 'Stray citations' lead to duplicate records [14]. In general, Google Scholar lacks a quality control process and also includes citing blogs and magazine articles [14]. This leads to a drastic overestimation of the index. Google Scholar also includes technical reports and other gray literature, what is questionable, because these items have not been refereed. One advantage of Google Scholar is that books are also taken into consideration. For a comprehensive citation analysis it would be fair to include books which are likely to have a significant impact. For example, Wolfgang Schleich's book [15] is by far the most-cited item in his publication list with 1840 citations in comparison with 399 citations to his most cited journal publication (according to Google Scholar as of 20th July 2018), with the latter also on top of the WoS list but with only 298 citations.

Another often used database is SCOPUS which in my view has the same high standards and therefore also a reputation as high as the WoS. However, depending on the scientific field, in comparison with the WoS different journals, more or less proceedings, and more or less books are evaluated. Thus the h-index values are not comparable between those two databases, and one should use one or the other database. For the above mentioned publications SCOPUS lists 328 citations. The coverage of the three databases has recently been compared across disciplines [14].

3. A case study

For the present investigation, I have downloaded the complete citation record of Wolfgang P Schleich from the WoS into a spreadsheet in December 2017 (which means that some citations from 2017 are missing in the dataset). The author ambiguity problem is not so big in this case, but one has to consider not only WP Schleich but also W Schleich. Comparing with his publication list which can be obtained from the WWW, only a few papers, e.g., on breast cancer and brain cell biology, in the WoS list had to be discarded. Thereby some care had to be exercised, because some of his publications in the WoS records are missing in his WWW list. The subsequent analysis comprises 272 publications with altogether 7362 citations.

The citation record of Schleich is visualized in figure 1. According to Hirsch [1] and using equation (1), it is easy to determine h = 43, which is the largest rank r for which the data point lies on or above the diagonal c(x) = x. If one interpolates c(x) between r and r + 1 linearly, then one can determine a generalized index $\tilde{h}$ from $\tilde{h}=\,c(\tilde{h})$. In other words, the intersection of the interpolating lines in figure 1 with the diagonal determines $\tilde{h}=43.5.$ Rounding the rational number $\tilde{h}$ to the next lower integer value yields the original h-index.

Figure 1.

Figure 1. Citation record of Wolfgang P Schleich. The papers are ranked according to the number of citations (${\rm{\times }}$). Also given is the average number of citations (+) up to rank r. The straight solid line reflects the diagonal c(r) = r, the broken line indicates c(r) = 2 ${\cdot }$ r.

Standard image High-resolution image

As mentioned in the introduction one can generalize the h-index by choosing a prefactor q and thus put emphasis on more or less cited papers. In figure 1 the case q = 2 is illustrated yielding h2 = 32. For q = 10 we get h10 = 11. But such relatively small values make it difficult to distinguish different researchers. Differences of few index points should not be taken as an indication that one scientist is better than the other.

In general, the index values should be interpreted with discretion. There are prominent scientists with low index values. One should be cautious and better adhere to the principle of antidiagnostics, which has been formulated already in 1997, long before the introduction of the h-index, namely that 'in scientometrics, numerical indicators can reliably suggest only eminence but never worthlessness' [16].

Hirsch has suggested [1] that a successful scientist would be characterized by an index value of about the number of years since his or her first publication. For Schleich this would mean 35 years and should be compared with h = 43. Thus he is well above the quality threshold suggested by Hirsch.

4. Averaging the number of citations

The citation record is usually strongly skewed, as demonstrated in figure 1. The skewness can be quantified: Schleich's 5% most cited papers obtained 36% of all the 7362 citations. 80% of the citations were received by the top 29% of the papers which means that the distribution does not quite reach Pareto's principle, the 80–20 rule, but it comes close to that.

Obviously the long tail of lowly cited publications is not taken into consideration for the evaluation of the h-index. This is usually not a problem, but many researchers consider it unfair that a lot of citations to the papers in the h-core, i.e. the set of h-defining papers, are not taken into account. These are called excess citations. For example in figure 1 c(1) − h = 497 − 43 = 454 citations to the first paper are not relevant for the h-index. This shortcoming was remidied in the g-index [17] which can be defined in analogy to the h-index by requiring that the average number $\bar{c}$ of citations to the g most cited papers is larger or equal to g [18]. Of course, the respective average values show a much smoother behavior than c(r), as can be seen in figure 1. Naturally, the result g = 76 is larger than h. If one interpolates c(x) as above, one can find the interpolated index $\widetilde{g}=\bar{c}(\widetilde{g})$ which turns out to be a real number. Again the intersection with the diagonal as shown in figure 1 gives the result graphically, in this case $\tilde{g}=76.3.$ An advantage of $\tilde{g}$ is that every additional citation to a paper in the g-core yields an increase of the index, albeit a small one. I think that this is an attractive feature.

The averaging procedure here is a mathematical formality in order to take the excess citations to the core papers into account in a reasonable way [17]. In fact, for the h-index the h ${\cdot }$ h citations in the lower left corner of figure 1 are relevant. All the excess citations above this square do not count. In contrast, for the g-index, the complete area below the c(r) curve is integrated up to $\tilde{g}$ and this area is equal to ${\tilde{g}}^{2}.$ In other words, the g-core comprises at least g2, but less than (g + 1)2 citations. Formally, this is equivalent to the above definition which utilizes the average $\bar{c}.$

In principle the use of averages for strongly skewed distributions violates most-basic scientific standards. Thus one should not consider the average number of citations for the complete dataset. This severe criticism applies also against the widely used and acclaimed impact factor: the impact factor for a given year is calculated as the average number of citations in a given year to the papers published in the previous two years. Such an average is not a good measure [19], because the citation distributions of journals are also strongly skewed. As an example, for Physica Scripta my WoS search (in March 2018) yielded 1297 citations in 2017 to 843 articles from the years 2015 and 2016. This means an impact factor 1.54 (note that I have only considered the document type 'article' so that this value is different from the official WoS value). But only 262 = 31% of all those papers had received 2 or more citations in 2017. Similarly, for Science I found 56651 citations in 2017 to 1505 articles from 2015 and 2016. Thus the average gives an impact factor of 37.6 (again restricted to 'articles'), but only 372 articles have received 38 or more citations in 2017. The remaining 75% had less citations and thus decreased the average. In conclusion, it would have been better for the journal's impact factor if those papers had not been published in Science.

This argument also dramatically shows that the impact factor of a journal should never be used to judge the quality of a single paper in that journal. Mathematically one can express this as a low correlation between the impact factor and the number of citations to an individual paper in a journal. A strong correlation can only be derived between impact factor and the number of uncited papers in a journal.

5. Other Hirsch-type indices

The hype about the Hirsch index produced a plethora of variants [2022] so that probably every letter in the alphabet has been used at least once for a new bibliometric index. Advantages or disadvantages are often discussed with a bias, driven by personal taste or possibly by the desire of the proposing author to come off better than competing colleagues.

One modification takes the number of coauthors into account. However, it is difficult to treat them in a fair way, because the contributions of the individual authors to a manuscript are difficult to quantify [23]. Sharing the fame equally among the authors is a solution, which is of course imperfect. It could be done by fractionalizing the citation counts, which means to attribute c(r)/a(r) citations to each of the a(r) authors [24]. But for the determination of the respective fractionalized index hf this means that the papers have to rearranged, sorting by the fractionalized citation counts. Moreover, it appears unreasonable that afterwards highly cited papers with many authors may have dropped out of the core.

Therefore it is a better solution to fractionalize the paper count, which means to attribute only a fraction 1/a(r) of a paper to each author [25, 26]. The respective modification hm can have a strong effect on the ranking [25]. This modification has been installed into the 'publish or perish' software [27]. Another advantage of hm is that it enables a straightforward aggregation of the results, e.g. of all researchers in an institute. It should be noted, however, that any fractionalization appears to be unfair towards publications from large collaborations for which either the citation count or the paper count would be marginalized due to the large number of authors. But is it appropriate to take a paper with more than 100 authors fully into account more than 100 times?

Another question is how to treat self-citations, because they do not reflect the impact of a publication. Of course, some self-citations are useful, if one does not want to repeat previous material. There are also indolent self-citations, because it is easier to refer to one's own citations which one knows usually better than other people's work. And then there are the unnecessary self-citations which are based on the observation that one has to cite one's own publications, because no one else does so. In any case, it seems appropriate to discard the self-citations from the database. However, this is not so easy.

The WoS citation record designates a number of citations without self-citations. But this is misleading, because only self-citations within the investigated dataset sample are excluded. For example, if I just analyze the citation record of my most cited paper in bibliometrics [26] then the WoS gives 70 citations and finds no self-citations. However, among the 70 citing papers one can easily find 6 of my own publications, which means 6 self-citations. But even if I start with a complete citation record of my publications, then only my own (i.e. direct) self-citations would be counted. So-called indirect self-citations of my coauthors would be overlooked, but they can sometimes be numerous, depending on how enthusiastic that coauthor is citing the collaborative work. Consequently, it is difficult to determine all self-citations. Moreover, even the direct self-citations are counted in the WoS only globally for the investigated sample, but not for each paper.

For a previous case study I excluded the direct and indirect self-citations for every single paper from the citation record checking every coauthor's citations [28]. This is of course tedious work so that it cannot be expected to be performed regularly. But in principle all self-citations should be excluded from the citation record, because these can strongly change the ranking [28] in contrast to Hirsch's conjection [1] that self-citations would not have a big influence on the index.

If one combines the index hm which is modified for multi-author papers with the index hs which is sharpened by exclusion of all self-citations, then one obtains the index hms [29]. Similarly, a modified sharpened gms can be constructed, which in my view would be the most appropriate version within the huge and still growing zoo of Hirsch-type indices.

One can already observe, that self-citations are sometimes aimed at enhancing an author's h-index, when the author surgically cites those of his own papers which have a citation count just below his h-index. This is an easy and obvious way to adjust one's behavior to the introduction of h as a new measure; like they say: 'whatever you count, you get more of.' However, this is not only an easy way to game the h-index, but it also easy to detect. A more elaborate way to engineer the h-index would be by collusive citations via citation cartels. These are more difficult to find and to distinguish from collaborative citations. But it can be done [9, 30]. In collaborations people can easily agree on reciprocal citation behavior. Such malpractice could also become one-sided, if group leaders exert pressure on younger researchers or senior faculty on junor faculty for citations [31].

Another topical problem concerns coercive citations in a more general setting, for example self-citations by referees requesting their own work to be cited in a reviewed manuscript [32]. Even editors have been found [33] to coerce citations for their personal benefit increasing their citation count and their h-index in this way. On a broader scale coercive citations have been used for editorial glorifications, namely when editors request or even extort that their journal is (more often) cited in a submitted manuscript with the aim of enhancing journal metrics like the journal impact factor [30]. This clearly unethical behavior has become so severe with more than 20% of the investigated journals found guilty [34] that a voluntary Code of Conduct [35] was proposed and adopted by a group of journal editors. It is worth noting that the journal impact factor can already be engineered by editorials with frequent citations to previous editorials, by publishing a lot of correspondence, and by publishing a manuscript as a letter to the editor. These items do not contribute to the journal impact factor, but citations therein do so, thus enhancing this measure [36].

A more elaborate way of fractionalizing coauthor impact is based on the investigation of complete citation networks as mentioned in the introduction. For example, the π-index was shown [9] to reduce the citation impact of multi-authored papers as compared to the h-index as well as devaluating self-citations and also reducing the impact of quantity-oriented publication habits in contrast to quality-orientated strategies. In consequence, the π-index appears to be a much fairer way of evaluating researchers, and I would call it better, if only it would not be so difficult to determine it.

6. The predictability of the Hirsch index evolution

Hirsch [37] showed a high correlation between the h-index values after 12 years and after 24 years of the career of researchers. He concluded that the index has predictive power. Likewise, with a complicated fit using 18 parameters the future h-index was rather accurately predicted for several years [38]. Even a much smaller set of 5 parameters leads to a relatively good fit. However, such a fit can easily suggest an accuracy of the prediction which has no meaning, as illustrated by the warning that 'with four parameters I can fit an elephant and with five (sic!) I can make him wiggle his trunk' which was attributed to von Neumann by Enrico Fermi [39]. In any case, the transferability of the h-index fit to other samples was questioned [40], also for different career-age cohorts [41].

What one can show is that the h-index is indeed a good predictor of itself because its evolution is usually dominated for several years by further citations to previous publications rather than by new scientific achievements [42]. This is visualized in figure 2 where the time evolution of the h-index for Schleich's publications is shown. It is compared with the fictitious evolution obtained under the assumption that he had refrained from publishing in and after a certain year r. For example, for r = 1993 the index hr would have increased like h until 1997, and even 2 years later it would have been smaller by only 2 index points. If he had failed to publish since 1997, there would have been no change in the next 3 years and only a small change of one index point in the subsequent 4 years, not more than 2 points deviation from the actual h-index until 2010, i.e. for 13 years and even in 2017 the value of hr = 38 would be only 5 points below the actual index and thus still well above the above-mentioned threshold of one index point for each year since the first publication as indicated by the dotted line in figure 2.

Figure 2.

Figure 2. Time evolution of the h-index for Wolfgang P Schleich's publications (top line). Additionally the dependence of hr(y) is shown for selected years (see legend) starting with the year r which is assumed as the year in which the author had retired and/or refrained from publishing ever since. The dots reflect a steady increase of one index point since 1982, i.e. the publication year of the oldest publication in the dataset.

Standard image High-resolution image

And finally, if we assume that he had retired in 2001 and refrained from publishing ever since (admittedly a very unlikely hypothesis in Schleich's case), his Hirsch index would not have suffered until 2010 and would have been only one index point lower in 2015 and two index points lower in 2017. Consequently, looking at the h-index one cannot decide whether a scientist has been active recently or not. This inertia of the h-index is the reason why the index is a good predictor of itself. But this should not be confounded with an indicator for future scientific performance.

As demonstrated by the small deviations in figure 2, the h-index does not reflect recent impact, which is not surprising, because it becomes more and more difficult for recent publications to contribute to the h-index, if and when the index values are already relatively large. Thus the index is more likely to result from rather old publications, i.e., it depends on passée achievement. This means that even if somebody does not perform as expected, this would not disappoint the evaluators who measure the performance in terms of the h-index some years later. One might consider this as an advantage for the lazy researcher. There may be people who do not even care, because they are more interested in the increasing measure than in the factual performance.

7. The timed h-index restricted to a citation window

There is a simple way to circumvent this difficulty: one can leave out the early publications from the analysis. This can be easily done in the WoS by specifying the time range of the investigation. That allows one to measure the citation impact in recent years. It also provides a straightforward way to detect how more and more papers drop out of the current h-core when the publication activity is restricted to a smaller and smaller time span, i.e. more recent years only.

While this study may give interesting insights into the activity of a single researcher, for the purpose of comparison it is advisable to select a certain time span. A short period of 3 years [43] may not be representative, because in many cases it takes some time for a publication to be recognized and cited. Also, delays in publications of the citing articles may occur. In my view [44] a time period of 5 years [45] is more reasonable. For Schleich the respective timed index ht yields a value (according to the WoS in March 2018) of 11 in 2017 considering the citation record in 2017 and the previous t = 5 years.

It is instructive to analyze this timed index also with a sliding time window. Respective results are displayed in figure 3. Selecting the evaluation year y one can easily restrict the WoS search to y and the previous 5 years. It is also straightforward to utilize the spreadsheet with the downloaded citation data to determine ht. The results in figure 3 show a rather constant high performance with prominent peaks around 1997 and 2000. Even in the valley between 2003 and 2007 the values are relatively high. After that the activity increased once more or at least the research had again more impact in terms of citations.

Figure 3.

Figure 3. Time evolution of the timed h-index ht(y) for Wolfgang P Schleich's publications in the year y taking only publications from the previous t = 5 years into account. Note that interpolated values (as described in section 3) are used giving more detail to these relatively small values.

Standard image High-resolution image

8. A different indicator

Although I have discussed above several indicators based on the h-index, I do not want to give the impression that I am convinced that these are really meaningful for performance evaluations. In my view they are better than the original version, but my general skepticism remains. An alternative is to compare a specific citation record with reference datasets for different fields or subfields and for the different publication years. Thus age- and field-normalized impact scores are determined by the respective percentile of the corresponding reference set to which the publication belongs [46].

In recent years especially one such bibliometric indicator has become 'most important ... because it is the most stable impact indicator and cannot be strongly influenced by a single highly cited article' [47]. This PP10% indicator is used in the Leiden ranking covering the 900 largest universities in the world [48]. It reflects the proportion of the publications of the university that belong to the top-10% most frequently cited in comparison with other similar publications.

This indicator can also be applied to the publication of an individual researcher, determining how many of his publications belong to the top 10%. To be specific, I have compared the number of citations of each of Schleich's publications with the citation distribution of all papers published in the same journal in the same year.

For the following evaluation I excluded the publications from 2016 and 2017, because the small citation numbers are susceptible to strong statistical fluctuations, and one citation more or less can drastically influence the result. The thus restricted dataset comprises 259 publications.

The rank of each paper in its journal volume is then expressed in percentiles and the number of papers which fall into the top 10 percentiles is counted, 36 in this case. So 36/259 = 13.9% of his publications belong to the top-10% class. This means that Schleich has a considerably larger proportion of top-10% publications than the expectation value of 10% ≈ 26.

Similar indicators are in use for the top-1%, top-5%, top-50% publications. In figure 4 the respective results of PPm% are visualized for all percentiles m. One can see that Schleich's papers are nearly always (with the exception of m = 3) better than the expected value given by the diagonal. In particular, PP1% = 1.5%, PP5% = 8.1%, PP50% = 55.6%.

Figure 4.

Figure 4. Proportion of Wolfgang P Schleich's publications which belong to the most cited m% in the reference sets, i.e. published in the respective journal in the same year. The diagonal reflects the expected value of m%.

Standard image High-resolution image

The difference PPm% − m% is plotted in figure 5. However, in the determination of the rank (and thus the percentile) for a certain paper in comparison with its reference dataset, there often occurs an ambiguity, because several publications in the same journal in the same year obtained the same number of citations. E.g., Schleich's recent paper [49] attracted 28 citations. (20 in 2016, thus contributing very positively to the impact factor of the journal.) But a ranking of the 2664 papers of Phys. Rev. Lett. from 2015 showed that all the papers from rank 400 to 420 received the same number of citations. So the question arises, which of these ranks should be taken for comparison. Fortunately, in this particular example there is no effect on PPm%, because all those 21 ranks belong to the 16th percentile class (which includes ranks 400 to 426 because 15% ${\cdot }$ 2664 = 399.6 and 16% ${\cdot }$ 2664 = 426.2).

Figure 5.

Figure 5. Difference between the actual proportion from figure 4 and the expected value. The difference is determined from the respective median ranks in each of the reference datasets (full line), as well as the lowest ranks (+) and the highest ranks (${\rm{\times }}$).

Standard image High-resolution image

For the data in figure 4 I always used the median rank of the respective range in the reference sets. Allotting the percentile of the lowest (highest) rank in each interval for comparison, a better (worse) performance of Schleich would be derived. But the aggregated values in PPm% are not influenced very much for small percentiles, as indicated by the symbols in figure 5. For larger percentiles which correspond to smaller citation counts the effect is stronger, because small citation counts occur more frequently in the reference datasets so that the ambiguity becomes larger. In this case it would be better not to determine the appropriate percentile by the median rank but rather attribute the paper fractionally to 2 or more percentile classes [50].

9. A completely different performance indicator

Besides citation analysis, nowadays a large number of different indicators is in frequent use, e.g., the amount of obtained and/or spent external funds, the number of PhD students, etc. The latter is a kind of vote with one's feet, because the candidates select the professor to whom they go for their research. However, their selection criteria are debatable and the number depends strongly on the size of the student body which differs drastically between different universities.

But there is a similar voting which admittedly can be evaluated only in Germany. The Alexander von Humboldt foundation grants a large number of research fellowships and various awards of very high reputation to excellent international scientists each year. These provide the data for the highly recognized Humboldt ranking of universities which depends on the number of Humboldtians that were hosted in the previous 5 years. The university of Ulm hosted 26 fellows and 4 award winners according to the 2017 ranking in the time span from 2012 until 2016. This yielded rank 41 in Germany (or rank 46 if weighted by the size of the institutions). 'Because Humboldtians choose their own hosts and make their own decisions based on the hosts' academic performance and international visibility ... (it) is an important indicator for international contacts and reputation' [51].

In comparison one can appraise Schleich's esteem who hosted altogether 11 fellows and 20 award winners of the Humboldt foundation during his entire career. Note also that the ratio of awardees and 'simple' fellows is exceptionally high. Additionally, Schleich was also instrumental for a prestigious Alexander von Humboldt professorship that has gone to the university of Ulm. These numbers indicate his high reputation, because he was selected as a host by so many scientists whose excellence was corroborated by the highly restrictive selection process of the Humboldt foundation.

10. Bibliographic network

The Humboldt hosts and guests constitute a unique network. As mentioned in the introduction, bibliometric networks are more commonly studied. As an example, I have determined the direct citation network in the field of quantum optics. For this purpose a WoS search was performed for the topic 'quantum optics' on 18 July 2018 yielding 15354 results. Out of these the 6207 items with at least 10 citations were downloaded. This threshold was chosen in order to restrict the subsequent analysis to highly-cited papers. The same threshold is applied by Google Scholar to determine the highly cited publications (yielding the i10-index). The dataset was complimented by 93 likewise highly-cited publications of WP Schleich (in addition to his 10 highly-cited quantum optics papers). Altogether these proceedings yielded 14878 authors, which were included in the direct citation network created using the VOSviewer software that can be freely downloaded from the WoS [11].

The result is displayed in figure 6 showing 6 clusters which are, however, somewhat overlapping and intertwined. Nevertheless, the main players in the field can easily be identified. A more detailed analysis shows that there are 8803 links in this network with a total link strength of 39 350. But these numbers depend strongly on the number of authors who are included in the clustering. In the present case the 200 best-linked authors have been taken into account. Among these authors the maximum number of documents is 50 (except for WP Schleich), the maximum number of citations is 8602, and the maximum link strength is 2147.

Figure 6.

Figure 6. Direct citation network between the authors identified in the WoS by the topic 'quantum optics', supplemented by Wolfgang P Schleich's publications. The default settings of the VOSviewer software were applied, except that the minimum number of documents of an author was set to 1 (instead of 5) and the number of authors selected by their link strength was reduced to 200 (from the default 500). A thesaurus file was used to correct the author names with some missing second initial, and to erase a small separate cluster of 9 authors and a single author with only one connection to the rest of the network. In addition, the colors have been adjusted.

Standard image High-resolution image

The software package also allows an overlay visualization which by default provided information about the average publication years. Not surprisingly, the items around Schleich, Ralph, Leuchs, Zeilinger were on average older, while the youngest items appeared on the fringe of the network.

11. Summary and conclusions

In the present investigation I have demonstrated the inert behavior of the h-index. This inertia might have been expected, but discussions with many colleagues told me that most people are not aware of this behavior. However, if one wants to employ the h-index, then one should know that it bears testimony of significant passée achievements. Of course in evaluations the old publications and their impact could be an important aspect. But as always care should be taken when fortelling the future, also the future scientific performance of a researcher. This applies as well utilizing the predictability of the h-index although the h-index is a good predictor of its own evolution.

The same problem pertains to many, if not all bibliometric indicators. Many improvements of the h-index were suggested. In the above discussion I have mentioned several of them, but in view of the large number of respective indicators I had to make a strong selection and the resultant choice is certainly dictated by personal taste. My preference is Egghe's g-index modified by fractional counting of papers for multiple-author publications and sharpened by the exclusion of all (direct and indirect) self-citations. In any case, if you do not like a ranking, you should just wait for the next one or suggest your own one, as Hirsch did.

One should be aware, however, that the advantage of the g-index namely that excess citations are taken into account can also be a disadvantage: reviews usually attract a lot of citations but it could be considered unfair that these are strongly enhancing the g-index, because reviews usually do not reflect new research. The same problem arises with abundantly cited research reports on topical political issues. In the h-index determination such a highly-cited review or report would only contribute at most one point.

Self-citations are a sensitive issue. Some are necessary and/or useful. Some reflect bad manners, when they are used to game the h-index or other metrics. Coerced citations by referees and journal editors are unethical. But on the other hand missing self-citations can also be unethical, e.g. when an author does not cite previous own work in order to hide salami slicing publishing tactics or even duplicate manuscript submission.

To quantify the citation impact of recent publications the timed h-index was discussed reflecting the h-index evolution for a sliding time window. This gives interesting insights into the significance of recent achievements and therefore allows for a better comparison of different candidates, considering another performance dimension. One disadvantage is that so-called sleeping beauties, i.e. papers which escaped the notice of the scientific community for several years and became heavily cited later, do not leave any trace in the evolution of ht at all. But this only is an example of the observation (attributed to Albert Einstein) that 'not everything that counts can be counted. And not everything that can be counted counts' [52].

One purpose of the above discussion was to explain the difficulties in obtaining a correct dataset for analysis. There are a lot of pitfalls which are likely to corrupt the database and thus the analysis. And well before the advent of the h-index, some methodological problems in ranking scientists by citation analysis already led to the conclusion that 'the necessary caution can not in principle be exercised by science administrators' [53]. 'It is worth emphasizing that administrators and PRTs (faculty promotion, renewal, tenure committees) can be disturbingly naive about the nuances and shortcomings of citation-based metrics' [31]. But also for the evaluation of bibliometric measures by peers it is at least necessary that they become aware of all the difficulties with establishing the database and problems in evaluating it. In this context it is rather obvious that different citation cultures can be observed and need to be compensated by some kind of field normalization. The same applies to different age cohorts. The PP10% indicator appears to be a good solution to these problems. With this approach a relatively fair evaluation could lead to a rather well-deserved judgment. Nevertheless, 'uncertainties make the concerted use of citation analysis and peer evaluation inevitable' [53]. In other words, one-dimensional indicators are not sufficient [8] and one should always remember that 'a good decision is based on knowledge not on numbers' as the Greek philosopher Plato knew already 2400 years ago [54].

Footnotes

  • Dedicated to Wolfgang P Schleich on the occasion of his 60th birthday.

Please wait… references are loading.