Using residual heat maps to visualise Benford’s multi-digit law

Benjamin Hull; Alexander Long; Ifan G Hughes

doi:10.1088/1361-6404/ac3671

1. Introduction

1.1. Benford's first digit law

Benford's first-digit law—also known as the Newcomb–Benford law, or the law of anomalous numbers—describes the counter intuitive phenomenon that the probability of the digits 1, 2, ..., 9 to occur in the first index in a number drawn from many real-world data sets is not uniform [1]. Newcomb gave the first mathematical statement of the first-digit law after the observation that the first pages of logarithmic tables wear out faster than the last ones [2]. Over half a century later, Benford analysed 20 datasets, including the values of physical constants and population data, for conformity with Newcomb's findings. In addition to first-digit analysis, Benford's law can be extended to the second index, or any combination of indexes, in a number, such as the first-second digit Benford law [1].

Extensive numerical data sets exist that display non-uniform distribution of the first digit. For instance, the Fibonacci numbers, which appear in the growth patterns of sunflowers, pinecones and other plants and flowers [3]. The Fibonacci numbers are more likely to start with one than any other digit, with nine the least likely digit to appear. Indeed, the Fibonacci numbers are said to follow Benford's law [4] which quantifies this property. Numerous data sets that arise in different scientific disciplines have been found—to differing quantitative extents—to conform with Benford's law. The physics phenomena include alpha decay [5]; astrophysics [6]; atomic line strengths [7]; biophysics [8]; extra-solar planets [9]; geophysics [10]; particle physics [11]; quantum critical phenomena [12]; spectroscopy [13] and end-of-chapter exercises in physics textbooks [14]. An extensive database of articles, books and other resources related to Benford's law can be found at [15].

1.2. Benford's law and fraud detection

The counter intuitive nature of Benford's law has an interesting consequence: human intervention, or manipulation, of data sets can lead to a change in compliance with the first (or multi) digit law. In other words, many real-world data sets conform with Benford's law, except when the data are adjusted. This insight is used in forensic accountancy [16], and as a filter for financial and election fraud [17]. The central idea is that it should be easy to spot financial fraud as human intervention is unlikely to be conducted in a Benford-compliant way. Obviously, deviations from the expected Benford distribution do not necessarily have a nefarious origin, however they provide a useful flag for further investigation [16]. Consequently, there has been much interest in using Benford analysis for detecting anomalies in financial datasets [18–20].

In 1988 Carslaw [21] analysed the frequency of second digits occurring in reported earnings and losses for New Zealand based firms, concluding that there is a tendency to overstate earnings and understate losses. This study gave evidence of goal-oriented behaviour in financial statement reporting. These results were reproduced for firms operating in the United States and the United Kingdom by Thomas [22] and Van Caneghem [23], respectively. Tilden and Janes [24] showed that corporate accounting data conforms with Benford's law under stable economic conditions. Mehta [25] applies Benford's law, along with other forensic accounting techniques, to the financially fraudulent statements of Toshiba during the 2008–2015 period, concluding that Benford's law was a useful technique for detecting irregularities in reported accounting data in this instance. Previous studies of fraud using Benford's Law rarely, if ever, use multi-digit Benford analysis, despite it being shown that when financial manipulation occurs there is a tendency to round up profits [21] in order to meet company goals, something which multi-digit Benford analysis would excel at pointing out.

1.3. Benford's law and evidence of human intervention

One of the greatest difficulties with looking for evidence of human intervention in financial data sets is the lack of availability of a good control group. In this work we analyse price paid data as an example where a change of behaviour is expected. In England and Wales stamp duty tax is paid when purchasing a house, with different rates applied within certain price bands. There was a significant change in the way the tax was calculated in December 2014 (the higher rate being valid only for the excess amount above a threshold, versus applicable to the whole amount), therefore the data sets of prices paid for house sales pre and post 2014 is expected to be very different.

The simplest form of Benford's Law is a conformity test for first digit. A correction to the law to take into account the finite range of the data has been used. The strength, and weakness, of using a Benford filter is its simplicity; it is very easy to use for a large dataset. However, as noted above, one has to be very careful in interpretation of the (lack of) conformity of a data set with Benford's law. In this work we use multidigit tests, and introduce a finite range formula. We use heat maps of normalised residuals (difference between data set and expectation); this retains the ease of use of Benford's law, while also providing an easily visualised inspection method for evidence of human intervention.

In this work, we apply residual heat map analysis to house price stamp duty in England and Wales pre and post 2014, showing clear evidence of changes in behaviour of human intervention. From a university-level teaching perspective we emphasise that the techniques discussed and utilised in this paper are not restricted to financial datasets; indeed, they can be applied to a broad range of data-analysis problems. It can be enlightening for physics students to see techniques they master during their studies (chi-squared analysis; hypothesis compliance testing; quantitative analysis of differences between data and models) applied to a diverse range of real-world phenomena.

The rest of the paper is organised as follows: in section 2 we illustrate the use of the finite-range first digit law, discuss the multi-digit version, and introduce a finite-range correction to the later. In section 3 we present analysis of price paid data, showing in particular the power of the normalised residuals heat map to visualise trends in the data set. Finally, in section 4 we present our conclusions.

2. Benford distributions

In this section we shall look at the mathematical formulae for Benford's law, for both first and multi-digit form. The term index is used to refer to a position in a number. At each index there will be an integer 0, 1, 2, ..., 9. Indexes begin at the first significant digit in a number. For example, for the integer 6301, the first index D₁ references the first position with a digit of 6, the second index D₂ references the second position, with a digit of 3 and similarly for the third and fourth indexes. In this case we would stay that D₁ = 6, D₂ = 3, ... or (D₁, D₂, D₃, D₄) = (6, 3, 0, 1). By assumption D₁ is non-zero.

2.1. First-digit distribution

Benford-compliant sets conform with a logarithmic distribution, with each successive digit appearing with a lower frequency in the first index. Newcomb derived the following formula describing the probability of the integer d₁ ∈ {1, 2, ..., 9} to occur in the first index D₁,

$\begin{equation}P[{D}_{1}={d}_{1}]=\mathrm{log}\left(1+\frac{1}{{d}_{1}}\right).\end{equation} \tag{ 1 }$

Note that all logarithms in this paper are to base 10, unless otherwise stated. Therefore according to Benford's law, the first digit of numbers appearing in Benford-compliant sets is more likely 1 (about 30%) than 2 (18%), and so forth to 9 (5%). There has been much discussion in the literature as to why such a large number of data sets should exhibit this logarithmic behaviour² ; however, our interest in this work lies not in the fundamental reason behind the compliance (or not) of a particular dataset—there do not exist a priori criteria to know when a group of data should or should not fulfil the law [26]—but rather what can be learned by studying changes in compliance.

2.2. Finite-range version of the first-digit formula

Sambridge, Tkalčić, and Arroucau [27] formulated the first digit finite range version of Benford's law. Consider datasets in the range U = [a × 10^α, b × 10^β], $a,b\in [1,10),\alpha ,\beta \in \mathbb{N}$ with β > α. This form of Benford's law is useful when the upper and lower bounds of a dataset are known.

The probability of a number in the range U having a first digit d₁ ∈ {1, 2, ..., 9} is given by the modified probability distribution function:

$\begin{equation}P[{D}_{1}={d}_{1}]=\frac{1}{{\lambda }_{c}}\left[(\beta -\alpha -1)\mathrm{log}\left(1+\frac{1}{{d}_{1}}\right)+{\lambda }_{a}+{\lambda }_{b}\right],\end{equation} \tag{ 2 }$

where,

$\begin{equation*}\begin{aligned}\hfill {\lambda }_{c}& =(\beta -\alpha )+\mathrm{log}\left(\frac{b}{a}\right)\hfill \\ \hfill {\lambda }_{a}& =\begin{cases}\enspace \mathrm{log}\left(1+\frac{1}{{d}_{1}}\right)\quad \hfill & :\;{d}_{1} > {a}_{1}\hfill \\ \mathrm{log}\left(\frac{1+{d}_{1}}{a}\right)\quad \hfill & :\;{d}_{1}={a}_{1}\hfill \\ 0\quad \hfill & :\;{d}_{1}< {a}_{1}\hfill \end{cases}\hfill \\ \hfill {\lambda }_{b}& =\begin{cases}0\quad \hfill & :\;{d}_{1} > {b}_{1}\hfill \\ \mathrm{log}\left(\frac{b}{{d}_{1}}\right)\quad \hfill & :\;{d}_{1}={b}_{1}\hfill \\ \mathrm{log}\left(1+\frac{1}{{d}_{1}}\right)\quad \hfill & :\;{d}_{1}< {b}_{1}.\hfill \end{cases}\hfill \end{aligned}\end{equation*}$

Here a₁ and b₁ are the first digits of a and b, respectively. The terms λ_a and λ_b in equation (2) are boundary terms accounting for data in the ranges [a × 10^α, 10^α+1] and [10^β, b × 10^β], respectively. The first term in the bracket is Benford's law applied to the range [10^α+1, 10^β]. λ_c is a normalisation factor covering the entire finite range which ensures that all probabilities sum to one. A larger difference between α and β—the dynamic range of the data set—results in a better fit when applied to a Benford compliant set. Indeed, we recover Benford's law when a, b = 1, 10 or in the limit β − α → ∞.

2.3. Quantification of compliance—normalised residuals

It is easy to glance at a histogram and tell whether the shape matches a Benford curve; indeed, such a simple approach is often adopted in popular accounts of Benford's law and some of the published literature. However, statistical tests are required to quantify the fit, and perhaps provide insight not apparent to the naked eye³ .

There are many possible tests and metrics for conformity of a data set with Benford's law that have been discussed in the literature. Each of these has its own strengths and weaknesses, and unfortunately there is no single metric that has been universally adopted. Our motivation in this work was to find a simple metric that can both be calculated easily and form the basis for a visual representation of the agreement (or not) of a data set with Benford's law; as we shall demonstrate below, this is particularly useful for the multi-digit law. Therefore, the test statistics used in this work are the widely used χ² and reduced χ² [28].

The χ² statistic for discrete data takes the form [28]

$\begin{equation}{\chi }^{2}=\sum\limits _{i}\frac{{\left({O}_{i}-{E}_{i}\right)}^{2}}{{E}_{i}},\end{equation} \tag{ 3 }$

where E_i is the occurrence expected by the probability distribution, and O_i is the observed occurrence. In deciding how likely a value for χ² is to have occurred, the reduced χ² statistic, ${\chi }_{\nu }^{2}$ , is useful:

$\begin{equation}{\chi }_{\nu }^{2}=\frac{{\chi }^{2}}{\nu },\end{equation} \tag{ 4 }$

for data with ν degrees of freedom. A value of ${\chi }_{\nu }^{2}\approx 1$ implies a good match between the parent and sample distribution [28].

Note that the expected value E_i for having a first digit d₁ ∈ {1, 2, ..., 9} is the product of the number of data points, N, with the probability from Benford's law, equation (2).

The χ² statistic can be thought of as the sum of the squares of the normalised residuals. The residual is defined as O_i − E_i, i.e. the difference between what we observe and what we expected; and a normalised residual is realised by dividing this value by its error bar, which is E_i for Poisson count statistics with discrete data [28]. Therefore for a data set that is expected to be compliant with Benford's law we would expect approximately two thirds of the normalised residuals to be less than 1 in magnitude.

We illustrate some of the points made above in figure 1 where we consider one of the original data sets form Benford's original paper, and the set of the first 1000 Fibonacci numbers (which are known to follow Benford's Law [29]).

**Figure 1.** Histograms plotting (a) 'atomic wt.' data from Benford's paper [1] with 91 data points and (b) first digit data from the first 1000 Fibonacci numbers. The expected value of the occurrence, E_i, are in accordance with Benford's law, plotted as a cross, with error bars equal to the Poisson noise, $\sqrt{{E}_{i}}$ . The actual observed occurrences, O_i, are shown for each digit as a bar. The digits for which the data are within a distance of one standard error of the value predicted by the Benford model are highlighted by their corresponding bars being green. Under the histograms the normalised residuals, $({O}_{i}-{E}_{i})/\sqrt{{E}_{i}}$ , are plotted. Both data sets evidently display the first-digit phenomenon, with a far higher proportion of occurrences of 1 as the first digit than for any other number. However, for (a) the value of ${\chi }_{\nu }^{2}=2.25$ indicating a poor conformity of atomic weights with Benford's law, whereas ${\chi }_{\nu }^{2}=0.029$ for (b) as the Fibonacci numbers have a log-uniform distribution.
Download figure:
Standard image High-resolution image

2.4. Multi-digit distributions

The result of equation (1) can be generalised: the non-trivial positive integer $n={\sum }_{i=0}^{m}1{0}^{i}{a}_{i}$ appears at the beginning of a number $N={\sum }_{i=0}^{M}1{0}^{i}{b}_{i}$ , with probability,

$\begin{equation}P[n]=\mathrm{log}\left(1+\frac{1}{n}\right),\end{equation} \tag{ 5 }$

where a_i, b_i ∈ {0, 1, ..., 9} and a_m, b_M ≠ 0 with the decimal expansion of n having at most the same number of terms as N, that is, m ⩽ M.

A useful case of equation (5) is the first-second digit law which states that the probability of d₁ ∈ {1, 2, ..., 9} occurring at the first index D₁ and d₂ ∈ {0, 1, ..., 9} occurring at the second index D₂ is,

$\begin{equation}P[({D}_{1},{D}_{2})=({d}_{1},{d}_{2})]=\mathrm{log}\left(1+\frac{1}{10{d}_{1}+{d}_{2}}\right).\end{equation} \tag{ 6 }$

Note that 10d₁ + d₂ is the decimal expansion of d₁ in the first index and d₂ in the second. Summing over all possible values of d₁ gives the probability of d₂ appearing in the second index D₂,

$\begin{equation}P[{D}_{2}={d}_{2}]=\sum\limits _{{d}_{1}=1}^{9}\enspace \mathrm{log}\left(1+\frac{1}{10{d}_{1}+{d}_{2}}\right).\end{equation} \tag{ 7 }$

2.5. Finite-range version of the multi-digit formula

Here we present a finite-range version of Benford's multi-digit formula; the derivation of these formulae can be found in appendix 1. Consider a Benford set B in the range [a × 10^α, b × 10^β], with α > β integers and $a,b\in {\mathbb{R}}^{\geqslant 1}$ . Then the probability of the n digit long integer $D={\sum }_{i=1}^{n}{d}_{i}\times 1{0}^{n-i}={d}_{1}\times 1{0}^{n-1}+{d}_{2}\times 1{0}^{n-2}+\cdots +{d}_{n}$ to appear at the beginning of an element x ∈ B, P_D(n), is given by,

$\begin{equation}{P}_{D}(n)=\frac{1}{{\lambda }_{c}}\left[\left(\beta -\alpha -1\right)\mathrm{log}\left(1+\frac{1}{D}\right)+{\lambda }_{a}+{\lambda }_{b}\right],\end{equation} \tag{ 8 }$

where

$\begin{equation}{\lambda }_{c}=\left(\beta -\alpha \right)+\mathrm{log}\left(\frac{b}{a}\right)\end{equation} \tag{ 9 }$

and

$\begin{equation}{\lambda }_{a}=\begin{cases}\mathrm{log}\left(1+\frac{1}{D}\right)\quad \hfill & D > {a}_{n}\hfill \\ \mathrm{log}\left(\frac{1+D}{a\times 1{0}^{n-1}}\right)\quad \hfill & D={a}_{n}\hfill \\ 0\quad \hfill & D< {a}_{n}\hfill \end{cases}\end{equation} \tag{ 10 }$

$\begin{equation}{\lambda }_{b}=\begin{cases}0\quad \hfill & D > {b}_{n}\hfill \\ \mathrm{log}\left(\frac{b\times 1{0}^{n-1}}{D}\right)\quad \hfill & D={b}_{n}\hfill \\ \mathrm{log}\left(1+\frac{1}{D}\right)\quad \hfill & D< {b}_{n}.\hfill \end{cases}\end{equation} \tag{ 11 }$

Here a_n and b_n are the first n digits of a and b respectively. If a or b is less than n digits in length then assume all subsequent digits are zero. Note that d_i ∈ {0, 1, ..., 9} and d₁ ∈ {1, 2, ..., 9}.

Figure 2 illustrates the concepts introduced in this section. We use the (a) conventional, equation (5), and (b) finite-range, equation (8), first-second digit Benford's law to analyse a geometric series (which are known to be Benford compliant [29]). We plot the normalised residuals as a heat map for all 90 possible first-second digit combinations. The geometric series analysed had an upper bound of $\approx \;4.3\times 1{0}^{9}$ . A sudden shift from positive residuals to negative residuals appears immediately after the (d₁, d₂) = (4, 3) point; the finite-range law, on the other hand, does not exhibit a sharp feature in its residuals, indicative of a more uniform conformity.

**Figure 2.** First-second digit normalised residuals of a geometric series with 2,000 data points compared to the (a) regular first-second digit Benford's law and (b) the finite-range first-second digit Benford's law. The geometric series' upper bound is $\approx \;4.3\times 1{0}^{9}$ , and a sudden shift from positive residuals to negative residuals appears just after the (d₁, d₂) = (4, 3) point; the finite-range law on the other hand does not feature a sharp turning point of its residuals.
Download figure:
Standard image High-resolution image

3. House price data

We now move on to apply the ideas and formulae developed above to a specific worked example—house price paid data from England and Wales [30]. We shall consider the data in two distinct sets, pre and post 2014, for reasons outlined below.

3.1. Stamp duty tax

A notable factor that may influence house price values is stamp duty tax—a levy that applies to property transactions in England and Wales, and is payable to HM Revenues and Customs. The tax rates depends on the price of the property, i.e. the tax bracket of the property. The tax rate and brackets pre and post December 2014 are outlined in table 1.

Table 1. Threshold values and stamp duty tax rates for residential properties valued under one million pounds sold in England and Wales [31]. Pre December 2014 the rate was payable on the total value of the property (the firm rate) whereas post December 2014 the rate was payable only on the portion of the value of the property in each tax bracket (the relaxed rate). Note that some of the tax brackets change between the two rates.

Tax rate after Dec. 2014 (relaxed)		Tax rate pre Dec. 2014 (firm)
Tax bracket	Tax rate (%)	Tax bracket	Tax rate (%)
£0–£125 000	0	£0–£125 000	0 ^a
£125 001–£250 000	2	£125 001–£250 000	2 ^a
£250 001–£925 000	5	£250 001–£500 000	3 ^a
£925 001–£1.0m	10	£500 000–£1.0m	4 ^a

^aPayable on total property price once limit is breached.

3.2. Psychological pricing

It has been noted previously [32–37] that human intervention is setting commodity prices leads to certain preferred values—this is at odds with Benford's law, and we therefore expect our Benford multi-digit filter to reveal this psychological pricing behaviour. There is a tendency to price houses at given reference points. Namely, we observe that property prices tend to be set with zero or five in the second and third digit; i.e. a property is more likely to be priced at £250k or £255k than £252 500. This is similar to the cognitive reference points in commodity pricing described by Carslaw [21] in that there is a psychological tendency for properties to be valued at these price points. We shall also demonstrate that there is a tendency to price houses just below psychological price points in agreement with the pricing behaviour described by Rosch [38]; i.e. there is a human perception that £99k is significantly cheaper than £100k.

3.3. Price paid data pre and post 2014

Figure 3 shows the results of looking for conformity between Benford's law and house price paid data for 2013 and 2014. Subfigures (a) and (b) show the first and second digit tests, respectively. Each of these tests suggests nonconformity with a Benford distribution at the 0.01 confidence level according to the χ² test statistics. This conclusion comes as no surprize, as it is evident that all the first and second digits deviate from expectation by more than one normalised error. For the first digit test, the distribution is skewed towards one and away from nine, while the second digit test shows that five is the most likely digit to appear in the second index.

Similarly, figure 4 shows a Benford analysis for price paid data from 2015 and 2016. Again, subfigures (a) and (b) show the first and second digit tests, respectively. As before, we observe nonconformity at the 0.01 confidence level. All data points deviate from expectation by more than one normalised residual for both the first and second digit tests. The distribution of first digits is skewed towards one, and five deviates the most from the expected occurrence of second digits.

From a purely traditional Benford analysis standpoint, it is difficult to distinguish the two distributions. Both are highly nonconforming and, upon casual inspection, appear to show the same trends. However, the heat maps in subfigures (c) and (d) in figures 3 and 4 reveal subtle variations between the two.

Subfigures (c) show the normalised deviation of the first and second digits from a Benford distribution. For the 2013 data, shown in figure 3(c), we expect properties to be valued just below the tax threshold values shown in table 1. Indeed there is a spike for the first and second digits (D₁, D₂) = (1, 2), (2, 4) which correspond to valuations just below the threshold values of £125 001 and £250 001. We observe a similar, yet less pronounced, result in figure 4(c). This is most likely due to the relaxed tax rates, as there is less motivation to fix prices below threshold values (see table 1).

Not all the first-second digit values (1, 2) occur in the range £120k–£125k. Indeed for the 2013 data, 52.4% of all data points in the range £120k–£130k lie in the range £120k–£125k compared with 48.5% for the 2015 data. This suggests that there is more of a tendency to value properties under the threshold value for valuations made under firm tax rates compared to the relaxed rates.

We observe pricing at the reference points £120k and £125k in the range £120–£125. Indeed 17.7% and 18.4% of properties are priced at £120k in 2013 and 2015, respectively, in this price range. Moreover, 31.0% and 23.4% are priced at £125k in 2013 and 2015, respectively, showing a clear tendency for valuations to be made with zero or five in the third digit. This psychological bias is shown in figures 3(d) and 4 (d), which show the normalised deviation of second and third digit occurrence from Benford's law. Indeed, the second-third digit pairs (5, 0) and (0, 0) display the most deviation, showing the significance of this property-pricing factor.

There is also a tendency to price houses just below psychological price points in agreement with the pricing behaviour described by Rosch [38]. In particular, in figures 3(d) and 4 (d) we see spikes at the second-third digit pairs (4, 9) and (9, 9). Such price points correspond to prices set just below the psychological prices with five and zero in the second digit. This is done to give the impression of a significantly lower price than the property's actual price. That is a house priced at £249k appears to be significantly cheaper than if it were priced at £250k. This is a well-known technique used in sales and is used when marketing properties.

Figure 5 shows histograms of the distribution of price paid data in the years (a) 2013 and (b) 2015. The bins were chosen beginning at one with a width of ten thousand, except around tax thresholds, in which instance they have a width of five thousand. Such a choice allows us to examine properties priced at £125k and £250k.

**Figure 5.** Histograms showing the distribution of price paid data for houses in England and Wales in the years 2013 (3.8 × 10⁵ data points) and 2015 (4.6 × 10⁵ data points). Bars shown in red represent prices near the tax threshold values £125 001 and £250 001. The legend shows the mean $\bar{x}$ and standard deviation σ of house prices as well as the number of house prices analysed N. There is clear pattern of houses been priced below these tax thresholds corresponding to spikes in the histograms just below these values and deficits immediately after. Indeed, this trend is most pronounced in subfigure (a) under the firm tax rate in 2013, suggesting valuations are made with tax minimisation taken into consideration. Similar features hold in 2015 under the relaxed tax rate; however, they are less pronounced.
Download figure:
Standard image High-resolution image

In 2013 there was a clear pattern of pricing properties to avoid higher taxation, particularly at the £250,001 threshold, which incurs a 3% tax rate on the entire property. Figure 3(c) shows the same features, as there is a higher count of the first-second digits (1, 2) than expected, corresponding to the £125,001 threshold and an increased count of (2, 4) corresponding to the £250 001 threshold. Similar, yet less prominent, features hold in 2015 primarily due to relaxed tax rates. Once again figure 4(c) displays the same trends with spikes in the first-second digits (1, 2) and (2, 4); however, these features are less significant when compared to figure 3(c).

We note that with the exception of the tax thresholds there appears to be a smooth distribution to describe the form of the histogram. We emphasise again that the exact form of this distribution is neither particularly interesting nor relevant for this investigation; it is the pattern of the deviation from the smooth distribution that is analysed here.

Although there is a clear trend in valuing properties below tax threshold values, such valuations may also be influenced by psychological bias. That is, properties may be valued at £125k or £250k since there is an aforementioned bias towards pricing properties with a zero or five in the third index. Although there is a clear motivation to set prices at this value to maximise the profit made while selling the property and minimising tax paid, it is unclear whether this is done deliberately in every case or instead, whether this behaviour can be attributed to unconscious bias. It should also be noted that there is no mechanism for checking this within the scope of this investigation.

The results of first-digit analyses for both the 2013 and 2015 PPD can be explained using figure 5. For example, in 2015, 62% of data points are in the range £100k–£299k. Indeed, inspecting figure 5 reveals that most property prices are in the range £100k–£299k. Thus we would expect deviations from Benford's law due to this concentration of data points close to the mean (£212k). A similar argument holds for the 2013 price paid data.

4. Conclusions

In this investigation we have applied a Benford analysis to the distribution of price paid data for house prices in England and Wales pre and post-2014. Neither distribution conforms with Benford's law, and they appear on face value to display roughly the same trends. Although a traditional Benford analysis reveals this nonconformity, there is little insight indicating why this should be the case. A residual heat map analysis for both first-second and second-third digits allowed further insight to be gained. Two examples of human intervention on price paid data were revealed; (i) selling property at values just cheaper than a tax threshold, to reduce the amount of tax paid; and (ii) psychological pricing, with a particular bias for the final digit to be 0 or 5. There was a change in legislation in 2014 to soften the presence of tax thresholds, and the influence of this change on house price paid data was clearly evident.

We note that income tax in the UK also has different rates within different brackets. However, it is significantly easier to obtain the data set for price paid data for property than it is for income tax, as individual tax returns are subject to privacy and confidentiality issues [39]. Nevertheless, a multi-digit heat map Benford analysis of income tax data could be an interesting area of future investigation.

The techniques utilised in this work are simple to apply and efficient to use. The residual heat map analysis in particular offers a visually attractive method for identifying interesting features in a large data set, such as human intervention. However, we finish with a note of caution: while these techniques provide a fast method of analysing massive data sets for anomalous behaviour, as Goodman points out in the context of financial fraud detection [40], Benford's law is not a hypothesis test, and no individual should be accused of fraud based solely on evidence from Benford analysis.

Acknowledgments

We thank Steven Wrathmall for useful discussions and advice on the manuscript, and Martin Ward and Charlotte Wojcik for suggesting the topic of house price stamp duty as a rich source of data for a Benford analysis.

Appendix A.: Finite range law proof

We will assume the PDF has a log-uniform distribution or equivalently P(x) = x⁻¹. This assures that the underlying set is Benford [27]. Write a_n as ${a}_{n}={\sum }_{i=1}^{n}{a}_{i}\times 1{0}^{n-i}={a}_{1}\times 1{0}^{n-1}+{a}_{2}\times 1{0}^{n-2}+\cdots +{a}_{n}$ and similarly for b_n.

We will consider three subsets of B defined by L := [a × 10^α, 10^α+1], M := [10^α+1, 10^β] and R := [10^β, b × 10^β]. Note that M can be empty if β = α + 1 but one of L and R will be non-empty.

Firstly, we can determine the normalisation constant, λ_c, by integrating the probability density P(x) = x⁻¹ over the entire range of the set.

$\begin{equation*}\begin{aligned}\hfill {\lambda }_{c}& ={\int }_{a\times 1{0}^{\alpha }}^{b\times 1{0}^{\beta }}\frac{1}{x}\enspace \mathrm{d}x=(\beta -\alpha )\mathrm{ln}(10)+\mathrm{ln}\left(\frac{a}{b}\right)\hfill \\ \hfill & =\frac{1}{\mathrm{ln}(10)}\left[(\beta -\alpha )+\mathrm{log}\left(\frac{a}{b}\right)\right].\hfill \end{aligned}\end{equation*}$

M = [10^α+1, 10^β]: Define M^α+i := [10^α+i, 10^α+i+1] and the set ${M}_{D}^{\alpha +i}$ as the subset of M^α+i with D as the first n digits. Then,

$\begin{equation*}{M}_{D}^{\alpha +i}=[D\times 1{0}^{\alpha +i-n-1},(D+1)\times 1{0}^{\alpha +i-n-1}).\end{equation*}$

The integral of the density function over all ${M}_{D}={\bigcup }_{i=1}^{\beta -\alpha -1}{M}_{D}^{\alpha +i}$ is,

$\begin{equation*}\begin{aligned}\hfill \sum\limits _{i=1}^{\beta -\alpha -1}{\int }_{{M}_{D}^{\alpha +i}}\frac{1}{x}\enspace \mathrm{d}x& =\sum\limits _{i=1}^{\beta -\alpha -1}{\int }_{D\times 1{0}^{\alpha +i-n-1}}^{(D+1)\times 1{0}^{\alpha +i-n-1}}\frac{1}{x}\enspace \mathrm{d}x\hfill \\ \hfill & =\sum\limits _{i=1}^{\beta -\alpha -1}\mathrm{ln}\left(1+\frac{1}{D}\right)\hfill \\ \hfill & =\frac{1}{\mathrm{ln}(10)}\left[(\beta -\alpha -1)\mathrm{log}\left(1+\frac{1}{D}\right)\right].\hfill \end{aligned}\end{equation*}$

L := [a × 10^α, 10^α+1]: The set of elements in L with D as the first n digits is given by,

$\begin{equation*}{L}_{D}=\begin{cases}\left[D\times 1{0}^{\alpha -n-1},(D+1)\times 1{0}^{\alpha -n-1}\right]\quad \hfill & :\;D > {a}_{n},\hfill \\ \varnothing \quad \hfill & :\;D< {a}_{n},\hfill \\ [a\times 1{0}^{\alpha },(D+1)\times 1{0}^{\alpha -n-1}]\quad \hfill & :\;D={a}_{n}.\hfill \end{cases}\end{equation*}$

We then integrate over the density function to obtain the coefficient λ_a:

$\begin{equation*}\begin{aligned}\hfill {\lambda }_{a}{:=}{\int }_{{L}_{D}}\frac{1}{x}\enspace \mathrm{d}x& =\begin{cases}{\int }_{D\times 1{0}^{\alpha -n-1}}^{(D+1)\times 1{0}^{\alpha -n-1}}\frac{1}{x}\enspace \mathrm{d}x\quad \hfill & :\;D > {a}_{n},\hfill \\ 0\quad \hfill & :\;D< {a}_{n},\hfill \\ {\int }_{a\times 1{0}^{\alpha }}^{(D+1)\times 1{0}^{\alpha -n-1}}\frac{1}{x}\enspace \mathrm{d}x\quad \hfill & :\;D={a}_{n},\hfill \end{cases}\hfill \\ \hfill & =\begin{cases}\frac{1}{\mathrm{ln}(10)}\enspace \mathrm{log}\left(1+\frac{1}{D}\right)\quad \hfill & :\;D > {a}_{n},\hfill \\ 0\quad \hfill & :\;D< {a}_{n}\hfill \\ \frac{1}{\mathrm{ln}(10)}\enspace \mathrm{log}\left(\frac{D+1}{a\times 1{0}^{n-1}}\right)\quad \hfill & :\;D={a}_{n}.\hfill \end{cases}\hfill \end{aligned}\end{equation*}$

R := [10^β, b × 10^β]: The set of elements in R with D as the first n digits is given by,

$\begin{equation*}{R}_{D}=\begin{cases}\varnothing \quad \hfill & :\;D > {b}_{n},\hfill \\ [D\times 1{0}^{\beta -n-1},(D+1)\times 1{0}^{\beta -n-1}]\quad \hfill & :\;D< {b}_{n},\hfill \\ [D\times 1{0}^{\beta -n-1},b\times 1{0}^{\alpha }]\quad \hfill & :\;D={b}_{n}.\hfill \end{cases}\end{equation*}$

We then integrate over the density function to obtain the coefficient λ_b:

$\begin{equation*}\begin{aligned}\hfill {\lambda }_{b}{:=}{\int }_{{R}_{D}}\frac{1}{x}\enspace \mathrm{d}x& =\begin{cases}0\quad \hfill & :\;D > {b}_{n},\hfill \\ {\int }_{D\times 1{0}^{\beta -n-1}}^{(D+1)\times 1{0}^{\beta -n-1}}\frac{1}{x}\enspace \mathrm{d}x\quad \hfill & :\;D< {b}_{n},\hfill \\ {\int }_{D\times 1{0}^{\beta -n-1}}^{b\times 1{0}^{\beta }}\frac{1}{x}\enspace \mathrm{d}x\quad \hfill & :\;D={b}_{n},\hfill \end{cases}\hfill \\ \hfill & =\begin{cases}0\quad \hfill & :\;D > {b}_{n},\hfill \\ \frac{1}{\mathrm{ln}(10)}\enspace \mathrm{log}\left(1+\frac{1}{D}\right)\quad \hfill & :\;D< {b}_{n}\hfill \\ \frac{1}{\mathrm{ln}(10)}\enspace \mathrm{log}\left(\frac{b\times 1{0}^{n-1}}{D}\right)\quad \hfill & :\;D={b}_{n}.\hfill \end{cases}\hfill \end{aligned}\end{equation*}$

Summing each of these integrals over L, M and R and dividing by the normalisation constant, λ_c, gives the probability of finding an element in the Benford set in the range [a × 10^α, b × 10^β] with the first n digits being D.

$\begin{equation}\begin{aligned}\hfill {P}_{n}(D)& =\frac{1}{{\int }_{a\times 1{0}^{\alpha }}^{b\times 1{0}^{\beta }}\frac{1}{x}\enspace \mathrm{d}x}\left[{\int }_{{M}_{D}}\frac{1}{x}\enspace \mathrm{d}x+{\int }_{{L}_{D}}\frac{1}{x}\enspace \mathrm{d}x+{\int }_{{R}_{D}}\frac{1}{x}\enspace \mathrm{d}x\right]\hfill \\ \hfill & =\frac{1}{{\lambda }_{c}}\left[\frac{(\beta -\alpha -1)}{\mathrm{ln}(10)}\enspace \mathrm{log}\left(1+\frac{1}{D}\right)+{\lambda }_{a}+{\lambda }_{b}\right].\hfill \end{aligned}\end{equation} \tag{ 12 }$

Note that we can cancel the $\frac{1}{\mathrm{ln}(10)}$ terms from all terms in the bracket with the $\frac{1}{\mathrm{ln}(10)}$ in λ_c which gives the desired form.

Using residual heat maps to visualise Benford's multi-digit law

Article metrics

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Abstract

1. Introduction

1.1. Benford's first digit law

1.2. Benford's law and fraud detection

1.3. Benford's law and evidence of human intervention

2. Benford distributions

2.1. First-digit distribution

2.2. Finite-range version of the first-digit formula

2.3. Quantification of compliance—normalised residuals

2.4. Multi-digit distributions

2.5. Finite-range version of the multi-digit formula

3. House price data

3.1. Stamp duty tax

3.2. Psychological pricing

3.3. Price paid data pre and post 2014

4. Conclusions

Acknowledgments

Appendix A.: Finite range law proof

Footnotes

Using residual heat maps to visualise Benford's multi-digit law

Article metrics

Share this article

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Abstract

1. Introduction

1.1. Benford's first digit law

1.2. Benford's law and fraud detection

1.3. Benford's law and evidence of human intervention

2. Benford distributions

2.1. First-digit distribution

2.2. Finite-range version of the first-digit formula

2.3. Quantification of compliance—normalised residuals

2.4. Multi-digit distributions

2.5. Finite-range version of the multi-digit formula

3. House price data

3.1. Stamp duty tax

3.2. Psychological pricing

3.3. Price paid data pre and post 2014

4. Conclusions

Acknowledgments

Appendix A.: Finite range law proof

Footnotes