Information retrieval and structural complexity of legal trees

We introduce a model for the retrieval of information hidden in legal texts. These are typically organised in a hierarchical (tree) structure, which a reader interested in a given provision needs to explore down to the ‘deepest’ level (articles, clauses, …). We assess the structural complexity of legal trees by computing the mean first-passage time a random reader takes to retrieve information planted in the leaves. The reader is assumed to skim through the content of a legal text based on their interests/keywords, and be drawn towards the sought information based on keywords affinity, i.e. how well the Chapters/Section headers of the hierarchy seem to match the informational content of the leaves. Using randomly generated keyword patterns, we investigate the effect of two main features of the text—the horizontal and vertical coherence—on the searching time, and consider ways to validate our results using real legal texts. We obtain numerical and analytical results, the latter based on a mean-field approximation on the level of patterns, which lead to an explicit expression for the complexity of legal trees as a function of the structural parameters of the model.


I. INTRODUCTION
What is the maximum number of tenants that a UK landlord may let a property of a given size to without incurring in penalties?Faced with a legal question of this nature, a layperson would naturally resort to consulting the institutional repository www.legislation.gov.uk,where a quick keyword search (say, "overcrowding") would return -as most recent reference -the Housing Act 1985.The requested information will eventually be found in the articles 325 et sqq., which can be located after following the path "UK Public General Acts", "Housing Act 1985", "Part X: Overcrowding", "Definition of Overcrowding", "324 Definition of Overcrowding", through several part and section headers of the act.
Arguably, a definition of how "complex" a piece of legislation is should reflect how fast and reliably information hidden in its text can be retrieved.The concept of "legal complexity" and quantitative measures thereof -in one of its very many incarnations -has been considered by legal scholars and -to a lesser extent -by scientists in recent years (see next section), so far without reaching a satisfactory and widely accepted consensus on the best framework to use.In this paper, we develop a realistic model for the search process of information hidden in a legal text -organised in a hierarchical fashion -by a "typical" reader who needs to extract a precise answer out of a potentially messy structure of semantic dependencies.
We consider a tree structure that mimics the organisation of a typical act of Parliament, with primary focus on the usual structure of legislative bills in the United Kingdom.To standardise labels, we always use the capitalised terms "Act", "Part", "Chapter", "Section", "Article", and "Paragraph" for ease of reference.If applicable, the list can be extended by adding "sub"-items, e.g.Sub-Sections between Sections and Articles.For the sake of clarity, references to items of this manuscript will be made in small letters and abbreviated (e.g."sec.").Each item in the hierarchy is identified by a header or set of keywords, which ideally reflect the general content of the corresponding sub-tree: for example, the tree in fig. 1 represents a selection of the Housing Act 1985, with some of the nodes labelled by their textual content."What is the maximum number of tenants that a UK landlord may let a property with a given number of rooms to without incurring in penalties?" Higher up in the hierarchy, the textual content mainly consists of short titles, containing only a small number of keywords (e.g."Housing" and "Occupation"), while lower nodes often contain full sentences.The Chapter "Definition of Overcrowding" of Part X "Overcrowding" in fig. 1 contains the text [1] 324 Definition of overcrowding.
A dwelling is overcrowded for the purposes of this Part when the number of persons sleeping in the dwelling is such as to contravene-(a) the standard specified in section 325 (the room standard), or (b) the standard specified in section 326 (the space standard).
Based on the headers/keywords, and assuming they reflect the content underneath, the reader will be more or less inclined to follow a certain path rather than another in their search for a piece of information, planted in one of the leaves of the tree.For the sake of simplicity, we do not consider more complicated network structures with cycles and longrange connections, determined for instance by cross-references or internal amendments.
We characterise the complexity of a legal tree in terms of the time a random reader takes on average to reach the sought information by hopping through the nodes of the tree, guided by the headers' keywords.The hopping probabilities will reflect the search strategy of the reader and will be defined in sec.III.The observable we will therefore focus on and give analytical estimates of is the mean first-passage time (MFPT) to reach the target starting from the root.Leveraging on the MFPTs of random readers of "legal trees", we will be able to formulate a closed-form expression for their structural complexity in terms of the network parameters, and draw some real-life policy implications for the drafting of legal texts.
To put our work into context, we review the related literature on complexity of legal systems in section II.In sec.III we provide the main definitions of our model for a "random reader" hopping through sections of a legal tree according to suitably defined probabilistic rules.Sec.IV covers some analysis of the statistical properties of this model, with relevant aspects of the theory of MFPTs reviewed in sec.V. Our main results are shown in sec.VI: these are based on a mean-field -or rather, "mean-text" -approximation for the MFPT between the starting point and a target node of our random reader model.This approximation serves as the definition of our "complexity function" for legal trees.We compare analytical and numerical results in sec.VII, and summarise our findings in sec.VIII.Technical derivations of our results are shown in the appendices A and D.

II. RELATED LITERATURE
In this section, we present a non-exhaustive overview of the literature related to our work, which lies at the intersection between legal network science, legal complexity, and search models.

A. Legal Network Science
In the seminal works [2][3][4], the authors apply tools from the area of complex networks to examine the US legal system quantitatively, in particular the structure and content of the US Code and the US Supreme Court citation network.Recent studies focuses on the time-evolution of legal texts in the US and Germany, using a clustering algorithm [5]; similarly, [6] correlates changes in the Korean constitutional law to societal changes.Several references within [5] treat the time-evolution of national and super-national legal corpora from a network-perspective.
Several authors have noted important analogies between software and legal systems, for instance [7] analyses the US Code based on software engineering terms.More recently, [8] has expanded on the analogy, with a focus on symptoms and markers (called smells) of software that is likely to become problematic in the future, e.g. through long reference trees or duplicate phrases.
Further studies base their analysis on topic modelling, a family of machine learning algorithms that extract "topics" from a given text (first introduced in [9]).For instance, [10] presents cases studies of the network of SCOTUS opinions, tracking proxies for the change in topic proportions over time.The conceptual bridge to law search is built in [11,12], by connecting law search and prediction of the relevance of certain documents based on their content and topological properties within the citation network.

B. Legal Complexity
Some works take a rather system-wide approach, discussing complexity as it relates to the connection between the world, the lawmaker, the legal practitioner, and the citizen that is subject to legal rules.Much of the work in this area is descriptive, such as the influential articles [13][14][15], relating the institutional challenges and "locations" of uncertainty in the system.Some authors, e.g.[16,17], make an attempt at general quantification, for instance via the entropy associated to the uncertainty of a legal rule, or by drawing connections between network properties and standards of legal practice [18].

C. (Randomised) Search Models
The searching behaviour of agents in a diverse range of systems has been the subject of intensive study for many decades now.Following similar developments in economics [19,20] and biology [21,22], for instance, understanding law searches has been gathering traction recently [11,12,23].
In these works, searching is framed as an optimal stopping problem: upon sampling a resource (information, trading opportunities, food) a number of times in one location, how does the searcher decide when to stop and change location?Our approach will be different in that our searcher has sufficient information to know exactly when to stop.The modelling of such problems in terms of random walks (on networks or grids, say) has proven successful in many cases -see for instance [24] and references therein for search strategies involving resets of the random walker to its starting position.Another similar class of problems concerns moving and hiding targets and optimal strategies for a random-walk searcher [25].
In [11,12,23] the approach is based on a joint empirical analysis of structure and contents of legal text networks, by means of network-based topic modelling, a method largely developed in [9,26,27] to extract a set of topics given a sufficiently large body of text, and assign one or more topic labels to each section.For instance, this analysis can be based on the movements of a random walker in the textual landscape, studying in which regions the walker tends to sojourn for longer periods of time, as well as the overlap between such regions.The authors demonstrate that this analysis may be useful to predict citations in US legal opinions and statutory law.Moreover, the authors propose a law-search model based on their findings on link prediction, and compare it with human law-searchers.Further references in this area can be found in [12,23].
Other lines of research have been more interested in the particular behaviour of an information-seeker.Important foundations to this field are laid out in [28], leading to further studies in various contexts.In particular, [29] (and references therein) examines the information-seeking behaviour of legal professionals, which relies markedly on informal sources instead of primary literature.Comprehensive reviews of the area are given in [30] and [31].

III. DEFINITION OF THE MODEL
We consider a finite tree of N nodes -in which every node stands for an "item" in the law as described in sec.I -and two nodes are connected as parent and daughter if one contains the other (e.g. a Section within a Chapter, or an Article within a Section).There are no limitations on the exact shape of the tree.For simplicity of the analysis, however, we consider c-ary trees, i.e. trees in which a designated root, r, has degree c and every other node is either a leaf, or has degree c + 1.
We model the textual content of every node v in terms of a binary string of length L, that we will refer to as pattern.We denote patterns as ξ v = (ξ v 1 , . . ., ξ v L ) where ξ v i ∈ {0, 1} encodes presence (1) or absence (0), in the textual content of node v, of keyword i, from a predetermined glossary of L keywords (which is typically defined by the user).A reduced glossary for the example in fig. 1 may be the list {"slum", "demolition", "clearance", "overcrowding", "room", "space", "responsibilities", "occupation", "escape"}, which would lead to the assignment of patterns shown in fig. 2.
We assume that a reader is interested in the information hidden in a particular leaf of the tree, which we call the target t.We model the search process of the reader as a random walker that moves randomly along the links of the tree, starting from the root.We assume that the reader is more likely to step on a node when the text associated with it has a higher semantic similarity with the sought (target) information.Hence, we assume that when on a node, the walker will step to one of the neighbouring nodes v with a probability that depends on the semantic similarity between the node v and the target node, and it does not depend on the starting node.The semantic dissimilarity of two nodes is measured as the Hamming distance of their patterns which equals the number of bits on which the two patterns disagree.We will refer to such distance as "pattern-distance".This distance is not to be confused with the "edge-distance" between the corresponding nodes on the graph, defined as the number of edges constituting patterns for the (shortened) glossary {"slum", "demolition", "clearance", "overcrowding", "room", "space", "responsibilities", "occupation", "escape"} of length L = 9.Every node is assigned a binary pattern of length L, whose bits represent the presence or absence of the corresponding keyword.
The boldface bits stand for keywords specific to the Part in which their node is located.Dots are used to omit some of the all-0 patterns.
the shortest path between them.As the probability to step onto a node v does not depend on the node the walker is stepping from, we can define, for any link pointing to v, a weight that is higher, the higher the semantic similarity between the patterns in v and t.Using these non-negative weights, we can define a matrix of transition probabilities between two nodes u and v as where ∂u is the neighbourhood of node u.
We will characterise the complexity of the search process in terms of the average number of steps taken until the target is first found.Our object of study will be the dependence of this quantity on the way patterns are assigned to nodes as well as on the properties of the tree itself.
We will assume a stochastic set-up, where patterns are regarded as quenched random binary vectors, with statistics controlled by two tunable parameters, which we call tightness and overlap, representing the vertical and horizontal coherence of the legal text, respectively.
In particular, we will assume that the components of the root pattern are independent random variables with fixed expected value a.All patterns on the lower levels are generated from their parent node via a Markov process, according to which the entries of the patterns in the child node are mutated with a given rate with respect to those of the parent node.
The tightness τ is defined as a decreasing function of the mutation rate such that tighter sets of patterns are generated by least rates of mutation.Additionally, assuming that each part covers a unique topic with corresponding specific keywords, we define the overlap (denoted 2∆), quantifying the number of keywords that are expected to be shared by two successive Parts.
In later sections we will study the impact of τ and ∆ on the complexity of the defined search process.We will first quantify their role on the statistics of pattern distances (sec.IV), then we will study their role on the average number of steps taken by the walker to first reach the target node where the information of interest is hidden (secs.VI and VII).
In the remainder of this section we provide details about the Markov process used to generate patterns, which reflect the hierarchy of the tree and have the desired properties of tightness and overlap.
We will denote the vertices with lexicographical labels, such that the c descendants of the root will be denoted by a single index µ 1 ∈ {1, . . ., c}; the µ 2 -th descendant of the µ 1 -th descendant of the root, will be denoted by the two indices µ 1 , µ 2 and a node v in the k-th generation will be denoted by k indices µ 1 , . . ., µ k where µ j ∈ {1, . . ., c} for all 1 ≤ j ≤ k, see fig. 4 for a schematic representation of the genealogy of patterns.We denote with ξ the textual pattern at the root.For the root pattern, we assume the entries to be drawn from a factorised distribution with so for all i = 1, . . ., L the expectation is ξ i = a.For the first level of hierarchy (i.e. Part-level), we assume that the patterns are generated with distribution from the root pattern.Here, the R µ i (ξ µ i |ξ i ) for ξ µ i , ξ i ∈ {0, 1} form the entries of the 2 × 2 transition matrix of the Markov process generating the Part-pattern entry ξ µ i from the root.By the law of total probability, ξ µ i obeys the marginal probability Fixing where the parameter Γ ∈ [0, 1] determines the rate of mutation from ξ i to ξ µ i .We take the a µ i to satisfy some constraints, motivated by the idea that different Parts treat individual topics.If a given keyword is highly related to the topic of some Part, it will have a high probability to appear in that Part.If each keyword is related to one topic only, it will appear with high probability in the Part treating that topic, and with low probability in all other Parts.This situation is represented in the top panel of fig. 3.However, we allow for a degree of topic similarity between two successive Parts µ and µ + 1, realised by a subset of size 2∆ of keywords that appear with high probability in µ, as well as in µ + 1.This situation for 2∆ = 6 is shown in the bottom panel of fig.3, representing the appearance of topic-specific keywords in the Parts.In particular, we introduce ∆ as a parameter that controls the number of Part-specific keywords shared by neighbouring Parts µ, µ + 1 i.e. as the Part overlap, and define where c is the ratio L/c, ∆ ∈ [0, c c−1 2 ] and we consider periodic boundaries, i.e. a µ i = a µ kL+i for all k ∈ Z (see fig. 3 for a schematic representation).Moreover, we assume 0 < a l < a h < 1 (the indices l and h stand for "low" and "high").However, since elements of R µ i are probabilities, one can derive the stricter bounds for all µ = 1, . . ., c and j = 1, . . ., L. For our purposes it will be useful to define for some β l satisfying 0 < β l < a to enforce a l < a h .The set of indices i such that a µ i = a h we refer to as the a h -domain of µ, and to the complementary set as a l -domain.
We now extend the above prescription to generate patterns at the lower levels of the hierarchy.Consider the pattern ξ µ 1 ,...µ k on level k, generated with probabilities , so the marginal probabilities can be written in terms of the marginal of the parent pattern For simplicity, we assume that a µ 1 ...
which equals either a h or a l , according to eq. ( 9).Then Q µ 1 ...µ k i takes the form Here the parameter Γ ∈ [0, 1] controls the level of noise in the patterns below Part-level.The relation between patterns in terms of the transition matrix families R and Q is summarised in fig. 4.
FIG. 4: Pattern hierarchy and their relations.The i-th bit of each pattern is generated from the parent bit by a two-state Markov process, the transition matrix of which is indicated next to the edge between the two patterns.For instance, the transition matrix relating With all model parameters defined, in the next section we study the statistical properties of the distances between patterns sitting on different nodes of the network, as these will determine the kinetics of the random walk and, in particular, the complexity of the search process that the random walk is meant to model.

IV. EXPECTED PATTERN-DISTANCES ALONG AND ACROSS BRANCHES
In this section we provide analytical expressions, in terms of the model control parameters τ and ∆, for the expected values of two classes of pattern-distances, namely between (1) adjacent Part-level patterns ξ µ and ξ µ+1 and ( 2) any Part-level pattern ξ µ 1 and leaf patterns of the same Part, ξ µ 1 ...µ h .For clarity of presentation, we state here the main results and present their derivations in app. A.

A. Overlap: Distance between neighbouring patterns
Let ξ µ , ξ µ+1 be two child patterns of the root pattern ξ, with marginal expectations as described by eq. ( 9).With d being the Hamming distance, eq. ( 1), we are interested in the properties of the distance on Part-level d µ,µ+1 := d (ξ µ , ξ µ+1 ) as we vary ∆.App.A shows that the expectation of the pattern-distance over the distribution of patterns d(∆) := d µ,µ+1 is given by the expression We observe that d is a decreasing function of ∆, in the physical range of parameters β l < a that we identified after eq. ( 11).Hence, the parameter ∆ controls the "topical overlap" between adjacent Parts as described in the previous section.

B. Tightness: Distance between ancestor and descendant patterns
Having examined the "horizontal" variation of patterns in the previous section, we now turn to "vertical" pattern-distance; that is to say, the expected pattern-distance between the "Part" node at the top of a certain branch, and any leaf in the same branch.To this end, let ξ µ 1 be any Part-level pattern, and let ξ µ 1 ...µ h−1 be the pattern of any leaf in the same branch, i.e. descending from µ 1 .Appendix A shows that we have for the expected pattern-distance between ξ µ 1 and ξ with In fig.6 we compare eq. ( 18) to the numerical average of d(ξ µ 1 , ξ µ 1 ...µ h−1 ), observing an excellent agreement between the two.
In contrast to ∆, the corresponding pattern-distance does not depend linearly on Γ.However, as it strictly decreases as Γ increases, we see that Γ acts as expected from a mutation rate, namely, the higher the mutation rate, the higher the distance between Part-and leaflevel patterns.Eq. ( 16) shows that upon setting Γ = 1, the ξ µ 1 i 's and ξ independent.This is the state of least tightness and also the maximum of d(ξ µ 1 , ξ µ 1 ...µ h−1 ) .x-axis, and the predicted expectation value according to eq. ( 18) on the y-axis.
Similarly, Γ = 0 enforces d(ξ µ 1 , ξ µ 1 ...µ h−1 ) = 0, representing highest tightness.We define the tightness as the following monotonically decreasing function of Γ As discussed in sec.III, this parameter controls the (expected) similarity between an ancestor-descendant pair of patterns, just as Γ controls the level of mutation from the former to the latter.Tuning τ (as opposed to Γ) allows us to achieve a better resolution in simulations for Γ → 1.

V. BRIEF REVIEW OF MEAN FIRST-PASSAGE TIMES
The mean first-passage time (MFPT) m ij of a random walker is the expected time the walker takes to first hit node j, having started from node i; the expectation is taken over many random walks on a fixed instance of the graph.MFPTs can be derived as unique solutions of the recurrence equations where W ik denotes the transition probability of the walker from node i to node k.The first term in eq. ( 21) accounts for the walker hopping from i to j directly (which occurs with probability W ij ), while the second term accounts for the walker hopping to any other node k first and starting a first-passage process from there (at the next time step).
Eq. ( 21) can be rearranged into [32] where 1 N −1 and 1 N −1 are the unit matrix and the all-one vector of size N − 1, respectively, and m j is the vector of MFPTs to node j, whose element (m j ) i corresponds to the MFPT from source node i for i ∈ {1, . . ., N } \ {j} to j.Moreover, in eq. ( 22), W j is the matrix obtained by removing the j-th row and column from W .In this paper, eq. ( 22) will be used to provide a numerical benchmark to compare our analytical estimates with.Further information on methods and applications of MFPTs can be found in [33] and the excellent review [32] (as well as references in the latter).
For tree-graphs, which we consider in this manuscript, and generally for graphs in which two sets of nodes exist that are connected by a single edge, one can find more explicit formulae for MFPTs.As we have shown elsewhere [34], if we can coarse-grain the network into clusters 0, . . ., H that lie on a line, and every cluster K hangs from a one-dimensional chain only via a single node v K as shown in fig.7, the MFPT m v 0 ,v H is given by the formula where π T is the stationary probability vector of the transition matrix W -i.e.π T is the unique solution of the eigenvector equation π T W = π -and Π K = k∈K π k is the stationary probability of the cluster K. Fig. 8 exemplifies how the clusters can be defined if the network is a tree.
We finish this review with a useful note on how π T can be derived explicitly without solving the eigenvector problem π T W = π directly if the graph is a tree.Let u be any node, and let t u be the collection of directed edges pointing towards u.A directed edge (v → w) is pointing towards u if the distance between w and u is less than the distance between v and u.In that case, (v → w) ∈ t u , and otherwise (w → v) ∈ t u , noticing that one of the two has to be the case if the full graph is a tree.It can then be shown that π u is proportional to the product of the weights W vw running over t u (see e.g.[35]), Here, the sum in the denominator runs over all nodes of the graph to ensure that π T is normalised.
As an application of eq. ( 24), consider a random walk on the tree in fig. 9 with a general transition matrix W .To find π T , we have to solve the linear system of six equations given by which is elementary but laborious.Instead, we can apply eq. ( 24), which tells that π 0 , say, is proportional to the product Analogously, the other entries of π T are proportional to where the prefactor 1 Z ensures that π is normalised.
FIG. 9: A tree on six vertices to demonstrate the use of eq. ( 24).

MFPTs
In this section, we present an expression for the complexity of the Act represented by the model introduced in sec.III.For the definition of the complexity, recall that every assignment of patterns defines a transition matrix W for the random walker due to eq. ( 3).Denoting by m rt (W ) the MFPT from root r to target t given such a transition matrix, we define the complexity as the average of m rt (W ) over the distribution of patterns, The quantity C does not depend on any particular realisation of patterns; it reflects the "higher-level" properties encoded in the model parameters and the tree.However, evaluating the expectation in eq. ( 28) based on the formulae in eq. ( 22) or eq.( 23) analytically is a formidable task.We avoid this difficulty by introducing the mean-field approximation i.e. we calculate m rt for the random walker subject to the averaged transition matrix W , with the average taken over the pattern distribution.Since W is always a stochastic matrix, so is W , and it does indeed define a random walker on the tree.Eq. ( 29) is an approximation because m rt is a non-linear function of W , which can be seen from eq. ( 22).We dub the approximate, left-hand quantity in eq. ( 29) the approximate complexity.
C MF is an explicit (though complicated) function of all model parameters, though we are mostly interested its dependencies on ∆ (defined in eq. ( 9)), τ (defined in eq. ( 20)) and a (defined in eq. ( 5)).Furthermore, recall the parameters h and c of the tree itself, being the height of the tree, and the number of children to a (non-leaf) node, respectively.The derivation of C MF is deferred to app.E; in the following we only summarise the results.Due to the cumbersome nature of the expressions, they will also be made available as Python code.
We shall now verify our expression for C MF by comparing it to C MF obtained from numerical simulations.In these simulations, we fixed a tree with c = 3 and h = 4, as well as the parameters L = 48, a = 0.7, β l = 0.07 and Γ = 0.3.For each given pair of values for ∆ and τ , we calculate C MF with the following procedure: Generate a set of patterns, and subsequently a transition matrix W as described in sec.III; repeat 100 times and take the average of the resulting matrices, denoted W emp ; this average approximates W .The value of m rt ( W emp ) calculated numerically using eq.( 22) is our benchmark for C MF in eq.(30).The two values of C MF are plotted against each other in fig.10.We see from the figure that the agreement is excellent, in spite of the fact that the derivation in app.D uses two approximations (eqs.(C6) and (D8)) to obtain explicit expressions for the entries of W .
In the next section, we proceed by testing the goodness of the approximation eq. ( 29) numerically.30) vs C MF calculated using eq.( 22) with an empirically transition matrix W emp averaged over 100 realisations for each set of parameters.
Each datapoint refers to a pair (∆, τ ), grouped by colour and symbol according to τ .For all points, we fixed a = 0.7 and Γ = 0.3.

VII. SIMULATIONS AND OBSERVATIONS
This section contains numerical validations of the approximation in eq. ( 29) by comparing C MF to C, computed numerically as an average over quenched MFPTs m rt (W ).We then proceed to consider the behaviour of C and C MF as we vary the model parameters.
For given values of the parameters a, ∆ and τ we calculate C(a, τ, ∆) by repeating the following steps 500 times: Generate a set of patterns, and subsequently the transition matrix W according to eq. ( 3); record the value m rt (W ) calculated numerically using eq.( 22).
The average of these values is approximately (due to the finite size of the sample) equal to C(a, τ, ∆).In these simulations, we fixe a tree with c = 3 and h = 4, as well as the parameters L = 48, β l = 0.07 and Γ = 0.In fig.11, we directly compare C MF as per eq.( 30) to C in a scatterplot for fixed a.
The standard deviation of m rt (W ) with respect to variations in W is indicated as errorbars.The plot confirms for all parameters considered that eq. ( 29) leads to a systematic underestimation while accurately reflecting the correct trend.
Next, we analyse the dependence of C MF on the different parameters of our model.Fig. 12 plots C MF and C as a function of a = ξ j , the expectation of root-level bits ξ j .The dashed lines represent the mean-field approximation C MF in eq. ( 30), the symbols the value of C as obtained in the beginning of this section.Fig. 12 confirms that C MF tracks C faithfully, also for varying a.
For small values of τ , i.e. little vertical coherence between patterns, the complexity C MF first shows a slight increase as a function of a (approximately in the interval 0 < a < 0.2) before a more pronounced decrease to about 1/3 of its maximum (for a > 0.2).As τ approaches 1, the curves for C MF first become notably flatter and lower, which is as expected since higher vertical coherence is more likely to put the reader on the right track faster.This effect is only seen up to τ = 0.8, as too high coherence -given if τ → 1 -forces all patterns within the same Part to be equal, which does not help the reader navigate at all.29), taken over 500 realisations of W .
Fig. 13 shows C MF as dashed lines and C as symbols as functions of ∆.The panels and different curves per panel correspond to different values for a and τ , respectively.We make the same observation as above about the systematic underestimation incurred in eq. ( 29), although in addition to τ , the offset also seems to decrease with a → 1.
C MF is largely constant in ∆ for ∆ ≤ 8. Beyond this value, C begins to increase with ∆, except for the lowest tested value a = 0.05.C increases by about a factor of 2 for ∆ > 8.
Strikingly, C MF and C show a slight decrease up to ∆ ≤ 8 for a = 0.8, which is contrary to our intuition that higher overlap between adjacent Parts should increase the complexity as it leads to more initial missteps of the random walker.However, the observed decrease is minor compared to the observed increase exhibited at higher ∆.The fact that such increase in complexity is less significant for higher τ is again in line with our expectation that a more "vertically coherent" text should be overall easier to navigate.We note that the range of values of C over ∆ is less than the one over a, shown in fig .12, showing that ∆ has lower influence.
We deduce from fig. 13 that in order to reduce C, ∆ should not be chosen too high.
Further simulations suggest that depending on the coordination number c of the tree, the complexity can also rise if ∆ is chosen too low.This means that C has a local minimum in ∆, which represents the optimal keyword overlap between adjacent Parts.To conclude the analysis of fig.14 we summarise that C may be minimised by choosing an appropriately high value for τ , which should not be too close to 1.This means that one should allow the keywords within one Part to vary slightly, to avoid keyword patterns that are either almost identical or approximately independent.
Figs. 12, 13 and 14 suggest that an increasing ∆, or decreasing τ or a results in a higher rate of mistakes made by the walker, thus leading to a higher searching time.Since ∆ has its primary effect on the root (Act) level, it is dominated by τ , which controls noise on all (bar the Act) levels.Since a also affects the values of a h and a l , it has an effect on all levels as well; accordingly, we observe that varying a and τ have comparable effects on C, and dominate variations over ∆.Form here we conclude that the priority should be on maximising the tightness τ and keyword frequency a to reduce the complexity of the modelled ensemble of Acts.
This section shows that C MF faithfully reproduces the trends of C for varying τ , ∆ and a.This entails a significant benefit because it allows us to optimise the parameters of the model with respect to C MF without the need for costly simulations.As a consequence, one can imagine optimising a real legal text, by estimating its parameter values for our model and tweaking the text and layout to minimise C MF .
Here we have not considered the effects of the parameters c, h, L and Γ .The former two of these deserve a word of caution: The number of nodes increases as c h+1 , and m rt for the regular random walker as hc h+1 .Therefore, without appropriate rescaling, the values of C and C MF do not allow for the comparison of graphs of different size.

VIII. CONCLUSIONS AND OUTLOOK
We presented a quantitative theory of informational complexity of legal trees by analysing a random walker model for the retrieval of information planted in the leaves of a legal tree.
The model assumes that the reader proceeds by keyword affinity, such that it is drawn towards nodes whose content looks similar to the target information.The searched text is generated randomly, with two main parameters controlling its horizontal and vertical coherence.Our analysis and numerical simulations show that these properties of the text have the desired effect on the random walker: With high vertical coherence, the content of the leaves is well-reflected in the top items (Parts) of the text, and the reader finds its target more quickly.High horizontal coherence, on the other hand, means that different Parts are difficult to discern, leading to more initial errors by the reader.
As a measure of complexity, we propose the MFPT of the random reader from the root of the tree to the predefined target information; it gives an intuitive account of how difficult it is for a typical reader to navigate the legal text by following only local information.Similarly, MFPTs have also been employed successfully to asses the heterogeneity and transport properties of social and other complex networks [36].
So far, we have limited our analysis to trees, where we were able to compute our complexity measure analytically using simple approximations.However, other topologies play an important role in real-life legal networks.A direct generalisation of the present work can be the inclusion of cross-references, which can potentially lead to detours and act as shortcuts.In fact, studies of European civil law have found that these legal systems can exhibit small-world properties [37].In other studies, the more general directed acyclic graphs are used to represent citation networks, e.g. the network of precedents in the US [2] or of the total citation network in a system of statute law [3,5].Recently, [8] has elaborated on the similarity between legal and software systems, drawing from best practices on the latter to propose improvements on the former.The framework developed in the present paper may prove useful in providing a more quantitative ground to assess the methods in these lines of research as well.
Moreover, our model definitions rely on a number of assumptions on the details of how keywords are distributed over the text: Firstly, the definition of overlap assumes that the Parts of an Act be ordered in such a way that consecutive pairs realise the maximal overlap in that Act, and that this overlap is the same for all adjacent pairs.Secondly, we assume that below the Part level, the marginal distributions for every keyword is fixed within each part, which might be unrealistic for "deep" laws with many levels.Relaxing these assumptions -introduced for the sake of computational simplicity -may render the model even more realistic and general.
We have modelled a reader as a Markovian random walker, that is to say that it is "memoryless".To replicate the behaviour of a real reader more closely, more general types of walks (e.g.self-avoiding walks [38]) might be appropriate.
Finally, on the side of our analysis, it should be possible to refine the approximation in eq. ( 29).Perhaps, more of the information contained in the pattern-dependent m rt (W ) can be exploited by carrying the analysis beyond its mean to higher moments.
Previous research indicates that glossaries (lists of keywords) may be extracted using natural language processing [39] (in particular topic models [23]).To make our model applicable in real-life scenarios, one should devise a way to estimate the horizontal and vertical coherence of an existing legal document with a tree-like backbone, after a glossary has been extracted.
We can derive three broad and intuitive lessons from the results in sec.VII to reduce the "complexity" of a legal tree.(i) Keywords at the lowest levels should be reflected at higher levels, i.e. a legal text should be "tightly" formulated.Yet, it is possible to make it overly tight, which happens when all text items look too similar to each other.This situation is identical to giving no clues at all to the reader.(ii) Parts should be well separated by their keywords (and hence by topic); some keyword overlap is acceptable, as long as sufficiently many Part-specific keywords remain to guide the reader.(iii) Text at higher levels should not be too sparse.If high-level entries contain only a small number of keywords (such as a short headline), little information about its subordinate items can be conveyed (except by interpretation, e.g. through association of keywords and related words).A higher keyword frequency at the top levels saves the reader time-costly detours into wrongs Parts.
are functions of ∆ (cf.fig. 3) where L is the number of bits per patterns, c is the number of Parts, and c = L/c.The reason for the presence of different cases is that the inequality The probabilities of {ξ µ j = ξ µ+1 j } can be derived using the law of total probability and the conditional independence of the Part patterns given the root pattern ξ, where the factors in the square brackets are given by the elements of the R j 's in eq. ( 8) for each of the combinations of a h and a l .In fact, by definition of R j , we have With the decomposition eq.(A5), and using that ξ j = P {ξ j = 1} = a for all j, this produces the marginal probability Considering now all L hh bits j for which ξ µ j = ξ µ+1 j = a h , the above expression reduces to 2(1 − a)(1 − Γ )Γ since a h = (1 − a)Γ + a (given in eq. ( 12)) sets the other two summands to zero.The sum of the distances |ξ µ j − ξ µ+1 j | for theses bits then forms a binomial random variable with "success" probability 2(1 − a)(1 − Γ )Γ .Similarly, we can treat the other bits in two groups of size L lh and L ll , respectively, as described in the beginning of this section.
For this purpose, it is useful to recall from eq. ( 11) that a l − (1 − a)Γ = β l .The total distance d µ,µ+1 is then given by a sum of binomial random variables expressed in terms of β l for conciseness.
The expected pattern-distance between neighbours is readily calculated using the linearity of • , the above characterisation for d µ,µ+1 and the expectation for binomial random variables eq.(A2).These ingredients give the result reported for d = d µ,µ+1 in eq.(17).We can see that d is a decreasing function in ∆ with maximum and minimum We now examine the expected distance of two patterns ξ µ , ζ, where the former represents Part µ, and the latter is a descendant of the former, at distance k.We are particularly interested in the situation where ζ is a leaf-level pattern, i.e. the edge-distance between the two is k = h − 1.To determine d(ζ, ξ µ ) , we need to know the probabilities of the events {ξ µ j = ζ j } which we calculate from the k-th power of Q defined in eq. ( 16) where a µ j is the marginal expectation ξ µ j = a µ j .As a µ j can take the values a l and a h , as defined in eqs.(11) and ( 12), the j-th bits are different with probability For the full pattern-distance composed by all bits, we have to take both possible values, a l and a h , for a µ j into account.Again, the full pattern-distance is a sum of two independent binomial random variables with expectation given by eq.(A2), For k = h − 1, we obtain the result given in the main text in eq.(18).
Various examples for labelled paths are given in fig.15.In all cases, the sum of the first four indices equals the length of the path P .
We anticipate that our notation will not be well defined for most directed paths in the tree, but it will be well defined for those paths relevant to the analysis in this manuscript.
Moreover, the above definition does not identify paths uniquely; for instance, the label (k, 0, 0, 0; 1) applies to all paths of length k not leaving the target-Part and with the additional constraint of being directed away from t.However, due to the constraint on the path direction, the pattern-distances between the start and end nodes of each path are identically distributed, and the distribution is determined by (k, 0, 0, 0; 1).We can now use these labels to refer to classes of nodes as well.Given a class of paths of the event {ξ v j = ξ t j } given d u , which can be calculated appealing to Bayes' rule.Bayes' rule states that the conditional probability of the event A given an event B with P {B} = 0 obeys consequently, we can write the probability of {ξ v j = ξ t j } given d u as and we proceed calculating the terms on the right hand side individually.
Clearly, the pattern distance d u to target is a sum of binomial random variables, whose statistics depends on ∆.If ∆ = ∆ max , all bits of the pattern ξ u have the same expectation ξ u j = a h , and are therefore identically distributed.As the bits are independent, d u is the binomial random variable with "success probability" In contrast to this, ∆ < ∆ max implies that a u j = ξ u j , and hence P ξ t j = ξ u j , depends on j and on the Part containing u as prescribed by eq. ( 9).Therefore, instead of eq.(C4) we consider In order to extend eq.(C3) to general ∆, we make the simplifying assumption that the bits of a given pattern are identically distributed, with probabilities averaged over all bits of that pattern.That is, we make the approximation where µ and ν are the parts containing u and v, respectively, and O µν ∆ (a x , a y ) is the fraction of indices j such that a µ j = a x while a ν j = a y .By definition of ∆ in sec.III, and with the help of fig.3, we can write these fractions as Notice that the definition of the f ∆ 's is consistent with eq.(C4) because O µν ∆max (a h , a h ) = 1.Under the assumptions of eq.(C6), the fraction in eq.(C2) can be written as using the expression for the binomial PMF in eq.(A1).
To calculate the conditional probability P d u = y ξ v j = ξ t j in eq.(C2), we can use the fact that bits are independent, and consider the pattern-distance d u (j) of u that disregards bit j.This allows us to split the event {d u = y} into the cases where ξ u j = ξ t j and ξ u j = ξ t j , respectively: In this expression, the marginal probabilities of d u (j) are given by the binomial PMF eq.(A1) with n replaced by L − 1 and p = f tu ∆ , whereas ξ u j = ξ t j ξ v j = ξ t j is the same event as ξ u j = ξ v j ξ v j = ξ t j , which has probability 1 − f uv ∆ .We can now combine the eqs.(C8) and (C9) into eq.(C2) to obtain Due to the assumption of eq.(C6), this expression is independent of j, which implies that d v | d u is a binomial random variable as well, with "success" probability as in eq.(C10), and L trials.Therefore, we can calculate the conditional expectation To finish this calculation we have to find the g's defined in eq.(C5), which will determine the probabilities f u ∆ , f v ∆ and f uv ∆ by their definition in eq.(C6).For this, the label-notation introduced in app.B will be useful.
We consider a pattern ξ u at some node u in Part µ, and a path P ∼ (k, l, m, n; ν) starting from u and ending in v.The transition probability from the j-th bit of pattern ξ u to the j-th bit of the pattern at the end of the path P is given by The Q's are as defined in eq. ( 16), with powers as in eq.(A13).The matrix R ν j is defined as in eq. ( 8) depending only on the Part-index of the label P , and that only if ∆ < ∆ max ; if ∆ = ∆ max , the R ν j are equal for all j and ν.Moreover, R ↑ is the family of transition matrices comprised by elements R ↑µ j (ξ j |ξ µ j ) = P µ j (ξ j |ξ µ j ) for ξ µ j , ξ j ∈ {0, 1}, which are given by Bayes' rule, eq.(C1).Consequently, the elements of R ↑µ j read Similarly, Q ↑µ j k is the matrix with elements P v j (ξ µ j |ξ u j ) for ξ u j , ξ µ j ∈ {0, 1} .However, due to our stipulation that a u j depends only on the Part µ in which v is located (see eq. ( 15)), a quick calculation reveals that Q ↑u j = Q u j .Thus the probabilities g P (a µ j , a ν j ) = P ξ u j = ξ v j are given by g (k,l,m,n;ν) (a µ j , a ν j ) := g uv (a µ j , a ν j ) = a µ j Q ν j (0|1) + (1 − a µ j )Q ν j (1|0) , (C14) where the dependence on a ν j is implicit in Q ν j .In the following we report the relevant values for g by explicitly expanding eq.(C14) in terms of the matrix elements given in eqs.( 8),

(C13) and (A13).
There are essentially five different cases to consider for eq.(C14), each one corresponding to P being one of the paths (k, 0, 0, 0; 1), (k, 1, 0, 0; 1), (k, 1, 1, m; ν), (0, 0, 1, m; ν) or (0, 0, 0, m; ν).By direct calculation starting from eq. (C14) we find the probabilities g (k,0,0,0;1) (x, y) = 2x(1 − x) 1 − (1 − Γ) k , g (k,1,0,0;1) (x, y) = a + x − 2ax + 2(1 − a)(1 − Γ) k (Γ − x) , g (k,1,1,m;ν) (x, y) =  This can be done by temporarily assuming that the pattern-distance d v 1 = d(ξ v 1 , ξ t ) of the target-wards neighbour is given.Then we can first calculate the conditional expectations Moreover, under the assumption of eq.(C6), d v 1 is a binomial random variable with parameters p = f tv 1 and n = L in the notation of eq.(A1).Hence, the expectation of [d v 1 + 1] −1 is known to be [40] Therefore, we can write W vv 1 /W vv i explicitly by substituting eqs.(D2) and (D3) into eq.(D1) where f tv 1 ∆ and f tv i ∆ are given by the probabilities described in eq.(C15), in terms of the paths connecting v 1 and v i to t, respectively.f v 1 v i ∆ is given by f P ∆ with P the labelled path connecting v 1 to v i over v.In the notation of eq.(D1), we thus have Having derived an expression for the ratios ε i = W vv 1 /W vv i , we now use these to compute the averages W vv i that we were originally interested in, by venturing the approximation for all i > 1 in the neighbourhood of v (see fig. 17).Now all expected weights W vv i in the neighbourhood of v are approximately determined by the system of equations The unique solution of this linear system is given by with Z the normalising constant.
Important specialisations of this formula are those for which (i) all f tv i 's and f v i v 1 's for i > 1 are equal, i.e. when all v i (i > 1) have the same path label P i relative to the target, and (ii) all v i 's for (1 < i < c + 1) are labelled by the same P i , but v c+1 = r, which has its own unique label.Case (i) applies unless v is the root or the Part level node µ = 1.Case (ii) applies if v is the Part node µ = 1.
In the first case, all W vv i 's and ε i 's for i > 1 have to be equal, which produces In the second case, we have to distinguish W vv i for 1 < i < c + 1 and W vv c+1 with the result

( 3 )FIG. 1 :
FIG. 1: Example from an excerpt of the UK Housing Act 1985, c.68, to be found at https: //www.legislation.gov.uk/ukpga/1985/68/contents.The nodes of the tree represent structural items of the text such as the Act itself, its Parts, Sections, etc, and two items are linked if one is contained in the other.The dashed edges label the ideal path of a reader researching the question

FIG. 3 :
FIG. 3: A schematic drawing of patterns with L = 12 and c = 3, with boxes marking bits with high expectation ξ µ i = a h .Top: ∆ = 0, i.e. specific keywords of different Parts do not overlap.Bottom: ∆ = 3, i.e. two neighbouring Parts have 2∆ = 6 specific keywords in common.For ∆ = 0 there are c (c − 1) keywords that are generic for any Part, therefore the maximum value for ∆ is c (c−1) 2 = 4.
Fig. 5 compares eq.(17) to the numerical average of distances d µ,µ+1 .The agreement is excellent, showing the accuracy of our calculation.

FIG. 5 :
FIG.5: Scatter plot for the expected distance d on the horizontal axis, compared to eq. (17) on the vertical axis.d is sampled in simulations with the parameters shown at the top.

FIG. 6 :
FIG. 6: Mean pattern-distance across h = 4 levels, simulated with the shown parameters on the

3 .FIG. 11 :
FIG.11: Complexity C MF obtained via sec.VI vs C, being the average MFPT m rt (W ) , where the average is taken over the distribution of patterns.For each realisation of W , m rt (W ) was computed using eq.(22).Error-bars represent one standard deviation of m rt (W ).Each datapoint refers to a pair (∆, τ ), grouped by colour and symbol according to τ .For all points, we fixed a = 0.7 and Γ = 0.3.

15 FIG. 12 :
FIG. 12: Complexity C MF as a function of a for different values of τ and ∆.The points show the numerical average for C on the right hand side of eq.(29), taken over 500 realisations of W .

Fig. 12 8 FIG. 13 :
Fig.12suggest the following conclusion: For fixed, low values of τ , the complexity can be minimised by increasing a as much as possible.Since a represents the keyword density of the root pattern, this means that the Title of the represented Act should reference as

8 FIG. 14 :
FIG. 14: Complexity C MF as a function of τ for different values of ∆ and a.The points show the numerical average for C on the right hand side of eq.(29), taken over 500 realisations of W .

Fig. 14 presents
Fig. 14 presents C MF as a function of τ as dashed lines, with different curves and panels represent different values for ∆ and a, respectively.Different symbols are used to represent the values of C = m rt (W ) .The figure shows that C MF has a pronounced local minimum in τ between 0.8 and 0.9 for all values of ∆ and a tested.At τ = 1, the random walker becomes diffusive whenever inside a Part, because all patterns within a given Part are identical.Therefore it makes

FIG. 15 :
FIG.15: Labels for different paths of length 3 oriented away from t, shown by arrows.The number of densely dashed and densely dotted arrows give the first and fourth coordinate, respectively.The number of thinner, loosely dashed and loosely dotted arrows give the second and third coordinates, respectively.The latter two are always associated with the root, so the third and fourth entries are both either 0 or 1.Note that the first four coordinates always sum to the path-length.

FIG. 17 :
FIG. 17: Neighbourhood of the node v, separated into target-ward, root-ward and all other neighbours.The W 's denote weights of edges pointing away from v. If v is the root, then there is no root-ward neighbour v c+1 , and if v is a leaf, there is only the neighbour v c+1 .
x, y) representing the f ∆ for the path connecting v 1 and v