Genome entropy and network centrality contrast exploration and exploitation in evolution of foodborne pathogens

Sheryl L Chang; Carl J E Suster; Rebecca J Rockett; Adam J Svahn; Oliver M Cliff; Alicia Arnott; Qinning Wang; Rady Kim; Basel Suliman; Mailie Gall; Tania C Sorrell; Vitali Sintchenko; Mikhail Prokopenko

doi:10.1088/1478-3975/acd899

1. Introduction

Environmental reservoirs of foodborne pathogens harbour diverse microbial populations that evolve within their ecological niches. New strains associated with increased virulence or antimicrobial resistance pose a particular risk to human health. An understanding of the evolutionary dynamics of endemic foodborne pathogens is therefore important to enable effective management of threats to public health, food safety, and trade. To this end, new methods are required for identifying evolutionary phases and bottlenecks, and abrupt transitions in the dynamics that signal emerging threats.

Salmonella enterica causes a significant foodborne disease burden globally [1]. In Australia, S. enterica subspecies enterica serovar Typhimurium (STM) is responsible for most locally-acquired foodborne infections [2–5]. In New South Wales (NSW), STM surveillance includes routine whole genome sequencing (WGS) of bacterial isolates, which has identified genomic diversity in the endemic population including a multi-drug-resistant monophasic strain [6, 7]. The evolutionary dynamics of endemic strains can be modelled as a trade-off between an exploration phase where genetic diversity increases to explore the local fitness landscape, and an exploitation phase where a successful variant disseminates in its niche [8–10]. Pathogen surveillance targets human disease, and the majority of available genomic data is collected from clinical cases of salmonellosis. This renders visible only part of the STM population which infects human hosts (e.g. through contaminated food), whereas the dynamics of populations in reservoirs such as poultry flocks are largely invisible. Furthermore, traditional evolutionary tools such as phylogenetic analysis are limited in that they can only analyse bacterial sequences which are common to all isolates. However bacteria often have extensive accessory genomes, i.e. genes present only in a subset of strains. These accessory genes are not necessary to generate a fully functioning bacterial cell, but often encode virulence factors and mechanisms that provide antibiotic resistance. The accessory genome can also be rapidly mobilised and exchanged between and within bacterial populations therefore providing a rapid mechanism for genetic diversification [11]. The shell genome here is defined as the set of genes present in 15%–95% of isolates.

Previous studies identified an exploration-exploitation distinction in a space formed by two dimensions: structure (network centrality) and function (prevalence), using an STM surveillance dataset spanning a nine-year period in NSW [12, 13]. This disease predated routine genomic surveillance, employing multiple-locus variable-number tandem-repeats analysis (MLVA) instead of WGS. While MLVA often provides sufficient discriminatory power to support outbreak investigations [14, 15], WGS data offers a much higher resolution and the possibility to quantitatively verify the suggested distinction between exploration and exploitation [16, 17]. In this study, we aimed to apply network-theoretic and information-theoretic methods in order to trace STM evolutionary pathways and identify their salient phases and transitions, using a WGS dataset of 3939 isolates collected between 2016 and 2021 and sequenced by NSW Health Pathology.

We first applied an established network-theoretic approach [13, 18] to derive genotype networks for the new WGS dataset. This allowed us to identify a clear partition of pathogens in the constructed centrality-prevalence space. We then demonstrated, by computing the corresponding shell genome entropy, that pathogens with the low-to-mid centrality have a higher genetic diversity (identifying these with the exploration phase), while pathogens with the mid-to-high centrality have a lower genetic diversity (identifying these with the exploitation phase). A transition between these phases was quantified with the Fisher information in terms of the network centrality. Continuing with network analysis, we computed the probability density along evolutionary paths, contrasting the paths with increasing and decreasing connectivity. This analysis demonstrated that pathogens exploring the evolutionary search-space (i.e. with low-to-mid centrality) may move towards the exploitation phase (i.e. change their network centrality towards higher values) while increasing their prevalence and producing outbreaks. Importantly, however, we identified an evolutionary bottleneck in the observed population emerging in response to the epidemic containment measures taken by Health Protection NSW at that time.

2. Methods

2.1. Bacterial isolates and bioinformatic analysis

A total of 3939 STM isolates collected between 2016 and 2021 and sequenced at the Institute of Clinical Pathology and Medical Research (ICPMR), NSW Health Pathology were included in the analysis. Sequencing was performed as described in a previous study [18]. Demultiplexed sequencing reads with ${\gt}1\times 10^7$ reads per isolate were trimmed [19], based on a minimum quality read score of 20, and then assembled de novo using SPAdes (version 3.13.0 [20]). The quality of assemblies was assessed with Quast (version 5.0.2 [21]); only assemblies with ${\lt}200$ contigs and N50 values of ${\gt}$ 50 000 bp were included in further analysis. Contigs were annotated with predicted genes using Prokka (version 1.13.3 [22]). The bioinformatic tool, Roary (the pangenome pipeline; version 3.12.0 [23], with default parameters) was used to partition the gene content of all isolates. The software defines the core genes as genes present in 99%–100% of isolates, soft core (in 95%–99% of isolates), shell (in 15%–95% of isolates) and cloud gene content (in 0%–15% of isolates). Single nucleotide polymorphisms (SNPs) were identified from the 3792 383 base pairs defined as the core genome, using SNP-sites [24] without recombination masking [25]. The SNP distance G_ij between isolates i and j is defined as the number of core SNPs by which they differ. The median value of G_ij across all 3939 STM isolates was 383 (range 0–2051).

2.2. Undirected genotype network

We first derived a complete undirected network in which nodes represent individual isolates and edges represent genetic distances. Specifically, following an earlier study [18], the edge weight between adjacent nodes i and j in the complete undirected network was set to the pairwise SNP distance, G_ij. We then used the derived undirected network to construct a 2-dimensional space in terms of two quantities measured for each node i: the closeness centrality l_i, and the prevalence defined as the number of close neighbours within genetic distance $G_{\textrm{max}} = 20$ from node i. The constructed centrality-prevalence space relates a structural property of the pathogen (its network centrality) to the functional property (its neighbourhood prevalence). The latter can be interpreted as the pathogen's fitness capturing the average spread and virulence across its genetic neighbourhood [13]. Following prior studies [12, 13], we use the normalised node closeness centrality, computed as the inverse sum of the length of the shortest paths between the node of interest and other nodes, as described below.

2.2.1. Closeness centrality

Node closeness centrality captures the relative importance of node i to other nodes in the network by distance proximity. It is calculated as the inverse average shortest distance from node i to other nodes:

$\begin{equation} l_i=\frac{N}{\sum_{j=1, i \neq j}^N d_{i,j}} \end{equation} \tag{ 1 }$

where N is the number of nodes in the network, $d_{i,j}$ is the length of the shortest path between any pair of nodes i and j, defined as the sum of the edge weights $G_{k_1 k_m}$ , where $k_1 = i$ and $k_m = j$ , on the path with $(m-1)$ edges. If m = 2, then $d_{ij} = G_{ij}$ .

If we take inverse of l_i, $\lambda_i = 1 / l_i$ , and average λ_i across the entire network, the result yields the average path length L of the network:

$\begin{align} L =\frac{1}{N(N-1)}\sum_{i=1}^{N}\sum_{j=1, i \neq j}^{N} d_{i,j} \end{align} \tag{ 2 }$

$\begin{align} \kern-4.4pt = \frac{1}{N-1} \sum_{i=1}^{N} \left(\frac{1}{N}\sum_{j=1, i \neq j}^{N}d_{i,j}\right) \end{align} \tag{ 3 }$

$\begin{align} & =\frac{1}{N-1} \sum_{i=1}^{N} \frac{1}{l_i} = \frac{1}{N-1} \sum_{i=1}^{N} \lambda_i. \end{align} \tag{ 4 }$

In this study, we employ the average length of the shortest paths to measure how individual nodes contribute to the average genetic variation in the population. The relationship between the closeness centrality and the average path length suggests that the closeness centrality is a more suitable characteristic of genetic diversity of the population on average than alternative centrality measures, such as betweenness centrality [26] or eigenvector centrality [27].

2.2.2. Thresholded undirected network

Using the undirected network, we then constructed a thresholded undirected network (figure 1) where the edges are only inferred between genetically similar isolates, within the genetic distance threshold $G_{\textrm{max}} = 20$ . When comparing Salmonella core genomes, distances of 0–5 SNPs are highly indicative of STM isolates collected from the same point source outbreak—if they are collected within a three month period—whereas distances of more than 20 SNPs suggest that isolates are epidemiologically unrelated [28–30]. The choice of $G_{\textrm{max}} = 20$ therefore produces edges between pairs of plausibly-related isolates while excluding edges where the number of core SNPs would imply the divergence of strains occurred well outside of the study period. In our previous study of Salmonella Enteritidis genotype networks [18], we showed that the network analysis was robust to choices of G_max between 10 and 30.

**Figure 1.** The thresholded undirected SNP sub-networks produced for the complete SNP network using SNP distance threshold $G_{\textrm{max}} = 20$ . Total number of nodes $N = 3,939$ , total number of edges $M = 336,522$ . Singletons (193 isolates) are removed. Total number of components: 173 (excluding singletons). See figure A1(a) for the degree distribution. Node colours: (blue) low-to-mid centrality with low prevalence; (cyan) low-to-mid centrality with high prevalence; (orange) mid-to-high centrality with low prevalence; (yellow) mid-to-high centrality with high prevalence. Colour coding is in accordance with figure 5. See section 3.1 for more details.
Download figure:
Standard image High-resolution image

We highlighted nodes by applying a centrality threshold of $3.3\,\times\,10^{-3}$ : nodes with centrality below the threshold are coloured blue or cyan and nodes with a higher centrality are coloured orange or yellow, as shown in figure 1. This centrality threshold is determined by an observed partition of isolates in the centrality/prevalence space (see section 3.1 for more details). The centrality-prevalence space is used for identifying patterns and phases in the evolutionary dynamics of the pathogens.

2.3. Directed genotype network

We also derived a directed SNP sub-network that captures both genetic and temporal proximity, with a directed edge only inferred between isolates if they are (i) genetically close ( $G_{\textrm{max}} = 20$ ), and (ii) collected within a fixed time window ( $T_{\textrm{max}} = 30$ days), with the edge direction pointing from the predecessor to the successor. In cases where the isolates are collected on the same day, two directional edges are inferred. We chose the collection date as a proxy to the date of infection because foodborne salmonellosis is characterised by acute symptoms with the onset periods varying between 6 hours to 6 days. In addition, the time window, T_max, is an approximation of the median shedding period commonly associated with foodborne salmonellosis [31]. A visual representation of the directed SNP sub-network is shown in figure 2.

**Figure 2.** The thresholded directed SNP sub-networks derived using SNP distance threshold $G_{\textrm{max}} = 20$ and time window $T_{\textrm{max}} = 30$ days. The inset shows a zoomed-in view of one of the network components. Total number of nodes $N = 3,939$ , total number of edges $M = 55,440$ . Singletons (824 isolates) are removed. Total number of components: 338 (excluding singletons). Node colours: (blue) low-to-mid centrality with low prevalence; (cyan) low-to-mid centrality with high prevalence; (orange) mid-to-high centrality with low prevalence; (yellow) mid-to-high centrality with high prevalence. Colour coding is in accordance to figure 5. See section 3.1 for more details.
Download figure:
Standard image High-resolution image

2.4. Probability density of directed paths

A directed path is formed by connecting directed edges and may be considered as an approximation of pathogen adaptations over time. Previous studies of directed MLVA networks [13] suggested a distinction between paths leading to 'successful' or 'unsuccessful' adaptation, by associating these types with the increasing or decreasing connectivity along the path. In this work, we applied this distinction to the directed network derived using the WGS dataset. In order to trace genetic changes over time, we identified $38,698$ sufficiently long (m > 5) directed paths formed by connecting a sequence of directed edges. For each path we compared changes of the neighbourhood prevalence between the start and end nodes, distinguishing positive (successful adaptation) and negative (unsuccessful adaptation) changes. Some examples of successful and unsuccessful paths are shown in figure 3.

**Figure 3.** 'Successful' and 'unsuccessful' paths in the directed network. For the shown successful path, the end node has a higher neighbourhood prevalence than the start node (i.e. gaining neighbours with prevalence increasing from 144 to 387). The end node of the shown unsuccessful path, on the other hand, has a lower neighbourhood prevalence than the start node (i.e. losing neighbours with the prevalence decreasing from 88 to 74). The directed network shown on the right is the one depicted in figure 2.
Download figure:
Standard image High-resolution image

These directed paths were projected onto the centrality-prevalence space for the complete undirected SNP network, and for each point (i.e. each node) the prevalence changes were averaged across all paths originating from this node. For each point in the centrality-prevalence space (i.e. for each node), we quantify the average changes in prevalence along both successful and unsuccessful paths originating from this node. The resultant estimated probability density of directed paths distinguishes between the pathogens which tend to develop towards a higher or lower neighbourhood prevalence [13]. Thus, the probability density captures the expected changes in prevalence during a subsequent pathogen adaptation (measured via the likelihood of either expanding or contracting genetic neighbourhood).

2.5. Shell genome entropy

The pangenome comprises the entire set of $13,555$ genes from all $3,939$ isolates. After excluding the core and soft core genome (genes present in greater than 95% of isolates) and the cloud genome (genes present in fewer than 15% of isolates), the remaining shell genome consisted of 713 genes. The high-resolution WGS dataset enabled us to further explore the evolutionary pathways by looking at the genetic composition of different groups of isolates (e.g. isolates in the low-to-mid and mid-to-high centrality groups). We computed normalised entropy of the shell genome for each group.

The shell genome entropy is computed using gene presence/absence data over a fixed number of temporally ordered isolates (n = 10, note that collection date is not uniformly spaced). See figure 4 for a schematic of the shell genome entropy computation. The normalised Shannon entropy, S, is then calculated as follows [32]:

$\begin{equation} S =\frac{H}{H_{\textrm{max}}}=\frac{-\sum^{Q}_{i=1} p_i \log p_i }{H_{\textrm{max}}} \end{equation} \tag{ 5 }$

where Q is the number of genes, p_i is the probability of the gene i being present in the considered group of isolates, and $H_{\textrm{max}} = {\ln}(Q)$ the maximum entropy.

**Figure 4.** A schematic of the shell genome entropy computation defined in equation (5). Note that the shell genome data shown in step (B) is a subset of the shell genome data for the full population shown in figure B1.
Download figure:
Standard image High-resolution image

2.6. Fisher information in the centrality space

In order to verify that the two proposed types of the evolutionary dynamics (exploration and exploitation) can be characterised as phases, separated by a properly defined phase transition, we considered the pathogen adaptation as a thermodynamic process developing in response to changes in the control parameter defined as the node closeness centrality.

We employed the Fisher information to capture the sensitivity of the shell genome probability distribution to changes in the closeness centrality. In general, Fisher information [33] measures the amount of information that an observable random variable $\mathcal{X}$ contains about an unknown parameter θ, given the probability function $p(x|\theta)$ over $x \in \mathcal{X}$ with respect to parameter θ:

$\begin{align} F[\theta] & = \mathrm{E} \left[ \left. \left (\frac{\partial}{\partial \theta} \log p(x|\theta) \right) ^2 \right| \,\theta \right] \end{align} \tag{ 6 }$

$\begin{align} & = \int_{x} \left (\frac{\partial}{\partial \theta}\log p(x|\theta) \right)^2 p(x|\theta) \ \mathrm{d} x \end{align} \tag{ 7 }$

where $\mathrm{E} [\cdot | \theta]$ denotes the conditional expectation with respect to θ.

There is a well-established association between phase transitions and the Fisher information [34–41]: the Fisher information diverges at critical points, being proportional to the rate of change of the corresponding order parameter [37]. In the context of evolutionary dynamics of pathogens, the system possibly undergoing a transition is a pathogen which may change its adaptation behaviour from explorative to exploitative in response to changes in its node centrality (i.e. control parameter). In general, the node centrality may be expected to change in response to the underlying evolutionary dynamics. Hence, in this study, we consider changes in the node centrality as a process corresponding to varying the control parameter. On the other hand, the prevalence may correspond to the order parameter characterising the outcome of the evolutionary process. The advantage of using Fisher information in identifying possible phase transitions is that the computation of the corresponding order parameter is not required—instead, this model-independent method seeks specific critical points in the centrality space at which the Fisher information peaks.

We computed normalised Fisher information using gene presence/absence data over the isolates ordered by centrality with a fixed centrality increment of $1 \times 10^{-4}$ . Following [42–44], the normalised discrete Fisher information is calculated as follows:

$\begin{equation} F[\theta]=F_0 \sum^{R-1}_{i=1} [\sqrt{p_{i+1}}-\sqrt{p_i}]^2 \end{equation} \tag{ 8 }$

where R is the number of centrality increments (steps), and F₀ is a normalisation constant:

$\begin{align} F_0= \left\{\!\!\! \begin{array}{l@{}l} 1 & \textrm{if} \ p_i^*\!=\!1 \textrm{ for } i^*\!=\!1 \textrm{ or } i^*\!=\!R, \textrm{ and } p_i \neq 0 \ \forall i \neq i^* \\ \frac{1}{2} & \textrm{otherwise.}\end{array}\right. \end{align} \tag{ 9 }$

3. Results

3.1. SNP centrality-prevalence space

The constructed centrality-prevalence space (figure 5) shows a clear structure with at least two groups separated by their closeness centrality values: the isolates with low-to-mid centrality ( $\leqslant 3.3 \times 10^{-3}$ , shown in blue/cyan), and the isolates with mid-to-high centrality ( ${\gt}3.3 \times 10^-3$ , shown in yellow/orange). We draw an additional distinction by separating the isolates by neighbourhood prevalence (set at 100 isolates marked by the dashed line), above which isolates are considered high prevalence.

Figure 5 shows that (i) isolates in the same centrality group (low-to-mid, shown in blue; or mid-to-high, shown in orange) typically belong to the same network component, and (ii) isolates with high neighbourhood prevalence (shown in yellow and cyan) usually cluster together within the network components with some peripheral nodes from the same centrality group.

We also related the neighbourhood prevalence and the local clustering coefficient. This allowed us to investigate the role of clusters and hubs which are known to contribute to epidemic propagation [45]. However, the partition of isolates observed in the clustering-prevalence space (see figure A1(b) in appendix A) is not as structured as the one in the centrality-prevalence space.

3.2. Shell genome entropy and exploration-exploitation distinction

Previous studies distinguished between the explorative and exploitative pathways in the STM genetic space and argued that a mid-to-high centrality region contains higher-risk STM pathogens [12, 13]. These studies relied on coarse-grained MLVA datasets, and these results have not been verified with high-resolution WGS data. In this work, in order to associate the two groups separated in the centrality-prevalence space (figure 5) with the exploration-exploitation distinction, we compared the shell genome entropy between the groups.

Figure 6(a) shows that the isolates with low-to-mid centrality tend to have a higher normalised shell genome entropy, thus demonstrating a greater genetic diversity, while the normalised shell genome entropy for the isolates with mid-to-high centrality tends to be smaller, showing a reduced genetic diversity (figure 6(b)). This difference confirms the exploration-exploitation distinction: the isolates with low-to-mid centrality tend to diversify, while the isolates with mid-to-high centrality are more stable in their shell genome. Figure B1 in appendix B illustrates the corresponding shell genome data.

3.3. Fisher information and evolutionary dynamics

We also examined a possible transition between two phases (exploration and exploitation) in the centrality space by tracing the normalised Fisher information computed for all isolates ordered by their centrality values. Figure 7 shows that a small change in the centrality value from $2.6 \times 10^{-3}$ to $2.7 \times 10^{-3}$ results in an abrupt spike in the Fisher information. This indicates that the probability distribution of genes in the isolates around this centrality value is most sensitive to the centrality change. In thermodynamic terms, when the closeness centrality is considered as a control parameter, the observed spike in the Fisher information indicates that there is a phase transition between the lower and higher centrality regions. We interpret the resulting evolutionary dynamics as exploration and exploitation respectively. The critical regime is observed towards the end of exploration phase before the population self-organises into more exploitative dynamics.

**Figure 7.** Normalised shell genome Fisher information (713 genes) in the centrality space. The normalised Fisher information has two peaks: (i) a large peak for the isolates around critical centrality value between $2.6 \times 10^{-3}$ and $2.7 \times 10^{-3}$ , and (ii) a smaller peak for isolates with a high centrality greater than $4 \times 10^{-3}$ . The centrality increment is $1 \times 10^{-4}$ .
Download figure:
Standard image High-resolution image

It is also worth noting that many isolates in the critical centrality range ( $2.6 \times 10^{-3}$ to $2.7 \times 10^{-3}$ ) are associated with an outbreak cluster with MLVA profile 5-17-9-13-490. This is an MLVA profile with a substantially different genetic composition relative to the rest of the population. However, the Fisher information peak indicating a critical regime in the centrality space remains robust even when the outbreak isolates are excluded. This is shown in appendix D, by deriving the network for the truncated population (figure D3). We also found that the Fisher information peak is not sensitive to the removal of all isolates in the mid-centrality—high prevalence cluster (see appendix D). Robustness of the Fisher information peak indicates that the critical regime does not appear due to specific outbreaks emerging during the exploration phase. Instead it points to a phase transition in the overall evolutionary dynamics.

Interestingly, the pathogens within the medium centrality range exhibit features of both exploration and exploitation. This is revealed by comparing the centrality-partitions shown in figures 5, D1 and D2, with the partition becoming less sharp as the network is disturbed by the population truncation. Such mixing behaviours are often observed at critical regimes, with phase separation splitting a homogeneous mix into distinct phases (e.g. one homogeneous liquid separates into two different liquids [46]).

We also verified that the observed peak in Fisher information computed for the shell genome (using the gene presence/absence data) cannot be solely attributed to the population structure. In other words, there are no distinct sub-populations (defined in terms of SNP distance across the entire genome) which may have been partitioned by the critical centrality regime, as shown in appendix D (figure D4).

At the high-centrality region, there is a secondary peak observed when the centrality exceeds $4 \times 10^{-3}$ (figure 7). The second peak may indicate that the system is sensitive because it encountered another change in evolutionary dynamics, possibly an adaptation bottleneck (see the following section). We tested the robustness of the two observed peaks by using different centrality increments (steps) varying from a smaller ( $5 \times 10^{-5}$ ) to a larger step ( $2 \times 10^{-4}$ ), as shown in figure C1. While the peak near the critical value remains evident across different increments, the secondary peak in the high centrality range appears more sensitive to the increment value, possibly indicating its transient nature.

3.4. Probability density of directed paths

We then studied the conjecture that the isolates with mid-to-high centrality can be associated with the higher risk STM genotypes. Figure 8 shows two regions in the centrality-prevalence space: a dense 'expansion' region with a majority of nodes starting successful paths (shown in red), and a smaller 'bottleneck' region with nodes starting unsuccessful paths (shown in blue). Both regions are characterised by high centrality and high prevalence. It is also evident that isolates in the low-to-mid centrality region do not initiate sufficiently long paths, showing minimal neighbourhood prevalence changes.

Earlier studies based on MLVA data covering a nine-year period reported that while the average prevalence increases, the centrality of adapting pathogens initially increases (from low to high values) and then decreases (from high to medium values) [13]. Our results show a partial agreement with this finding: the increasing neighbourhood prevalence is observed only in the group of isolates with increasing centrality, from medium to high values. Importantly, our study shows that the isolates with the highest centrality encountered a bottleneck (shown in blue in figure 8), a region forming an emergent 'barrier' for successful paths and slowing evolutionary dynamics. In other words, the paths originating within the bottleneck region did not yield increased neighbourhood prevalence, in contrast to previous studies [13] which reported that a minority of paths may continue through the bottleneck region and attain higher prevalence while decreasing centrality. In our case, the observed bottleneck is related to the decline in the total numbers of salmonellosis cases reported between 2015 and 2019, which has been attributed to the national Foodborne Illness Reduction Strategy [2, 47] which used to contain the STM outbreaks in NSW. The reduction in salmonellosis notifications may also be attributed to increased attention to hand and surface hygiene during the COVID-19 pandemic [2, 48]. The emergence of the bottleneck region explains the secondary peak in the Fisher information, shown in figure 7.

4. Discussion

We investigated the evolution of a high burden foodborne pathogen by applying network methods and information theory, using a WGS dataset of STM isolates collected over five years (2016 to 2021) in NSW, Australia. We considered a relationship between structural network features (closeness centrality) and functional features of the evolving STM pathogens (their neighbourhood prevalence). The analysis revealed a clear structure in the centrality-prevalence space, and we associated the observed groups with two evolutionary dynamics: exploration and exploitation.

We verified this conjecture by computing the Shannon entropy and Fisher information of the corresponding shell genomes, showing that the isolates with low-to-mid centrality are characterised with higher entropy (thus, exploring the evolutionary landscape) and thus demonstrating the plasticity of the shell genome, and conversely, the isolates with mid-to-high centrality exhibited lower entropy (thus, exploiting their environment) and show less diversity within the shell genome. Furthermore, we considered the pathogen evolution as a thermodynamic process and showed that a small change in the parameter of interest, node centrality, results in sudden change in the system sensitivity, measured by the Fisher information. The centrality range between the value maximising Fisher information, $2.7 \times 10^{-3}$ , and the centrality partition point of $3.3 \times 10^{-3}$ can be considered as a critical regime that separates two phases of evolutionary dynamics. This provides a further argument for a possible phase transition driven by changes in the shell genome where short-term evolutionary events are likely to occur.

In addition, analysis of the path probability density in the centrality-prevalence space showed two regions of interest: an 'expansion' region where the successful paths tend to originate from, and a 'bottleneck' region giving rise to unsuccessful paths.

These results demonstrate two important findings at the WGS level. Firstly, there is a clear distinction between explorative and exploitative phases of pathogen evolutionary dynamics. Secondly, there is an evolutionary pathway along which pathogens exploit the search-space by increasing their centrality, before encountering a bottleneck. These findings suggest that the population of pathogens may be self-organising in the centrality-prevalence space, guided by both evolutionary and epidemic containment constraints. We observed these features despite the relatively short duration of five years covered by the WGS dataset, and the likely impact on foodborne infections of COVID-19-related public health measures [2, 48].

This study investigated the evolutionary dynamics in STM population at a high resolution offered by WGS data. This analysis confirmed that a structure-function relationship, projected on the space formed by closeness centrality and pathogen prevalence can reveal salient evolutionary dynamics. In doing so, we used a novel approach for identifying and quantifying distinct phases with the shell genome entropy and Fisher information. We believe that the approach combining network-theoretic and information-theoretic analysis will be of general benefit to studies of evolutionary dynamics, particularly in identifying patterns and phases of pathogen evolution.

Acknowledgment

This research was supported by the Australian Government through the Australian Research Council's Discovery Projects funding scheme (Project DP200103005). R J R is supported by an NHMRC Emerging Leadership Investigator Grant (GNT 2018222). We gratefully acknowledge the NSW Health Pathology Enteric Reference Laboratory for access to historical Salmonella Typhimurium collections and the NSW Health Pathology Microbial Genomics Reference Laboratory for access to the whole genome sequences. We are grateful for the use and support of the high performance computing cluster provided by the Sydney Informatics Hub, a Core Research Facility of The University of Sydney.

Data availability statement

The data cannot be made publicly available upon publication because they are owned by a third party and the terms of use prevent public distribution. The data that support the findings of this study are available upon reasonable request from the authors.

Appendix A: Network properties

Figure A1(a) shows the degree distribution of the constructed thresholded undirected SNP sub-network where a large number of isolates (e.g. singletons and pairs) are disconnected from the highly connected isolates in large network components (figure 1). We also note that the complete undirected SNP network is extremely dense—with the local clustering coefficient varying between 0.994 to 0.999, as shown in figure A1(b)—and does not show a clear partition in the clustering-prevalence space.

**Figure A1.** Network properties. (a) Degree distribution for the thresholded undirected SNP sub-network (see figure 1 for the network representation). (b) Clustering-prevalence space for the complete undirected SNP network, in contrast with figure 5 shown in main text. The neighbourhood prevalence of an isolate is measured by the number of neighbours within $G_{\text{max}} = 20$ .
Download figure:
Standard image High-resolution image

Appendix B: Shell genome data

Figure B1 illustrates gene presence/absence data in the shell genome, across two groups separated by the centrality-partition point.

**Figure B1.** Visual representation of the shell genome gene presence/absence data for the two groups identified in figure 5. Isolates are temporally ordered by time of the sample collection. Black block: gene presence, white block: gene absence. (a) Low-to-mid centrality group ( $\leqslant 3.3 \times 10^{-3}$ ) consisting of 2197 isolates sampled between 1 October 2016 and 06 May 2021, and (b) Mid-to-high centrality group ( ${\gt}3.3 \times 10^{-3}$ ) consisting of 1742 isolates sampled between 2 October 2016 to 05 May 2021.
Download figure:
Standard image High-resolution image

Appendix C: Critical centrality: sensitivity to increments

To investigate sensitivity of the Fisher information to centrality increments, we varied the centrality increment across three values: $5 \times 10^{-5}$ , $1 \times 10^{-4}$ and $2 \times 10^{-4}$ . Figure C1 shows that the identified critical centrality value is not sensitive to changes in the increment.

**Figure C1.** Normalised Fisher information in the centrality space, computed with different steps: (a) centrality increment of $5 \times 10^{-5}$ , (b) centrality increment of $1 \times 10^{-4}$ (identical to figure 7), and (c) centrality increment of $2 \times 10^{-4}$ .
Download figure:
Standard image High-resolution image

**Figure C1.** Normalised Fisher information in the centrality space, computed with different steps: (a) centrality increment of $5 \times 10^{-5}$ , (b) centrality increment of $1 \times 10^{-4}$ (identical to figure 7), and (c) centrality increment of $2 \times 10^{-4}$ .
Download figure:
Standard image High-resolution image

Appendix D: Fisher information: alternative explanations

To verify that a phase transition in the centrality space does not appear due to a specific outbreak, we identified 212 isolates that are close neighbours (i.e. within $G_{\textrm{max}} = 20$ ) to the outbreak MLVA profile 5-17-9-13-490. Then we computed the Fisher information (as detailed in section 2) and identified the phase transition for a truncated population (3727 isolates) in which the 212 outbreak isolates are removed.

The re-constructed centrality-prevalence space is shown in figure D1, and the centrality value $3.3 \times 10^{-3}$ continues to partition two centrality regions. The Fisher information of the truncated population is computed for the corresponding normalised shell genome (711 genes). Figure D3 (orange line and circles) shows that although the entire centrality range slightly changed for the truncated population without the outbreak isolates, the predominant spike in Fisher information is still observed near the conjectured critical regime.

Another sensitivity analysis was performed by removing isolates in the mid-centrality—high prevalence cluster (481 isolates). We re-constructed the centrality-prevalence (figure D2) and re-computed the Fisher information (figure D3 of the normalised shell genome (681 genes) for this truncated population (3458 isolates). Figure D3 (gray line and diamonds) shows that the predominant peak in Fisher information is robust with only a slight shift of the centrality regime.

We also verified that the observed peak in Fisher information cannot be simply attributed to presence of (two) distinct sub-populations which happen to be split at the critical centrality. Taking the node with the highest centrality (i.e. isolate 16-1157-0215, node centrality $4.1 \times 10^{-3}$ ) as the reference node, we traced its SNP distance (across the entire genome) from all other nodes. Figure D4 shows that, as the centrality reduces from the reference node, the corresponding SNP distance increases relatively smoothly across the entire population despite having a small number (i.e. 40) of outliers (figure D4, marked with orange circle outline). Here, we fit a linear model between the node centrality and the distance to reference isolate and identify outliers as nodes whose SNP distance from the reference isolate is 20% higher than the corresponding linear fit. The analysis of the centrality-prevalence space (figure D5) and the normalised Fisher information (figure D6) for the truncated population verified that the critical regime in the centrality space is still observed despite the removal of outliers.

**Figure D1.** Centrality-prevalence space for the complete SNP network derived for the truncated population without 212 outbreak isolates. The neighbourhood prevalence (shown in log scale) of an isolate is measured by the number of neighbours within $G_{\textrm{max}} = 20$ . The dashed line (marked at 100 isolates), separates lower and higher neighbourhood prevalence. See section 2 for more details on the construction of centrality-prevalence space.
Download figure:
Standard image High-resolution image

**Figure D2.** Centrality-prevalence space for the complete SNP network derived for the truncated population without 481 isolates in the mid-centrality—high prevalence cluster. The neighbourhood prevalence (shown in log scale) of an isolate is measured by the number of neighbours within $G_{\text{max}} = 20$ . The dashed line (marked at 100 isolates), separates lower and higher neighbourhood prevalence. See section 2 for more details on the construction of centrality-prevalence space.
Download figure:
Standard image High-resolution image

**Figure D3.** Normalised Fisher information in the centrality space. Black line and crosses: full population (identical to the plot shown in figure 7). Gray line and diamonds: truncated population without 481 isolates in the mid-centrality—high prevalence cluster. Orange line and circles: truncated population without 212 outbreak isolates. Centrality increment is $1 \times 10^{-4}$ .
Download figure:
Standard image High-resolution image

**Figure D3.** Normalised Fisher information in the centrality space. Black line and crosses: full population (identical to the plot shown in figure 7). Gray line and diamonds: truncated population without 481 isolates in the mid-centrality—high prevalence cluster. Orange line and circles: truncated population without 212 outbreak isolates. Centrality increment is $1 \times 10^{-4}$ .
Download figure:
Standard image High-resolution image

**Figure D4.** Centrality versus SNP distance between reference node, isolate 16-1157-0215, and all other nodes in the network. A total number of 40 nodes are considered 'outliers' (marked by orange outline) determined by linear fitting as shown in inset. Goodness of fit: $R^2 = 0.9218$ , adjusted $R^2 = 0.9217$ .
Download figure:
Standard image High-resolution image

**Figure D4.** Centrality versus SNP distance between reference node, isolate 16-1157-0215, and all other nodes in the network. A total number of 40 nodes are considered 'outliers' (marked by orange outline) determined by linear fitting as shown in inset. Goodness of fit: $R^2 = 0.9218$ , adjusted $R^2 = 0.9217$ .
Download figure:
Standard image High-resolution image

**Figure D5.** Centrality-prevalence space for the complete SNP network derived for the truncated population without 40 outlier isolates. The neighbourhood prevalence (shown in log scale) of an isolate is measured by the number of neighbours within $G_{\textrm{max}} = 20$ . The dashed line (marked at 100 isolates), separates lower and higher neighbourhood prevalence. See section 2 for more details on the construction of centrality-prevalence space.
Download figure:
Standard image High-resolution image

**Figure D6.** Normalised Fisher information in the centrality space. Black line and crosses: full population (identical to the plot shown in figure 7). Gray line and squares: truncated population without 40 outlier isolates. Centrality increment is $1 \times 10^{-4}$ .
Download figure:
Standard image High-resolution image

These observations demonstrate that the phase transition between the exploration and exploitation phases is not sensitive to the inclusion of the outbreak isolates or SNP distance outliers. In other words, it does not appear due to presence of distinct sub-populations, being a systemic feature of the overall evolutionary dynamics.

Genome entropy and network centrality contrast exploration and exploitation in evolution of foodborne pathogens

Article metrics

Submit

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Methods