Machine learning-based prediction of Q-voter model in complex networks

In this article, we consider machine learning algorithms to accurately predict two variables associated with the Q-voter model in complex networks, i.e. (i) the consensus time and (ii) the frequency of opinion changes. Leveraging nine topological measures of the underlying networks, we verify that the clustering coefficient (C) and information centrality emerge as the most important predictors for these outcomes. Notably, the machine learning algorithms demonstrate accuracy across three distinct initialization methods of the Q-voter model, including random selection and the involvement of high- and low-degree agents with positive opinions. By unraveling the intricate interplay between network structure and dynamics, this research sheds light on the underlying mechanisms responsible for polarization effects and other dynamic patterns in social systems. Adopting a holistic approach that comprehends the complexity of network systems, this study offers insights into the intricate dynamics associated with polarization effects and paves the way for investigating the structure and dynamics of complex systems through modern methods of machine learning.


Introduction
Interactions among the components of a complex system have given rise to properties not present in its isolated parts [1].For instance, the collective behavior of ants in a colony provides a compelling illustration of emergence.While individually following simple rules, ants exhibit complex behaviors such as efficient food foraging, elaborate nest construction, and coordinated defense [2].Such an emergence phenomenon significantly extends beyond the natural world, since it also manifests within our society through intricate interactions among agents, groups, and institutions.
A substantial consequence of emergence is social polarization, according to which agents develop increasingly extreme opinions and display diminished tolerance for opposing viewpoints, ultimately leading to societal divisions.Numerous studies have associated the phenomenon with negative outcomes in political contexts, as seen in the recent elections in both Brazil and the United States [3,4,5,6].In Brazil, heightened polarization culminated in a significant event on January 8, 2023, when key institutions in Brasília, the capital of Brazil, were invaded.This event was the result of escalating tensions stemming from polarized political discourse.The Supreme Federal Court, the National Congress building, and the Presidential Palace were among the targeted institutions.Similarly, the United States also faced its own challenges associated with polarization.A notable incident occurred on January 6, 2021, when a crowd stormed the United States Capitol in an attempt to overturn the results of the presidential election.Therefore, the causes and effects of polarization in social networks must be comprehended so that effective communication strategies and social interventions that mitigate its detrimental impact can be designed [7,8].
Towards a deeper understanding of social polarization, various mathematical models have been developed [9] and the most sophisticated ones have recently considered the dynamics of interactions between agents and their underlying structure.Indeed, consensus models must be simulated in complex networks to be more realistic, since the network topology heavily influences both their dynamics and the final result of consensus generation [10].
Several models, including the Ising model, the Sznajd model, the voter model, the naming game, the bounded confidence model, and the Q-voter model, address complex phenomena stemming from interactions among individuals in social and physical contexts and are adapted for complex networks.Researchers employ these models to identify conditions fostering consensus emergence and network features facilitating the process.Simulations within complex networks are crucial to achieving a more accurate portrayal of consensus formation.The intricate network topology significantly influences model dynamics and consequently impacts consensus generation outcomes [11].The Ising model, originating from physics, focuses on material magnetization by representing the magnetic orientation of spins in a three-dimensional lattice.The interaction between neighboring spins aims to minimize the system's energy, leading to phenomena like the Ising phase transition [12,13,14,15].In contrast, the Sznajd model explores how similar opinions can influence others.The premise is that people with coinciding opinions are more likely to persuade others, leading to the formation of opinion clusters [16,17,18,19].Meanwhile, the voter model simplifies decision-making in a population, where individuals adopt the majority opinion of their neighbors, illustrating how social influences can drive convergence towards dominant opinions or polarization [20,9].The naming game addresses language evolution, where individuals attempt to communicate and reach a consensus on names for concepts, balancing communicative efficiency and linguistic diversity [21,9,22,23].Bounded confidence explores how opinions change through social interactions, assuming people update their opinions only when the difference from others' opinions falls within a specific limit [24,25].Finally, the Q-voter model offers an approach to simulate collective decisions within groups of individuals.In this model, each agent adopts the opinion of one of its randomly selected Q-neighbors.These Q-neighbors represent a subset of neighboring agents, and the extent of this subset, denoted by Q, significantly influences the dynamics of opinion diffusion.This model investigates how connectivity and information exchange between individuals impact consensus formation.By studying how opinions spread through the Q-voter framework, researchers gain insights into the emergence of consensus, polarization, or the coexistence of diverse viewpoints within a population [9,26,27].In [28], researchers investigate the impact of polarization in the three-state Q-voter model, considering limited confidence and noise.By incorporating these factors, the study reveals how agent interactions lead to the formation of groups with divergent opinions, complicating the convergence to a single opinion.Similarly, [29] examines the role of anticonformity and limited confidence in the Q-voter model.This study demonstrates that anticonformity amplifies polarization and emphasizes the coexistence of groups with similar yet distinct opinions, especially when limited confidence is present.Furthermore, [30] introduces a mathematical model that examines the effects of conformity and anticonformity on opinion polarization.This study investigates a similar opinion dynamics model based on the Q-voter, analyzing how the interplay between these behaviors influences the formation of groups with divergent opinions.These collective studies substantially contribute to a deeper understanding of the underlying dynamics of opinion polarization in social contexts.
Empirical investigations have consistently provided compelling evidence that different network topologies exhibit varying degrees of polarization and consensus formation [31,32,33,34,35].For example, recent studies have shown the adoption of the Q-voter model within modular networks can result in highly polarized public opinions [31].On the other hand, in scale-free networks, highly connected agents can expedite the process of consensus formation while potentially amplifying extreme polarization [32].Furthermore, studies have delved into the influence of network clustering, degree distribution, and other network properties on the dynamics of consensus formation [9], highlighting the crucial role of network topology in the development of realistic models for comprehending consensus formation within complex networks.By considering the intricate interplay between network structure and opinion dynamics, researchers can attain a more comprehensive understanding of the factors that shape the emergence of consensus and polarization.
Given the significant influence of network topology on the emergence of consensus, an essential question is whether it is feasible to develop a machine learning model that can forecast dynamic variables based on network properties.Such an inquiry has been widely explored in various fields, including the prediction of both epidemics in human contact networks [36,37] and synchronization in coupled oscillators [38,37].The investigations have not only demonstrated the possibility of forecasting the behavior of dynamic systems from the network topology but also underscored the importance of comprehending the relationship between network structure and dynamics in those systems as in the recent article by Brooks and Porter [39].Both our study and the one by Brooks and Porter share a common focus on complex phenomena within social networks.We employ interdisciplinary approaches that integrate complex network theory, system dynamics, and machine learning.Both studies acknowledge the pivotal role of network structure in shaping social dynamics and investigate opinion dynamics within social networks, although with different emphases.While both studies delve into opinion dynamics, our research primarily centers on utilizing machine learning to predict variables based on the Q-voter model.In contrast, Brooks and Porter's research delves into how media exposure influences ideological content within social networks.This distinction underscores the significance of media in their study, while our research places a strong emphasis on network structure and agent interactions.
The application of machine learning algorithms in the study for the prediction of consensus time and frequency of opinion changes in the Q-voter model offers several advantages.Machine learning promotes the capture of complex patterns, learning from historical data, and adaptation to evolving dynamics; it is a powerful tool for uncovering intricate relationships and enhancing predictive accuracy.Moreover, its use in the context of Q-voter represents a novel approach, pushing the boundaries of traditional analysis and providing new insights into the mechanisms driving opinion dynamics in complex social systems.The consensus is a significant metric that indicates the level of agreement among agents in a network.Conversely, the frequency of opinion changes reflects a network's ability to maintain its beliefs and showcases the level of volatility in the system.Understanding and mitigating the effects of polarization in complex network systems is of utmost importance, as it can significantly impact both the consensus formation process and the stability of opinions within a network.Both metrics play a crucial role in the comprehension of the behavior of social systems and offer insights into the factors contributing to stability or instability within such systems [40].
This study provides valuable insights into the intricate relationship between network structure and social dynamics, highlighting the potential of complex network measures for analyzing dynamic systems.Additionally, it demonstrates the effectiveness of complex network structures in accurately predicting the consensus time and frequency of opinion changes within the Q-voter model using machine learning algorithms.The significance of each network feature in these predictions was evaluated, revealing the clustering coefficient (C) and information centrality (IC) as the most influential measures for predicting these outcomes.Furthermore, the robustness of these predictions was tested using three distinct initialization methods in the Q-voter model, specifically assessing the model's behavior when initialized with high degree, low degree, and a random selection of agents with positive opinions.
The article is organized as follows: Section 2 is divided into four parts.The first part introduces the simulated Q-voter model, Subsection 2.2 describes the investigated networks, Subsection 2.3 explains the network measurements, and Subsection 2.4 presents the machine learning algorithms used for prediction.Section 3 provides the results, and Section 4 is dedicated to relevant observations and conclusions.

Stochastic simulation of Q-voter model
In the context of the Q-voter model, a group of Q agents (Q-voters) influences the opinion of a single agent.This interaction determines the number of neighbors considered by an agent for decision-making, as dictated by the parameter Q.This model is particularly interesting for studies of social dynamics since it captures the impact of group influence, conformity, and social reinforcement on opinion dynamics.Furthermore, it exhibits a rich phase-transition behavior, depending on the value of Q and network topology, leading to various outcomes such as consensus, fragmentation, and coexistence of opinions [41,42,43,44,45].Introduced in [46], its applicability extends to all integer-values of Q >0, meaning that Q can encompass a range of values greater than 0. Furthermore, by setting Q=1, we directly return to the standard voter model.Within this framework, the possibility of repetition is considered, implying that a specific neighbor can be selected multiple times.Thus, when Q is greater than the number of neighbors (the degree of a node), the opinion of the same neighbor will be taken into account more than once.
Consider a network of N voters (also known as agents, nodes, spins, or individuals).Each is defined by a single dynamical binary variable s(x, t) = j, where j = +1 or j = −1, x = 1, ..., N , and t represents time.From a social standpoint, s(x, t) represents a two-point psychometric scale (yes/no, agree/disagree) opinion of an agent placed at node x at time t on a particular subject.
The initial fraction of agents with positive opinions (p+) is fixed at the beginning of the simulation and randomly distributed to the network nodes.Parameter ϵ represents the probability of an agent x acting independently of their neighbors, indicating their unwillingness to yield to group pressure.Consequently, (1 − ϵ) represents conformity, influencing the likelihood of an agent adopting the majority opinion of her/his Q neighbors.Note that the individual opinion of the selected agent x is not taken into account in the probability of opinion change or retention in the dynamics.Table 1 shows the fixed parameters of Q-voter, including the number of nodes in complex networks (N = 1, 000), probability of an agent acting independently (ϵ = 0.01), an initial fraction of agents with positive opinions (p+ = 0.20), and the number of neighbors (Q = 2).The value of β represents the probability of an agent changing their opinion to the opposite when there is no consensus among their neighbors.
The parameters were fixed toward establishing a consistent baseline for our machine learning-based prediction of consensus time and frequency of opinion changes.Consensus time is the relaxation time of a finite-size system needed to approach a stationary state.By keeping them constant, our exploration can focus on the impact of other variables and a more thorough analysis of our machine learning model's predictive performance regarding the desired outcomes can be conducted.The initial percentage of agents selected was modified to have a positive opinion in three ways: through random agent selection, and by selecting high-and low-degree agents.Algorithm 1 exemplifies the stochastic simulation, followed by an illustration in Figure 1 of the model.In other words, all agents have a binary opinion, represented here by the colors red and blue.Suppose an agent has a red opinion; then, their opinion can be altered based on the following social response: the probability of nonconformity, i.e., reluctance to yield to group pressure, with a probability ϵ, of changing their opinion.Alternatively, conformity (1-ϵ) represents the probability of behaving like their neighbors.If the neighbors share a consensus, meaning they all have the same opinion, the agent will switch to the blue color or remain in the red color.However, if there is no consensus among the neighbors, with a probability β, the agent will switch to the blue color, and with a probability of 1-β, they will maintain their opinion.

Networks
Nine complex network measures were examined, as discussed in Subsection 2.3.The analysis involved eight distinct topological structures, including Erdős-Rényi [48], Barabási-Albert linear [49], Barabási-Albert nonlinear with α = 0.5 and α = 1.5 [50], Lancichinetti-Fortunato-Radicchi (LFR) graphs [51], Watts-Strogatz [52], Waxman [53], and path graph [54].The Erdős-Rényi network model is generated by randomly adding connections between nodes with a uniform probability.In contrast, the non-linear Algorithm 1 Q-voter model algorithm 1: Initialize a complex network of size N representing the agents 2: Assign each agent a binary variable, s(x, t) with x ∈ [1, N ] at time t, whose values +1 or -1 representing two opposing opinions (j = +1 or j = −1) 3: for each time step t do 4: Randomly select an agent x 5: Randomly choose Q neighbors of agent x (allowing for repetition) if all Q neighbors have the same state then

Absence of Consensus
Figure 1.Illustration of the Q-voter model: All agents have a binary opinion, represented here by the colors red and blue.Suppose an agent has a red opinion; then, their opinion can be altered based on the following social response: the probability of non-conformity, i.e., reluctance to yield to group pressure, with a probability ϵ of changing their opinion.Alternatively, there is conformity (1-ϵ), which represents the probability of acting like their neighbors.If the neighbors have a consensus, meaning they all share the same opinion, the agent will switch to the blue color or remain in the red color.However, if there is no consensus among the neighbors, with a probability β, the agent will switch to the blue color, and with a probability of 1-β, they will maintain their opinion.The figure was created by the authors and is based on [47].
Barabási-Albert model is constructed iteratively, incorporating preferential attachment of new nodes to existing ones through a non-linear function that considers the node's connections.The LFR model is widely employed for creating networks with realistic community structures, assigning nodes to communities based on degree and community size distributions, and establishing connections that consider both intra-and intercommunity links.The Watts-Strogatz model introduces the concept of small-world networks by randomly rewiring a portion of links in a regular lattice.A path graph is a specific type of graph in graph theory that consists of a linear sequence of connected nodes, where each node is linked to the next node in the sequence by a single edge.This creates a structure resembling a straight line of nodes, and it is often used as a simple representation of an ordered sequence of elements or events.A path graph is created by defining the nodes in the desired order and connecting them sequentially with edges.Lastly, the Waxman model takes into account geographic proximity and node attractiveness to determine the formation of connections, considering both physical distances and random appeal.For each of the mentioned networks, Appendix A provides details of the Python functions used, and 100 unique instances were generated, with each network consisting of N = 1, 000 nodes and an average degree ranging from 9 to 10, that is, the available dataset comprises 800 instances of complex networks (denoted as i).

Network Measurements
To effectively capture and explain the dataset's predominant variability, a visual representation of the principal component analysis (PCA) plot is provided in Appendix C.This PCA plot serves as a powerful tool for gaining insights into the underlying patterns and structures within the dataset.Subsequently, the Q-voter model was simulated in each of these structures to measure the time taken to reach consensus (Y i ) and the total number of opinion changes that occurred in the model (C i ).It is hypothesized that both Y i and C i can be predicted using a feature vector derived from the network structure, denoted as X i = X i1 , X i2 , . . ., X ik , where X ik represents the k-th measure extracted from network i.The subsequent explanation primarily focuses on the prediction of Y i , although the same process is applied to the prediction of C i .Therefore, the learning model is defined by The goal is to infer the function f () that relates Y i to the network measures.Estimating Y i is treated as a regression problem, where δ represents a random error term independent of X i , following a normal distribution with a mean of zero and a standard deviation of σ.While feature selection and model comparison algorithms can be used to identify components of X i that contribute to predicting Y i , this study employed conventional network measures which are presented in Table 2.
The first measure utilized in this study was the clustering coefficient (C), a local measure, which quantifies the extent to which nodes in a network tend to form tightly connected clusters.It assesses the likelihood of two neighbors of a node being connected, reflecting the local clustering patterns within the network [52].Closeness centrality (CLC), another local measure, was employed to calculate the proximity of a node to all other nodes in the network.It reflects the average distance between a node and all other nodes, indicating the efficiency of information or resource flow within the local neighborhood of a node [55].Betweenness centrality (BC) is a measure that identifies nodes acting as critical intermediaries in the network.BC quantifies the extent to which a node lies on the shortest paths between other pairs of nodes, thus indicating its influence over the flow of information or resources within its vicinity [56].The shortest path length (SPL) measures the minimum number of edges required to traverse between two nodes in the network, providing insights into network connectivity and the efficiency of information or resource transfer within local regions of the network [57].Degree Pearson correlation coefficient (PC) examines the correlation between the degrees of connected nodes, capturing the tendency of nodes with similar degrees to connect and indicating the presence of assortativity or disassortativity within the network [58].Information centrality (IC) assesses the importance of a node based on its ability to control the flow of information in the network, considering the number of shortest paths that pass through the node [59].Subgraph centrality (SC) measures the importance of a node within its local subgraph by considering the closed walks that pass through the node, capturing its influence within specific network neighborhoods [60].Approximate Current Flow Betweenness Centrality (AC) quantifies the extent to which a node controls the flow of electric current in the network, considering the current paths between all pairs of [61].Finally, Eigenvector centrality (EC) determines a node's importance based on its neighboring nodes' centrality, assigning higher importance to nodes connected to other important nodes and capturing the concept of influence [62].Such measures, collectively used here, provide valuable insights into complex network structures, connectivity, efficiency, influence, and community organization [63].Details and equations for each of the mentioned measures, along with the Python functions used, are provided in Appendix B.

Machine learning algorithms
The machine learning algorithms utilized are the least absolute shrinkage and selection operator (LASSO), multi-layer perceptron regressor (MLP), random forest (RF), and extreme gradient boosting (XGBoost).Among the several techniques used to improve the proposed machine learning algorithms, nested cross-validation, shuffle, and grid search are highlighted.The former is a multi-round cross-validation procedure adopted in machine learning for model selection and performance assessment [64].It is a more rigorous model selection and performance evaluation than traditional cross-validation since it reduces the risk of overfitting and provides a more accurate estimate of the model's performance on unseen data [65].Its main idea is the existence of an outer loop, which divides the data into training and test sets, and an inner one, which uses cross-validation to determine optimal values for the model's hyperparameters.Shuffle was employed during nested cross-validation to avoid possible biases in the selection of training and testing data, ensuring the model learned in a balanced way throughout the range of data.Finally, grid search searched for the best model hyperparameters by systematically exploring different combinations of possible values for them.The set of techniques used significantly contributed to the development of a more robust and accurate model.A 5-fold outer shuffle cross-validation and a 5-fold inner crossvalidation were also adopted, following similar approaches described in previous studies [66].During the inner folds, a grid search hyperparameter optimization was performed -specific details can be found in Table D1 in the Appendix D. The coefficient of determination, R2, is a metric used to measure how well a regression model fits the data [67].However, when we add more predictors to the model, R2 can increase even if these new predictors don't really help explain the variation in the dependent variable.To address this, we use the R2 adjusted, which considers the number of predictors and penalizes the inclusion of irrelevant ones.This adjustment gives us a more accurate evaluation of how well our model predicts the outcome.In simpler terms, we prefer R2 adjusted over R2 because it prevents values from being artificially inflated by including unnecessary predictors.This ensures a more reliable assessment of our model's performance.
The schematic in Figure 2 provides an overview of the comprehensive process outlined in this article, which encompasses several steps: a) Generation of complex networks: We generate the eight types of networks under study.b 1 ) Calculation of topological measures: In this step, we compute the nine topological measures for all the previously generated complex networks.b 2 ) Implementation of the Q-voter model: In this stage, we implement the Q-voter model on each of the complex networks using three distinct initialization methods represented by colored circles: high-degree (purple), low-degree (green), and random selection (orange).This analysis is performed for both Y i (consensus time) and C i (frequency of opinion changes).c) Creation of the dataset: A dataset is constructed containing information from all generated networks.Each row represents a specific network, and the columns contain topological measure calculations.
The dataset also includes values for initialization methods (high-degree, low-degree, and random selection) for both Y i (consensus time) and C i (frequency of opinion changes).d) Application of machine learning algorithms: Based on the collected information, machine learning algorithms are used to conduct further analyses and extract significant insights and summary statistics from the generated data.

Results
Figure 3 presents boxplots illustrating four machine learning algorithms: LASSO (light brown box), RF (pink box), XGBoost (blue box), and MLP (yellow box) for predicting Y i .It is worth noting that RF (box 2, pink) and XGBoost (box 3, blue) exhibit the tallest boxes, indicating their tendency to yield higher average adjusted R2 values compared to the other algorithms.Furthermore, LASSO, RF, and XGBoost consistently produce the best results across all initialization methods, including high degree, low degree, and random selection.These three algorithms were selected for further analysis to predict C i , and the results are presented in Figure 4. Remarkably, LASSO (box 1, light brown) and RF (box 2, pink) emerge as the tallest boxes, suggesting their inclination to yield higher average adjusted R2 values compared to XGBoost.For this reason, we chose the RF algorithm, which stood out as the best in both figures, to illustrate the following figures (Figure 5 and Figure 6).
In Figures 5-A and B, we refer to the variables Y i and C i , respectively, and illustrate the relationship between predicted values (ŷ) on the y-axis and their corresponding original values (y) on the x-axis.Each point in the plot represents a specific data instance, where the x-coordinate indicates the actual value, and the y-coordinate represents the predicted value.The red dotted line represents a linear regression model, which provides an approximation of the overall trend in the data, aiding in the visualization of our model's predictive performance.For Y i (Figure 5-(A)), we calculated Pearson's correlation coefficients, resulting in values of 0.998 for high-degree initialization (purple dots), 0.991 for low-degree initialization (green dots), and 0.990 for random selection (orange dots).Additionally, we computed the adjusted R2 values, which were 0.996, 0.982, and 0.968, respectively, for the same initialization methods.For C i (Figure 5-(B)), we also calculated Pearson's correlation coefficients, yielding values of 0.999 for high-degree initialization, 0.991 for low-degree initialization, and 0.991 for random selection.The adjusted R2 values were 0.997, 0.983, and 0.945, respectively.These results underscore the correlations observed between the original and predicted values for both Y i and C i , regardless of the initialization method used.
The RF algorithm assessed the input variables (network features) in our model.It evaluates the significance of variables by observing the improvement they provide to the model when incorporated into decision trees.The prioritization of network features based on their average importance across different initialization methods, as depicted in Figure 6, provides valuable insights into their predictive capabilities.In this analysis, the features were ranked according to their average importance, considering The process involves several key steps: a) Generation of complex networks: In this initial step, we create complex networks for analysis.In this illustrative example, we generate four networks labeled as i = 1, i = 2, i = 3, and i = 4, each consisting of a total of 10 nodes.It's worth noting that in our article, we generate a set of 800 complex networks.b 1 ) Calculation of topological measures: In this step, we compute various topological measures for all the previously generated complex networks.However, for the sake of simplification in this illustration, we focus on a single measure: Betweenness Centrality (BC).We apply this calculation to one of the four networks, specifically network i = 4. b 2 ) Implementation of the Q-voter model: In this stage, we implement the Q-voter model on each of the complex networks using three distinct initialization methods represented by colored circles: high-degree (purple), low-degree (green), and random selection (orange).This analysis is performed for both Y i (consensus time) and C i (frequency of opinion changes).For the sake of simplification, we select only network i = 4 to illustrate this process.c) Creation of the dataset: In this step, we construct a dataset that contains information from all the generated networks.Each row of the table represents a specific network, and the columns contain the calculations of topological measures for these complex networks.Additionally, we include the corresponding values for initialization methods (high-degree, low-degree, and random selection) regarding Y i and C i .For illustration purposes, we present information only for network i = 4, including BC and Y i .However, in the full article, our table encompass 800 rows and 15 columns, comprising nine topological measures, along with three variations of initialization methods for Y i and C i .d) Application of machine learning algorithms: Finally, based on the gathered information, we apply machine learning algorithms to conduct further analyses and obtain significant insights and summary statistics from the data generated in the previous steps.
three initialization methods: high degree (purple bar), low degree (green bar), and random selection (orange bar).Upon analyzing the bar chart (Figure 6), it becomes apparent that network features with higher average importance occupy the top positions.Notably, when attempting to predict Y i , the clustering coefficient (C) emerges as the most significant measure (Figure 6-A).This indicates that the network's structure, particularly the formation of cohesive groups, plays a crucial role in the speed of consensus attainment within the Q-voter model.In terms of C i , information centrality (IC) stands out as the most relevant network measure (Figure 6-B).This suggests that the dissemination and influence of information within the network play a fundamental role in the dynamics of opinion changes.These network measures play vital roles in predicting different aspects of the Q-voter model.These inferences underscore the significance of different network aspects concerning the various phenomena under study.While the C focuses on consensus formation, IC pertains to opinion changes.These findings offer valuable insights for comprehending and forecasting the behavior of voter models in broader contexts.In contrast, measures such as eigenvector centrality (EC), Degree Pearson correlation coefficient (PC), and subgraph centrality (SC) do not exhibit significant predictive capabilities in these scenarios.Also, note that, individually, the CLC (represented by the purple bar in Figure 6-A) becomes more relevant in networks initialized with a high degree of connectivity, while the AC (indicated by the orange bar in Figure 6-A) is more significant in the randomly initialized networks.CLC gains importance when the dynamics of the Qvoter model are initiated by selecting nodes with a higher degree, as it measures how easily a node can communicate or influence other nodes in the network.When starting the dynamics with high-degree nodes, these high-degree nodes can have a substantial influence on the spread of opinions, and CLC can capture this capacity for influence.Similarly, the significance of the AC centrality measure when initiating the dynamics of the Q-voter model by selecting nodes randomly may be related to the definition of this centrality measure and the dynamics of opinion propagation in the Q-voter model on a network.AC is a measure that reflects the efficiency with which a node can transmit information or influence others in the network.When the dynamics of the Q-voter model are initiated randomly, there is no initial preference for high-degree or low-degree nodes.Therefore, it is crucial to identify nodes that can effectively facilitate the spread of opinions throughout the network, and AC can highlight nodes playing an important role in this regard.
Finally, the learning curve was calculated specifically for the two best results achieved using the high-degree initialization method for Y i (adjusted R2 = 0.996) and C i (adjusted R2 = 0.997).By manipulating the size of the training set, the learning curve offers valuable insights into the model's predictive capabilities [68].This approach provides the advantage of understanding how the model's performance improves as more training instances are used, focusing on the most promising initialization methods.

Conclusions
In this article, we predicted dynamic variables associated with Q-voter models based on network properties.We verified that the prediction is very accurate and determined which features most contribute to the emergence of polarization.Mainly, we show that the clustering coefficient and information centrality are the most important measures to quantify these patterns of connections.Moreover, variations in the initialization method, to start the dynamic of the Q-voter model with a positive opinion, were performed to predict consensus of the time (Y i ) and frequency of opinion changes (C i ).
Initially, agents were randomly selected, following the original method of the Q-voter model.Subsequently, agents with the highest degree were identified and selected to investigate their potential for strongly influencing the overall opinion dynamics due to their extensive connections.Lastly, agents with the lowest degree of connectivity were considered initiators of the dynamics to explore the potential impact of less influential agents on opinion evolution.Although modifications in the initialization methods of positive opinions affect the results, their impact is relatively small.Indeed, subsequent interactions and information exchange among agents tend to overshadow the influence of the initially selected agents, leading to a consensus of opinions and a limited long-term impact of the initial agent selection.Nonetheless, the exploration of the role of both highly connected and less connected agents provided valuable insights into the complex dynamics of opinion formation and consensus emergence within the Q-voter model.We found that, regardless of the initialization method used to start the Q-voter model, the initial influence of the selected agents tends to decrease over time.This occurs because, as agents interact and exchange information, their opinions are influenced by others.Over time, opinions begin to converge towards a consensus, and the initial influence of randomly selected, high-or low-connectivity agents becomes equivalent Illustration showing the relationship between their corresponding original values (y) and predicted values (ŷ) for (A) Time and (B) Frequency regarding the selection of agents with high degree (purple dots), low degree (green dots), and random selection (orange dots) for the initiation of dynamics.This analysis was conducted using the RF algorithm.This analysis was conducted using the RF algorithm.
since there is not a significantly superior initialization method over the others; all of them yield equally good results.When we say that the absence of influential agents contributes to a more efficient consensus, we mean that the absence of agents with disproportionate influence in the network means that each agent plays a similar role in shaping the collective opinion.This is important because polarization often occurs when a few extremely influential agents have a disproportionate impact on others' opinions.In the article [69], the authors investigate the influence of highly connected individuals in opinion dynamics.Their research illustrates that a small number of highly connected individuals can significantly influence the polarization of opinions within a network.Furthermore, Sunstein's book 'Republic: Divided Democracy in the Age of Social Media' [70] provides insights into the role of online platforms and highly influential users in shaping public discourse, potentially leading to polarization.
If all agents have similar influence, it is less likely that a few highly influential agents dominate the conversation and pull the collective opinion to opposite extremes. .The examination of the most crucial features, which are determined based on the average importance of complex network measures, was conducted to predict both (A) Y i and (B) C i using various initialization methods.These methods encompassed the selection of agents with the highest degree (purple bars), lowest degree (green bars), and random selection (orange bars) to initiate the dynamics.Notably, the clustering coefficient (C) and information centrality (IC) consistently emerged as the two most significant measures in both scenarios.This analysis was carried out employing the RF algorithm.
Therefore, the absence of highly influential agents can contribute to a more balanced and less polarized decision-making process.
Expanding our methodology to explore the variance prediction within the Qvoter model can provide further insights into the factors that contribute to diverse outcomes in social dynamics.Future work in this direction will contribute to a more comprehensive understanding of the complex nature of polarization and its potential implications.By leveraging machine learning algorithms and complex network features, this study can advance research in the field of complex systems and pave the way for future investigations on the dynamics of polarization in various social contexts.Overall, the combination of machine learning algorithms and complex network analysis has the potential to revolutionize our comprehension of social systems, leading to a deeper understanding of human behavior and the development of strategies that promote positive societal outcomes.
-n: The number of nodes in the generated network.In the example, 1000 nodes were created.-m: The number of outgoing edges generated for each node or a list containing the number of outgoing edges for each node explicitly.In the example, each node has 5 outgoing edges.-outpref : A boolean value that determines whether the out-degree of a node affects its citation probability.In the example, it is set to True.-directed: A boolean value that determines whether the generated network is directed.In the example, it is set to False, indicating that the network is undirected.-power: The power constant of the nonlinear model.In the example, the value is 1.0, representing the linear model.-zero appeal: The attractiveness of nodes with degree zero.In the example, it is set to 1. -implementation: The algorithm used to generate the network.In the example, it is set to psumtree, which uses a partial prefix-sum tree.-start from: If provided and not None, this parameter uses another network as a starting point for the preferential attachment model.In the example, no starting network is specified (None).
Note that to generate the Barabási networks in a non-linear manner, we modified the power parameter to 0.5 and later to 1.5.
• LFR (Lancichinetti-Fortunato-Radicchi Benchmark): We generated LFR networks using the LFR benchmark graph function [73].A table following this one provides information on how this network was generated.Parameter Descriptions:

Parameter
-n: Number of nodes in the created network.
τ 1 : Power law exponent for the degree distribution of the created network.This value must be strictly greater than one.τ 2 : Power law exponent for the community size distribution in the created network.This value must be strictly greater than one.µ: Fraction of inter-community edges incident to each node.This value must be in the interval [ Parameter Descriptions: -n: The number of nodes.
-k: Each node is joined with its k nearest neighbors in a ring topology.
-p: The probability of rewiring each edge.-n: Number of nodes.
-L: The maximum distance between nodes is set to be the maximum distance between any pair of nodes.-domain: Domain size, given as a tuple of the form (x min, y min, x max, y max).-metric: Euclidean distance metric is used.
-seed (integer, random state, or None): Indicator of random number generation state (default is None).
• Path: We used the nx.path graph function from the NetworkX library to generate this network [76].Note that in our code available on GitHub here for generating networks, we have added specific lines of code for the path graph to ensure that the average degree falls within the range of 9 to 10, aligning with the characteristics of the other networks generated.

Parameter Value n 1000
Table A6.Parameters for the Path network model.
Parameter Descriptions: -n: Number of nodes.

Appendix B. Network Measurement Details
within a network.The mathematical formula for calculating C of a node v in a graph is as follows: where: • C(v) is the local clustering coefficient of node v.
• E(v) is the number of edges between the direct neighbors of v (i.e., the triangles that include node v).
• k v is the degree of node v, which is the number of direct neighbors it has.
The transitivity local undirected(mode="zero") is a Python function commonly employed in network analysis using the Igraph library.This function calculates the C for individual nodes within a graph.It operates in "zero" mode, which specifically considers triangles in the network that share exactly one node with the node being analyzed.The output of this function is a data structure, typically a list or a similar container, containing the C corresponding to each node in the graph.Finally, we calculate the mean to get a final value.

Appendix B.2. Closeness Centrality (CLC)
Local closeness centrality (CLC) is a network analysis metric that measures how close a node is to all the other nodes in its local neighborhood within a graph.It quantifies how quickly information can spread from a specific node to its neighboring nodes.Nodes with higher CLC are considered to be more central within their local environment, as they can reach other nodes more efficiently.The mathematical formula for the CLC of a node v is as follows: where: • CLC(v) is the local closeness centrality of node v.
• d(v, u) represents the shortest path distance between nodes v and u in the graph.The in the denominator calculates the sum of the shortest path distances from node v to all other nodes u in its local neighborhood.
The closeness centrality(normalized=True) function is a commonly used Python function in network analysis using the Igraph library.This function calculates CLC measures for each node in a graph.When we use normalized=True, it indicates that we want the CLC values to be normalized.In other words, the values are adjusted to be within the range of 0 to 1, making these measures comparable across different graphs, regardless of the network's size or scale.Finally, by calculating the average of these normalized measures, we obtain a representative value of the average closeness centrality in the network, which is useful for assessing the communication efficiency of nodes within their respective local environments.terms of node degrees, indicating whether there is a tendency for assortativity (positive correlation) or disassortativity (negative correlation) in the network's connectivity.The formula for calculating the PC is given by: where: • P C(v) is the Pearson Correlation Coefficient.
• x i and y i are the degrees of the nodes.
• x and ŷ are the means of the node degrees.
In Python, we can calculate the Pearson Correlation Coefficient for Degrees using libraries such as NetworkX.The functions degree pearson correlation coefficient() in NetworkX can be used to calculate this measure on a graph represented by the respective libraries.
The result will inform us about the nature of the network's connectivity about node degrees, which is useful for network analysis and characterization.

Appendix B.6. Information centrality (IC)
Information centrality (IC) is a network metric used to assess the importance of nodes in a graph in terms of how they facilitate the flow of information or communication within the network.This metric is based on the idea that some nodes may act as critical points for the efficient dissemination of information in a network.Information centrality measures the amount of information a node is capable of controlling or transmitting to other nodes in the network.The mathematical formula for IC is defined as: where: • IC(v) is the information centrality of node v.

•
represents the sum over all nodes u different from v.
• d(v, u) is the geodesic distance between nodes u and v, i.e., the length of the shortest path between them.
This formula calculates the information centrality of a node by summing the inverses of the geodesic distances between the node in question v and all other nodes u in the graph.The shorter the path between v and u, the greater the contribution of node u to the information centrality of v. Therefore, nodes that are closer to v will have a higher contribution to its information centrality.
In Python, we can use the information centrality() function from NetworkX to calculate the IC for the nodes in a graph.The function returns a dictionary where the keys are the nodes in the graph, and the values are the corresponding information centrality scores.This allows us to identify the most critical nodes in the network in terms of their ability to influence the flow of information.
The underlying idea is that nodes connected to other important nodes are themselves important.Therefore, eigenvector centrality takes into account not only the number of connections a node has but also the importance of the nodes to which it is connected.The mathematical formula to calculate the EC of a node in a graph is defined by the following equation: where: • EC(v) is the eigenvector centrality of node v.
• λ is the eigenvalue associated with the largest eigenvalue of the adjacency matrix of the graph.
• represents the sum over all nodes u connected to node v.
• w(u, v) is the weight of the edge between nodes u and v.
• C(u) is the eigenvector centrality of node u.
The eigenvector centrality() function is part of the Igraph library in Python, used to calculate eigenvector centrality in a graph.EC is a measure that assesses the importance of nodes in a graph based on their connections, taking into account the importance of the nodes to which they are connected.The EC values are not scaled, meaning they reflect the raw measure of importance for each node in the graph.To obtain a single centrality measure for the entire graph, it's common to calculate the average of the centrality values for all nodes.
The Python code used to generate the Q-voter model, as well as the complex networks and measures of complex networks, is available for access at [77].

Appendix C. Principal Component Analysis (PCA)
The analysis of cumulative explained variance provides valuable insights into the dimensionality reduction achieved by the PCA algorithm.The plot of cumulative explained variance illustrates the amount of information retained as the number of principal components increases (Figure C1).This information helps determine the minimum number of principal components required to capture a significant portion of the original data's variability, considering the dataset with 800 rows and 9 columns.This analysis is crucial for making decisions regarding the dimensionality reduction process, in the context of changing network topologies every 100 rows.
On the other hand, the plot of the reduced data using the principal components visually represents the transformed dataset in a lower-dimensional space (Figure C1).By visualizing the data in this reduced space, which is particularly important in the case of high-dimensional data with 9 complex network measures, a better understanding of its structure and potential patterns or clusters that may exist is gained.These plots play a vital role in validating the effectiveness of the PCA algorithm in capturing the most relevant features of the data while reducing its dimensionality, considering the complexity and diversity of the network measures across different network topologies.
Additionally, the proximity of data points in the reduced space reflects the similarity between the models, allowing for the identification of clusters or groupings within each network topology and across different topologies.This further aids in understanding the relationships and similarities among different instances in the dataset, facilitating comparative analysis and identification of common characteristics or trends.Overall, these plots provide valuable insights into the data, aiding in analysis, interpretation, and model comparison, particularly in the context of complex networks with multiple measures and changing topologies.

7 :
agent x takes the value of the Q neighbors

Figure 2 .
Figure 2. Schematic Overview of the process outlined in this article.The process involves several key steps: a) Generation of complex networks: In this initial step, we create complex networks for analysis.In this illustrative example, we generate four networks labeled as i = 1, i = 2, i = 3, and i = 4, each consisting of a total of 10 nodes.It's worth noting that in our article, we generate a set of 800 complex networks.b 1 ) Calculation of topological measures: In this step, we compute various topological measures for all the previously generated complex networks.However, for the sake of simplification in this illustration, we focus on a single measure: Betweenness Centrality (BC).We apply this calculation to one of the four networks, specifically network i = 4. b 2 ) Implementation of the Q-voter model: In this stage, we implement the Q-voter model on each of the complex networks using three distinct initialization methods represented by colored circles: high-degree (purple), low-degree (green), and random selection (orange).This analysis is performed for both Y i (consensus time) and C i (frequency of opinion changes).For the sake of simplification, we select only network i = 4 to illustrate this process.c) Creation of the dataset: In this step, we construct a dataset that contains information from all the generated networks.Each row of the table represents a specific network, and the columns contain the calculations of topological measures for these complex networks.Additionally, we include the corresponding values for initialization methods (high-degree, low-degree, and random selection) regarding Y i and C i .For illustration purposes, we present information only for network i = 4, including BC and Y i .However, in the full article, our table encompass 800 rows and 15 columns, comprising nine topological measures, along with three variations of initialization methods for Y i and C i .d) Application of machine learning algorithms: Finally, based on the gathered information, we apply machine learning algorithms to conduct further analyses and obtain significant insights and summary statistics from the data generated in the previous steps.

FrequencyFigure 4 .
Figure 4.Each boxplot represents the distribution of adjusted R2 values for the corresponding machine learning algorithm (LASSO, RF, and XGBoost), considering different initialization methods (high degree, low degree, and random selection) to predict C i .Box 1, which corresponds to the LASSO algorithm, is the highest.This indicates that, on average, the adjusted R2 values for the LASSO algorithm are higher compared to the other algorithms (RF and XGBoost) considered.

Figure 5 .
Figure 5. Illustration showing the relationship between their corresponding original values (y) and predicted values (ŷ) for (A) Time and (B) Frequency regarding the selection of agents with high degree (purple dots), low degree (green dots), and random selection (orange dots) for the initiation of dynamics.This analysis was conducted using the RF algorithm.This analysis was conducted using the RF algorithm.

Figure 6
Figure 6.The examination of the most crucial features, which are determined based on the average importance of complex network measures, was conducted to predict both (A) Y i and (B) C i using various initialization methods.These methods encompassed the selection of agents with the highest degree (purple bars), lowest degree (green bars), and random selection (orange bars) to initiate the dynamics.Notably, the clustering coefficient (C) and information centrality (IC) consistently emerged as the two most significant measures in both scenarios.This analysis was carried out employing the RF algorithm.

Figure C1 .Figure D1 .
Figure C1.The top figure illustrates the Cumulative Explained Variance in PCA (Principal Component Analysis) Analysis.This plot showcases the cumulative amount of variance in the data explained by each principal component, while the subsequent figure displays the Reduced Data Plot using Principal Components.The reduced data is represented in a lower-dimensional space defined by the principal components, allowing for a simplified representation of the original data while preserving its underlying structure.These figures provide insights into the data used to feed our machine-learning prediction models and demonstrate the effectiveness of PCA in reducing the dimensionality of the input data.

Table 1 .
Q-voter model parameters with default values.

Table 2 .
Measures of complex networks used here.
The findings depicted in Figures D1 indicate that the complete database is not indispensable for achieving the highest level of validation accuracy.Surprisingly, even with a mere 200 training instances, the model demonstrated exceptional performance.These results emphasize that a relatively smaller training set can still yield satisfactory results.Each boxplot represents the distribution of adjusted R2 values for the corresponding machine learning algorithms (LASSO, RF, XGBoost, and MLP), considering different initialization methods (high degree, low degree, and random selection) to predict Y i .Among the algorithms, box 2 and box 3 correspond to the RF and XGBoost algorithms, respectively, and show the highest adjusted R2 values.This indicates that, on average, the RF and XGBoost algorithms outperform the other algorithms (LASSO and MLP) in terms of predictive accuracy.

Table A3 .
Parameters for the LFR Benchmark network model.

Table A4 .
0, 1].-average degree: Desired average degree of nodes in the created network.This value must be in the interval [0, n].-min degree: Minimum degree of nodes in the created graph.This value must be in the interval [0, n].-max degree: Maximum degree of nodes in the created network.If not specified, this is set to n, the total number of nodes in the network.-min community: Minimum size of communities in the network.If not specified, this is set to min degree.-max community: Maximum size of communities in the network.If not specified, this is set to n, the total number of nodes in the network.-tol: Tolerance when comparing floats, specifically when comparing average degree values.-max iters (int): The maximum number of iterations to attempt in order to create community sizes, degree distribution, and community affiliations.-seed(integer, random state, or None -default): An indicator of the random number generation state.• Watts-Strogatz: We used the nx.watts strogatz graph from the NetworkX library to generate a Watts-Strogatz network [74].The following table contains information about the values of each parameter of this network.Parameters for the Watts-Strogatz network model.

• Waxman :
[75]sed the nx.waxman graph function from the NetworkX library to generate a Waxman network[75].

Table A5 .
Parameters for the Waxman network model.