Extracting macroscopic quantities in crowd behaviour with deep learning

Abnormal behaviours in crowded populations can pose significant threats to public safety, with the occurrence of such anomalies often corresponding to changes in macroscopic quantities of the complex system. Therefore, the automatic extraction and prediction of macroscopic quantities in pedestrian collective behaviour becomes significant. In this study, we generated pedestrian evacuation data through simulation, and calculated the average kinetic energy, entropy and order parameter of the system based on principles of statistical physics. These macroscopic quantities can characterize the changes in crowd behaviour patterns over time and can also assist in detecting abnormalities. Subsequently, we designed deep convolutional neural networks(CNNs) to estimate these macroscopic quantities directly from frame-by-frame image data. In the end, a convolutional auto-encoder(CAE) model is trained to learn the underlying physics unsupervisedly. Successful results indicate that deep learning methods can directly extract macroscopic information from crowd dynamics, aiding in analysing collective behaviour.


Introduction
Collective behaviours under extreme circumstances, e.g., crowd congestion during emergencies, often associated with abnormal behaviours like clogging stampedes, have garnered significant attentions [1][2][3].The research of collective patterns and individual actions in such scenarios is crucial.Moreover, these studies are instrumental in understanding the connection between individual decision-making and collective behaviours in extreme conditions [4,5].To this end, a plethora of research over the past several decades has focused on exploring collective behaviours through simulations and experimental approaches [1][2][3][6][7][8].
From a statistical physics perspective, one can explore social systems with more quantitative approaches [9,10].Physical quantities such as energy and entropy can help build an understanding of the macroscopic behaviour of the pedestrians [11][12][13][14][15][16][17].For instance, the position and velocity of each pedestrian and the interaction between different individuals can be used to construct the energy function [13].The interaction energy potential function can be used to represent the interaction between a pedestrian with other pedestrians around him [12].The energy change can be used to reflect the normal and abnormal behaviour of a group of people.Additionally, the crowd motion issue can be treated as a physical system, and for an open system, the energy and information can be exchanged with the outside [18].Entropy, a metric for a system's disorder, serves as a pivotal tool for detecting collective movements in pedestrian dynamics [14].Besides the conventional velocity entropy [15,16], the position and orientation of pedestrians have also been leveraged to compute entropy quantities [17].While these physics-inspired macroscopic quantities offer insights into crowd behaviour analysis, their practical application in real-world scenarios is limited.This limitation stems from the reliance on detailed individual velocity or positional information, which necessitates extraction from imagetype data.
Over recent years, advancements in artificial intelligence (AI) have led to significant breakthroughs across scientific domains [19].This development has enabled the collection of vast amounts of data and the application of cutting-edge machine learning(ML) techniques to social systems, e.g., criminal networks [20][21][22] and evacuation behaviours [23][24][25][26][27][28].Deep learning (DL), a subset of ML, provides a unique tool to discern underlying patterns in intricate data.Consequently, integrating DL algorithms with spatio-temporal dynamics presents a compelling avenue for exploring human behaviours.While most current studies emphasize the role of DL in formulating evacuation strategies based on data [29][30][31], another application involves training deep neural networks (DNNs) on simulated datasets.These trained networks can then be applied to real-world datasets to assess actual scenarios or detect concealed patterns, which is a methodology validated in fields such as epidemiology [32,33].As illustrated in figure 1, our work incorporates DL into the microscopic model, aiming to extract macroscopic quantities from image-type spatiotemporal maps.With this method, one can build a direct connection between image-type data and the macroscopic quantities for crowd video surveillance.
In this paper, we employ replicator dynamics to simulate evacuation scenarios, integrating both bounded rational behaviour and rational decision-making [4,[34][35][36].Utilizing the microscopic model outlined in [4], we implement a CNN model to derive macroscopic quantities that characterize crowd behaviour.Section 2 begins with an explanation of the microscopic model, i.e., the cellular automaton simulations; it then proceeds to introduce macroscopic quantities that capture the crowd behaviours in evacuation, and concludes by outlining the details of preparing the dataset for training and testing.In section 3, we initially present the abnormal behaviours observed in simulations alongside their corresponding macroscopic quantities.This is followed by showcasing the feasibility of using deep learning to extract quantities such as kinetic energy, entropy, and order parameter.We further delve into the generalization capability of this approach and discuss potential challenges in applying it to real-world scenarios.A convolutional auto-encoder(CAE) model is developed to reconstruct the image-type data, and the extracted latent variables are compared to the corresponding macroscopic quantities.The correlation function results demonstrate that the underlying physics can be well-learned from the CAE.Our comprehensive findings and discussions are summarized in section 4.

Cellular automaton modeling evacuation behaviour
Microscopic models, including the social force model, cellular automaton(CA), and magnetic field force model, provide intricate insights into individual behaviours [37][38][39].In evacuation situations, individuals often exhibit behaviours that deviate from rationality, resulting in a spectrum of behavioural patterns [40,41].This concept, termed bounded rationality, is shaped by various elements such as processing a range of information, which involves both environmental aspects and the actions of surrounding individuals [4,42,43].Our research utilizes the CA model as a framework to simulate and analyze typical behaviours in evacuation scenarios within confined spaces [4].
In this study, we use a L × L cell grid to replicate the space, with each cell representing either an empty space, a pedestrian, or a wall.The CA model employs Moore neighborhood with transition matrices P(i, t) guiding pedestrian movement, where i ä [1, L ,N] and t ä [0, L ,T] with N as initial total population and t as maximum evolution time.Each cell can host a pedestrian, a wall, or remain empty, with pedestrians able to move or stay stationary at each time step.The escape process is simulated through synchronous updates of the cellular automata, with specified exit locations.For a detailed explanation of the grid structure, neighbor labeling, and movement rules, see [4].
The transition probability matrix, denoted as P(i, t), is defined by the equation:

= å
In this formulation, R(i, t) and B(i, t) represent matrices that allocate weights to rational and bounded rational behaviors, respectively.Each matrix is of dimension 3 × 3, comprising nine elements that correspond to the neighborhood and the individual iʼself.These matrices are unique to each individual i and evolve over time steps t, indicating dynamic updates in their configurations.
The matrix R m,n (i, t) is defined through the product: This implies that for a neighboring position (m, n) around individual i at time t, D m,n (i, t) is set to 1 if the position is unoccupied, and ò otherwise.Similarly, E m,n (i, t) is assigned the value α when (m, n) points towards an exit, with all other directions assigned a minimal value of ò (set to 10 −10 in our simulations) to prevent mathematical divergence.The parameter α quantifies the exit's attractiveness to individuals seeking egress or the significance of exit location information within the model.As demonstrated in [4], an increase in the parameter α leads to a reduction in escape time, which ultimately reaches a saturation point at both the individual and systemic levels.This behavior suggests that α serves as a critical metric for assessing optimality within evacuation behaviors.Consequently, we designate α as the 'rational parameter' within the context of this CA model.Determining the optimality of path-selection decisions in crowd dynamics thus involves the extraction of the rational parameter α, as evidenced by our analysis.For the sake of simplification, we assume B m,n (i, t) = 1 in subsequent simulations to highlight rational decision-making processes.Further discussions on the implications of the rational parameter and bounded rational behaviors are elaborated in referenced works [4,5].

Physics-inspired macroscopic quantities
In the subsequent paragraphs, we present three macroscopic quantities, drawing inspiration from statistical physics.The foundational elements of these quantities are derived from the microscopic states, predominantly relying on the individual velocity, v i .This velocity is a vector calculated based on the positional changes of each individual across two sequential time-steps.Average Kinetic Energy.Within statistical physics [18], the average kinetic energy of a system is expressed as This formula accounts for all N particles in the system, where m i and v i represent the mass and velocity of each particle, respectively.In our case, we adapt this concept to pedestrian dynamics and define this quantity as follows, where N t denotes the number of pedestrians in the system at time step t.
Entropy of Velocity.In statistical mechanics [18], the entropy of a system can be calculated from its microstates using the formula S k p p ln , where k B represents the Boltzmann constant, and p i denotes the probability of micro-state i.For our study, we define the entropy of velocity as follows, where r indexes the orientation of the velocity, with R being 9 in our context.The term p r (t) is the proportion of individuals with orientation r at time step t, calculated as p r (t) = n r (t)/N t , where n r (t) is the number of individuals choosing orientation r and N t is the total population at t time-step.It should be noted that this definition is consistent with different works [14][15][16].In the subsequent content, we will refer to the entropy of velocity as the entropy for brevity.
Order Parameter of Velocity.The concept of an order parameter is routinely used to characterize phase transitions in equilibrium systems [18].Vicsek et al [44] introduced the order parameter to describe the kinetic phase transition in self-driven particles, from disordered states to fully ordered states and vice versa.In our study, we employ a modified definition of the order parameter as presented in [15], where v i,x and v i,y represent the velocities of individual i along the longitudinal and vertical directions, respectively.In the subsequent content, we will refer to the order parameter of velocity simply as the order parameter for brevity.
In figure 2, it contains a series of images representing evacuation maps generated from a simulation described in section 2.1 for sequential time-steps.Each map corresponds to a different time point in the simulation, specifically at t = 0, t = 10, t = 20, t = 40, t = 100.Accompanying each map are values for three quantities: E k (average kinetic energy), S v (entropy), and O v (order parameter).These quantities are quantitatively represented at each time step, providing insights into the dynamics of the evacuation process as it evolves.More quantitative discussions can be found in section 3.1.

Data-set preparation and neural networks
The data sets that we prepared for training the neural networks are from the CA model included a total of 100,000 images.Out of the 100,000, there are 100 different initial populations ρ 0 uniformly ranging from 0.1 to 0.5, each with 1,000 images generated.Out of the 1,000 images with each initial population, there are 10 different values of rational parameters α ä [1, 2, L ,9, 10] and 100 frames of evolution in time-step t ä [1, 100] for each parameter.Each image represents one snapshot of the evacuation process in a square form with a side length of 28, so each image we generated has 784 pixels.Each pixel of an image is either 0 or 1, where 0 represents empty space and 1 represents an individual present at that spot.
In our study, we divided the data into 10 groups based on varying α, with each group containing 10,000 images.Depending on the selected number of frames, datasets of different sizes can be compiled to serve as input for the model.Our training and test datasets were allocated in an 80% to 20% ratio, respectively.The labels for training were chosen to be three quantities corresponding to a continuous sequence of n frames, with each quantity averaged over the frames (for a discussion on this, see the following section 3.2).
In this study, we employed a Convolutional Neural Network (CNN) model in the Tensorflow-Keras framework, tailored for a regression task involving image-type data.One visualization of the CNN model is shown in figure 3. The architecture commences with a convolutional layer featuring 32 filters of 3 × 3 kernel size, coupled with the ReLU activation function, and processes input data shaped as (L, L, n).Sequentially, a max pooling layer with a 3 × 3 pool size follows, serving to diminish the spatial dimensions, thereby reducing parameter count and computational load, and aiding in mitigating overfitting.A similar configuration of a convolutional layer followed by max pooling is repeated, further refining the feature extraction process.Subsequent to the convolutional layers, the network architecture transitions into a flatten layer, converting 2D outputs to a 1D vector, necessary for the ensuing dense layer.This dense layer comprises 32 neurons, also employing the ReLU activation function.A dropout layer with a 0.5 rate is incorporated to randomly nullify outputs from the dense layer during training, a strategy employed to prevent overfitting.The architecture culminates in a linear output layer with 3 units, corresponding to the 3 continuous target variables,e.g., average kinetic energy, entropy and order paremeter in our case.In total, the CNN model contains 14 947 trainable parameters.
For model optimization, we utilized the Adam optimizer, a variant of the classical stochastic gradient descent algorithm [45], known for its efficacy in iteratively updating network weights based on training data.The mean squared error (MSE) was adopted as the loss function, a standard choice for regression tasks, quantifying the average of the squares of errors between predicted and true values.Additionally, the model evaluation metrics included mean absolute error (MAE), MSE and R-squared(R 2 ), providing performance evaluations of the average absolute deviation, squared deviation and correlation, respectively.

Abnormal behaviour
In figure 4, it illustrates how three macroscopic quantities-average kinetic energy, entropy and order parameter -evolve over a range of time steps in one sample simulation.The upper panel depicts average kinetic energy, showcasing how the energy dynamics fluctuate with time under different α conditions, but with a distinct decreasing in the first 20 time steps for large-α.Subsequently, the entropy panel reveals the temporal evolution  The figure illustrates the variation of macroscopic quantities in the context of evacuation dynamics, as modulated by the rational parameter alpha.Displayed across three panels, it captures the average kinetic energy (upper panel), entropy (middle panel), and order parameter (bottom panel), under diverse conditions.This depiction is based on a cellular automata (CA) simulation with parameters L = 28 and ρ 0 = 0.4, serving as a representative case study.
of system disorder for the same set of α highlighting the increasing in the first 20 time steps for large-α.Lastly, the bottom panel demonstrates the variation of the order parameter, quantifying the collective behaviours from the order to the disorder.
The three macroscopic quantities reveal an emergent abnormal behaviour around the 20th time-step for larger-α.This phenomenon is characterized by individuals moving towards the exit with distinct orientations during the initial 20 time steps, particularly when α is large, indicating the significance of exit information as described in [4].In our simulations, when this parameter is larger, it means that the information is more oriented to each individual's next movement decision; conversely, it is weaker.A larger α will make a large number of pedestrians more inclined to flock to the exit at the beginning of the escape behavior, making rapid congestion occur, which results in lower average kinetic energy, higher velocity entropy, and lower sequential parameters.On the contrary, without the guidance of the exit information, the pedestrian actions at the beginning of the escape behavior are nearly random, making these three physical quantities almost unchanged.
From a physics perspective, in the large-α scenario, increase of velocity entropy arises when considering the emergence of blockages.Initially, individuals predominantly choose the three rightmost directionsaiming for the nearest exitas their preferred paths, leading to an initial entropy larger than S v | ini ; 1.10, derived from 3 ln -´.However, as clogging occurs, the paths for escape become restricted, forcing many individuals into random movement.With all nine possible directions becoming equally probable, the limitation value of the entropy in this random walk scenario S v | rand reaches  9 ln ln 9 2.20 -´= .This transition from an initial orderly state to one characterized by random movement and blockages exemplifies an inherent increase in entropy.
In figure 2, this phenomenon is also visualised.Due to the limited size of the exit, congestion near the exit happens around the 20th time-step, concurrently leading to a reduction in movement speed.The lower entropy and higher order parameter values characterize ordered behaviour, as noted in [15].Therefore, these macroscopic quantities aptly describe the collective behaviour in evacuation scenarios, enabling the direct detection of abnormal behaviours.

Extracting macroscopic quantities
In this section, the performance of extracting macroscopic quantities from image-type data utilizing CNN models is presented.As depicted in figure 4, the quantity fluctuations are not negligible, necessitating the incorporation of frame-average values to establish new training labels.The term 'window' will henceforth refer to the number of frames employed for input data and their corresponding frame-average throughout the remainder of this section.In the training, we chose data-set of α 5, with 50 epoch training.The training loss and validation loss both converge in such a set-up.
In table 1, the effectiveness of varying window sizes on test accuracy is quantitatively evaluated using several metrics.The performance shows for three distinct window sizes: 1, 5, and 10.For each window size, the corresponding values for MSE, MAE, and R 2 for average kinetic energy ( Ēk ), entropy(S v ), and order parameter(O v ) are provided.Notably, as the window size increases from 1 to 10, there is a consistent decrease in both MSE (from 0.0043 to 0.0006) and MAE (from 0.0429 to 0.0164), indicating improved accuracy.Similarly, the R 2 values show an increase with larger window sizes, suggesting that larger windows provide a better fit for the data.Specifically, the R 2 values for Ēk increase from 0.213 to 0.644, for S v from 0.617 to 0.900, and for O v from 0.343 to 0.810.Among three quantities, prediction of the entropy achieves the best performance, which is helpful for detecting abnormal behaviours because the entropy is a more sensitive indicator compared to the others [15].
Figure 5 presents a comparison of model predictions against true values across three macroscopic quantities, over a specified range of time steps (1 to 95).The well-trained model using 5 frame images as input can predict the frame-average quantities correctly.Although the fluctuations are inevitable, the qualitative tendencies in three obervables can be extracted by the well-trained CNN model directly.The starting point of the abnormal behaviour(around 20th time-step) can be precisely predicted.Collectively, it offers an intuitive evaluation of the capability in capturing complex collective behaviours, which is consistent with the existing physics insights.
It should be noted that discrepancies between CNN predictions and the true physical quantities, notably in the average kinetic energy.These deviations may stem from limitations in our training dataset or the intrinsic challenges of accurately predicting time-dependent physical quantities.Despite these issues, the model reliably captures crucial patterns, as evidenced by the statistical metrics.the accurate predictions for each time step are not guaranteed.This challenge is not unique to our study but is a common limitation across all time-series forecasting tasks [46].It is crucial to underscore that our primary focus lies in the model's ability to accurately predict overarching trends rather than discrete, time-specific outcomes.This capability is particularly valuable in accurately characterizing and understanding clogging behaviors within evacuation scenarios.

Scenario generalization
Given that the model was exclusively trained on data with α values greater than or equal to 5, the two histograms displayed in figures 6 and 7 exhibit the generalization abilities of trained models amidst diverse scenarios.Figure 6 showcases prediction errors beyond the training range (α 1 to 4).The bar graph displays the mean squared error (MSE) for average kinetic energy, entropy, and order parameter (OP) at varying α values.Higher MSE values at lower α levels indicate decreased prediction accuracy in these scenarios.However, an increase in MSE values within the trained α range suggests improved model performance on the testing data.It is understandable that lower α values correspond to a more disordered scenario in which individuals lack sufficient exit information, which is rare in the large-α scenario.
Figure 7 shows the R-squared values, indicating the model performance in scenarios with α values of 1, 2, 3, and 4, not included in the training data.For each α, three bars represent various quantities, where negative initial values were substituted with zeros, suggesting a possibility of deteriorated predictive performance in such cases.Conversely, when α values are within the trained range of 5 and above, there is a marked increase in R-squared values seen, indicating the model's proficiency in fitting to training labels.Besides, the fitting ability for entropy performs better than the other two quantities.To summarize, the two histograms not only reflect how the model performs within its training boundaries, but they also reveal its capacity to generalize to untrained scenarios, providing a thorough outlook on the model's overall effectiveness.

Latent variables
In this section, we explore the capabilities of deep convolutional neural networks (CNNs) in learning the underlying physics in crowd behaviour through unsupervised learning, employing a convolutional autoencoder (CAE) model.The CAE has two components, an encoder and a decoder as figure 8 shows.The encoder utilizes convolutional and max-pooling layers to progressively downsample and extract crucial features from the input image-type data, reducing it to a compressed features, alternatively referred to as latent variables.The architecture involves sequentially diminishing filter sizes (32,16,8) with corresponding spatial dimension reduction.Subsequently, the decoder reverses this process through convolutional layers and upsampling, progressively restoring the image to its original dimensions, aiming to reconstruct the input image with high fidelity.The model, compiled using the Adam optimizer and binary cross-entropy loss function, is particularly tailored for reconstructing 5-frame image sequences.In training procedure, we choose the same training data-set as the above, setting batch size as 16.After 10 epochs, training converges and the reconstruction loss becomes stable.We further extracted the latent variables predicted from the testing data-set.To evaluate whether the well-trained CAE model can learn the physics quantities, we calculated the Pearson correlation coefficient (PCC), R P .the PCC is defined as, where x, y represent the extracted latent variables and physics-inspired macroscopic variables in our case, respectively.¯x y , are their average values on the testing data-set.It has a value in [−1, 1], which measures linear correlation between two sets.Because there is (2,2,1) compressed feature map after the encoder, we first flatten all features as four latent variables.The PCC is calculated to demonstrate the correlation between the extracted latent variables and three macroscopic quantities, i.e., average kinetic energy, entropy and order parameter.
From figure 9, it indicates that S v is positively correlated to all four latent variables and its corelation function reaches 0.88 with the variable-1.The other two physics quantities behave clearly negative correlation with the latent variables.The results mean the well-trained CAE can produce the meaningful physics variables unsupervisedly.The high absolute values of R P validate the CNN models' ability to discern and represent underlying macroscopic physics in crowd behaviours.

Conclusions
In conclusion, our study marries the intricacies of replicator dynamics with the power of deep learning to offer a novel perspective on crowd behaviour analysis, particularly in evacuation scenarios.By integrating concepts from bounded rational behaviour and rational decision-making as outlined in prior research [4,[34][35][36], we  successfully simulate pedestrian evacuation and subsequently employ convolutional neural networks (CNNs) to extract crucial macroscopic quantities supervisedly and unsupervisedly.Beginning with cellular automata (CA) model simulations, our research advances to delineate and quantify macroscopic quantities reflective of crowd behaviors during evacuations.
We introduce three physics-inspired macroscopic quantitiesaverage kinetic energy, entropy, and the order parameterto characterize collective behaviors.Our simulations reveal notable fluctuations and overarching trends in the clogging phenomenon, underscoring the deep learning methodology's efficiency in gleaning quantitative insights from image-based data representations.This study stands at the frontier of demonstrating deep learning's capability to extract macroscopic insights directly from the dynamics of crowds, thereby proposing a pioneering, data-driven approach to collective behavior analysis.The harmonious fusion of microscopic modeling with deep learning unveils a groundbreaking framework for exploring collective behaviors, promising valuable applications in enhancing public safety.
Furthermore, we also discuss the performance of scenario generalization and the ability of extracting macroscopic physics unsupervisedly.Incorporating deep learning with physics-inspired quantities into crowd behaviour video surveillance presents a promising yet challenging frontier.The integration of these quantities can enhance model adaptability to diverse real-world scenarios, crucial for accurate and efficient real-time processing [47,48].This approach also aids in addressing the scalability challenges posed by zero-shot and fewshot learning methods, essential for recognizing novel behaviours in dynamic environments [49].Customizing models to specific crowd dynamics, while leveraging the synergy between deep learning and physics-based insights, could significantly refine surveillance systems, making them more effective and adaptable for varied urban settings [50].Some ideas will be further explored in the future works.

Figure 1 .
Figure 1.Flowchart of learning macroscopic quantities from microscopic model generated spatiotemporal maps.The left panel shows evacuation maps simulated from a microscopic model.The bottom panel represents a generic neural network model which can be specified as a deep CNN model in our case.The upper panel indicated macroscopic quantities, e.g., entropy, energy and order paremeters.

Figure 2 .
Figure 2.An evacuation process simulated by the CA model with parameters L = 28, α = 10, ρ 0 = 0.39, where ρ 0 = N/L 2 .The orange squares depict individuals while the white spaces signify empty areas.The black solid line constrains behaviours within a closed room and the only exit is located at the far right.The values listed at the bottom correspond to different quantities across varying timesteps.

Figure 3 .
Figure 3.The CNN model begins with an input layer that processes 2D image-type data, followed by two Conv2D layers.Each Conv2D layer uses 3 × 3 kernels and the ReLU activation function.Post each Conv2D layer, MaxPooling is applied, and Dropout is utilized to mitigate overfitting.A Flatten operation bridges the last Conv2D layer and the output, converting 2D data into a 1D array for the final Dense layer, which is a fully-connected layer producing the output values.

Figure 4 .
Figure 4.The figure illustrates the variation of macroscopic quantities in the context of evacuation dynamics, as modulated by the rational parameter alpha.Displayed across three panels, it captures the average kinetic energy (upper panel), entropy (middle panel), and order parameter (bottom panel), under diverse conditions.This depiction is based on a cellular automata (CA) simulation with parameters L = 28 and ρ 0 = 0.4, serving as a representative case study.

Figure 5 .
Figure 5. Predicted macroscopic quantities derived from a specific sample are presented, utilizing a cellular automata (CA) simulation framework.The parameters for this simulation are set as follows: lattice size L = 28, rational parameter α = 10, initial density ρ 0 = 0.4, with an observation window of 5.This configuration elucidates the dynamics and emergent properties under these defined conditions.

Figure 6 .
Figure 6.Histograms showing the mean squared error (MSE) for predicting the energy, entropy, and order parameter within the trained model were generated using 1920 testing samples.

Figure 7 .
Figure 7. Histograms showing the R-squared for predicting the energy, entropy, and order parameter within the trained model were generated using 1920 testing samples.

Figure 8 .
Figure 8. CNN-AutoEncoder.The upper panel is the encoder and the bottom pannel is the decoder.

Figure 9 .
Figure 9. Pearson correlation function is employed to analyze the relationship between extracted latent variables and three macroscopic quantities, S E , v k and O v .A red coloration signifies a positive correlation, whereas a blue coloration denotes a negative correlation.

Table 1 .
Test accuracy on different average windows.