Simple mechanism rules the dynamics of volleyball

In volleyball games, we define a rally as the succession of events observed since the ball is served until one of the two teams on the court scores the point. In this process, athletes evolve in response to physical and information constraints, spanning several spatiotemporal scales and interplaying co-adaptively with the environment. Aiming to study the emergence of complexity in this system, we carried out a study focused on three steps: data collection, data analysis, and modeling. First, we collected data from 20 high-level professional volleyball games. Then we conducted a data-driven analysis from where we identified fundamental insights that we used to define a parsimonious stochastic model for the dynamics of the game. On these bases, we show that it is possible to give a closed-form expression for the probability that the players perform n hits in a rally using only two stochastic variables. Our results fully agree with the empirical observations and represent a new advance in the comprehension of team-sports competition complexity and dynamics.


Introduction
The engaging nature of the complexity related to team sport competition has historically captured the attention of academics [1,2,3,4,5,6,7,8,9,10,11,12]. In recent years, the interest in studying this area has increased. Promoted by the demands of the thriving global sports market, the research community is now seeing the study of the phenomena related to these systems as challenges to face in the context of a hyper-competitive twentyone-century industry [13]. However, most of the papers available in the literature use empirical data to perform statistical analysis aiming to generate key performance indicators [14,15,16,17,18,19]. This approach can be handy for coaches using data-driven based training systems, but it usually does not provide a deeper understanding of the phenomena.
From an alternative perspective, our view focuses on use empirical data to uncovering the mechanisms that explain the emergence of global observables; based on a multidisciplinary framework that connects physics with complexity in the social sciences [20]. In previous works, we have successfully studied an modeled several aspects of the game of football [21,22]. In this paper, we report a strikingly simple mechanism that governs the dynamics of volleyball.
To place the complexity of this sport in context, let us briefly discuss the basic concepts.
In volleyball games, two teams of six players, separated by a high net, try to score points by grounding the ball on the other's team court. In possession of the ball, teams may hit the ball up to three times, but players must not hit the ball twice consecutively. During a rally, namely the time between the serve and the end of the play, the teams compete for the point using specially trained hits to control the ball, set, and attack. Blocks, likewise, are trained actions to try to stop or alter an opponent's attack, but unlike the others, the hit involved in blocks does not count as an "official" hit. It means that a player who hits the ball by blocking can consecutively touch the ball again without infringing the rules. On the other hand, we want to point out that the emergence of complexity in the game is strictly related to players' ball control and accuracy. Note that if one of the players is not able to handle the ball, their team misses the point; if one of the players cannot pass the ball accurately, it increases the probability that the teammates can not control the ball, and if one of the players cannot attack accurately, it increases the probability that the opponents can handle the ball and eventually obtain the point.
In order to study this complex dynamic and unveil the mechanisms behind it, we proceed as follows. First we visualized and collected data from 20 high-level international games.
Then we performed a data-driven analysis to get insight into the underlying mechanisms of the game. With this, we proposed a parsimonious model and developed an analytical approach to obtain a closed-form expression able to capture in remarkable good approximation one of the most relevant observables of the dynamics: the probability that the players perform n hits in a rally, P n . A variable that is key to characterizing the system complexity, and it is also related to teams performances [23,24], it is involved in self-regulated phenomena [25], may affect the motor activities during a top-level match [26], etc.

Data collection
To collect the data used in this work, we coded a visualization program that allows extracting information from video records of volleyball games. The program is designed for a trained operator to record the most relevant events observed during rallies. The information gathered at each event include, 1. An identification number for the player that hit the ball in the event.
2. The time of the event referred to the beginning of the game.
3. The position of the player that hit the ball. In two dimensions, referred to the court's floor. 4. The type of hit performed: pass, set, attack, block, etc.
We visualized 10 games of the 2018 FIVB Men's Volleyball Nations League and 10 games of the 2019 FIVB Men's Volleyball Challenger Cup, collecting the information of 3302 rallies in total. The visualization program and an anonymized version of the collected data used for this work can be found in [27]. For further information on the visualized games and details of the visualization process, please c.f. Supplementary Material section S1.

Data analysis
We analyzed the collected data to understand why the players succeed or fail when they intend to score a point. In Fig. 1 we show heatmaps indicating the probability that the players hit the ball in particular zones of the court. Panels zones. Notice that, in the following, we will use these references to discuss the results.
As a first observation, we can see that, in successful possessions, it is highly probable that the players hit the ball in their natural action zones: the first hit (reception) in zones 5-6-1, the second one (set up) around 2-3, and the attack in the zones 4-3-2 (at the front) and 6-1 (at the back). In unsuccessful possessions, we can see that the players have to move out of the actions zones to perform the hit. For instance, in panel (b), we can see that the probability of the players performing the 1st hit around zone 2 rises. Note, that it is not tactical convenient because this is the zone that the setter uses to perform the 2nd hit, and the tight presence of other players may hinder them and produces a fail or cause a reduction of accuracy. If the 1st hit is performed inaccurately, it produces what we see in panel (d), where the setter frequently has to go to the attackers' zones to handle the ball causing the same hinder effect as in the previous case. In this context, the attackers' options are limited. In extremes cases, they will not attack. They will pass an easy ball to the opponents just to keep playing. Therefore, in these cases, the players' performance diminishes. In conclusion, when the players have to move out of their action zones to handle the ball, it increases the probability of both missing the ball and hitting the ball imprecise. In the following, we will use these observations to define our model.

Model
We have observed that, during rallies, when the players have to move out of their action zone to perform a hit, the probability of missing the ball increases, and the precision decreases. In the light of these observations, to model the rallies' dynamic, we introduce two stochastic parameters, p and q, as follows, With probability 1 − q, likewise, the opposite occurs in both situations.
Following these rules, the players will move the ball around the court n times until one of them misses a hit, ending the rally.
5 Analytical approach to obtain P n We now focus on developing an analytical approach to obtain the probability distribution that the players perform n hits during rallies, P n . First, we introduce two approximation: (i) we will suppose that the teams always try to use the three reglementary hits to score the point. In other words, they do not pass the ball to the other team at the first or second hit. Considering that these events are rare in the dataset (< 0.1% of the observed cases), we understand that this is a reasonable approximation; and (ii) we will not consider the blocks as hits. By rule, when a team blocks an attack, they still have the three hits to control, set and attack. Therefore, in practice, the effects of blocks can be absorbed as inefficiencies in the attack.
Let us start calculating the cases P 1 , P 2 and P 3 . We obtain, In eq. (1a), the probability q indicates that the service is performed outside the action zone of the player in charge of taking the first hit (a difficult service), and the probability (1 − p) indicates that this player cannot perform the hit. Consequently, in the rally only one hit is performed: the service. In eq. (1b), q indicates a difficult service, p indicates that the player can perform the second hit, (1 − q) indicates the pass is performed outside the action zone of the player in charge of taking the second hit (setter), and (1 − p) indicates that the later cannot perform the third hit. Consequently, in the rally two hits are performed: the service, and the first hit. With a similar analysis, we can obtain the probability of performing three hits, P 3 , that we show in eq. (1c).
To obtain P 4 and further values of n is useful to define the transition probabilities linked to the teams' chances of performing the three hits that define a complete possession, where P dd is the probability that a team receives a difficult ball and, after the three hits, delivers to the other team a difficult one, P de is the probability that a team receives a difficult ball and delivers an easy one, P ed the probability that a team receives an easy ball and delivers a difficult one, and P ee the probability that a team receives and delivers an easy ball. In this frame, P 4 can be written as follows, where the first term is the probability that a team receives a difficult ball, handles it, delivers a difficult ball, and the other team cannot achieve the first hit. The second term is the probability that a team receives an easy ball, delivers a difficult one, and the opponent team cannot achieve the first hit.
To calculate P n for higher values of n, we use a binary decision tree. As an example, let us focus on calculating P 7 . In Fig. 2 we exhibit a three levels tree where the nodes indicate the number of hits performed and the edges the transition probabilities to the next level.
Note, the level of the tree, l, is related to the number of performed hits by the expression n = 3l − 2. If we define the probability D 3 as, obtained from the sum of all the paths leading to the third level leaves indicating that the team in possession is delivering a difficult ball (see circles in Fig. 2), then, to obtain P 7 we just multiply for (1 − p), which indicates the team receiving the attack cannot perform the 8th hit. Therefore, To calculate the probability for values of n between two levels of the tree, for instance P 8 and P 9 , we use Eq. (4) as follows, As the reader may be noted, the calculation of D l is useful to obtain P n . Therefore in the following, we focus on calculating D l ∀ l ∈ N. We can write the probability of "achieving" a level l of the tree, A l , as, where probability E l is the sum of all the paths leading to the l th level leaves which transition probabilities indicate that the team in possession is delivering an easy ball.
(2)], we obtain the following system of mutually recursive linear sequences, Then, the roots of the characteristic polynomial linked to the 2 × 2 matrix of system (9), can be used to express the general solution of D l as, where constants a and b can be found by using Eqs. (10) and (11), and the expressions for In the light of the described above, we have formally found the exact solution for P n .
In the following set of equations we summarize our result, with n = 3l − 2, and l ∈ N.
In Fig. 3 we show the probability distribution P n . The extracted from the data is colored in red and the theoretical calculated with equations (13)  Regarding the obtained values of p and q, it is necessary to highlight that in a real games, each player on the court should have a different pair of values p and q. This is because distinct players should perform differently. Moreover, these pairs could depend on the set because players vary their performance during the game. However, with these results, we show that the simplification proposed in our model allows us to represent each player as a single average player. In this context, since we construct the empirical probability P n with the results of several matches, teams, and players, it is expected that throughout the minimization process, we obtain values close to 0.5.
The reader might note that the shape of the curves follows a zig-zag pattern with the peaks placed at the values n = 1, 4, 7, 10, ..., which are the number of hits related to teams' complete ball possessions. We can explain this non-trivial behavior by analyzing the natural dynamics of the game. If a team can control an attack in the first hit, they will use the rest of the reglementary hits to set the ball and attack back. As we have previously mentioned, it is unlikely they attack at the first or second hit. Because of this, it is more probable to find in the dataset cases, for instance, with n = 7 than cases with n = 6.
We want to highlight that despite the complex dynamics of the game, the analytical approach based on our parsimonious stochastic model succeeds in capturing the non-trivial behavior of this relevant observable.

Multi-agent simulations
We now apply the rules of our model to guide the dynamics of a minimalist 1-D self- propelled agents system thought of to emulate volleyball rallies. We aim to confirm, with this agent-based model, the analytical results obtained for probability distribution P n .
Moreover, we show that it is possible to combine the analytical results with the data to give an approach to capture other empirical global observables related to spatiotemporal variables. The agent model is based on the following elements, 1. Teams and players. In this parsimonious system, there are two teams with two players (agents) by team.
2. The court. We represent the game's court as a 1-D array with four sites. The 1st and the 2nd sites shape the first team side, and the 3rd and the 4th sites shape the second team side. The net is placed between sites 2 and 3, delimiting the teams' sides. Players are allowed to occupy any site of their zones. They can also overlap, but they cannot invade the rival's side.
3. The teammates' roles. To emulate the player's tactical dependency in volleyball teams, we propose that one of the two players manages the 1st and 3rd hit (defender/ attacker), and the other manages the 2nd hit (setter).
4. Initial conditions. The player's sites at the beginning of the rally are randomly set.
The player at the 1st site serves the ball.

5.
Dynamics. If the agents have to change their current place to hit the ball, then the parameters p and q give the probability of performing the hit and achieving precision, respectively (see section Model). Achieving precision in this context means, in the 1st hit, sending the ball to the partner place, and in the 3rd hit, sending the ball to the site where the rival in charge of taking the 1st hit is not occupying. The rally ends when one of the players misses a hit.
We define T as the rallies' total time and R as the length of the projection of the ball trajectory on the court's floor. These two variables can be empirically measured from the collected data by simply adding the succession of temporal and spatial intervals, ∆t and ∆r, observed between rallies' events. To obtain T and R from simulations we propose the following, 6. Global observables T and R. At every step of the dynamics, the players hitting the ball draws from the empirical joint distribution P(∆t,∆r), shown in Fig. 4 (a), a pair of values (∆t i , ∆r i ) that we will link to the ith event. Then, at the end of the rallies, we compute T = n i ∆t i and R = n i ∆r i , where n is the total number of events, to obtain the simulated values of these observables.
Notice we sample ∆t and ∆r from the joint distribution because these variables are correlated. This is evidenced in Fig. 4 (a), where we can see that the multimodal behavior of P (∆r) and P (∆t) results in a non-trivial joint distribution with several local maxima.
With the model defined, we performed 10 5 rally simulations setting the parameters p and q with the values calculated in section 5 . In Fig. 4 panels (b) to (e), we compare the outcomes with the empirical data. Additionally, in panel (b), we plot the theoretical curve given by the Eq. (13). In the following, we discuss these results. In panel (b), we show the probability distribution P (n). We can see that it agrees with the theoretical curve, n. This deviation may be related to the emergence of some type of complexity in the actual dynamics that our simple model cannot capture. For instance, sometimes during the game the play becomes unstable, and the players hit the ball several times in reduced space until one team is able to control the ball. Notice we cannot capture this kind of "burstiness effect" through our simple model. According to the stated rules, at every step, we have to randomly draw a pair (∆t, ∆r) which implies following a memoryless process.
In this frame, since we have a joint probability distribution that is bimodal in the variable ∆r, the probability of obtaining a long sequence of pairs with small values of ∆r is low.
Consequently, for a given value of n, the calculated value of R in the model could tend to be slightly higher than in the empirical case, as we observe in Fig. 4 panel (e).

Conclusions
Our investigation focused on studying the dynamics of volleyball's rallies. We proposed a framework based on games visualization, collection of relevant information, and through data-driven analysis aiming to obtain insights to define the rules of a mathematical model able to capture the underlying dynamics in volleyball games. We found that the players are more likely to fail and become imprecise if they have to move out of their actions zones to perform a hit. With this in mind, we proposed a model based on two stochastic parameters: p and q, where the first is the probability of performing the hit, and the latter is the probability of achieving precision. Then we calculated a closed-form expression for the probability that players perform n hits in the rally, P n , that agree remarkably well with the empirical observations. Therefore, we understand that we have uncovered two stochastic variables able to partially generate the level of complexity observed in the volleyball dynamics. In this regard, we consider that this work represents a new step towards a broad understanding of volleyball games as complex adaptive systems. Moreover, the collected data that we make publicly available with this work is, as far as we know, the largest open collection of volleyball-logs ever released, an invaluable resource for the research community that opens the door for further research in this area. Finally, we want to point out that our findings provide new knowledge that should be actively taken into account for sports scientists and coaches. Since it has been shown that the length of rallies may affects the team behavior through many indirect variables [23,24,25,26], our results can be handy for designing new efficient data-driven training systems aiming to enhance the performance in competitive scenarios.

Data availability statement
The data that support the findings of this study are publicly available in [27]. Plots at the right (colored in red) when they miss the point. The inset in panel (e) shows the six zones of the court used as reference. We use the information presented in these plots to highlight the lack of performance that the players exhibit when moving out of their action zones to perform a hit. Figure 2: With transition probabilities P dd , P de , P ed , and P ee , it is possible to evaluate the probability that the players perform n hits in a rally, P n , by using a decision tree. In this plot we show the decision tree used to calculate P 7 . Nodes represent the number of hits performed in complete ball possession and edges transition probabilities. Circles show the end of the paths used for the calculation of probability D 3 [see main text, Eq. (4)]. Figure 3: The probability that the players perform n hits during rallies, P n . Colored in red with circles marks, the empirical case. Colored in black with plus symbols, the analytical results. We can observe that the analytical approach captures well the non-trivial behavior of the empirical distribution. The similarity can be quantified by calculating the Jensen-Shannon divergence between the curves. From this procedure, we obtain D JS = 0.009, which indicates a high similarity. Figure 4: In these plots, we summarize the results obtained from the multi-agent simulations. Panel (a) shows the joint probability distribution that we used to draw, at every simulation step, a pair (∆t i , ∆r i ). Panel (b) exhibits the probability distribution P (n). Here we show the empirical observations (DS), the theoretical calculation (TH), and the obtained from the agent-based model (MO). Panel (c) shows the probability distribution P (T ) obtained from the dataset (DS) and from the model's outcomes. Panel (d) exhibits the probability distribution P (R) obtained from the dataset (DS) and from the model's outcomes. Panel (e) shows the relation T vs. n and R vs. n, a comparison between the empirical case (DS) and the result of the model (MO).