Learning from Survey Propagation: a Neural Network for MAX-E-$3$-SAT

Many natural optimization problems are NP-hard, which implies that they are probably hard to solve exactly in the worst-case. However, it suffices to get reasonably good solutions for all (or even most) instances in practice. This paper presents a new algorithm for computing approximate solutions in ${\Theta(N})$ for the Maximum Exact 3-Satisfiability (MAX-E-$3$-SAT) problem by using deep learning methodology. This methodology allows us to create a learning algorithm able to fix Boolean variables by using local information obtained by the Survey Propagation algorithm. By performing an accurate analysis, on random CNF instances of the MAX-E-$3$-SAT with several Boolean variables, we show that this new algorithm, avoiding any decimation strategy, can build assignments better than a random one, even if the convergence of the messages is not found. Although this algorithm is not competitive with state-of-the-art Maximum Satisfiability (MAX-SAT) solvers, it can solve substantially larger and more complicated problems than it ever saw during training.


Introduction
The Boolean Satisfiability (SAT) Problem [1,2] is the issue of finding an assignment that satisfies a given Boolean formula. A Boolean formula is any operation made with Boolean variables, where each variable can take the value T RUE or F ALSE, {1, 0} respectively. For example, a CNF (conjunctive normal form) [3] formula is a conjunction of one or more clauses, where a clause is a disjunction of literals. A CNF formula is satisfiable if and only if there is a configuration of the Boolean variables that simultaneously satisfy all the clauses.
In our work, N defines the number of Boolean variables and M the number of clauses so that the CNF formula has the following form: where l c is the size of clause, i.e. the number of literals in clause c for 1 ≤ c ≤ M, and p ci is a literal, thus a proposal variable x i or its negation x i , for 1 ≤ i ≤ N.
The maximization problem associated with SAT is called MAX-SAT. In this case, a solver tries to satisfy the maximum number of clauses given a CNF formula [4,5]. If a CNF formula has in each clause at least k literals, then the problem is called MAX-k-SAT. If there are exactly k literals for each clause in a CNF formula, then the problem is named MAX-E-k-SAT [6].
The MAX-SAT is of considerable interest not only from the theoretical side but also for applications. For instance, many mathematical logic and artificial intelligence issues can be expressed in the form of satisfiability or some of its variants, like constraint satisfaction. Examples are in probabilistic inference [7], data analysis [8], Maximum Clique and Maximum Independent Set [9,10,11], software analysis [12], reasoning over bio networks and Bayesian network structure learning [13], minimization of visibly pushdown automata [14], compressive sensing [15], community detection [16] and much more [17,18]. For instance, in physics, the MAX-SAT problem is used for providing a provable periodically constrained ground state of a complex lattice [19]. Physicists also study the global landscape structure of the MAX-SAT problem for understanding phase transitions that appear into the solution space [20], trying to connect them to computational complexity limits. An interesting example of this research was developed in [21,22]. In these papers, continuous-time deterministic systems based on ordinary differential equations were proposed as SAT and MAX-SAT solvers. These works are based on the observation that the continuous-time deterministic systems have dynamics attracted by fixed points, identifying solutions with minimum energy.
From the theoretical point of view, the MAX-SAT problem is studied for giving optimal inapproximability results. Inapproximability results help to understand the computational complexity of hard problems [6,23]. Many natural optimization problems, indeed, are NP-hard. This implies that they are probably hard to solve exactly in the worst-case. The worst-case complexity measures the maximum amount of resources that an algorithm requires, given an input of arbitrary size [24].
However, it suffices to get reasonably good solutions for all (or even most) instances in practice. Examples of these results have been studied since 1973 when Johnson [23] analyzed the worst-case behavior of simple, polynomial-time, random algorithms for finding approximate solutions to various combinatorial optimization problems. He measured the worst solution value ratio, which can be reached by an algorithm, to the optimal one. For a maximization problem, an algorithm is a ρ-approximation algorithm, ρ < 1, if it produces a solution whose objective value is at least ρ · OP T where OP T is the global optimum, for each instance. A similar definition applies to minimization problems. The approximation algorithms' important property relates the size of the solution produced directly to a lower bound on the optimal solution. Instead of telling us how well we might do, they will tell us about the worst-case, i.e., how badly we might perform [25].
Following this research topic, many computer scientists have proven rigorous results over optimal inapproximability. The first result proving hardness for the problem we are discussing here was obtained in the fundamental paper by Arora et al. [26]. He established the PCP theorem. The theorem states that every decision problem in the NP complexity class has probabilistically checkable proofs, where a verifier reads only a constant number of bits and uses logarithmic random of bits. Significant results were obtained successively by Bellare, Sudan, and others in [27,28,29,30,31], but the most famous was given by Håstad in 1997 [6]. He proved optimal inapproximability results, up to an arbitrary ǫ > 0, for MAX-E-k-SAT for k ≥ 3, by maximizing the number of satisfied linear equations in an over-determined system of linear equations modulo a prime p. More precisely, the author showed that for the MAX-E-3-SAT, no approximate algorithm could outperform the random assignment threshold, which is set to be 7/8 the optimal one, unless P = NP . He also stated that the MAX-E-4-SAT is not approximable beyond the random assignment threshold on satisfiable instances. The random assignment threshold is set to be 15/16 the optimal one, unless P = NP . These results are only valid for approximate algorithms.
In contrast to this kind of algorithms, heuristics [32] can find better approximate solutions. However, their worst-case performance can be tough to analyze and, therefore, heuristic methods may be considered approximate and not accurate algorithms. Although they have this negative reputation, heuristics are the only viable option for various optimization problems that need to be routinely solved in real-world applications.
Heuristic algorithms for solving MAX-SAT problems can be roughly classified into two main categories. The first category of algorithms searches for a solution by performing a biased random walk in the space of configurations. Instead, the second one tries to build a solution assigning variables, according to some estimated marginals. MaxWalkSAT, focused Metropolis search or diffusion Monte Carlo algorithms, genetic algorithms with local search, and many other SAT-Solvers belong to the former category [33,34,35,36,37,38,39,40,41,42,43,44,45]. In contrast, in the second category, we find algorithms deriving from the class of message passing algorithms. Examples are Warning Propagation (WP), Belief Propagation Guided Decimation (BPGD), Survey Inspired Decimation (SID), Backtracking Survey Propagation (BSP), and SPy, a generalization of Survey Propagation (SP) algorithm [46,47,48,49,50,51,52]. The last algorithms use a decimation move for building a solution to the problem. This decimation procedure modifies the underlying graph's structure and makes these algorithms hard to be analytically analyzable.
This paper introduces a new heuristic-learning algorithm that collects local information from the Survey Propagation algorithm, avoids any decimation strategy, and fixes the local variables by using deep learning methodology [53,54,55].
The deep learning methodology is part of a broader family of machine learning methods based on artificial neural networks. It has been applied in many fields, from computer science to physics and economics, with many useful applications. Deep learning uses multiple layers to extract higher-level features from the raw input. Here, the deep learning methodology extracts patterns from instances of MAX-E-3-SAT and their solutions for learning how to fix a single Boolean variable.
In combinatorial optimization, deep learning methodology has been used for solving SAT problems or problems related to graphical models. For instance, Selsam et al. [56] obtained exciting results for SAT problems. More precisely, the authors present a message passing neural network that learns to solve SAT problems after only being trained as a classifier to predict satisfiability. Instead, for problems related to graphical models, Dai et al. [57] present a learning algorithm for graph problems using a unique combination of reinforcement learning and graph embedding. Many other results have been reached during this period, and we refer to an interesting survey on machine learning for combinatorial optimization of Bengio et al. [58] and references therein. Regarding the MAX-SAT Problem, instead, an exciting work based on statistical learning theory has been presented recently in [59]. They introduce a novel setting for learning combinatorial optimization problems from contextual examples.
These works deal with small instances because the neural networks are fed with the whole instance of the problem. In this paper, in contrast, we present, for the first time, as far as we know, a linear algorithm that takes information locally on the graph and uses a deep neural network for assigning a single variable. This strategy allows us to analyze CNF formulae composed of 10 6 variables, larger and more difficult than the neural network ever saw during training.
The motivation that guides us in simplifying heuristic methods is the following: although making a heuristic more complicated does not necessarily make it better in the worst-case, maybe making it simpler does not necessarily make it worse in the worstcase. The worst-case scenario for heuristic message passing algorithms is due to the fact that a full convergence of the messages is not found. This can appear even for a single message. When such a scenario appears, we cannot say anything about the instance we are looking at. We lose all the local information that is correctly obtained by the message passing procedure.
For this reason, we deal with SP equations. Although they are not the best for the MAX-SAT problem, in contrast to the SP-y equations, they are simpler and much more suitable for meeting the worst-case scenario. They may make this heuristic-learning algorithm analytically analyzable. Indeed, decimation and backtracking moves are avoided. This paper is divided into the following sections: the first one recalls the MAX-E-3-SAT problem and its factor graph representation; the second one recalls the Survey Propagation algorithm; the third one introduces the deep neural network and also presents the numerical analysis. We conclude our manuscript with a discussion on the future research directions that this new method gives rise to.

The MAX-E-3-SAT Problem and its Factor Graph Representation
As explained in the Section 1, the MAX-SAT is the maximization problem associated with the SAT problem. It is asked to find an assignment of the Boolean variables such that the maximum number of clauses, in a CNF formula, is satisfied. The maximization version of the k-SAT problem is called MAX-E-k-SAT, in this case, each clause contains exactly k literals. With k = 3 a 3-SAT problem with N = 9 variables and M = 4 clauses is of the form: Clearly an assignment that satisfies all the clauses is Sol 3−SAT : (x 1 = 1, x 2 = 0, x 3 = 0, x 4 = 1, x 5 = 0, x 6 = 1, x 7 = 0, x 8 = 1, x 9 = 0). The MAX-E-3-SAT looks for an approximate solution of the problem in (2). For example, a solution of the MAX-E-3-SAT, outputted by a random algorithm, could be Sol M AX−E−3−SAT : ( satisfy all the clauses in (2), but it is an approximate solution where just a clause is unsatisfied, i.e. (x 1 ∨ x 2 ∨ x 3 ) = F ALSE. For a general instance of the MAX-E-3-SAT problem that contains N variables and M clauses, it is easy to see that a random assignment satisfies each clause with probability 7/8. Hence, if there are M clauses, it is not hard to find an assignment that satisfies 7M/8 clauses. Since we can never satisfy more than all the clauses this gives a 7/8-approximation algorithm [23,6]. In this paper, as stated in the Section 1, we are interested in heuristic methods that use message passing procedure. For this reason, we recall the factor graph representation of satisfiability problems. The SAT problem, and thus the MAX-SAT, is represented as a factor graph where clauses are identified as functional nodes and variables as variable nodes. A factor graph is a bipartite graph representing the factorization of a function.
In Figure 1, it is shown a cartoon of the factor graph associated with a 3-SAT (MAX-E-3-SAT) problem. The variables, i, j, l and k, are variable nodes (circles) while clauses, a, b and c, are functional nodes (squares). Each variable enters a clause as a literal, e.g., p ci , if and only if it is connected with an edge. Dashed edges identify literals where variables are negated, while full edges identify literals where variables are not negated.
With the symbol ∂ a , we define the set of variables nodes that are connected with the functional node a, i.e., the literals of clause a. In contrast, with the symbol ∂ i , we define the set of functional nodes connected with the variable node i, i.e., the set of clauses where the literal indexed i appears. The cardinality of the set ∂ i is the degree of a variable node i, i.e., the number of links connected to a circle, and is defined with n i . The set ∂ i is also composed of two other sets, namely ∂ + i that contains the functional nodes where the variable node i appears not negated, and ∂ − i that contains the functional nodes where the variable node i appears negated. Obviously, the relation We also defined two more quantities, n ± i , for the number of dashed and full edges of a variable node i. More precisely, n + i defines the cardinality of the set we define the set of functional nodes containing the variable node i, excluding the functional node a itself, satisfied (respectively not satisfied) when the variable x i is assigned to satisfy clause a. In other words, if the variable x i is not negated in the clause a, then the ∂ + ia is the set of functional nodes containing the variable node i, excluding the functional node a itself, where the variable node i is connected with a full edge, thus where the variable x i appears not negated, while ∂ − ia is the set of functional nodes containing the variable node i, where the variable node is connected with dashed edges, thus where the variable x i appears negated. In contrast, if the variable x i is negated in the clause a, then the ∂ + ia is the set of functional nodes containing the variable node i, excluding the functional node a itself, where the variable node i is connected with a dashed edge, thus where the variable x i appears negated, while ∂ − ia is the set of functional nodes containing the variable node i, where the variable node is connected with full edges, thus where the variable x i appears not negated.

The Survey Propagation Algorithm
The Survey Propagation algorithm (SP) is a heuristic message passing algorithm. A detailed description of the Survey Propagation algorithm can be found in [47,48,49,60], here we recall it naively. Mezard, Parisi, and Zecchina developed it in [47] from the assumption of one-step replica symmetry breaking and the cavity method of spin glasses. SP has been applied to different combinatorial optimization problems, like random K-SAT, MAX-E-k-SAT, q-coloring, Maximum Independent Set, etc. [47,48,49,50,51,61,62], always showing the best performance for solving these problems. It works on a factor graph underlying the CNF formula. For N → ∞, SP is conjectured to work better and better because it runs over locally-tree like factor graphs, and cycles into the graph are at least O(log N).
Broadly speaking, SP exchanges messages between variables and clauses for guessing the value that each variable needs to be set. More precisely, a message of SP, called a survey, passed from one function node a to a variable node i (connected by an edge) is a real number η a→i ∈ [0, 1]. Under the assumption that SP runs over a treelike factor graph, the messages have a full probabilistic interpretation. In particular, the message η a→i corresponds to the probability that the clause a sends a warning to variable i, telling which value the variable i should adopt to satisfy itself [63,48].
The updating rules of a single message η a→i are presented in Algorithm 1.
Algorithm 1: Subroutine SP-UPDATE( η a→i ) Input: set of all messages arriving onto each variable node j ∈ ∂ a \ i Output: new value for the message η a→i . for j ∈ ∂ a \ i do ja is empty, the corresponding product takes value 1 by definition; end SP is a local algorithm that extracts information on the underlying graph of a CNF formula. As Input, it takes a CNF formula of a Boolean Satisfiability Problem, and it performs a message passing procedure to obtain convergence of the messages. More precisely, we are given a random initialization of all messages, and at each iteration, each message is updated following the SP-UPDATE rule described in Algorithm 1. SP runs until all messages would satisfy a convergence criterion. This convergence criterion is defined as a small number ǫ such that the iteration is halted at the first time t * when no message has changed by more than ǫ over the last iteration. If this convergence criterion is not satisfied after t max iterations, SP stops and returns a failure output. Once a convergence of all messages η a→i is found, SP's goal is to minimize the number of violated clauses. For doing that, a new strategy for fixing the value of the variables must be introduced. This strategy is called decimation and transforms the SP into the Survey Inspired Decimation Algorithm (SID). For using decimation, however, one needs to compute the SP marginals for each variable i: where: The SP marginal S + i (S − i ) tells the probability that the variable i must be forced to take the value x i = 1(x i = 0), conditional on the fact that it does not receive a contradictory message, while S 0 i provides the information that the variable i is not forced to take a particular value.
Once all the SP marginals have been computed, the decimation strategy can be applied. Decimating a variable node i means fixing the variable to T RUE or F ALSE depending on the SP marginals, removing all satisfied functional nodes and the variable node i from the factor graph, and removing all the literals into the clauses that have not been satisfied by the fixing. However, how to choose the variable node i to decimate? The answer is simple, just selecting a variable with the maximum bias Decimated the variable node i, the SID iteratively runs the SP algorithm and uses decimation again. The decimation procedure continues till one of these three different outcomes appears: (i) a contradiction is found, then SID returns exit failure; (ii) SP does not find a convergence, then SID returns exit failure; (iii) all the messages converge to a trivial fixed point, i.e., all the messages are equal to 0, in this case, SID calls WalkSAT, which solves the residual formula and builds the complete solution of the problem.
The SID has extremely low complexity. Each SP iteration requires O(N) operations, which yields O(Nt max ), where t max is the maximum time allowed for finding a convergence, i.e. a big constant. In the implementation described above, the SID has a computational complexity of O(t max N 2 log N), where the N log N comes from the sorting of the biases. This can be reduced to O(Nt max (log N) 2 ) by noticing that fixing a single variable does not affect the SP messages significantly. Consequently, SP can be called every Nδ decimation step by fixing a fraction of variables at each decimation step. The efficiency of SID can be improved by introducing a backtracking strategy or a reinforcement strategy. We refer to [64,65,49,66], and references therein for a complete explanation of these strategies.

The Neural Network and a new heuristic-learining algorithm
This manuscript aims to present a new heuristic-learning algorithm that can find an assignment for a set of Boolean variables that maximizes the number of satisfied clauses of a given CNF formula. Although this new heuristic-learning algorithm does not reach state-of-the-art algorithms for the MAX-SAT problem, it can solve problems that are substantially larger and more difficult than it ever saw during training. The code was developed in C++ using mlpack, a fast and flexible C++ machine learning library [67]. The experiments were performed on a cluster with 128 cores and 512 GB of RAM. The code, the training data, and the test data can be downloaded from [68].

Empirical analysis of SP equations
Before presenting the whole algorithm, we start to analyze the SP algorithm. It is known that the SP algorithm collects information locally and by using equations in (3)  possible clauses [33,48].
We choose random 3-SAT instances for two reasons. The first one is just for the sake of simplicity. We use the same instances to analyze the new algorithm's performance in approximating the solutions of the MAX-E-3-SAT problem associated with them. The second one, instead, is given by the fact that many theoretical results are well known. For example, it is known that the SAT-UNSAT threshold for the random 3-SAT is at α s = 4.267 [69,70,71]. The SAT-UNSAT threshold defines two regions sharply: for N → ∞, the region before the threshold contains all the instances of random k-SAT problems that have at least an assignment that satisfies all the clauses (SAT region), while beyond the threshold, no assignment that satisfies all the clauses exists (UNSAT region). It is also known that when α = α U ≈ 4.36, the SP equations do not have a unique solution. This fact is not of direct importance for the random 3-SAT problem because we are beyond the SAT-UNSAT threshold. No exact solution exists, i.e., not all the clauses of an instance can be satisfied simultaneously. However, for the MAX-E-3-SAT problem, this point is interesting. Indeed, from there, we expect that SP equations will not converge, and therefore the worst-case scenario for the SP algorithm appears.
We start analyzing the empirical convergence of the SP algorithm (initialized with uniformly random messages) as a function of the clause density α, for different values of N. We fix, as described in the previous section, the value of t max to 1024 and ǫ to 10 −2 .
In Figure 2, we plot, for random 3-SAT, the fraction of instances that did not converge, ν, (left panel), and the number of iterations t that SP needs to make for reaching a convergence of all messages (right panel), as a function of the clause density α. The analysis shows that the SP algorithm returns a failure output for random 3-SAT at α conv 3−SAT = 4.355 because a full convergence is not found, in agreement with the results obtained by Mezard and Montanari in [60]. In both cases, we observe a step function form of the fraction of instances that did not converge for N → ∞. For N → ∞, therefore, SP always converges before the specific value of the clause density α conv 3−SAT , because the solutions of the SP equations are unique, while does not converge beyond the α conv 3−SAT , because the SP equations have many solutions [72]. This property shows that no algorithm, based on SP equations presented in Algorithm 1, can build any solution beyond α U = α conv 3−SAT . Analyzing the average fraction of messages that do not converge, η, as a function of α, however, it seems that beyond the convergence threshold α conv 3−SAT , a fraction of messages always converges. More precisely, a fraction of converging messages, which is almost ∼ 20% of the messages in each random 3-SAT instance (see Figure 3, left panel), exists. This fact inspires us to look at the average error of convergence. We define the average error of convergence as the quantity ǫ such that: where k = 3, because we have 3 messages for each clause, n conv s is the number of messages that converge in the s-th instance over the M analyzed, and In the case where n conv s = kM, we define that In the right panel of Figure 3, we plot the quantity ǫ as a function of the clause density α. For N → ∞, the probability of finding a full convergence in the region α ∈ [4.200, 4.355) is equal to one because SP equations run over a factor graph that is locally tree-like. When N is finite and small, for instance, N = 10 4 , the property of having a tree-like structure is not always preserved, and, therefore, we can meet before the threshold α conv 3−SAT , instances of the random 3-SAT problem that are not able to find a full convergence of the messages. This property allows us to understand the worst-case scenario of the SP algorithm in this region. Therefore, we analyzed a set of instances, with cardinality 2 10 3 , with N = 10 4 , and we looked at the instances where the average error of convergence was not trivial, i.e. different from 0. The plot shows that in the region α ∈ Error bars are standard deviations. Right: The figure shows the normalized histogram of the conditional probability returned by the neural network y( x(i), θ * ) that a variable i must be set to T RU E or not. In green, we plot the correct assignments, i.e. the value of y( x(i), θ * ) which allows us to correctly fix the variable i to T RU E or F ALSE, while in red we plot the value of y( x(i), θ * ) which makes us guessing a wrong assignment. Surprisingly, only the ∼ 20% of variables is fixed in the wrong way.
while beyond the threshold α conv 3−SAT the quantity ǫ grows linearly with α. This fact suggests that some information can also be extracted by those messages, although it is not completely correct. However, if we use the local information obtained by those messages, could we find an assignment of the Boolean variables better than a random one? To answer this question, we create a simple neural network described in the next subsection.

The Neural Network and Numerical Analysis
Algorithms based on the theory of deep learning have become essential in a wide variety of scientific disciplines. Deep learning is a class of machine learning algorithms that uses multiple layers to extract higher-level features from the raw input. In this manuscript, the deep learning methodology extracts patterns for fixing a single Boolean variable by building a function learned by solutions coming from Survey Inspired Decimation Algorithm. Deep learning theory is based on Artificial Neural Networks (ANN), a series of functional transformations. These functional transformations can be obtained by fixing a set of basis functions in advance and allowing them to be adaptive during training.
In our case, the ANN is a feed-forward neural network that is trained as a classifier to predict the conditional probability that a variable i must be set to T RUE or not, given a piece of local information expressed into a vector of input data x(i). This vector i.e., t * < t max . In the region α ∈ [4.355, 4.620], the average was performed only on the instances of random 3-SAT problem that did not converge, i.e., t * = t max . Right: The plot displays the quantity 1 − ρ as a function of the clause density α for instances of the random MAX-E-3-SAT problem that did not converge. We analyzed 2 10 3 instances for N = 10 4 (red points), and we averaged only on the instances that after t * = t max did not find a convergence for all the messages. The vertical coral line identifies the SAT-UNSAT threshold at α s = 4.267. The vertical magenta line identifies the SP unique solution threshold at α u = 4.355. Error bars are standard deviations.
x(i) has 4 dimensions and it is composed by i , under the assumption that the factor graph is locally tree-like, may be interpreted as the probability that the variable i does not receive warnings from the set of clauses where it appears negated (n − ) or not negated (n + ). The components of the vector x(i) could be interpreted as the features used for feeding a deep neural network in the general framework of machine learning.
The deep neural network N N ( x(i), θ) has five layers. The input and output layers have 4 and 1 neuron respectively, where sigmoidal activation function acts element-wise on each neuron, i.e. σ(a) = (1 + exp(−a)) −1 . The hidden layers are sigmoidal layers composed by 40 neurons each. The total set of parameters to be optimized is defined with θ. We define as loss function the following cross-entropy error function: where T is the target variable, and N is the total batch size. In our case, the target variables T i are the variables into a satisfiable assignment of a random 3-SAT problem. We chose the cross-entropy error function as a loss function because we are dealing with a classification problem. Classification is the problem of identifying to which of a set of categories a new observation belongs, based on a training set of data containing observations whose category membership is known. In our case, the training set is composed by vectors x(i) and targets that are Boolean variables. We can assume that each of these targets has a Bernoulli distribution. Considering them composed by independent observations, the loss function that arises naturally by taking the negative log-likelihood is the cross-entropy error function. The output y( x(i), θ) can be interpreted as the conditional probability p(x i = 1| x(i)), with p(x i = 0| x(i)) given by 1 − y( x(i), θ), where θ is the set of parameters that has to be optimized.
For optimizing these parameters, we need to train our neural network. For doing that, we solved 400 random 3-SAT problems at α = 4.2 and N = 10 4 using SID (see Section 3). For each of this 400 instances we stored one solution and the x(i) for each variable i. In other words, for each instance random 3-SAT we have 10 4 vectors x(i), and at each of these vectors is associated the target variable T i , which is the Boolean variable associated with the satisfiable assignment. The training of the neural network was performed by giving a batch of 20 random vectors x(i) and the respective target variables T i , without replacement, to the neural network. The optimal assignment of θ, i.e., θ * , to which the right-hand side of equation (7) vanishes, can be found by running an SGD algorithm.
For our simulations, we used the default SGD (Adam [73]) given by mlpack [67]. For testing the performance of the deep neural network, we calculated the accuracy in computing the conditional probability p(x i = 1| x(i)) on a validation data set. The validation data set was obtained by solving 36 random 3-SAT problems at α = 4.23 and N = 10 4 using SID. We define the accuracy of the neural network as: where, M is the total number of test solutions, i.e., the 36 solutions of random 3-SAT problems at α = 4.23 and N = 10 4 ; N H(a, b) is the normalized Hamming distance between two strings a and b with the same length; S s is the exact solution obtained by SID; Y s is the approximate solution obtained from the deep neural network. As the reader can see, we tested the deep neural network on solutions with a different value of α to which the neural network was trained. It was possible because we assumed that the local information obtained by SP equations should be independent of the clause density α.
In the left panel of Figure 4, we present the accuracy (AC) of the neural network on the validation data set of 36 10 4 elements as a function of the number of training steps, i.e., the number of times that we called the SGD for optimizing the set of parameters θ. The neural network starts by giving a random assignment to the Boolean variables, AC ∼ 50%, and after 100 training steps, it learns how to assign the Boolean variables. We also tested the accuracy of the neural network on a validation data set of 17 10 4 elements coming from 17 solutions of 3-SAT at α = 4.24 and N = 10 4 , obtaining the same accuracy. However, as shown on the right panel of Figure 4, the approximation of the conditional probability y( x(i), θ * ) completely fails for ∼ 20% of the variables. This failure is not bad for our purpose. Indeed, we are not interested in building a solution that satisfies all the clauses. Still, we are interested in finding an approximation of a solution that minimizes the number of unsatisfied clauses into an instance of MAX-E-3-SAT.
The whole algorithm, which we name DeepSP, is presented in Algorithm 2. For speeding up the algorithm, we introduced the convergence criterion explained in Section 3. The computational complexity of DeepSP algorithm is, therefore, Θ(N). Indeed, the maximum number of operation that it takes for outputting a result, after the training procedure for optimizing the parameters θ of the deep neural network, is ∼ t max kM + O(| θ| 2 )N, where | θ| << N and M = αN. Moreover, once the deep neural network's training procedure is performed, one can save the parameters' value and upload them instead of re-training the neural network each time. We also release the parameters' value, which can be downloaded from [68].
For performing an analysis on the performance of this heuristic-learning algorithm, we need, therefore, to check the ratio of the number of satisfied clauses to the total number of clauses, i.e., ρ. This result is described in Figure 5. Both plots describe the quantity of 1 − ρ as a function of α. In each plot, having 1 − ρ on the y-axis, the top end of the plot coincides with the estimate of the random assignment threshold, i.e., 1−ρ rand = 1−7/8 = 0.125, to provide an immediate indication of the performance of the DeepSP. In the left panel of the Figure 5, in the region α ∈ [4.200, 4.355) the average of 1 − ρ is performed only on the instances of random 3-SAT problem that converged, i.e. t * < t max . In other words, we are looking at the average-case performance of the heuristic-learning algorithm.
The behavior of 1 − ρ is constant, showing, therefore, that the DeepSP algorithm can find approximate solutions such that only 1.46% of clauses are unsatisfied by the assignment found, more precisely (1 − ρ) = 0.0146 ± 0.0002. In the region α ∈ [4.355, 4.620] the average was performed only on the instances of random 3-SAT problem that did not converge, i.e., t * = t max . This is obvious because no convergence is possible beyond α conv 3−SAT . The behavior of 1 − ρ, in this case, is not constant anymore, but it is linear with the clause density α.
In the right panel, we plot, instead, the worst-case scenario for the SP equations, i.e., only the instances that did not find a convergence of all messages, i.e., the time t * = t max and ǫ = 0. For showing the worst-case scenario, we run 2 10 3 the heuristiclearning algorithm on instances random 3-SAT with N = 10 4 , and we analyzed only those where a full convergence of the messages was absent. We observe that the local nature of the SP algorithm helps us to find an approximate solution much better than the one outputted by the Johnson algorithm [23]. In the region α ∈ [4.200, 4.355) the average error of the convergence, i.e ǫ , in the right panel of Figure 3 is bounded and the DeepSP seems to follow the same behavior. This behavior shows that the solutions we found using DeepSP are not affected by the loss of convergence. In contrast, in the region α ∈ [4.360, 4.620] the algorithm, following the behavior of the average error of convergence ǫ defined in (5), is affected to the linear growth, and, therefore, the behavior of the quantity 1 − ρ grows linearly with α. To be more qualitatively, we computed the sample Pearson correlation between two sets of variables: where n is the sample size, x i , y i are the individual sample points indexed with i, and µ x = 1 n n i=1 x i the sample mean (and analogously for µ y ), between the set of data 1 − ρ and ǫ . The sample Pearson correlation is equal to r corr = 0.9959, confirming that the two quantities are dependent on each other.
As stated in Section 1, DeepSP is not competitive with state-of-the-art of MAX-SAT solvers. For showing this, we compared our results with the results obtained by two different established methods: MaxWalkSat [33], and SP-y [50]. MaxWalkSat searches for a solution by performing a biased random walk in the solution space. At the same time, SP-y is a message passing algorithm that tries to build a solution assigning variables according to some estimated marginals.
We start with comparing DeepSP and MaxWalkSat, by analyzing the performance of the two algorithms for instances of MAX-E-3-SAT composed by N = 10 6 variables. In Table 1, we show the values of 1 − ρ, the fraction of unsatisfied clauses, at three different values of α, i.e. 4.2, 4.3, 4.5. These three points are in three distinct regions of the solution space. The first one, i.e., α = 4.2, is in the region where solutions always exist. The second one, i.e., α = 4.3, is in the region where no solution exists, but SP equations always converge. Instead, the third one is in the region where no solution exists, and SP equations do not converge.
The fraction of unsatisfied clauses obtained by MaxWalkSat is strongly dependent on the cutoff parameter chosen. The cutoff parameter in MaxWalkSat identifies the Table 1. MaxWalkSat results at N = 10 6 for different values of α and different cutoffs.
We stop increasing the cutoff as soon as we find that MaxWalkSat results are better than the one found by DeepSP. number of flips performed by the algorithm for leading to the greatest decrease in the total number of unsatisfied clauses. We chose the values 10 4 , 10 5 , 10 6 , 10 7 , 10 8 for our comparison. When the cutoff is smaller than or equal to 10 7 , the performance of DeepSP is better than the performance of MaxWalkSat in satisfying the maximum number of clauses. However, from a cutoff of 10 8 , or bigger, the MaxWalkSat performance is the best. When the cutoff increases, the time for searching for an optimal solution also increases. This situation makes the MaxWalkSat really hard to analyze analytically. In other words, we do not know when N → ∞ how big the cutoff should be. It is just a parameter that is chosen a priori. In contrast, DeepSP overcomes this issue. No parameter increases its efficiency in finding an optimal approximate solution.
SP-y is a message passing algorithm ideated for minimizing the number of violated clauses in the MAX-E-k-SAT. It takes as INPUT a Boolean formula F in conjunctive normal form and outputs a simplified Boolean formula F ′ in conjunctive normal form and a partial truth-value assignment for the variables. If F ′ = ∅ is given to a heuristic MAX-SAT Solver (as MaxWalkSat, Simulated Annealing) for building the complete assignment of F . It uses decimation and backtracking strategies, which means that iteratively fixes and un-fixes variables for building up a partial (or complete) solution of F according to some estimated marginals.
Its performance is extraordinary. For α = 4.24 and N = 10 5 , SP-y can find a completely satisfiable assignment of the formula F , or, in its worst performance, a value of (1 − ρ) ≈ 10 −5 . Above the SAT − UNSAT threshold, for example, when α = 4.29, the best performance of the algorithm, on a single sample of a MAX-E-3-SAT instance, reached a value of (1 − ρ) ≈ 2 10 −4 . These performances can be obtained only when backtracking and decimation moves are performed, implying a very long run-time. Indeed, the SP-y algorithm has a computational complexity of order O(N 2 ), making the established method unfeasible for huge values of N.
Without any decimation or backtracking moves, as in DeepSP, the algorithm has the same performances of the heuristic MAX-SAT Solver used (indeed F = F ′ ), and the performance of the algorithm is strongly dependent on the parameters of the heuristic.
For concluding the analysis of the algorithm, we looked at the performance of the algorithm beyond α = 4.620. In the region α ∈ (4.620, +∞), we meet a point at α = 4.67, for N → ∞, where the SP equations return at least one message η a→i = 1, also if n i > 1. When such an issue happens, numerical instability into the SP equations appears. For avoiding this issue, therefore, we introduce a simple strategy, i.e. unit propagation strategy, on the variable i where at least one message η a→i = 1 appears in the set of functional nodes associated with it, and we fix such a variable by using the rule: i is T RUE if n + i > n − i , F ALSE otherwise. In Figure 6 we present the fraction of variables where unit propagation strategy was used for fixing the value of the Boolean variable i in a sample of MAX-E-3-SAT instances, i.e., ω, as a function of α. At α = 4.67, for N = 10 6 , we pass from a region where DeepSP uses the neural network to a region where the unit propagation rule fixes all the variables. Beyond the threshold at α = 4.67 we can claim, also if numerically we meet the random assignment threshold ρ rand = 7/8 at α = 10 for each value of N analyzed, that the worst-case performance of the DeepSP is equal to the random assignment outputted by the Johnson algorithm [23].

Conclusion
This paper has presented a new heuristic-learning algorithm, namely DeepSP algorithm, that finds approximate solutions for the MAX-E-3-SAT problem. This algorithm runs SP equations on the random factor graph associated with the MAX-E-3-SAT problem. It gives the local information computed to a neural network N N ( x(i), θ * ). The set of parameters θ are optimized following a supervised learning approach ( by using target values obtained by SID on a sample of random 3-SAT problems) and outputs an assignment. We have displayed an accurate analysis to explain the algorithm's average and worst-case behavior as a function of the clause density α. We have started with presenting the limits of the SP equations and the neural network's performance in learning and inferring the conditional probability that a variable i should take to T RUE or F ALSE. Then, we have shown that this algorithm can find approximate solutions that outperform the random assignment threshold value in the region where the SP equations do not present any numerical instability, and we have identified its algorithmic threshold, which is the ultimate limit of the algorithm, at α a = 4.67. Moreover, we have observed that the algorithm's output is strongly related to the average error of convergence ǫ that the SP equations commit if they do not find a unique set of fixed points. Although this algorithm is not competitive with state-of-the-art MAX-SAT solvers, it can solve substantially larger and more difficult problems than it ever saw during training.
As future research directions, we propose to analyze the performance of the algorithm on MAX-E-4-SAT problem and verify if a Belief Propagation algorithm version could perform as well as our DeepSP on that particular problem. We also suggest using SP-y equations, instead of SP equations, for improving the performance of this new heuristic-learning algorithm. Moreover, we suggest analyzing with a probabilistic approach the properties of the maximum error of convergence ǫ max that the SP equations can perform on random instances of the MAX-E-k-SAT problem. This maximum error should be related to the algorithmic performance, as numerically shown in our analysis.