Last week I was invited in Cambridge, UK, to participate to a panel at the FMCAD conference (Formal Methods in Computer-Aided Design). The subject: “Model Checking in the Cloud”. With another four people, we discussed the questions laid out by the panel moderator:
- How can model checking leverage the advantages of distributed and multi-core systems in the cloud?
- What are possible solutions beyond an “embarrassingly parallel” approach of running a single property per core?
- Is there a specific subset of properties that might be more suitable to this form of analysis?
- What issues need to be addressed for design houses to adopt this technology and will the current license model of EDA tools change to adapt to the new requirements?
These were my slides.
We had a lively discussion about cloud and verification. Of course I am a strong promoter of cloud-based solution for compute-intensive tasks like functional verification. In that very case though, the conversation focused on model checking –not logic simulation or equivalence checking.
My take is that there is no “embarrassingly parallel” approach to model checking. The naïve approach that consists of running a single property per core is dubious, because (1) you cannot do load balancing with this strategy –each property takes a very different time to be resolved, and (2) a single property can easily require years for a single core to tackle.
This means back to the white board to rethink the problem: how can we leverage a distributed computing system readily available in the cloud –a network of machines, each with its own cores, RAM, and disk— for model checking?
Model checking is about verifying whether a sequential system satisfies some timing properties. These properties are usually expressed in some formal language like LTL (Linear Temporal Logic) or CTL (Computation Tree Logic). For instance we can check that “if some event occurs, then some property is later always true”; or “some property eventually comes true”; or “some property holds infinitely often in the future”.
One major component used in model checking is state space exploration: visit the states that can be reached from some initial states with a finite state machine. Another major, practical component, is bounded model checking: instead of looking at the behavior of a system infinitely in the future, only consider a finite window in the future. This motivated my looking at three pillars of model checking: explicit state exploration, implicit state exploration, and bounded model checking.
Explicit State Exploration
This simply consists of explicitly generating every successors state of a given state, then the successors of the successors, etc. The key point is recognizing which states have been already visited. This problem is essentially memory limited. So in the context of cloud computing, this boils down to:
- How to efficiently partition the state space so that each node in the network can explore its state space?
- How to implement a very large distributed hash table for the visited states? We are talking Terabytes or Petabytes of data.
My conclusion: the technology for very large distributed hash table already exists. It would be interesting to see how such a large distributed system could help tackling industrial-scale model checking problems like those encountered in chip design.
Implicit State Exploration
State exploration can also be done implicitly by using a data structure such as BDDs (Binary Decision Diagrams). In essence, we use a data structure to represent the characteristic function of a set of states, as opposed to an explicit collection of states. This leads to very different algorithms that can dramatically outperform explicit state exploration.
My conclusion: the problem is still memory limited, but some attendees made a good case that it is rather time limited. The problem of state space partitioning remains, and there are arguably more sophisticated solutions here. One key is to distribute the BDDs on multiple nodes of a network. This is definitely a challenge, a problem worthy of more research.
Bounded Model Checking
This consists of unrolling the system under verification k times, k being the width of the temporal widow we consider, and build a propositional formula expressing the property on that window. We then check whether that formula is satisfiable or not (a SAT problem).
There are distributed SAT solvers, but scaling them up is non-trivial, as the nodes of the network need to share information. This is still a promising avenue of research.
“Don’t fight the exponential”
During the live exchanges between the panel and the audience, one notable objection was that model checking deals with vast state space (read: growing exponentially with the parameters of the system), and that there was no point in “fighting the exponential”. There was no point in more computing power. Instead, as was argued by some attendees, we should be “smarter” about verification: better models, better abstractions, better heuristics.
Of course we should look for “smarter” verification, with automated abstraction and more efficient algorithms. But we should not underestimate what we can achieve with raw computing power. What made Deep Blue the world chess champion, beating the then reigning human champion Kasparov back in 1997, was its very large compute power, smartly pieced together. What make Google Translate useable for basic translation are not decades of research in natural languages and semantics, but gigantic Markov chains built from Petabytes of data gathered by web crawlers. This is the same principle that made IBM’s Watson the winner of the quiz show Jeopardy! against the two best human players in 2011. These systems are not inherently smart. They rely on huge computing power resulting from intelligently assembling large amount of hardware, which processes very large amount of data.
To those arguing that more computing power is not a big help for problems that are exponential, I agree. However if the problem has a worst-case exponential complexity but exhibits a polynomial behavior in many practical cases, then throwing more computing power does help. In particular, BDD and SAT solvers, and therefore model checking, can benefit from large compute networks.
The cloud makes possible building compute grids made of hundreds or thousands of nodes. This is now up to the research community to leverage these previously unseen, cheap computing capacities to address ever more complex model checking problems. This however requires rethinking classical algorithms and recast them on very large distributed systems. And this will require being smart.