A program is deterministic, or repeatable, if it produces the very same output when given the same input no matter how many times it is run.
Refining this definition, we should consider whether a program produces the same result on any platform (32 and 64 bits machines, running Windows, Mac OS, Linux, Solaris, etc). Or whether the program is insensitive to the form of its inputs. For example, the problem of generating the shortest route to visit all the capitals of Europe should not depend on how the map of Europe is entered, nor it should depend on which language is used to name the capitals.
Determinism is obviously very desirable. For the user, a non-deterministic program can be confusing and frustrating. For the developer, a non-deterministic program is extremely hard to test and debug, since bugs and specific configuration cannot be easily reproduced.
Repeatability looks like a given for most applications. For instance, if we add two numbers in a spreadsheet, we expect the same result no matter how many times we perform this operation and regardless of the platform we run on (PC, Mac, etc). Or if we run a spell checker several times, we expect it to flag the very same errors.
But it is not that obvious for more complex applications. This is especially true when there are multiple solutions to a problem, or when heuristics are used to produce a result –because an exact solution is too computationally expensive. For example, it is not uncommon to see slightly different outcomes when running the same EDA synthesis or P&R tool on the same input several times.
Even more, a user would like to see the same result when only minor changes are applied to the input. For instances, running a P&R tool on two netlists that differ only by the names of their cells should produce exactly the same result –a P&R tools should produce a result that only depend on the netlist structure. But experience shows that industrial synthesis and P&R tool does not meet that requirement. Closest to software, it is not uncommon to generate slightly different object codes with gcc by changing the names of a few variables.
Among the causes of non-deterministic response, we can distinguish the following types:
- A random number generator
- Reading an uninitialized data
- A race condition on concurrent threads
- An unordered iteration that is assumed ordered
- A total order that depends on memory address
- A total order that depends on time stamps
- A total order that depends on a non-canonical labeling
There are a lot of applications that use stochastic processes (e.g., simulated annealing, genetic algorithms, Monte-Carlo simulations), but that we would like to be repeatable. Using a pseudorandom number generator with a known seed makes possible to reproduce the same long sequence of seemingly random numbers over and over again.
Note that some applications (e.g., gaming, cryptography, statistical sampling) require a non-deterministic behavior. In that case the seed of the random number generator must be an always-changing value, for example the host’s current time.
There are more deliberate efforts to produce true random values by relying on natural, chaotic events. For instance Lavarand produces random numbers by hashing the frames of a video stream of lava lamps. HotBits generates random bits by timing successive pairs of radioactive decays detected by a Geiger-Müller tube interfaced to a computer. Random.org uses variations in the amplitude of atmospheric noise recorded with a normal radio.
Initialized data may not exist in languages that have systematic default values and no memory management control, as opposed to high performance languages like C/C++.
Finding and fixing this kind of issues is relatively simple. For instance, tools like Purify and Valgrind can report when a C/C++ code reads arbitrary values in memory. To use Purify’s terminology, such errors are UMR (Uninitialized Memory Read), ABR (Array Bound Read, i.e., dereferencing an array outside of its bounds), and FMR (Free Memory Read). These defects all consist in reading some random value in memory. The code below illustrates some of these errors.
Note that an ABW (Array Bound Write, i.e., writing outside of an array’s bounds), FMW (Free Memory Write), FNH (Freeing Non-Heap memory) and FUM (Freeing Unallocated Memory), although severe bugs also reported by dynamic analysis tools, are not an original source of non-determinism: they consistently reproduce the same bug.
Thread races are difficult to detect, and fixing them can be very costly. A typical example is when one thread writes a value at some address, and another thread reads the value at that address. Depending on which thread access the address first, the outcome of the program will be different. Two threads performing a non-atomic write at the same address simultaneously results in some unpredictable value.
One can use a mutex to prevent conflicting read/write for non-atomic operations. But racing threads (e.g., who reads/writes first) must be resolved with synchronization, which can be quite complicated. Moreover it can hurt performances.
Iterating data with some random order can make a program non repeatable. This pattern is often encountered, and is easy to fix.
For example an algorithm produces a result via a visitor that assumes a total order. The developer uses an incorrect visitor, which enumerates data in a random order, usually depending on the memory allocation of the data container. E.g., instead of using a std::set as a container, the developer uses a std::hash_set (or a tr1::unordered_set instead of a tr1::ordered_set). Forcing a total order on the data fixes the problem.
Note that the fix may be incomplete if it simply transforms a type (4) non-determinism into a type (5) or (6) non-determinism, which we discuss below.
This type of non-determinism is extremely common. For instance, a developer uses a tr1::ordered_set as a container of pointers, and feels that the visitor is deterministic. It is indeed deterministic, but only w.r.t. the memory addresses allocated to the data, which depend on factors out of the application’s control.
One way of addressing the problem is to force a specific memory addressing scheme, but that requires a very fine control of the memory allocator and is therefore complicated. A more common way consists in assigning a unique ID to an object at the time of its creation. The ID can be a 32 bit unsigned integer that is incremented for every new object. A total ordering, independent from the objects’ actual memory addresses, is then obtained from the IDs. It is a simple solution, as long as one can afford the extra 4 bytes for every object. ID-based sorting with no memory penalty can be obtained using custom memory allocators.
Note though that the total ID-based ordering described above is exactly the order of creation of the objects. It is no different from an order that depends on time stamps. Therefore two equal sets of inputs that only differ in the order will be visited in a different order, which can lead to different results. This leads us to the type (7) of non-determinism.
Type (7) non-determinism often goes unrecognized, or is simply ignored. The idea is that as long as the same input is given to a program (but possibly in a different order or form), the output should be the same. If the input can somehow be normalized to a form that captures the notion of “same input”, then the program can be made insensitive to the format of the input. That is of course assuming that the normalization process run time penalty is not too high.
This normalization process is better defined as canonization. Formally, let O be a set of objects, and let EQ be an equivalence relation that captures the notion of “same” on these objects. A function Canon maps an object onto its canonical form, and is such that for any two objects o1 and o2, o1 and o2 are the same (i.e., o1 EQ o2) if and only if Canon(o1) = Canon(o2).
For instance, a set of integers can be represented by a number of containers (a list, an array, a hash set, a binary tree, etc). A canonical form can simply consist in sorting the integers. Two sets that are equal because they contain the very same integers, but that are initially given in different orders and forms, will end up in the same canonical form. Since sorting is an O(n log n) algorithm, this is an efficient canonization.
Canonization can be much more costly. A Boolean function can be represented in many ways, e.g., with a truth table, a Conjunctive Normal Form (CNF), a Disjunctive normal form (DNF), a decision diagram, etc (see below). Boolean function canonization is at least NP-hard, since it solves the satisfiability problem (SAT). In practice this means that Boolean function canonization algorithms have an exponential complexity.
Canonization can also be elusive. Let us consider the problem of drawing a graph in some aesthetic way (e.g., such that the nodes are evenly distributed and such that there is a minimum number of crossing edges). One would like the graph to be drawn the very same way, regardless of its representation (adjacency list or adjacency matrix), and regardless of the order the adjacency information is given. For instances the three graphs below, although looking different, are exactly the same, and can be drawn without any edge crossing as shown on the right side.
Graph canonization is also known as graph labeling. It is at least as hard as graph isomorphism, one of these rare problems that are in NP but that are not known to be NP-complete or in P. Although all existing graph canonization algorithm have an exponential worst-case complexity, it is believed that graph canonization can be done in polynomial time.
The most common cause for non-determinism is related to some unreliable data order. The ultimate solution to make a program insensitive to the form of its input is to canonize its input as a pre-processing step. This proves to be a challenging and costly task in some cases. Whenever possible, canonization (or some imperfect normalization) goes a long way to make the application consistently repeatable.