## The Model

Why does biological evolution work? And, for that matter, why does machine learning work? Both are examples of adaptive processes that surprise us with what they manage to achieve. So what’s the essence of what’s going on? I’m going to concentrate here on biological evolution, though much of what I’ll discuss is also relevant to machine learning—but I’ll plan to explore that in more detail elsewhere.

OK, so what is an appropriate minimal model for biology? My core idea here is to think of biological organisms as computational systems that develop by following simple underlying rules. These underlying rules in effect correspond to the genotype of the organism; the result of running them is in effect its phenotype. Cellular automata provide a convenient example of this kind of setup. Here’s an example involving cells with 3 possible colors; the rules are shown on the left, and the behavior they generate is shown on the right:

*Note: Click any diagram to get Wolfram Language code to reproduce it.*

We’re starting from a single () cell, and we see that from this “seed” a structure is grown—which in this case dies out after 51 steps. And in a sense it’s already remarkable that we can generate a structure that neither goes on forever nor dies out quickly—but instead manages to live (in this case) for exactly 51 steps.

But let’s say we start from the trivial (“null”) rule that makes any pattern die out immediately. Can we end up “adaptively evolving” to the rule above? Imagine making a sequence of randomly chosen “point mutations”—each changing just one outcome in the rule, as in:

Then suppose that at each step—in a minimal analog of natural selection—we “accept” any mutation that makes the lifetime longer (though not infinite), or at least the same as before, and we reject any mutation that makes the lifetime shorter, or infinite. It turns out that with this procedure we can indeed “adaptively evolve” to the rule above (where here we’re showing only “waypoints” of progressively greater lifetime):

Different sequences of random mutations give different sequences of rules. But the remarkable fact is that in almost all cases it’s possible to “make progress”—and routinely reach rules that give long-lived patterns (here with lifetimes 107, 162 and 723) with elaborate morphological structure:

Is it “obvious” that our simple process of adaptive evolution will be able to successfully “wrangle” things to achieve this? No. But the fact that it can seems to be at the heart of why biological evolution manages to work.

Looking at the sequences of pictures above we see that there are often in effect “different mechanisms” for producing long lifetimes that emerge in different sequences of rules. Typically we first see the mechanism in simple form, then as the adaptive process continues, the mechanism gets progressively more developed, elaborated and built on—not unlike what we often appear to see in the fossil record of biological evolution.

But let’s drill down and look in a little more detail at what’s happening in the simple model we’re using. In the 3-color nearest-neighbor (*k* = 3, *r* = 1) cellular automata we’re considering, there are 26 (= 3^{3} – 1) relevant cases in the rule (there’d be 27 if we didn’t insist that

). “Point mutations” affect a single case, changing it to one of two (= 3 – 1) possible alternative outcomes—so that there are altogether 52 (= 26 × 2) possible distinct “point mutations” that can be made to a given rule.

For example, starting from the rule

the results of possible single point mutations are:

And even with such point mutations there’s usually considerable diversity in the behavior they generate:

In quite a few cases the pattern generated is exactly the same as the one for the original rule. In other cases it dies out more quickly—or it doesn’t die out at all (either becoming periodic, or growing forever). And in this particular example, in just one case it achieves “higher fitness”, surviving longer.

If we make a sequence of random mutations, many will produce shorter-lived or infinite lifetime (“tumor”) patterns, and these we’ll reject (or, in biological terms, we can imagine they’re “selected out”):

But still there can be many “neutral mutations” that don’t change the final pattern (or at least give a pattern of the same length). And at first we might think that these don’t achieve anything. But actually they’re critical in allowing single point mutations to build up to larger mutations that can eventually give longer-lived patterns:

Tracing our whole adaptive evolution process above, the total number of point mutations involved in getting from one (increasingly long-lived) “fitness waypoint” to another is:

Here are the underlying rules associated with these fitness waypoints (where the numbers count cumulative “accepted mutations”, ignoring ones that go “back and forth”):

One way to get a sense of what’s going on is to take the whole sequence of (“accepted”) rules in the adaptive evolution process, and plot them in a dimension-reduced rendering of the (27-dimensional) rule space:

There are periods when there’s a lot of “wandering around” going on, with many mutations needed to “make progress”. And there are other periods when things go much faster, and fewer mutations are needed.

As another way to see what’s going on, we can plot the maximum lifetime achieved so far against the total number of mutation steps made:

We see plateaus (including an extremely long one) in which “no progress” is made, punctuated by sometimes-quite-large, sudden changes, often brought on by just a single mutation.

If we include “rejected mutations” we see that there’s a lot of activity going on even in the plateaus; it just doesn’t manage to make progress (one can think of each red dot that lies below a plateau as being like a mutation—or an organism—that “doesn’t make it”, and is selected out):

It’s worth noting that there can be multiple different (“phenotype”) patterns that occur across a plateau. Here’s what one sees in the particular example we’re considering:

But even between these “phenotypically different” cases, there can be many “genotypically different” rules. And in a sense this isn’t surprising, because usually only parts of the underlying rule are “coding”; other parts are “noncoding”, in the sense that they’re not sampled during the generation of the pattern from that rule.

And for example this highlights for each “fitness waypoint rule” which cells make use of a “fresh” case in the rule that hasn’t so far been sampled during the generation of the pattern:

And we see that even in the last rule shown here, only 18 of the 26 relevant cases in the rule are actually ever sampled during the generation of the pattern (from the particular, single-red-cell initial condition used). So this means that 8 cases in the rule are “undetermined” from the phenotype, implying that there are 3^{8} = 6561 possible genotypes (i.e. rules) that will give the same result.

So far we’ve mostly been talking about one particular random sequence of mutations. But what happens if we look at many possible such sequences? Here’s how the longest lifetime (or, in effect, “fitness”) increases for 100 different sequences of random mutations:

And what’s perhaps most notable here is that it seems as if these adaptive processes indeed don’t “get stuck”. It may take a while (with the result that there are long plateaus) but these pictures suggest that eventually “adaptive evolution will find a way”, and one will get to rules that show longer lifetimes—as the progressive development of the distribution of lifetimes reflects:

## The Multiway Graph of All Possible Mutation Histories

In what we’ve done so far we’ve always been discussing particular paths of adaptive evolution, determined by particular sequences of random mutations. But a powerful way to get a more global view of the process of adaptive evolution is to look—in the spirit of our Physics Project, the ruliad, etc.—not just at individual paths of adaptive evolution, but instead at the multiway graph of all possible paths. (And in making a correspondence with biology, multiway graphs give us a way to talk about adaptive evolution not just of individual sequences of organisms, but also populations.)

To start our discussion, let’s consider not the 3-color cellular automata of the previous section, but instead (nearest-neighbor) 2-color cellular automata—for which there are just 128 possible relevant rules. How are these rules related by point mutations? We can construct a graph of every possible way that one rule from this set can be transformed to another by a single point mutation:

If we imagine 5-bit rather than 7-bit rules, there are only 16 relevant ones, and we can readily see that the graph of possible mutations has the form of a Boolean hypercube:

Let’s say we start from the “null rule” . Then we enumerate the rules obtained by a single point mutation (and therefore directly connected to the null rule in the graph above)—then we see what behavior they produce, say from the initial condition ……:

Some of these rules we can view as “making progress”, in the sense that they yield patterns with longer lifetimes (not impressively longer, just 2 rather than 1). But other rules “make no progress” or generate patterns that “live forever”. Keeping only mutations that don’t lead to shorter or infinite lifetimes, we can construct a multiway graph that shows all possible mutation paths:

Although this is a very small graph (with just 15 rules appearing), we can already see hints of some important phenomena. There are “fitness-neutral” mutations that can “go both ways”. But there are also plenty of mutations that only “go one way”—because the other way they would decrease fitness. And a notable feature of the graph is that once one’s “committed” to a particular part of the graph, one often can’t reach a different one—suggesting an analogy to the existence of distinct branches in the tree of life.

Moving beyond 2-color, nearest-neighbor (*k* = 2, *r* = 1) cellular automata, we can consider *k* = 2, *r *= ones. A typical such cellular automaton is:

For *k* = 2, *r* = 1 there were a total of 128 (= 2^{23 – 1}) relevant rules. For *k* = 2, *r* = , there are a total of 32,768 (= 2^{24 – 1}). Starting with the null rule, and again using initial condition

And here is the beginning of the multiway graph for *k* = 2,* r* = rules—showing rules reached by up to two mutations starting from the null rule:

This graph contains many examples of “fitness-neutral sets”—rules that have the same fitness and that can be transformed into each other by mutations. A few examples of such fitness-neutral sets:

In the first case here, the “morphology of the phenotypic patterns” is the same for all “genotypic rules” in the fitness-neutral set. But in the other cases there are multiple morphologies within a single fitness-neutral set.

If we included all individual rules we’d get a complete *k* = 2, *r* = multiway graph with a total of 1884 nodes. But if we just include one representative from every fitness-neutral set, we get a more manageable multiway graph, with a total of 86 nodes:

Keeping only one representative from pairs of patterns that are related by left-right symmetry, we get a still-simpler graph, now with a total of 49 nodes:

There’s quite a lot of structure in this graph, with both divergence and convergence of possible paths. But overall, there’s a certain sense that different sections of the graph separate into distinct branches in which adaptive evolution in effect “pursues different ideas” about how to increase fitness (i.e. lifetime of patterns).

We can think of fitness-neutral sets as representing a certain kind of equivalence class of rules. There’s quite a range of possible structures to these sets—from ones with a single element to ones with many elements but few distinct morphologies, to ones with different morphologies for every element:

What about larger spaces of rules? For *k* = 2, *r* = 2 there are altogether about 2 billion (2^{25 – 1}) relevant rules. But if we choose to look only at left-right symmetric ones, this number is reduced to 524,288 (= 2^{19}). Here are some examples of sequences of rules produced by adaptive evolution in this case, starting from the null rule, and allowing only mutations that preserve symmetry (and now using a single black cell as the initial condition):

Once again we can identify fitness-neutral sets—though this time, in the vast majority of cases, the patterns generated by all members of a given set are the same:

Reducing out fitness-neutral sets, we can then compute the complete (transitively reduced) multiway graph for symmetric *k* = 2, *r* = 2 rules (containing a total of 60 nodes):

By reducing out fitness-neutral sets, we’re creating a multiway graph in which every edge represents a mutation that “makes progress” in increasing fitness. But actual paths of adaptive evolution based on random sequences of mutations can do any amount of “rattling around” within fitness-neutral sets—not to mention “trying” mutations that decrease fitness—before reaching mutations that “make progress”. So this means that even though the reduced multiway graph we’ve drawn suggests that the maximum number of steps (i.e. mutations) needed to adaptively evolve from the null rule to any other is 9, it can actually take any number of steps because of the “rattling around” within fitness-neutral sets.

Here’s an example of a sequence of accepted mutations in a particular adaptive evolution process—with the mutations that “make progress” highlighted, and numbers indicating rejected mutations:

We can see “rattling around” in a fitness-neutral set, with a cycle of morphologies being generated. But while this represents one way to reach the final pattern, there are also plenty of others, potentially involving many fewer mutations. And indeed one can determine from the multiway graph that an absolutely shortest path is:

This involves the sequence of rules:

We’re starting from the null rule, and at each step making a single point mutation (though because of symmetry two bits can sometimes be changed). The first few mutations don’t end up changing the “phenotypic behavior”. But after a while, enough mutations (here 6) have built up that we get morphologically different behavior. And after just 3 more mutations, we end up with our final pattern.

Our original random sequence of mutations gets to the same result, but in a much more tortuous way, doing a total of 169 mutations which often cancel each other out:

In drawing a multiway graph, we’re defining what evolutionary paths are possible. But what about probabilities? If we assume that every point mutation is equally likely, we can in effect “analyze the flow” in the multiway graph, and determine the ultimate probability that each rule will be reached (with higher probabilities here shown redder):

## The Fitness Landscape

Multiway graphs give a very global view of adaptive evolution. But in understanding the process of adaptive evolution, it’s also often useful to think somewhat more locally. We can imagine that all the possible rules are laid out in a certain space, and that adaptive evolution is trying to find appropriate paths in this space. Potentially we can suppose that there’s a “fitness landscape” defined in the space, and that adaptive evolution is trying to follow a path that progressively ascends to higher peaks of fitness.

Let’s consider again the very first example we gave above—of adaptive evolution in the space of 3-color cellular automata. At each step in this adaptive evolution, there are 52 possible point mutations that can be made to the rule. And one can think of each of these mutations as corresponding to making an “elementary move” in a different direction in the (26-dimensional) space of rules.

Here’s a visual representation of what’s going on, based on the particular path of adaptive evolution from our very first example above:

What we’re showing here is in effect the sequence of “decisions” that are being made to get us from one “fitness waypoint” to another. Different possible mutations are represented by different radial directions, with the length of each line being proportional to the fitness achieved by doing that mutation. At each step the gray disk represents the previous fitness. And what we see is that many possible mutations lead to lower fitness outcomes, shown “within the disk”. But there are at least some mutations that have higher fitness, and “escape the disk”.

In the multiway graph, we’d trace every mutation that leads to higher fitness. But for a particular path of adaptive evolution as we’ve discussed it so far, we imagine we always just pick at random one mutation from this set—as indicated here by a red line. (Later we’ll discuss different strategies.)

Our radial icons can be thought of as giving a representation of the “local derivative” at each point in the space of rules, with longer lines corresponding to directions with larger slopes “up the fitness landscape”.

But what happens if we want to “knit together” these local derivatives to form a picture of the whole space? Needless to say, it’s complicated. And as a first example, consider *k* = 2, *r* = 1 cellular automaton rules.

There are a total of 128 relevant such rules, that (as we discussed above) can be thought of as connected by point mutations to form a 7-dimensional Boolean hypercube. As also discussed above, of all 128 relevant rules, only 15 appear in adaptive evolution processes (the others are in effect never selected because they represent lower fitness). But now we can ask where these rules lie on the whole hypercube:

Each node here represents a rule, with the size of the highlighted nodes indicating their corresponding fitness (computed from lifetime with initial condition ……). The node shown in green corresponds to the null rule.

Rendering this in 3D, with fitness shown as height, we get what we can consider a “fitness landscape”:

And now we can think of our adaptive evolution as proceeding along paths that never go to nodes with lower height on this landscape.

We get a more filled-in “fitness landscape” when we look at *k* = 2, *r *= rules (here with initial condition ……):

Adaptive evolution must trace out a “never-go-down” path on this landscape:

Along this path, we can make “derivative” pictures like the ones above to represent “local topography” around each point—indicating which of the possible upwards-on-the-landscape directions is taken:

The rule space over which our “fitness landscape” is defined is ultimately discrete and effectively very high-dimensional (15-dimensional for *k* = 2, *r* = rules)—and it’s quite challenging to produce an interpretable visualization of it in 3D. We’d like it if we could lay out our rendering of the rule space so that rules which differ just by one mutation are a fixed (“elementary”) 2D distance apart. In general this won’t be possible, but we’re trying to at least approximate this by finding a good layout for the underlying “mutation graph”.

Using this layout we can in principle make a “fitness landscape surface” by interpolating between discrete points. It’s not clear how meaningful this is, but it’s perhaps useful in engaging our spatial intuition:

We can try machine learning and dimension reduction, operating on the set of “rule vectors” (i.e. outcome lists) that won’t be rejected in our adaptive evolution process—and the results are perhaps slightly better:

By the way, if we use this dimension reduction for rule space, here’s how the behavior of rules lays out:

And here, for comparison, is a feature space plot based on the visual appearance of these patterns:

## The Whole Space: Exhaustive Search vs. Adaptive Evolution

In adaptive evolution, we start, say, from the null rule and then make random mutations to try to reach rules with progressively larger fitness. But what about just exhaustively searching the complete space of possible rules? The number of rules rapidly becomes unmanageably big—but some cases are definitely accessible:

For example, there are just 524,288 symmetric *k* = 2, *r* = 2 rules—of which 77,624 generate patterns with finite lifetimes. Ultimately, though, there are just 77 distinct phenotypic patterns that appear—with varying lifetimes and varying multiplicity (where at least in this case the multiplicity is always associated with “unused bits” in the rule):

How do these exhaustive results compare with what’s generated in the multiway graph of adaptive evolutions? They’re almost the same, but for the addition of the two extra cases

which are generated by rules of the form (where the gray entries don’t matter):

Why don’t such rules ever appear in our adaptive evolution? The reason is that there isn’t a chain of point mutations starting from the null rule that can reach these rules without going through rules that would be rejected by our adaptive evolution process. If we draw a multiway graph that includes every possible “acceptable” rule, then we’ll see a separate part in the graph, with its own root, that contains rules that can’t be reached by our adaptive evolution from the null rule:

So now if we look at all (symmetric *k* = 2, *r* = 2) rules, here’s the distribution of lifetimes we get:

The maximum, as seen above, is 65. The overall distribution roughly follows a power law, with an exponent around –3:

As we saw above, not all rules make use of all their bits (i.e. outcomes) in generating phenotypic patterns. But what we see is that the larger the lifetime achieved, the more bits tend to be needed:

And in a sense this isn’t surprising: as we’ll discuss later, we can expect to need “more bits in the program” to specify more elaborate behavior—or, in particular, behavior that embodies a larger number for lifetime.

So what about general (i.e. not necessarily symmetric) *k* = 2, *r* = 2 rules? There are ^{31} ≅ 2 billion

Again it’s roughly a power law, but now with exponent around –3.5:

Here are the actual 100 patterns produced that have the longest lifetimes (in all asymmetric cases there are also rules giving left-right flipped patterns):

It’s interesting to see here a variety of “qualitatively different ideas” being used by different rules. Some (like the one with lifetime 151) we might somehow imagine could have been constructed specifically “for the purpose” of having their particular, long lifetime. But others (like the one with lifetime 308) somehow seem more “coincidental”—behaving in an apparently random way, and then just “happening to die out” after a certain number of steps.

Since we found these rules by exhaustive search, we know they’re the only possible ones with such long lifetimes (at least with *k* = 2, *r* = 2). So then we can infer that the ornate structures we see are in some sense necessary to achieve the objective of, say, having a finite lifetime of more than 100 steps. So that means that if we go through a process of adaptive evolution and achieve a lifetime above 100 steps, we see a complex pattern of behavior not because of “complicated choices in our process of adaptive evolution”, but rather because to achieve such a lifetime one has no choice but to use such a complex pattern. Or, in other words, the complexity we see is a reflection of “computational necessity”, not historical accidents of adaptive evolution.

Note also that (as we’ll discuss in more detail below) there are certain behaviors we can get, and others that we cannot. So, for example, there is a rule that gives lifetime 308, but none that gives lifetime 300. (Though, yes, if we used more complicated initial conditions or a more complicated family of rules we could find such a rule.)

Much as we saw in the symmetric *k* = 2, *r* = 2 case above, almost any long lifetimes require using all the available bits in the rule:

But, needless to say, there’s an exception—a pair of rules with lifetime 84 where the outcome for the case doesn’t matter:

But, OK, so can these long-lifetime rules be reached by single-mutation adaptive evolution from the null rule? Rather than trying to construct the whole multiway graph for the general *k* = 2, *r* = 2

And what we see is that at least in this case such a procedure never reaches the null rule. The “furthest” it gets is to lifetime-2 rules, and among these rules the closest to the null rule are:

But it turns out that there’s no way to reach these 2-bit rules by a single point mutation from any of the 26 1-bit rules that aren’t rejected by our adaptive evolution process. And in fact this isn’t just an issue for this particular long-lifetime rule; it’s something quite general among *k* = 2 rules. And indeed, constructing the “forward” multiway graph starting from the null rule, we find we can only ever reach lifetime-1 rules.

Ultimately this is a particular feature of rules with just 2 colors—and it’s specific to starting with something like the null rule that has lifetime 1—but it’s an illustration of the fact that there can even be large swaths of rule space that can’t be reached by adaptive evolution with point mutations.

What about symmetric *k* = 2, *r* = 2 rules? Well, to maintain symmetry we have to deal with mutations that change not just one but two bits. And this turns out to mean that (except in the cases we discovered above) the inverse multiway system starting from long-lifetime rules always successfully reaches the null rule:

There’s something else to notice here, however. Looking at this graph, we see that there’s a way to get with just one 2-bit mutation from a lifetime-1 to a lifetime-65 rule:

We didn’t see this in our multiway graph above because we had applied transitive reduction to it. But if we don’t do that, we find that a few large lifetime jumps are possible—as we can see on this plot of possible lifetimes before and after a single point mutation:

Going beyond *k* = 2, *r* = 2 rules, we can consider symmetric *k* = 3, *r* = 1 rules, of which there are 3^{17}, or about 129 million. The distribution of lifetimes in this case is

which again roughly fits a power law, again with exponent around –3.5:

But now the maximum lifetime found is not just 308, but 2194:

One again, there are some different “ideas” on display, with a few curious examples of convergence—such as the rules we see with lifetimes 989 and 990 (as well as 1068 and 1069) which give essentially the same patterns after just exchanging colors, and adding one “prefatory” step.

What about general *k* = 3, *r* = 1 rules? There are too many to easily search exhaustively. But directed random sampling reveals plenty of long-lifetime examples, such as:

And now the tail of very long lifetimes extends further, for example with:

It’s a little easier to see what the lifetime-10863 rule does if one visualizes it in sections (and adjusts colors to get more contrast):

Sampling 100 steps out every 2000 (as well as at the very end), we see elaborate alternation between periodic and seemingly random behavior—but none of it gives any obvious clue of the remarkable fact that after 10863 steps the whole pattern will die out:

## The Issue of Undecidability

As our example criterion for the “fitness” of cellular automaton rules, we’ve used the lifetimes of the patterns they generate—always assuming that if the patterns don’t terminate at all they should be considered to have fitness zero.

But how can we tell if a pattern is going to terminate? In the previous section, for example, we saw patterns that live a very long time—but do eventually terminate.

Here are some examples of the first 100 steps of patterns generated by a few *k* = 3, *r *= 1 symmetric rules:

What will happen with these patterns? We know from what we see here that none of them have lifetimes less than 100 steps. But what would allow us to say more? In a few cases we can see that the patterns are periodic, or have obvious repeating structures, which means they’ll never terminate. But in the other cases there’s no obvious way to predict what will happen. Explicitly running the rules for another 100 steps we discover some more outcomes:

Going to 500 steps there are some surprises. Rule (a) becomes periodic after 388 steps; rules (o) and (v) terminate after 265 and 377 steps, respectively:

But is there a way to systematically say what will happen “in the end” with all the remaining rules? The answer is that in general there is not; it’s something that must be considered undecidable by any finite computation.

Given how comparatively simple the cellular automaton rules we’re considering are, we might have assumed that with all our sophisticated mathematical and computational methods we’d always be able to “jump ahead of them”—and figure out their outcome without the computational effort of explicitly running each step.

But the Principle of Computational Equivalence suggests that pretty much whenever the behavior of these rules isn’t obviously simple, it will in effect be of equal computational sophistication to any other system, and in particular to any methods that we might use to predict it. And the result is the phenomenon of computational irreducibility that implies that in many systems—presumably including most of the cellular automata here—there isn’t any way to figure out their outcome much more efficiently than by explicitly tracing each of their steps. So this means that to know what will happen “in the end”—after an infinite number of steps—might take an unlimited amount of computational effort. Or, in other words, it must be considered effectively undecidable by any finite computation.

As a practical matter we might look at the observed distribution of lifetimes for a particular type of cellular automaton, and become pretty confident that there won’t be longer finite lifetimes for that type of cellular automaton. But for the *k* = 3, *r* = 1 rules from the previous section, we might have been fairly confident that a few thousand steps was the longest lifetime that would ever occur—until we discovered the 10,863-step example.

So let’s say we run a particular rule for 10,000 steps and it hasn’t died out. How can we tell if it never will? Well, we have to construct a proof of some kind. And that’s easy to do if we can see that the pattern becomes, say, completely periodic. But in general, computational irreducibility implies we won’t be able to do it. Might there, though, still be special cases where we can? In effect, those would have to correspond to “pockets of computational reducibility” where we manage to find a compressed description of the cellular automaton behavior.

There are cases like this where there isn’t strict periodicity, but where in the end there’s basically repetitive behavior (here with period 480):

And there are cases of nested behavior, which is never periodic, but is nevertheless simple enough to be predictable:

But there are always surprises. Like this example—which eventually resolves to have period 6, but only after 7129 steps:

So what does all this mean for our adaptive evolution process? It implies that in principle we could miss a very long finite lifetime for a particular rule, assuming it to be infinite. In a biological analogy we might have a genome that seems to lead to unbounded perhaps-tumor-like growth—but where actually the growth in the end “unexpectedly” stops.

## Computation Theoretic Perspectives and Busy Beavers

What we’re asking about the dying out of patterns in cellular automata is directly analogous to the classic halting problem for Turing machines, or the termination problem for term rewriting, Post tag systems, etc. And in looking for cellular automata that have the longest-lived patterns, we’re studying a cellular automaton analog of the so-called busy beaver problem for Turing machines.

We can summarize the results we’ve found so far (all for single-cell initial conditions):

The profiles (i.e. widths of nonzero cells) for the patterns generated by these rules are

and the “integrals” of these curves are what give the “areas” in the table above.

For the reasons described in the previous section, we can only be certain that we’ve found lower bounds on the actual maximum lifetime—though except in the last few cases listed it seems very likely that we do in fact have the maximum lifetime.

It’s somewhat sobering, though, to compare with known results for maximum (“busy beaver”) lifetimes for Turing machines (where now *s* is the number of Turing machine states, the Turing machines are started from blank tapes, and they are taken to “halt” when they reach a particular halt state):

Sufficiently small Turing machines can have only modest lifetimes. But even slightly bigger Turing machines can have vastly larger lifetimes. And in fact it’s a consequence of the undecidability of the halting problem for Turing machines that the maximum lifetime grows with the size of the Turing machine faster than any computable function (i.e. any function that can be computed in finite time by a Turing machine, or whose value can be proved by a finite proof in a finite axiom system).

But, OK, the maximum lifetime increases with the “size of the rule” for a Turing machine, or a cellular automaton. But what defines the “size of a rule”? Presumably it should be roughly the number of independent bits needed to specify the rule (which we can also think of as an approximate measure of its “information content”)—or something like log_{2} of the number of possible rules of its type.

At the outset, we might imagine that all 2^{32} *k *= 2, *r* = 2 rules would need 32 bits to specify them. But as we discussed above, in some cases some of the bits in the rule don’t matter when it comes to determining the patterns they produce. And what we see is that the more bits that matter (and so have to be specified), the longer the lifetimes that are possible:

So far we’ve only been discussing cellular automata with single-cell initial conditions. But if we use more complicated initial conditions what we’re effectively doing is adding more information content into the system—with the result that maximum lifetimes can potentially get larger. And as an example, here are possible lifetimes for *k* = 2, *r* = rules with a sequence of possible initial conditions:

## Probabilistic Approximations?

Cellular automata are at their core deterministic systems: given a particular cellular automaton rule and a particular initial condition, every aspect of the behavior that is generated is completely determined. But is there any way that we can approximate this behavior by some probabilistic model? Or might we at least usefully be able to use such a model if we look at the aggregate properties of large numbers of different rules?

One hint along these lines comes from the power-law distributions we found above for the frequencies of different possible lifetimes for cellular automata of given types. And we might wonder whether such distributions—and perhaps even their exponents—could be found from some probabilistic model.

One possible approach is to approximate a cellular automaton by a probabilistic process—say one in which a cell becomes black with probability *p* if it or either of its neighbors were black on the step before. Here are some examples of what can happen with this (“directed percolation”) setup:

The behavior varies greatly with *p*; for small *p* everything dies out, while for large *p* it fills in:

And indeed the final density—starting from random initial conditions—has a sharp (phase) transition at around *p* = 0.54 as one varies *p*:

If instead one starts from a single initial black cell one sees a slightly different transition:

One can also plot the probabilities for different “survival times” or “lifetimes” for the pattern:

And right around the transition the distribution of lifetimes follows a power law—that’s roughly τ^{–1} (which happens to be what one gets from a mean field theory estimate).

So how does this relate to cellular automata? Let’s say we have a *k* = 2 rule, and we suppose that the colors of cells can be approximated as somehow random. Then we might suppose that the patterns we get could be like in our probabilistic model. And a potential source for the value of *p* to use would be the fraction of cases in the rule that give a black cell as output.

Plotting the lifetimes for *k* = 2, *r* = 2 rules against these fractions, we see that the longest lifetimes do occur when a little under half the outcomes are black (though notice this is also where the binomial distribution implies the largest number of rules are concentrated):

If we don’t try thinking about the details of cellular automaton evolution, but instead just consider the boundaries of finite-lifetime patterns we generate, we can imagine approximating these (say for symmetric rules) just by random walks—that when they collide correspond to the pattern dying out:

The standard theory of random walks then tells us that the probability to survive τ steps is proportional to τ^{–3/2} for large τ—a power law, though not immediately one of the same ones that we’ve observed for our cellular automaton lifetimes.

## Other Adaptive Evolution Strategies

In what we’ve done so far, we’ve always taken each step of our process of adaptive evolution to pick an outcome of equal or greater fitness. But what if we adopt a “more impatient” procedure in which at each step we insist on an outcome that has strictly greater fitness?

For *k* = 2 it’s simply not possible with this procedure (at least with a null initial condition) to “escape” the null rule; everything that can be reached with 1 mutation still has lifetime 1. With *k* = 3

But we’re assuming here that we have to reach greater fitness with just one mutation. What if we allow two mutations at a time? Well, then we can “make progress”. And here’s the multiway graph in this case for symmetric* k* = 2, *r* = 2 rules:

We don’t reach as many phenotypic patterns as by using single mutations and allowing “fitness-neutral moves”, but where we do get, we get much quicker, without any “back and forth” in fitness-neutral spaces.

If we allow up to 3 mutations, we get still further:

And indeed we seem to get a pretty good representative sampling of “what’s out there” in this rule space, even though we reach only 37 rules, compared to the 77,624 (albeit with many duplicated phenotypic patterns) from our standard approach allowing neutral moves.

For *k* = 3, *r* = 1 symmetric rules single mutations can get 2 steps:

But now if we allow up to 2 mutations, we can go much further—and the fact that we now don’t have to deal with neutral moves means we can explicitly construct at least the first few steps of the multiway graph in this case:

We can go further if at each step we just pick a random higher-fitness rule reached with two or fewer mutations:

The adaptive evolution histories we just showed can be generated in effect by randomly trying a series of possibilities at each step, then picking the first one that exhibits increased fitness. Another approach is to use what amounts to “local exhaustive search”: at each step, look at results from all possible mutations, and pick one that gives the largest fitness. At least in smaller rule spaces, it’s common that there will be several results with the same fitness, and as an example we’ll just pick among these at random:

One might think that this approach would in effect always be an optimization of the adaptive evolution process. But in practice its systematic character can end up making it get stuck, in some sense repeatedly “trying to do the same thing” even if it “isn’t working”.

Something of an opposite approach involves loosening our criteria for which paths can be chosen—and for example allowing paths that temporarily reduce fitness, say by one step of lifetime:

In effect here we’re allowing less-than-maximally-fit organisms to survive. And we can represent the overall structure of what’s happening by a multiway graph—which now includes “backtracking” to lower fitnesses:

But although the details are different, in the end it doesn’t seem as if allowing this kind of backtracking has any dramatic effect. Somehow the basic phenomena around the process of adaptive evolution are strong enough that most of the details of how the adaptive evolution is done don’t ultimately matter much.

## An Aside: Sexual Reproduction

In everything we’ve done so far, we’ve been making mutations only to individual rules. But there’s another mechanism that exists in many biological organisms: sexual reproduction, in which in effect a pair of rules (i.e. genomes) get mixed to produce a new rule. As a simple model of the crossover that typically happens with actual genomes, we can take two rules, and splice together the beginning of one with the end of the other:

In general there will be many ways to combine pairs of rules like this. In a direct analogy to our Physics Project, we can represent such “recombinations” as “events” that take two rules and produce one:

The analog of our multiway graph for all possible paths of adaptive evolution by mutations is now what we call in our Physics Project a token-event graph:

In dealing just with mutations we were able to take a single rule and progressively modify it. Now we always have to work with a “population” of rules, combining them two at a time to generate new rules. We can represent conceivable combinations among one set of rules as follows:

There are at this point many different choices we could make about how to set up our model. The particular approach we’ll use selects just *n* of the = *n* (*n* – 1)/2 possible combinations:

Then for each of these selected combinations we attempt a crossover, keeping those “children” (drawn here between their parents) that are not rejected as a result of having lower fitness:

Finally, to “maintain our gene pool”, we carry forward parents selected at random, so that we still end up with *n* rules. (And, yes, even though we’ve attempted to make this whole procedure as clean as possible, it’s still a mess—which seems to be inevitable, and which has, as we’ll discuss below, bedeviled computational studies of evolution in the past.)

OK, so what happens when we apply this procedure, say to *k* = 3, *r* = 1 rules? We’ll pick 4 rules at random as our initial population (and, yes, two happen to produce the same pattern):

Then in a sequence of steps we’ll successively pick various combinations:

And here are the distinct “phenotype patterns” produced in this process (note that even though there can be multiple copies of the same phenotype pattern, the underlying genotype rules are always distinct):

As a final form of summarization we can just plot the successive fitnesses of the patterns we generate (with the size of each dot reflecting the number of times a particular fitness occurs):

In this case we reach a steady state after 9 steps. The larger the population the longer the adaptive evolution will typically keep going. Here are a couple of examples with population 10, showing all the patterns obtained:

Showing in each case only the longest-lifetime rule found so far we get:

The results aren’t obviously different from what we were finding with mutation alone—even though now we’ve got a much more complicated model, with a whole population of rules rather than a single rule. (One obvious difference, though, is that here we can end up with overall cycles of populations of rules, whereas in the pure-mutation case that can only happen among fitness-neutral rules.)

Here are some additional examples—now obtained after 500 steps with population 25

and with population 50:

And so far as one can tell, even here there are no substantial differences from what we saw with mutation. Certainly there are detailed features introduced by sexual reproduction and crossover, but for our purposes in understanding the big picture of what’s happening in adaptive evolution it seems sufficient to do as we have done so far, and consider only mutation.

## An Even More Minimal Model

By investigating adaptive evolution in cellular automata we’re already making dramatic simplifications relative, say, to actual biology. But in the effort to understand the essence of phenomena we see, it’s helpful to go even further—and instead of thinking about computational rules and their behavior, just think about vertices on a “mutation graph”, each assigned a certain fitness.

As an example, we can set up a 2D grid, assigning each point a certain random fitness:

And then, starting from a minimum-fitness point, we can follow the same kind of adaptive evolution procedure as above, at each step going to a neighboring point with an equal or greater fitness:

Typically we don’t manage to go far before we get stuck, though with the uniform distribution of fitness values used here, we still usually end on a fairly large fitness value.

We can summarize the possible paths we can take by the multiway graph:

In our cellular automaton rule space—and, for that matter, in biology—neighboring points don’t just have independent random fitnesses; instead, the fitnesses are determined by a definite computational procedure. So as a simple approximation, we can just take the fitness of each point to be a particular function of its graph coordinates. If the function forms something like a “uniform hill”, then the adaptive evolution procedure will just climb it:

But as soon as the function has “systematic bumpiness” there’s a tremendous tendency to quickly get stuck:

And if there’s some “unexpected spot of high fitness”, adaptive evolution typically won’t find it (and it certainly won’t if it’s surrounded by a lower-fitness “moat”):

So what happens if we increase the dimensionality of the “mutation space” in which we’re operating? Basically it becomes easier to find a path that increases fitness:

And we can see this, for example, if we look at Boolean hypercubes in increasing numbers of dimensions:

But ultimately this relies on the fact that in the neighborhood reachable by mutations from a given point, there’ll be a “sufficiently random” collection of fitness values that it’ll (likely) be possible to find a “direction” that’s “going up” in fitness. Yet this alone won’t in general be enough, because we also need it to be the case that there’s enough regularity in the fitness landscape that we can systematically navigate it to find its maximum—and that the maximum is not somehow “unexpected and isolated”.

## What Can Adaptive Evolution Achieve?

We’ve seen that adaptive evolution can be surprisingly successful at finding cellular automata that produce patterns with long but finite lifetimes. But what about other types of “traits”? What can (and cannot) adaptive evolution ultimately manage to do?

For example, what if we’re trying to find cellular automata whose patterns don’t just live “as long as possible” but instead die after a specific number of steps? It’s clear that within any finite set of rules (say with particular *k* and *r*) there’ll only be a limited collection of possible lifetimes. For symmetric *k* = 2, *r* = 2 rules, for example, the possible lifetimes are:

But as soon as we’re dealing even with *k* = 3, *r* = 1 symmetric rules it’s already in principle possible to get every lifetime up to 100. But what about adaptive evolution? How well does it do at reaching rules with all those lifetimes? Let’s say we do single point mutation as before, but now we “accept” a mutation if it leads not specifically to a larger finite lifetime, but to a lifetime that is closer in absolute magnitude to some desired lifetime. (Strictly, and importantly, in both cases we also allow “fitness-neutral” mutations that leave the lifetime the same.)

Here are examples of what happens if we try to adaptively evolve to get lifetime exactly 50 in *k* = 3,*r* = 1 rules:

It gets close—and sometimes it overshoots—but, at least in these particular examples, it never quite makes it. Here’s what we see if we look at the lifetimes achieved with 100 different random sequences of mutations:

Basically they mostly get stuck at lifetimes close to 50, but not exactly 50. It’s not that *k* = 3, *r* = 1

It’s just that our adaptive evolution process usually gets stuck before it reaches rules like these. Even though there’s usually enough “room to maneuver” in *k* = 3, *r* = 1 rule space to get to generally longer lifetimes, there’s not enough to specifically get to lifetime 50.

But what about *k* = 4, *r* = 1 rule space? There are now not 10^{12} but about 10^{38} possible rules. And in this rule space it becomes quite routine to be able to reach lifetime 50 through adaptive evolution:

It can sometimes take a while, but most of the time in this rule space it’s possible to get exactly to lifetime 50:

What happens with other “lifetime goals”? Even symmetric *k* = 3, *r* = 1 rules can achieve many lifetime values:

Indeed, the first “missing” values are 129, 132, 139, etc. And, for example, many multiples of 50 can be achieved:

But it becomes increasingly difficult for adaptive evolution to reach these specific goals. Increasing the size of the rule space always seems to help; so for example with *k* = 4, *r* = 1, if one’s aiming for lifetime 100, the actual distribution of lifetimes reached is:

In general the distribution gets broader as the lifetime sought gets larger:

We saw above that across the whole space of, say, *k* = 4, *r* = 1 rules, the frequency of progressively larger lifetimes falls roughly according to a power law. So this means that the fractional region in rule space that achieves a given lifetime gets progressively smaller—with the result that typically the paths followed by adaptive evolution are progressively more likely to get stuck before they reach it.

OK, so what about other kinds of objectives? Say ones more related to the morphologies of patterns? As a simple example, let’s consider the objective of maximizing the “widths” of finite-lifetime patterns. We can try to achieve this by adaptive evolution in which we reject any mutations that lead to decreased width (where “width” is defined as the maximum horizontal extent of the pattern). And once again this process manages to “discover” all sorts of “mechanisms” for achieving larger widths (here each pattern is labeled by its height—i.e. lifetime—and width):

There are certain structural constraints here. For example, the width can’t be too large relative to the height—because if it’s too large, patterns tend to grow forever.

But what if we specifically try to select for maximal “pattern aspect ratio” (i.e. ratio of width to height)? In essentially every case so far, adaptive evolution has in effect “invented many different mechanisms” to achieve whatever objective we’ve defined. But here it turns out we essentially see “the same idea” being used over and over again—presumably because this is the only way to achieve our objective given the overall structure of how the underlying rules we’re using work:

What if we ask something more specific? Like, say, that the aspect ratio be as close to 3 as possible. Much of the time the “solution” that adaptive evolution finds is the correct if trivial:

But sometimes it finds another solution—and often a surprisingly elaborate and complicated one:

How about if our goal is an aspect ratio of π ≈ 3.14? It turns out adaptive evolution can still do rather well here even just with the symmetric *k* = 3, *r* = 1 rules that we’re using:

We can also ask about properties of the “inside” of the pattern. For example, we can ask to maximize the lengths of uniform runs of nonwhite cells in the center column of the pattern. And, once again, adaptive evolution can successfully lead us to rules (like these random examples) where this is large:

We can go on and get still more detailed, say asking about runs of particular lengths, or the presence or number of particular subpatterns. And eventually—just like when we asked for too long a lifetime—we’ll find that the cases we’re looking for are “too sparse”, and adaptive evolution (at least in a given rule space) won’t be able to find them, even if exhaustive search could still identify at least a few examples.

But just what kinds of objectives (or fitness functions) can be handled how well by adaptive evolution, operating for example on the “raw material” of cellular automata? It’s an important question—an analog of which is also central to the investigation of machine learning. But as of now we don’t really have the tools to address it. It’s somehow reminiscent of asking what kinds of functions can be approximated how well by different methods or basis functions. But it’s more complicated. Solving it, though, would tell us a lot about the “reach” of adaptive evolution processes, not only for biology but also for machine learning.

## What It Means for What’s Going On in Biology

How do biological organisms manage to be the way they are, with all their complex and seemingly clever solutions to such a wide range of challenges? Is it just natural selection that does it, or is there in effect more going on? And if “natural selection does it”, how does it actually manage it?

From the point of view of traditional engineering what we see in biology is often very surprising, and much more complex and “clever” than we’d imagine ever being able to create ourselves. But is the secret of biology in a sense just natural selection? Well, actually, there’s often an analog of natural selection going on even in engineering, as different designs get tried and only some get selected. But at least in traditional engineering a key feature is that one always tries to come up with designs where one can foresee their consequences.

But biology is different. Mutations to genomes just happen, without any notion that their consequences can be foreseen. But still one might assume that—when guided by natural selection—the results wouldn’t be too different to what we’d get in traditional engineering.

But there’s a crucial piece of intuition missing here. And it has to do with how randomly chosen programs behave. We might have assumed (based on our typical experience with programs we explicitly construct for particular purposes) that at least a simple random program wouldn’t ever do anything terribly interesting or complicated.

But the surprising discovery I made in the early 1980s is that this isn’t true. And instead, it’s a ubiquitous phenomenon that in the computational universe of possible programs, one can get immense complexity even from very simple programs. So this means that as mutation operates on a genome, it’s essentially inevitable that it’ll end up sampling programs that show highly complex behavior. At the outset, one might have imagined that such complexity could only be achieved by careful design and would inevitably be at best rare. But the surprising fact is that—because of how things fundamentally work in the computational universe—it’s instead easy to get.

But what does complexity have to do with creating “successful organisms”? To create a “successful organism” that can prosper in a particular environment there fundamentally has to be some way to get to a genome that will “solve the necessary problems”. And this is where natural selection comes in. But the fact that it can work is something that’s not at all obvious.

There are really two issues. The first is whether a program (i.e. genome) even exists that will “solve the necessary problems”. And the second is whether such a program can be found by a “thread” of adaptive evolution that goes only through intermediate states that are “fit enough” to survive. As it turns out, both these issues are related to the same fundamental features of computation—which are also responsible for the ubiquitous appearance of complexity.

Given some underlying framework—like cellular automata, or like the basic apparatus of life—is there some rule that can be implemented in that framework that will achieve some particular (computational) objective? The Principle of Computational Equivalence says that generically the answer will be yes. In effect, given almost any “underlying hardware”, it’ll ultimately be possible to come up with “software” (i.e. a rule) that achieves almost any (“physically possible”) given objective—like growing an organism of at least some kind that can survive in a particular environment. But how can we actually find a rule that achieves this?

In principle we could do exhaustive search. But that will be exponentially difficult—and in all but toy cases will be utterly infeasible in practice. So what about adaptive evolution? Well, that’s the big question. And what we’ve seen here is that—rather surprisingly—simple mutation and selection (i.e. the mechanisms of natural selection) very often provide a dramatic shortcut for finding rules that do what we want.

So why is this? In effect, adaptive evolution is finding a path through rule space that gets to where we want to go. But the surprising part is that it’s managing to do this one step at a time. It’s just trying random variations (i.e. mutations) and as soon as it finds one that’s not a “step down in fitness”, it’ll “take it”, and keep going. At the outset it’s certainly not obvious that this will work. In particular, it could be that at some point there just won’t be any “way forward”. All “directions” will lead only to lower fitness, and in effect the adaptive evolution will get stuck.

But the key observation from the experiments in our simple model here is that this typically doesn’t happen. And there seem to be basically two things going on. The first is that rule space is in effect very high-dimensional. So that means there are “many directions to choose from” in trying to find one that will allow one to “take a step forward”. But on its own this isn’t enough. Because there could be correlations between these directions that would mean that if one’s blocked in one direction one would inevitably be blocked in all others.

So why doesn’t this happen? Well, it seems to be the result of the fundamental computational phenomenon of computational irreducibility. A traditional view based on experience with mathematical science had been that if one knew the underlying rule for a system then this would immediately let one predict what the system would do. But what became clear from my explorations in the 1980s and 1990s is that in the computational universe this generically isn’t true. And instead, that the only way one can systematically find out what most computational systems will do is explicitly to run their rules, step by step, doing in effect the same irreducible amount of computational work that they do.

So if one’s just presented with behavior from the system one won’t be in a position to “decode it” and “see its simple origins”. Unless one’s capable of doing as much computational work as the system itself, one will just have to consider what it’s doing as (more or less) “random”. And indeed this seems to be at the root of many important phenomena, such as the Second Law of thermodynamics. And I also suspect it’s at the root of the effectiveness of adaptive evolution, notably in biology.

Because what computational irreducibility implies is that around every point in rule space there’ll be a certain “effective randomness” to fitnesses one sees. And if there are many dimensions to rule space that means it’s overwhelmingly likely that there’ll be “paths to success” in some directions from that point.

But will the adaptive evolution find them? We’ve assumed that there are a series of mutations to the rule, all happening “at random”. And the point is that if there are *n* elements in the rule, then after some fraction of *n* mutations we should find our “success direction”. (If we were doing exhaustive search, we’d instead have to try about *k ^{n}* possible rules.)

At the outset it might seem conceivable that the sequence of mutations could somehow “cleverly probe” the structure of rule space, “knowing” what directions would or would not be successful. But the whole point is that going from a rule (i.e. genotype) to its behavior (i.e. phenotype) is generically a computationally irreducible process. So assuming that mutations are generated in a computationally bounded way it’s inevitable that they can’t “break computational irreducibility” and so will “experience” the fitness landscape in rule space as “effectively random”.

OK, but what about “achieving the characteristics an organism needs”? What seems to be critical is that these characteristics are in a sense computationally simple. We want an organism to live long enough, or be tall enough, or whatever. It’s not that we need the organism to perform some specific computationally irreducible task. Yes, there are all sorts of computationally irreducible processes happening in the actual development and behavior of an organism. But as far as biological evolution is concerned all that matters is ultimately some computationally simple measure of fitness. It’s as if biological evolution is—in the sense of my recent observer theory—a computationally bounded observer of underlying computationally irreducible processes.

And to the observer what emerges is the “simple law” of biological evolution, and the idea that, yes, it is possible just by natural selection to successfully generate all sorts of characteristics.

There are all sorts of consequences of this for thinking about biology. For example, in thinking about where complexity in biology “comes from”. Is it “generated by natural selection”, perhaps reflecting the complicated sequence of historical accidents embodied in the particular collection of mutations that occurred? Or is it from somewhere else?

In the picture we’ve developed here it’s basically from somewhere else—because it’s essentially a reflection of computational irreducibility. Having said that, we should remember that the very possibility of being able to have organisms with such a wide range of different forms and functions is a consequence of the universal computational character of their underlying setup, which in turn is closely tied to computational irreducibility.

And it’s in effect because natural selection is so coarse in its operation that it does not somehow avoid the ubiquitous computational irreducibility that exists in rule space, with the result that when we “look inside” biological systems we tend to see computational irreducibility and the complexity associated with it.

Something that we’ve seen over and over again here is that, yes, adaptive evolution manages to “solve a problem”. But its solution looks very complex to us. There might be some “simple engineering solution”—involving, say, a very regular pattern of behavior. But that’s not what adaptive evolution finds; instead it finds something that to us is typically very surprising—very often an “unexpectedly clever” solution in which lots of pieces fit together just right, in a way that our usual “understand-what’s-going-on” engineering practices would never let us invent.

We might not have expected anything like this to emerge from the simple process of adaptive evolution. But—as the models we’ve studied here highlight—it seems to be an inevitable formal consequence of core features of computational systems. And as soon as we recognize that biological systems can be viewed as computational, then it also becomes something inevitable for them—and something we can view as in a sense formally derivable for them.

At the outset we might not have been able to say “what matters” in the emergence of complexity in biology. But from the models we’ve studied, and the arguments we’ve made, we seem to have quite firmly established that it’s a fundamentally computational phenomenon, that relies only on certain general computational features of biological systems, and doesn’t depend on their particular detailed components and structure.

But in the end, how “generic” is the complexity that comes out of adaptive evolution? In other words, if we were to pick programs, say completely at random, how different would the complexity they produce be from the complexity we see in programs that have been adaptively evolved “for a purpose”? The answer isn’t clear—though to know it would provide important foundational input for theoretical biology.

One has the general impression that computational irreducibility is a strong enough phenomenon that it’s the “dominant force” that determines behavior and produces complexity. But there’s still usually something a bit different about the patterns we see from rules we’ve found by adaptive evolution, compared to rules we pick at random. Often there seems to be a certain additional level of “apparent mechanism”. The details still look complicated and in some ways quite random, but there seems to be a kind of “overall orchestration” to what’s going on.

And whenever we can identify such regularities it’s a sign of some kind of computational reducibility. There’s still plenty of computational irreducibility at work. But “high fitness” rules that we find through adaptive evolution typically seem to exhibit traces of their specialness—that manifests in at least a certain amount of computational reducibility.

Whenever we manage to come up with a “narrative explanation” or a “natural law” for something, it’s a sign that we’ve found a pocket of computational reducibility. If we say that a cellular automaton manages to live long because it generates certain robust geometric patterns—or, for that matter, that an organism lives long because it proofreads its DNA—we’re giving a narrative that’s based on computational reducibility.

And indeed whenever we can successfully identify a “mechanism” in our cellular automaton behavior, we’re in effect seeing computational reducibility. But what can we say about the aggregate of a whole collection of mechanisms?

In a different context I’ve discussed the concept of a “mechanoidal phase”, distinguished, say, from solids and liquids by the presence of a “bulk orchestration” of underlying components. It’s something closely related to class 4 behavior. And it’s interesting to note that if we look, for example, at the rules we found from adaptive evolution at the end of the previous section, their evolution from random initial conditions mostly shows characteristic class 4 behavior:

In other words, adaptive evolution is potentially bringing us to “characteristically special” places in rule space—perhaps suggesting that there’s something “characteristically special” about the kind of structures that are produced in biological systems. And if we could find a way to make general statements about that “characteristic specialness” it would potentially lead us to a framework for constructing a new broad formal theory in biology.

## Correspondence with Biological Phenomena

The models we’ve studied here are extremely simple in their basic construction. And at some level it’s remarkable that—without for example including any biophysics or biochemistry—they can get anywhere at all in capturing features of biological systems and biological evolution.

In a sense this is ultimately a reflection of the fundamentally computational character of biology—and the generality of computational phenomena. But it’s very striking that even the patterns of cellular automaton behavior we see look very “lifelike and organic”.

In actual biology even the shortest genomes are vastly longer than the tiny cellular automaton rules we’ve considered. But even by the time we’re looking at the length-27 “genomic sequences” in *k* = 3, *r* = 1 cellular automata, there are already 3 trillion possible sequences, which seems to be enough to see many core “combinatorially driven” biology-like phenomena.

The running of a cellular automaton rule might also at first seem far away from the actual processes that create biological organisms—involving as they do things like the construction of proteins and the formation of elaborate functional and spatial structures. But there are more analogies than one might at first imagine. For example, it’s common for only particular cases in the cellular automaton rule to be used in a given region of the pattern that’s formed, much as particular genes are typically turned on in different tissues in biological organisms.

And, for example, the “geometrical restriction” to a simple 1D array of “cells” doesn’t seem to matter much as soon as there’s sophisticated computation going on; we still get lots of structures that are actually surprisingly reminiscent of typical patterns of biological growth.

One of the defining features of biological organisms is their capability for self-reproduction. And indeed if it wasn’t for this kind of “copying” there wouldn’t be anything like adaptive evolution to discuss. Our models don’t attempt to derive self-reproduction; they just introduce it as something built into the models.

And although we’ve considered several variants, we’re basically also just building into our models the idea of mutations. And what we find is that it seems as if single point mutations made one at a time are enough to capture basic features of adaptive evolution.

We’ve also primarily considered what amounts to a single lineage—in which there’s just a single rule (or genome) at a given step. We do mutations, and we mostly “implement natural selection” just by keeping only rules that lead to patterns whose fitness is no less than what we had before.

If we had a whole population of rules it probably wouldn’t be so significant, but in the simple setup we’re using, it turns out to be important that we don’t reject “fitness-neutral mutations”. And indeed we’ve seen many examples where the system wanders around fitness-neutral regions of rule space before finally “discovering” some “innovation” that allows it to increase fitness. The way our models are set up, that “wandering” always involves changes in the “genotype”—but usually at most minor changes in the phenotype. So it’s very typical to see long periods of “apparent equilibrium” in which the phenotype changes rather little, followed by a “jump” to a new fitness level and a rather different phenotype.

And this observation seems quite aligned with the phenomenon of “punctuated equilibrium” often reported in the fossil record of actual biological evolution.

Another key feature of biological organisms and biological evolution is the formation of distinct species, as well as distinct phyla, etc. And indeed we ubiquitously see something that seems to be directly analogous in our multiway graphs of all possible paths of adaptive evolution. Typically we see distinct branches forming, based on what seem like “different mechanisms” for achieving fitness.

No doubt in actual biology there are all sorts of detailed phenomena related to reproductive or spatial isolation. But in our models the core phenomenon that seems to lead to the analog of “branching in the tree of life” is the existence of “distinctly different computational mechanisms” in different parts of rule space. It’s worth noting that at least with our finite rule spaces, branches can die out, with no “successor species” appearing the multiway graph.

And indeed looking at the actual patterns produced by rules in different parts of the multiway graph it’s easy to imagine morphologically based taxonomic classifications—that would be somewhat, though not perfectly, aligned with the phylogenetic tree defined by actual rule mutations. (At a morphological level we quite often see some level of “convergent evolution” in our multiway graphs; in small examples we sometimes also see actual “genomic convergence”—which will typically be astronomically rare in actual biological systems.)

One of the remarkable features of our models is that they allow quite global investigation of the “overall history” of adaptive evolution. In many of the simple cases we’ve discussed, the rule space we’re using is small enough that in a comparatively modest number of mutation steps we get to the “highest fitness we can reach”. But (as the examples we saw in Turing machines suggest) expanding the size of the rules we’re using even just a little can be expected to be sufficient to allow us to get astronomically further.

And the further we go, the more “mechanisms” will be “invented”. It’s an inevitable feature of systems that involve computational irreducibility that there are new and unpredictable things that will go on showing up forever—along with new pockets of computational reducibility. So even after a few billion years—and the trillion generations and 10^{40} or so organisms that have ever lived—there’s still infinitely further for biological evolution to go, and more and more branches to be initiated in the tree of life, involving more and more “new mechanisms”.

I suppose one might imagine that at some point biological organisms would reach a “maximum fitness”, and go no further. But even in our simple model with fitness measured in terms of pattern lifetime, there’ll be no upper limit on fitness; given any particular lifetime, it’s a feature of the fundamental theory of computation that there’ll always be a program that will yield a larger lifetime. Still, one might think, at some point enough is enough: the giraffe’s neck is long enough, etc. But if nothing else, competition between organisms will always drive things forward: yes, a particular lineage of organisms achieved a certain fitness, but then another lineage can come along and get to that fitness too, forcing the first lineage to go even further so as to not lose out.

Of course, in our simple model we’re not explicitly accounting for interactions with other organisms—or for detailed properties of the environment, as well as countless other effects. And no doubt there are many biological phenomena that depend on these effects. But the key point here is that even without explicitly accounting for any of these effects, our simple model still seems to capture many core features of biological evolution. Biological evolution—and, indeed, adaptive evolution in general—is, it seems, fundamentally a computational phenomenon that robustly emerges quite independent of the details of systems.

In the past few years our Physics Project has given strong evidence that the foundations of physics are fundamentally computational—with the core laws of physics arising as inevitable consequences of the way observers like us “parse” the ruliad of all possible computational processes. And what we’ve seen here now suggests that there’s a remarkable commonality between the foundations of physics and biology. Both are anchored in computational irreducibility. And both sample slices of computational reducibility. Physics because that’s what observers like us do to get descriptions of the world that fit in our finite minds. Biology because that’s what biological evolution does in order to achieve the “coarse objectives” set by natural selection.

The intuition of physics tends to be that there are ultimately simple models for things, whereas in biology there’s a certain sense that everything is always almost infinitely complicated, with a new effect to consider at every turn. But presumably that’s in large part because what we study in biology tends to quickly come face to face with computational irreducibility—whereas in physics we’ve been able to find things to study that avoid this. But now the commonality in foundations between physics and biology suggests that there should also be in biology the kind of structure we have in physics—complete with general laws that allow us to make useful, broad statements. And perhaps the simple model I’ve presented here can help lead us there—and in the end help build up a new paradigm for thinking about biology in a fundamentally theoretical way.

## Historical Notes

There’s a long—if circuitous—history to the things I’m discussing here. Basic notions of heredity—particularly for humans—were already widely recognized in antiquity. Plant breeding was practiced from the earliest days of agriculture, but it wasn’t until the late 1700s that any kind of systematic selective breeding of animals began to be commonplace. Then in 1859 Charles Darwin described the idea of “natural selection” whereby competition of organisms in their natural environment could act like artificial selection, and, he posited, would over long periods lead to the development of new species. He ended his *Origin of Species* with the claim that:

… from the war of nature … the production of the higher animals, directly follows. … and whilst this planet has gone cycling on according to the fixed law of gravity, from so simple a beginning endless forms most beautiful and most wonderful have been, and are being, evolved.

What he appears to have thought is that there would somehow follow from natural selection a general law—like the law of gravity—that would lead to the evolution of progressively more complex organisms, culminating in the “higher animals”. But absent the kind of model I’m discussing here, nothing in the later development of traditional evolutionary theory really successfully supported this—or was able to give much analysis of it.

Right around the same time as Darwin’s *Origin of Species*, Gregor Mendel began to identify simple probabilistic laws of inheritance—and when his work was rediscovered at the beginning of the 1900s it was used to develop mathematical models of the frequencies of genetic traits in populations of organisms, with key contributions to what became the field of population genetics being made in the 1920s and 1930s by J. B. S. Haldane, R. A. Fisher and Sewall Wright, who came up with the concept of a “fitness landscape”.

On a quite separate track there had been efforts ever since antiquity to classify and understand the growth and form of biological organisms, sometimes by analogy to physical or mathematical ideas—and by the 1930s it seemed fairly clear that chemical messengers were somehow involved in the control of growth processes. But the mathematical methods used for example in population genetics basically only handled discrete traits (or simple numerical ones accessible to biometry), and didn’t really have anything to say about something like the development of complexity in the forms of biological organisms.

The 1940s saw the introduction of what amounted to electrical-engineering-inspired approaches to biology, often under the banner of cybernetics. Idealized neural networks were introduced by Warren McCulloch and Walter Pitts in 1943, and soon the idea emerged (notably in the work of Donald Hebb in 1949) that learning in such systems could occur through a kind of adaptive evolution process. And by the time practical electronic computing began to emerge in the 1950s there was widespread belief that ideas from biology—including evolution—would be useful as an inspiration for what could be done. Often what would now just be described as adaptive algorithms were couched in biological evolution terms. And even when iterative methods were used for optimization (say in industrial production or engineering design) they were sometimes presented as being grounded in biological evolution.

Meanwhile, by the 1960s, there began to be what amounted to Monte Carlo simulations of population-genetics-style evolutionary processes. A particularly elaborate example was work by Nils Barricelli on what he called “numeric evolution” in which a fairly complicated numerical-cellular-automaton-like “competition between organisms” program with “randomness” injected from details of data layout in computer memory showed what he claimed to be biological-evolution-like phenomena (such as symbiosis and parasitism).

In a different direction there was an attempt—notably by John von Neumann—to “mathematicize the foundations of biology” leading by the late 1950s to what we’d now call 2D cellular automata “engineered” in complicated ways to show phenomena like self-reproduction. The followup to this was mostly early-theoretical-computer-science work, with no particular connection to biology, and no serious mention of adaptive evolution. When the Game of Life was introduced in 1970 it was widely noted as “doing lifelike things”, but essentially no scientific work was done in this direction. By the 1970s, though, L-systems and fractals had introduced the idea of recursive tree-like or nested structures that could be generated by simple algorithms and rendered by computer graphics—and seemed to give forms close to some seen in biology. My own work on 1D cellular automata (starting in 1981) focused on systematic scientific investigation of simple programs and what they do—with the surprising conclusion that even very simple programs can produce highly complex behavior. But while I saw this as informing the generation of complexity in things like the growth of biological organisms, I didn’t at the time (as I’ll describe below) end up seriously exploring any adaptive evolution angles.

Still another thread of development concerned applying biological-like evolution not just to parameters but to operations in programs. And for example in 1958 Richard Friedberg at IBM tried making random changes to instructions in machine-code programs, but didn’t manage to get this to do much. (Twenty years later, superoptimizers in practical compilers did begin to successfully use such techniques.) Then in the 1960s, John Holland (who had at first studied learning in neural nets, and was then influenced by Arthur Burks who had worked on cellular automata with von Neumann) suggested representing what amounted to programs by simple strings of symbols that could readily be modified like genomic sequences. The typical idea was to interpret the symbols as computational operations—and to assign a “fitness” based on the outcome of those operations. A “genetic algorithm” could then be set up by having a population of strings that was adaptively evolved. Through the 1970s and 1980s occasional practical successes were reported with this approach, particularly in optimization and data classification—with much being made of the importance of sexual-reproduction-inspired crossover operations. (Something that began to be used in the 1980s was the much simpler approach of simulated annealing—that involves randomly changing values rather than programs.)

By the beginning of the 1980s the idea had also emerged of adaptively modifying the structure of mathematical expressions—and of symbolic expressions representing programs. There were notable applications in computer graphics (e.g. by Karl Sims) as well as to things like the 1984 *Core War* “game” involving competition between programs in a virtual machine. In the 1990s John Koza was instrumental in developing the idea of “genetic programming”, notably as a way to “automatically create inventions”, for example in areas like circuit and antenna design. And indeed to this day scattered applications of these methods continue to pop up, particularly in geometrical and mechanical design.

From the very beginning there’d been controversy around Darwin’s ideas about evolution. First, there was the issue of conflict with religious accounts of creation. But there were also—often vigorous—disagreements within the scientific community about the interpretation of the fossil record and about how large-scale evolution was really supposed to operate. A notable issue—still very active in the 1980s—was the relationship between the “freedom of evolution” and the constraints imposed by the actual dynamics of growth in organisms (and interactions between organisms). And despite much insistence that the only reasonable “scientific” (as opposed to religious) point of view was that “natural selection is all there is”, there were nagging mysteries that suggested there must be other forces at work.

Building on the possibilities of computer experimentation (as well as things like my work on cellular automata) there emerged in the mid-1980s, particularly through the efforts of Chris Langton, a focus on investigating computational models of “artificial life”. This resulted in all sorts of simulations of ecosystems, etc. that did produce a variety of evolution-related phenomena known from field biology—but typically the models were far too complex in their structure for it to be possible to extract fundamental conclusions from them. Still, there continued to be specific, simpler experiments. For example, in 1986, for his book *The Blind Watchmaker*, Richard Dawkins made pictures of what he called “biomorphs”, produced by adaptively adjusting parameters for a simple tree-growth algorithm based on the overall shapes generated.

In the 1980s, stimulated by my work, there were various isolated studies of “rule evolution” in cellular automata (as well as art and museum exhibits based on this), and in the 1990s there was more systematic work—notably by Jim Crutchfield and Melanie Mitchell—on using genetic algorithms to try to evolve cellular automaton rules to solve tasks like density classification. (Around this time “evolutionary computation” also began to emerge as a general term covering genetic algorithms and other usually-biology-inspired adaptive computational methods.)

Meanwhile, accelerating in the 1990s, there was great progress in understanding actual molecular mechanisms in biology, and in figuring out how genetic and developmental processes work. But even as huge amounts of data accumulated, enthusiasm for large-scale “theories of biology” (that might for example address the production of complexity in biological evolution) seemed to wane. (The discipline of systems biology attempted to develop specific, usually mathematical, models for biological systems—but there never emerged much in the way of overarching theoretical principles, except perhaps, somewhat specifically, in areas like immunology and neuroscience.)

One significant exception in terms of fundamental theory was Greg Chaitin’s concept from around 2010 of “metabiology”: an effort (see below) to use ideas from the theory of computation to understand very general features of the evolution of programs and relate them to biological evolution.

Starting in the 1950s another strand of development (sometimes viewed as a practical branch of artificial intelligence) involved the idea of “machine learning”. Genetic algorithms were one of half a dozen common approaches. Another was based on artificial neural nets. For decades machine learning languished as a somewhat esoteric field, dominated by engineering solutions that would occasionally deliver specific application results. But then in 2011 there was unexpectedly dramatic success in using neural nets for image identification, followed in subsequent years by successes in other areas, and culminating in 2022 with the arrival of large language models and ChatGPT.

What hadn’t been anticipated was that the behavior of neural nets can change a lot if they’re given sufficiently huge amounts of training. But there isn’t any good understanding of just why this is so, and just how successful neural nets can be at what kinds of tasks. Ever since the 1940s it has been recognized that there are relations between biological evolution and learning in neural nets. And having now seen the impressive things neural nets can do, it seems worthwhile to look again at what happens in biological evolution—and to try to understand why it works, not least as a prelude to understanding more about neural nets and machine learning.

## Personal Notes

It’s strange to say, but most of what I’ve done here I should really have done forty years ago. And I almost did. Except that I didn’t try quite the right experiments. And I didn’t have the intuition to think that it was worth trying more.

Forty years later, I have new intuition, particularly informed by experience with modern machine learning. But even now, what made possible what I’ve done here was a chance experiment done for a somewhat different purpose.

Back in 1981 I had become very interested in the question of how complexity arises in the natural world, and I was trying to come up with models that might capture this. Meanwhile, I had just finished Version 1.0 of SMP, the forerunner to Mathematica and the Wolfram Language—and I was wondering how one might generalize its pattern-matching paradigm to “general AI”.

As it happened, right around that time, neural nets gained some (temporary) popularity. And seeing them as potentially relevant to both my topics I started simulating them and trying to see what kind of general theory I could develop about them. But I found them frustrating to work with. There seemed to be too many parameters and details to get any clear conclusions. And, at a practical level, I couldn’t get them to do anything particularly useful.

I decided that for my science question I needed to come up with something much simpler. And as a kind of minimal merger of spin systems and neural nets I ended up inventing cellular automata (only later did I discover that versions of them had been invented several times before).

As soon as I started doing experiments on them, I discovered that cellular automata were a window into an amazing new scientific world—that I have continued to explore in one way or another ever since. My key methodology, at least at first, was just to enumerate the simplest possible cellular automaton rules, and see what they did. The diversity—and complexity—of their behavior was remarkable. But the simplicity of the rules meant that the details of “successive rules” were usually fairly different—and while there were common themes in their overall behavior, there didn’t seem to be any particular structure to “rule space”. (Occasionally, though, particularly in finding examples for exposition, I would look at slightly more complicated and “multicolored” rules, and I certainly anecdotally noticed that rules with nearby rule numbers often had definite similarities in their behavior.)

It so happened that around the time I started publishing about cellular automata in 1983 there was a fair amount of ambient interest in theoretical biology. And (perhaps in part because of the “cellular” in “cellular automata”) I was often invited to theoretical biology conferences. People would sometimes ask about adaptation in cellular automata, and I would usually just emphasize what individual cellular automata could do, without any adaptation, and what significance it might have for the development of organisms.

But in 1985 I was going to a conference (at Los Alamos) on “Evolution, Games and Learning” and I decided I should take a look at the relation of these topics to cellular automata. But, too quickly, I segued away from investigating adaptation to trying to see what kind of pattern matching and other operations cellular automata might be able to be explicitly set up to do:

Many aspects of this paper still seem quite modern (and in fact should probably be investigated more now!). But—even though I absolutely had the tools to do it—I simply failed at that time to explore what I’ve now explored here.

Back in 1984 Problem 7 in my “Twenty Problems in the Theory of Cellular Automata” was “How is different behavior distributed in the space of cellular automaton rules?” And over the years I’d occasionally think about “cellular automaton rule space”, wondering, for example, what kind of geometry it might have, particularly in the continuum limit of infinitely large rules.

By the latter half of the 1980s “theoretical biology” conferences had segued to “artificial life” ones. And when I went to such conferences I was often frustrated. People would show me simulations that seemed to have far too many parameters to ever be able to conclude much. People would also often claim that natural selection was a “very simple theory”, but as soon as it was “implemented” there’d be all kinds of issues—and choices to be made—about population sizes, fitness cutoffs, interactions between organisms, and so on. And the end result was usually a descent into some kind of very specific simulation without obvious robust implications.

(In the mid-1980s I put a fair amount of effort into developing both the content and the organization of a new direction in science that I called “complex systems research”. My emphasis was on systems—like cellular automata—that had definite simple rules but highly complex behavior. Gradually, though, “complexity” started to be a popular general buzzword, and—I suspect partly to distinguish themselves from my efforts—some people started emphasizing that they weren’t just studying complex systems, they were studying complex adaptive systems. But all too often this seemed mostly to provide an excuse to dilute the clarity of what could be studied—and I was sufficiently put off that I paid very little attention.)

By the mid-1990s, I was in the middle of writing *A New Kind of Science*, and I wanted to use biology as an example application of my methodology and discoveries in the computational universe. In a section entitled “Fundamental Issues in Biology” I argued (as I have here) that computational irreducibility is a fundamentally stronger force than natural selection, and that when we see complexity in biology it’s most likely of “computational origin” rather than being “sculpted” by natural selection. And as part of that discussion, I included a picture of the “often-somewhat-gradual changes” in behavior that one sees with successive 1-bit changes in a *k* = 3, *r* = 1

This wasn’t done adaptively; it was basically just looking along a “random straight line” in rule space. And indeed both here and in most of the book, I was concerned with what systems like cellular automata “naturally do”, not what they can be constructed (or adaptively evolved) to do. I did give “constructions” of how cellular automata can perform particular computational tasks (like generating primes), and, somewhat obscurely, in a section on “Intelligence in the Universe” I explored finding *k *= 3, *r *= 1 rules that can successfully “double their input” (my reason for discussing these rules was to highlight the difficulty of saying whether one of these cellular automata was “constructed for a purpose” or was just “doing what it does”):

Many years went by. There’d be an occasional project at our Summer School about rule space, and occasionally about adaptation. I maintained an interest in foundational questions in biology, gradually collecting information and sometimes giving talks about the subject. Meanwhile—though I didn’t particularly internalize the connection then—by the mid-2010s, through our practical work on it in the Wolfram Language, I’d gotten quite up to speed with modern machine learning. Around the same time I also heard from my friend Greg Chaitin about his efforts (as he put it) to “prove Darwin” using the kind of computational ideas he’d applied in thinking about the foundations of mathematics.

Then in 2020 came our Physics Project, with its whole formalism around things like multiway graphs. It didn’t take long to realize that, yes, what I was calling “multicomputation” wasn’t just relevant for fundamental physics; it was something quite general that could be applied in many areas, which by 2021 I was trying to catalog:

I did some thinking about each of these. The one I tackled most seriously first was metamathematics, about which I finished a book in 2022. Late that year I was finishing a (50-year-in-gestation) project—informed by our Physics Project—on understanding the Second Law of thermodynamics, and as part of this I made what I thought was some progress on thinking about the fundamental character of biological systems (though not their adaptive evolution).

And then ChatGPT arrived. And in addition to being involved with it technologically, I started to think about the science of it, and particularly about how it could work. Part of it seemed to have to do with unrecognized regularities in human language, but part of it was a reflection of the emerging “meta discovery” that somehow if you “bashed” a machine learning system hard enough, it seemed like it could manage to learn almost anything.

But why did this work? At first I thought it must just be an “obvious” consequence of high dimensionality. But I soon realized there was more to it. And as part of trying to understand the boundaries of what’s possible I ended up a couple of months ago writing a piece exploring “Can AI Solve Science?”:

I talked about different potential objectives for science (making predictions, generating narrative explanations, etc.) And deep inside the piece I had a section entitled “Exploring Spaces of Systems” in which I talked about science problems of the form “Can one find a system that does X?”—and asked whether systems like neural nets could somehow let one “jump ahead” in what would otherwise be huge exhaustive searches. As a sideshow to this I thought it might be interesting to compare with what a non-neural-net adaptive evolution process could do.

Remembering Greg Chaitin’s ideas about connecting the halting problem to biological evolution I wondered if perhaps one could just adaptively evolve cellular automaton rules to find ones that generated a pattern with a particular finite lifetime. I imagined it as a classic machine learning problem, with a “loss function” one needed to minimize.

And so it was that just after 1 am on February 22 I wrote three lines of Wolfram Language code—and tried the experiment:

And it worked! I managed to find cellular automaton rules that would generate patterns living exactly 50 steps:

In retrospect, I was slightly lucky. First, that this ended up being such a simple experiment to try (at least in the Wolfram Language) that I did it even though I didn’t really expect it to work. And second, that for my very first experiment I picked parameters that happened to immediately work (*k *= 4, lifetime 50, etc.).

But, yes, I could in principle have done the same experiment 40 years ago, though without the Wolfram Language it wouldn’t have been so easy. Still, the computers I had back then were powerful enough that I could in principle have generated the same results then as now. But without my modern experience of machine learning I don’t think I would have tried—and I would certainly have given up too easily. And, yes, it’s a little humbling to realize that I’ve gone so many years assuming adaptive evolution was out of the reach of simple, clean experiments. But it’s satisfying now to be able to check off another mystery I’ve long wondered about. And to think that much more about the foundations of biology—and machine learning—might finally be within reach.

## Thanks

Thanks to Brad Klee, Nik Murzin and Richard Assar for their help.

The specific results and ideas I’ve presented here are mostly very recent, but they build on background conversations I’ve had—some recently, some more than 40 years ago—with many people, including: Sydney Brenner, Greg Chaitin, Richard Dawkins, David Goldberg, Nigel Goldenfeld, Jack Good, Jonathan Gorard, Stephen J. Gould, Hyman Hartman, John Holland, Christian Jacob, Stuart Kauffman, Mark Kotanchek, John Koza, Chris Langton, Katja Della Libera, Aristid Lindenmayer, Pattie Maes, Bill Mydlowec, John Novembre, Pedro de Oliveira, George Oster, Norman Packard, Alan Perelson, Thomas Ray, Philip Rosedale, Robert Rosen, Terry Sejnowski, Brian Silverman, Karl Sims, John Maynard Smith, Catherine Wolfram, Christopher Wolfram and Elizabeth Wolfram.