Stephen Wolfram Writings

What If We Had Bigger Brains? Imagining Minds beyond Ours

Stephen Wolfram — Wed, 21 May 2025 14:28:31 +0000

Cats Don’t Talk

We humans have perhaps 100 billion neurons in our brains. But what if we had many more? Or what if the AIs we built effectively had many more? What kinds of things might then become possible? At 100 billion neurons, we know, for example, that compositional language of the kind we humans use is possible. At the 100 million or so neurons of a cat, it doesn’t seem to be. But what would become possible with 100 trillion neurons? And is it even something we could imagine understanding?

My purpose here is to start exploring such questions, informed by what we’ve seen in recent years in neural nets and LLMs, as well as by what we now know about the fundamental nature of computation, and about neuroscience and the operation of actual brains (like the one that’s writing this, imaged here):

One suggestive point is that as artificial neural nets have gotten bigger, they seem to have successively passed a sequence of thresholds in capability:

So what’s next? No doubt there’ll be things like humanoid robotic control that have close analogs in what we humans already do. But what if we go far beyond the ~10¹⁴ connections that our human brains have? What qualitatively new kinds of capabilities might there then be?

If this was about “computation in general” then there wouldn’t really be much to talk about. The Principle of Computational Equivalence implies that beyond some low threshold computational systems can generically produce behavior that corresponds to computation that’s as sophisticated as it can ever be. And indeed that’s the kind of thing we see both in lots of abstract settings, and in the natural world.

But the point here is that we’re not dealing with “computation in general”. We’re dealing with the kinds of computations that brains fundamentally do. And the essence of these seems to have to do with taking in large amounts of sensory data and then coming up with what amount to decisions about what to do next.

It’s not obvious that there’d be any reasonable way to do this. The world at large is full of computational irreducibility—where the only general way to work out what will happen in a system is just to run the underlying rules for that system step by step and see what comes out:

And, yes, there are plenty of questions and issues for which there’s essentially no choice but to do this irreducible computation—just as there are plenty of cases where LLMs need to call on our Wolfram Language computation system to get computations done. But brains, for the things most important to them, somehow seem to routinely manage to “jump ahead” without in effect simulating every detail. And what makes this possible is the fundamental fact that within any system that shows overall computational irreducibility there must inevitably be an infinite number of “pockets of computational reducibility”, in effect associated with “simplifying features” of the behavior of the system.

It’s these “pockets of reducibility” that brains exploit to be able to successfully “navigate” the world for their purposes in spite of its “background” of computational irreducibility. And in these terms things like the progress of science (and technology) can basically be thought of as the identification of progressively more pockets of computational reducibility. And we can then imagine that the capabilities of bigger brains could revolve around being able to “hold in mind” more of these pockets of computational reducibility.

We can think of brains as fundamentally serving to “compress” the complexity of the world, and extract from it just certain features—associated with pockets of reducibility—that we care about. And for us a key manifestation of this is the idea of concepts, and of language that uses them. At the level of raw sensory input we might see many detailed images of some category of thing—but language lets us describe them all just in terms of one particular symbolic concept (say “rock”).

In a rough first approximation, we can imagine that there’s a direct correspondence between concepts and words in our language. And it’s then notable that human languages all tend to have perhaps 30,000 common words (or word-like constructs). So is that scale the result of the size of our brains? And could bigger brains perhaps deal with many more words, say millions or more?

“What could all those words be about?” we might ask. After all, our everyday experience makes it seem like our current 30,000 words are quite sufficient to describe the world as it is. But in some sense this is circular: we’ve invented the words we have because they’re what we need to describe the aspects of the world we care about, and want to talk about. There will always be more features of, say, the natural world that we could talk about. It’s just that we haven’t chosen to engage with them. (For example, we could perfectly well invent words for all the detailed patterns of clouds in the sky, but those patterns are not something we currently feel the need to talk in detail about.)

But given our current set of words or concepts, is there “closure” to it? Can we successfully operate in a “self-consistent slice of concept space” or will we always find ourselves needing new concepts? We might think of new concepts as being associated with intellectual progress that we choose to pursue or not. But insofar as the “operation of the world” is computationally irreducible it’s basically inevitable that we’ll eventually be confronted with things that cannot be described by our current concepts.

So why is it that the number of concepts (or words) isn’t just always increasing? A fundamental reason is abstraction. Abstraction takes collections of potentially large numbers of specific things (“tiger”, “lion”, …) and allows them to be described “abstractly” in terms of a more general thing (say, “big cats”). And abstraction is useful if it’s possible to make collective statements about those general things (“all big cats have…”), in effect providing a consistent “higher-level” way of thinking about things.

If we imagine concepts as being associated with particular pockets of reducibility, the phenomenon of abstraction is then a reflection of the existence of networks of these pockets. And, yes, such networks can themselves show computational irreducibility, which can then have its own pockets of reducibility, etc.

So what about (artificial) neural nets? It’s routine to “look inside” these, and for example see the possible patterns of activation at a given layer based on a range of possible (“real-world”) inputs. We can then think of these patterns of activation as forming points in a “feature space”. And typically we’ll be able to see clusters of these points, which we can potentially identify as “emergent concepts” that we can view as having been “discovered” by the neural net (or rather, its training). Normally there won’t be existing words in human languages that correspond to most of these concepts. They represent pockets of reducibility, but not ones that we’ve identified, and that are captured by our typical 30,000 or so words. And, yes, even in today’s neural nets, there can easily be millions of “emergent concepts”.

But will these be useful abstractions or concepts, or merely “incidental examples of compression” not connected to anything else? The construction of neural nets implies that a pattern of “emergent concepts” at one layer will necessarily feed into the next layer. But the question is really whether the concept can somehow be useful “independently”—not just at this particular place in the neural net.

And indeed the most obvious everyday use for words and concepts—and language in general—is for communication: for “transferring thoughts” from one mind to another. Within a brain (or a neural net) there are all kinds of complicated patterns of activity, different in each brain (or each neural net). But a fundamental role that concepts, words and language play is to define a way to “package up” certain features of that activity in a form that can be robustly transported between minds, somehow inducing “comparable thoughts” in all of them.

The transfer from one mind to another can never be precise: in going from the pattern of activity in one brain (or neural net) to the pattern of activity in another, there’ll always be translation involved. But—at least up to a point—one can expect that the “more that’s said” the more faithful a translation can be.

But what if there’s a bigger brain, with more “emergent concepts” inside? Then to communicate about them at a certain level of precision we might need to use more words—if not a fundamentally richer form of language. And, yes, while dogs seem to understand isolated words (“sit”, “fetch”, …), we, with our larger brains, can deal with compositional language in which we can in effect construct an infinite range of meanings by combining words into phrases, sentences, etc.

At least as we currently imagine it, language defines a certain model of the world, based on some finite collection of primitives (words, concepts, etc.). The existence of computational irreducibility tells us that such a model can never be complete. Instead, the model has to “approximate things” based on the “network of pockets of reducibility” that the primitives in the language effectively define. And insofar as a bigger brain might in essence be able to make use of a larger network of pockets of reducibility, it can then potentially support a more precise model of the world.

And it could then be that if we look at such a brain and what it does, it will inevitably seem closer to the kind of “incomprehensible and irreducible computation” that’s characteristic of so many abstract systems, and systems in nature. But it could also be that in being a “brain-like construct” it’d necessarily tap into computational reducibility in such a way that—with the formalism and abstraction we’ve built—we’d still meaningfully be able to talk about what it can do.

At the outset we might have thought any attempt for us to “understand minds beyond ours” would be like asking a cat to understand algebra. But somehow the universality of the concepts of computation that we now know—with their ability to address the deepest foundations of physics and other fields—makes it seem more plausible we might now be in a position to meaningfully discuss minds beyond ours. Or at least to discuss the rather more concrete question of what brains like ours, but bigger than ours, might be able to do.

How Brains Seem to Work

As we’ve mentioned, at least in a rough approximation, the role of brains is to turn large amounts of sensory input into small numbers of decisions about what to do. But how does this happen?

Human brains continually receive input from a few million “sensors”, mostly associated with photoreceptors in our eyes and touch receptors in our skin. This input is processed by a total of about 100 billion neurons, each responding in a few milliseconds, and mostly organized into a handful of layers. There are altogether perhaps 100 trillion connections between neurons, many quite long range. At any given moment, a few percent of neurons (i.e. perhaps a billion) are firing. But in the end, all that activity seems to feed into particular structures in the lower part of the brain that in effect “take a majority vote” a few times a second to determine what to do next—in particular with the few hundred “actuators” our bodies have.

This basic picture seems to be more or less the same in all higher animals. The total number of neurons scales roughly with the number of “input sensors” (or, in a first approximation, the surface area of the animal—i.e. volume^2/3—which determines the number of touch sensors). The fraction of brain volume that consists of connections (“white matter”) as opposed to main parts of neurons (“gray matter”) increases as a power of the number of neurons. The largest brains—like ours—have a roughly nested pattern of folds that presumably reduce average connection lengths. Different parts of our brains have characteristic functions (e.g. motor control, handling input from our eyes, generation of language, etc.), although there seems to be enough universality that other parts can usually learn to take over if necessary. And in terms of overall performance, animals with smaller brains generally seem to react more quickly to stimuli.

So what was it that made brains originally arise in biological evolution? Perhaps it had to do with giving animals a way to decide where to go next as they moved around. (Plants, which don’t move around, don’t have brains.) And perhaps it’s because animals can’t “go in more than one direction at once” that brains seem to have the fundamental feature of generating a single stream of decisions. And, yes, this is probably why we have a single thread of “conscious experience”, rather than a whole collection of experiences associated with the activities of all our neurons. And no doubt it’s also what we leverage in the construction of language—and in communicating through a one-dimensional sequence of tokens.

It’s notable how similar our description of brains is to the basic operation of large language models: an LLM processes input from its “context window” by feeding it through large numbers of artificial neurons organized in layers—ultimately taking something like a majority vote to decide what token to generate next. There are differences, however, most notably that whereas brains routinely intersperse learning and thinking, current LLMs separate training from operation, in effect “learning first” and “thinking later”.

But almost certainly the core capabilities of both brains and neural nets don’t depend much on the details of their biological or architectural structure. It matters that there are many inputs and few outputs. It matters that there’s irreducible computation inside. It matters that the systems are trained on the world as it is. And, finally, it matters how “big” they are, in effect relative to the “number of relevant features of the world”.

In artificial neural nets, and presumably also in brains, memory is encoded in the
strengths (or “weights”) of connections between neurons. And at least in neural nets it seems that the number of tokens (of textual data) that can reasonably be “remembered” is a few times the number of weights. (With current methods, the number of computational operations of training needed to achieve this is roughly the product of the total number of weights and the total number of tokens.) If there are too few weights, what happens is that the “memory” gets fuzzy, with details of the fuzziness reflecting details of the structure of the network.

But what’s crucial—for both neural nets and brains—is not so much to remember specifics of training data, but rather to just “do something reasonable” for a wide range of inputs, regardless of whether they’re in the training data. Or, in other words, to generalize appropriately from training data.

But what is “appropriate generalization”? As a practical matter, it tends to be “generalization that aligns with what we humans would do”. And it’s then a remarkable fact that artificial neural nets with fairly simple architectures can successfully do generalizations in a way that’s roughly aligned with human brains. So why does this work? Presumably it’s because there are universal features of “brain-like systems” that are close enough between human brains and neural nets. And once again it’s important to emphasize that what’s happening in both cases seems distinctly weaker than “general computation”.

A feature of “general computation” is that it can potentially involve unbounded amounts of time and storage space. But both brains and typical neural nets have just a fixed number of neurons. And although both brains and LLMs in effect have an “outer loop” that can “recycle” output to input, it’s limited.

And at least when it comes to brains, a key feature associated with this is the limit on “working memory”, i.e. memory that can readily be both read and written “in the course of a computation”. Bigger and more developed brains typically seem to support larger amounts of working memory. Adult humans can remember perhaps 5 or 7 “chunks” of data in working memory; for young children, and other animals, it’s less. Size of working memory (as we’ll discuss later) seems to be important in things like language capabilities. And the fact that it’s limited is no doubt one reason we can’t generally “run code in our brains”.

As we try to reflect on what our brains do, we’re most aware of our stream of conscious thought. But that represents just a tiny fraction of all our neural activity. Most of the activity is much less like “thought” and much more like typical processes in nature, with lots of elements seemingly “doing their own thing”. We might think of this as an “ocean of unconscious neural activity”, from which a “thread of consensus thought” is derived. Usually—much like in an artificial neural net—it’s difficult to find much regularity in that “unconscious activity”. Though when one trains oneself enough to get to the point of being able to “do something without thinking about it”, that presumably happens by organizing some part of that activity.

There’s always a question of what kinds of things we can learn. We can’t overcome computational irreducibility. But how broadly can we handle what’s computationally reducible? Artificial neural nets show a certain genericity in their operation: although some specific architectures are more efficient than others, it doesn’t seem to matter much whether the input they’re fed is images or text or numbers, or whatever. And for our brains it’s probably the same—though what we’ve normally experienced, and learned from, are the specific kinds of input the come from our eyes, ears, etc. And from these, we’ve ended up recognizing certain types of regularities—that we’ve then used to guide our actions, set up our environment, etc.

And, yes, this plugs into certain pockets of computational reducibility in the world. But there’s always further one could go. And how that might work with brains bigger than ours is at the core of what we’re trying to discuss here.

Language and Beyond

At some level we can view our brains as serving to take the complexity of the world and extract from it a compressed representation that our finite minds can handle. But what is the structure of that representation? A central aspect of it is that it ignores many details of the original input (like particular configurations of pixels). Or, in other words, it effectively equivalences many different inputs together.

But how then do we describe that equivalence class? Implementationally, say in a neural net, the equivalence class might correspond to an attractor to which many different initial conditions all evolve. In terms of the detailed pattern of activity in the neural net the attractor will typically be very hard to describe. But on a larger scale we can potentially just think of it as some kind of robust construct that represents a class of things—or what in terms of our process of thought we might describe as a “concept”.

At the lowest level there’s all sorts of complicated neural activity in our brains—most of it mired in computational irreducibility. But the “thin thread of conscious experience” that we extract from this we can for many purposes treat as being made up of higher-level “units of thought”, or essentially “discrete concepts”.

And, yes, it’s certainly our typical human experience that robust constructs—and particularly ones from which other constructs can be built—will be discrete. In principle one can imagine that there could be things like “robust continuous spaces of concepts” (“cat and dog and everything in between”). But we don’t have anything like the computational paradigm that shows us a consistent universal way that such things could fit together (there’s no robust analog of computation theory for real numbers, for example). And somehow the success of the computational paradigm—potentially all the way down to the foundations of the physical universe—doesn’t seem to leave much room for anything else.

So, OK, let’s imagine that we can represent our thread of conscious experience in terms of concepts. Well, that’s close to saying that we’re using language. We’re “packaging up” the details of our neural activity into “robust elements” which we can think of as concepts—and which are represented in language essentially by words. And not only does this “packaging” into language give a robust way for different brains to communicate; it also gives a single brain a robust way to “remember” and “redeploy” thoughts.

Within one brain one could imagine that one might be able to remember and “think” directly in terms of detailed low-level neural patterns. But no doubt the “neural environment” inside a brain is continually changing (not least because of its stream of sensory input). And so the only way to successfully “preserve a thought” across time is presumably to “package it up” in terms of robust elements, or essentially in terms of language. In other words, if we’re going to be able to consistently “think a particular thought” we probably have to formulate it in terms of something robust—like concepts.

But, OK, individual concepts are one thing. But language—or at least human language—is based on putting together concepts in structured ways. One might take a noun (“cat”) and qualify it with an adjective (“black”) to form a phrase that’s in effect a finer-grained version of the concept represented by the noun. And in a rough approximation one can think of language as formed from trees of nested phrases like this. And insofar as the phrases are independent in their structure (i.e. “context free”), we can parse such language by recursively understanding each phrase in turn—with the constraint that we can’t do it if the nesting goes too deep for us to hold the necessary stack of intermediate steps in our working memory.

An important feature of ordinary human language is that it’s ultimately presented in a sequential way. Even though it may consist of a nested tree of phrases, the words that are the leaves of that tree are spoken or written in a one-dimensional sequence. And, yes, the fact that this is how it works is surely closely connected to the fact that our brains construct a single thread of conscious experience.

In the actuality of the few thousand human languages currently in use, there is considerable superficial diversity, but also considerable fundamental commonality. For example, the same parts of speech (noun, verb, etc.) typically show up, as do concepts like “subject” and “object”. But the details of how words are put together, and how things are indicated, can be fairly different. Sometimes nouns have case endings; sometimes there are separate prepositions. Sometimes verb tenses are indicated by annotating the verb; sometimes with extra words. And sometimes, for example, what would usually be whole phrases can be smooshed together into single words.

It’s not clear to what extent commonalities between languages are the result of shared history, and to what extent they’re consequences either of the particulars of our human sensory experience of the world, or the particular construction of our brains. It’s not too hard to get something like concepts to emerge in experiments on training neural nets to pass data through a “bottleneck” that simulates a “mind-to-mind communication channel”. But how compositionality or grammatical structure might emerge is not clear.

OK, but so what might change if we had bigger brains? If neural nets are a guide, one obvious thing is that we should be able to deal directly with a larger number of “distinct concepts”, or words. So what consequences would this have? Presumably one’s language would get “grammatically shallower”, in the sense that what would otherwise have had to be said with nested phrases could now be said with individual words. And presumably this would tend to lead to “faster communication”, requiring fewer words. But it would likely also lead to more rigid communication, with less ability to tweak shades of meaning, say by changing just a few words in a phrase. (And it would presumably also require longer training, to learn what all the words mean.)

In a sense we have a preview of what it’s like to have more words whenever we deal with specialized versions of existing language, aimed say at particular technical fields. There are additional words of “jargon” available, that make certain things “faster to say” (but require longer to learn). And with that jargon comes a certain rigidity, in saying easily only what the jargon says, and not something slightly different.

So how else could language be different with a bigger brain? With larger working memory, one could presumably have more deeply nested phrases. But what about more sophisticated grammatical structures, say ones that aren’t “context free”, in the sense that different nested phrases can’t be parsed separately? My guess is that this quickly devolves into requiring arbitrary computation—and runs into computational irreducibility. In principle it’s perfectly possible to have any program as the “message” one communicates. But if one has to run the program to “determine its meaning”, that’s in general going to involve computational irreducibility.

And the point is that with our assumptions about what “brain-like systems” do, that’s something that’s out of scope. Yes, one can construct a system (even with neurons) that can do it. But not with the “single thread of decisions from sensory input” workflow that seems characteristic of brains. (There are finer gradations one could consider—like languages that are context sensitive but don’t require general computation. But the Principle of Computational Equivalence strongly suggests that the separation between nested context-free systems and ones associated with arbitrary computation is very thin, and there doesn’t seem to be any particular reason to expect that the capabilities of a bigger brain would land right there.)

Said another way: the Principle of Computational Equivalence says it’s easy to have a system that can deal with arbitrary computation. It’s just that such a system is not “brain like” in its behavior; it’s more like a typical system we see in nature.

OK, but what other “additional features” can one imagine, for even roughly “brain-like” systems? One possibility is to go beyond the idea of a single thread of experience, and to consider a multiway system in which threads of experience can branch and merge. And, yes, this is what we imagine happens at a low level in the physical universe, particularly in connection with quantum mechanics. And indeed it’s perfectly possible to imagine, for example, a “quantum-like” LLM system in which one generates a graph of different textual sequences. But just “scaling up the number of neurons” in a brain, without changing the overall architecture, won’t get to this. We have to have a different, multiway architecture. Where we have a “graph of consciousness” rather than a “stream of consciousness”, and where, in effect, we’re “thinking a graph of thoughts”, notably with thoughts themselves being able to branch and merge.

In our practical use of language, it’s most often communicated in spoken or written form—effectively as a one-dimensional sequence of tokens. But in math, for example, it’s common to have a certain amount of 2D structure, and in general there are also all sorts of specialized (usually technical) diagrammatic representations in use, often based on using graphs and networks—as we’ll discuss in more detail below.

But what about general pictures? Normally it’s difficult for us to produce these. But in generative AI systems it’s basically easy. So could we then imagine directly “communicating mental images” from one mind to another? Maybe as a practical matter some neural implant in our brain could aggregate neural signals from which a displayed image could be generated. But is there in fact something coherent that could be extracted from our brains in this way? Perhaps that can only happen after “consensus is formed”, and we’ve reduced things to a much thinner “thread of experience”. Or, in other words, perhaps the only robust way for us to “think about images” is in effect to reduce them to discrete concepts and language-like representations.

But perhaps if we “had the hardware” to display images directly from our minds it’d be a different story. And it’s sobering to imagine that perhaps the reason cats and dogs don’t appear to have compositional language is just that they don’t “have the hardware” to talk like we do (and it’s too laborious for them to “type with their paws”, etc.). And, by analogy, that if we “had the hardware” for displaying images, we’d discover we could also “think very differently”.

Of course, in some small ways we do have the ability to “directly communicate with images”, for example in our use of gestures and body language. Right now, these seem like largely ancillary forms of communication. But, yes, it’s conceivable that with bigger brains, they could be more.

And when it comes to other animals the story can be different. Cuttlefish are notable for dynamically producing elaborate patterns on their skin—giving them in a sense the hardware to “communicate in pictures”. But so far as one can tell, they produce just a small number of distinct patterns—and certainly nothing like a “pictorial generalization of compositional language”. (In principle one could imagine that “generalized cuttlefish” could do things like “dynamically run cellular automata on their skin”, just like all sorts of animals “statically” do in the process of growth or development. But to decode such patterns—and thereby in a sense enable “communicating in programs”—would typically require irreducible amounts of computation that are beyond the capabilities of any standard brain-like system.)

Sensors and Actuators

We humans have raw inputs coming into our brains from a few million sensors distributed across our usual senses of touch, sight, hearing, taste and smell (together with balance, temperature, hunger, etc.). In most cases the detailed sensor inputs are not independent; in a typical visual scene, for example, neighboring pixels are highly correlated. And it doesn’t seem to take many layers of neurons in our brains to distill our typical sensory experience from pure pieces of “raw data” to what we might view as “more independent features”.

Of course there’ll usually be much more in the raw data than just those features. But the “features” typically correspond to aspects of the data that we’ve “learned are useful to us”—normally connected to pockets of computational reducibility that exist in the environment in which we operate. Are the features we pick out all we’ll ever need? In the end, we typically want to derive a small stream of decisions or actions from all the data that comes in. But how many “intermediate features” do we need to get “good” decisions or actions?

That really depends on two things. First, what our decisions and actions are like. And second, what our raw data is like. Early in the history of our species, everything was just about “indigenous human experience”: what the natural world is like, and what we can do with our bodies. But as soon as we were dealing with technology, that changed. And in today’s world we’re constantly exposed, for example, to visual input that comes not from the natural world, but, say, from digital displays.

And, yes, we often try to arrange our “user experience” to align with what’s familiar from the natural world (say by having objects that stay unchanged when they’re moved across the screen). But it doesn’t have to be that way. And indeed it’s easy—even with simple programs—to generate for example visual images very different from what we’re used to. And in many such cases, it’s very hard for us to “tell what’s going on” in the image. Sometimes it’ll just “look too complicated”. Sometimes it’ll seem like it has pieces we should recognize, but we don’t:

When it’s “just too complicated”, that’s often a reflection of computational irreducibility. But when there are pieces we might “think we should recognize”, that can be a reflection of pockets of reducibility we’re just not familiar with. If we imagine a space of possible images—as we can readily produce with generative AI—there will be some that correspond to concepts (and words) we’re familiar with. But the vast majority will effectively lie in “interconcept space”: places where we could have concepts, but don’t, at least yet:

So what could bigger brains do with all this? Potentially they could handle more features, and more concepts. Full computational irreducibility will always in effect ultimately overpower them. But when it comes to handling pockets of reducibility, they’ll presumably be able to deal with more of them. So in the end, it’s very much as one might expect: a bigger brain should be able to track more things going on, “see more details”, etc.

Brains of our size seem like they are in effect sufficient for “indigenous human experience”. But with technology in the picture, it’s perfectly possible to “overload” them. (Needless to say, technology—in the form of filtering, data analysis, etc.—can also reduce that overload, in effect taking raw input and bringing our actual experience of it closer to something “indigenous”.)

It’s worth pointing out that while two brains of a given size might be able to “deal with the same number of features or concepts”, those features or concepts might be different. One brain might have learned to talk about the world in terms of one set of primitives (such as certain basic colors); another in terms of a different set of primitives. But if both brains are sampling “indigenous human experience” in similar environments one can expect that it should be possible to translate between these descriptions—just as it is generally possible to translate between things said in different human languages.

But what if the brains are effectively sampling “different slices of reality”? What if one’s using technology to convert different physical phenomena to forms (like images) that we can “indigenously” handle? Perhaps we’re sensing different electromagnetic frequencies; perhaps we’re sensing molecular or chemical properties; perhaps we’re sensing something like fluid motion. The kinds of features that will be “useful” may be quite different in these different modalities. Indeed, even something as seemingly basic as the notion of an “object” may not be so relevant if our sensory experience is effectively of continuous fluid motion.

But in the end, what’s “useful” will depend on what we can do. And once again, it depends on whether we’re dealing with “pure humans” (who can’t, for example, move like octopuses) or with humans “augmented by technology”. And here we start to see an issue that relates to the basic capabilities of our brains.

As “pure humans”, we have certain “actuators” (basically in the form of muscles) that we can “indigenously” operate. But with technology it’s perfectly possible for us to use quite different actuators in quite different configurations. And as a practical matter, with brains like ours, we may not be able to make them work.

For example, while humans can control helicopters, they never managed to control quadcopters—at least not until digital flight controllers could do most of the work. In a sense there were just too many degrees of freedom for brains like ours to deal with. Should bigger brains be able to do more? One would think so. And indeed one could imagine testing this with artificial neural nets. In millipedes, for example, their actual brains seem to support only a couple of patterns of motion of their legs (roughly, same phase vs. opposite phase). But one could imagine that with a bigger brain, all sorts of other patterns would become possible.

Ultimately, there are two issues at stake here. The first is having a brain be able to “independently address” enough actuators, or in effect enough degrees of freedom. The second is having a brain be able to control those degrees of freedom. And for example with mechanical degrees of freedom there are again essentially issues of computational irreducibility. Looking at the space of possible configurations—say of millipede legs—does one effectively just have to trace the path to find out if, and how, one can get from one configuration to another? Or are there instead pockets of reducibility, associated with regularities in the space of configurations, that let one “jump ahead” and figure this out without tracing all the steps? It’s those pockets of reducibility that brains can potentially make use of.

When it comes to our everyday “indigenous” experience of the world, we are used to certain kinds of computational reducibility, associated for example with familiar natural laws, say about motion of objects. But what if we were dealing with different experiences, associated with different senses?

For example, imagine (as with dogs) that our sense of smell was better developed than our sense of sight—as reflected by more nerves coming into our brains from our noses than our eyes. Our description of the world would then be quite different, based for example not on geometry revealed by the line-of-sight arrival of light, but instead by the delivery of odors through fluid motion and diffusion—not to mention the probably-several-hundred-dimensional space of odors, compared to the red, green, blue space of colors. Once again there would be features that could be identified, and “concepts” that could be defined. But those might only be useful in an environment “built for smell” rather than one “built for sight”.

And in the end, how many concepts would be useful? I don’t think we have any way to know. But it certainly seems as if one can be a successful “smell-based animal” with a smaller brain (presumably supporting fewer concepts) than one needs as a successful “sight-based animal”.

One feature of “natural senses” is that they tend to be spatially localized: an animal basically senses things only where it is. (We’ll discuss the case of social organisms later.) But what if we had access to a distributed array of sensors—say associated with IoT devices? The “effective laws of nature” that one could perceive would then be different. Maybe there would be regularities that could be captured by a small number of concepts, but it seems more likely that the story would be more complicated, and that in effect one would “need a bigger brain” to be able to keep track of what’s going on, and make use of whatever pockets of reducibility might exist.

There are somewhat similar issues if one imagines changing the timescales for sensory input. Our perception of space, for example, depends on the fact that light travels fast enough that in the milliseconds it takes our brain to register the input, we’ve already received light from everything that’s around us. But if our brains operated a million times faster (as digital electronics does) we’d instead be registering individual photons. And while our brains might aggregate these to something like what we ordinarily perceive, there may be all sorts of other (e.g. quantum optics) effects that would be more obvious.

Abstraction

The more abstractly we try to think, the harder it seems to get. But would it get easier if we had bigger brains? And might there perhaps be fundamentally higher levels of abstraction that we could reach—but only if we had bigger brains.

As a way to approach such questions, let’s begin by talking a bit about the history of the phenomenon of abstraction. We might already say that basic perception involves some abstraction, capturing as it does a filtered version of the world as it actually is. But perhaps we reach a different level when we start to ask “what if?” questions, and to imagine how things in the world could be different than they are.

But somehow when it comes to us humans, it seems as if the greatest early leap in abstraction was the invention of language, and the explicit delineation of concepts that could be quite far from our direct experience. The earliest written records tend to be rather matter of fact, mostly recording as they do events and transactions. But already there are plenty of signs of abstraction. Numbers independent of what they count. Things that should happen in the future. The concept of money.

There seems to be a certain pattern to the development of abstraction. One notices that some category of things one sees many times can be considered similar, then one “packages these up” into a concept, often described by a word. And in many cases, there’s a certain kind of self amplification: once one has a word for something (as a modern example, say “blog”), it becomes easier for us to think about the thing, and we tend to see it or make it more often in the world around us. But what really makes abstraction take off is when we start building a whole tower of it, with one abstract concept recursively being based on others.

Historically this began quite slowly. And perhaps it was seen first in theology. There were glimmerings of it in things like early (syllogistic) logic, in which one started to be able to talk about the form of arguments, independent of their particulars. And then there was mathematics, where computations could be done just in terms of numbers, independent of where those numbers came from. And, yes, while there were tables of “raw computational results”, numbers were usually discussed in terms of what they were numbers of. And indeed when it came to things like measures of weight, it took until surprisingly modern times for there to be an absolute, abstract notion of weight, independent of whether it was a weight of figs or of wool.

The development of algebra in the early modern period can be considered an important step forward in abstraction. Now there were formulas that could be manipulated abstractly, without even knowing what particular numbers x stood for. But it would probably be fair to say that there was a major acceleration in abstraction in the 19th century—with the development of formal systems that could be discussed in “purely symbolic form” independent of what they might (or might not) “actually represent”.

And it was from this tradition that modern notions of computation emerged (and indeed particularly ones associated with symbolic computation that I personally have extensively used). But the most obvious area in which towers of abstraction have been built is mathematics. One might start with numbers (that could count things). But soon one’s on to variables, functions, spaces of functions, category theory—and a zillion other constructs that abstractly build on each other.

The great value of abstraction is that it allows one to think about large classes of things all at once, instead of each separately. But how do those abstract concepts fit together? The issue is that often it’s in a way that’s very remote from anything about which we have direct experience from our raw perception of the world. Yes, we can define concepts about transfinite numbers or higher categories. But they don’t immediately relate to anything we’re familiar with from our everyday experience.

As a practical matter one can often get a sense of how high something is on the tower of abstraction by seeing how much one has to explain to build up to it from “raw experiential concepts”. Just sometimes it turns out that actually, once one hears about a certain seemingly “highly abstract” concept, one can actually explain it surprisingly simply, without going through the whole historical chain that led to it. (A notable example of this is the concept of universal computation—which arose remarkably late in human intellectual history, but is now quite easy to explain, albeit particularly given its actual widespread embodiment in technology.) But the more common case is that there’s no choice but to explain a whole tower of concepts.

At least in my experience, however, when one actually thinks about “highly abstract” things, one does it by making analogies to more familiar, more concrete things. The analogies may not be perfect, but they provide scaffolding which allows our brains to take what would otherwise be quite inaccessible steps.

At some level any abstraction is a reflection of a pocket of computational reducibility. Because if a useful abstraction can be defined, what it means is that it’s possible to say something in a “summarized” or reduced way, in effect “jumping ahead”, without going through all the computational steps or engaging with all the details. And one can then think of towers of abstraction as being like networks of pockets of computational reducibility. But, yes, it can be hard to navigate these.

Underneath, there’s lots of computational irreducibility. And if one is prepared to “go through all the steps” one can often “get to an answer” without all the “conceptual difficulty” of complex abstractions. But while computers can often readily “go through all the steps”, brains can’t. And that’s in a sense why we have to use abstraction. But inevitably, even if we’re using abstraction, and the pockets of computational reducibility associated with it, there’ll be shadows of the computational irreducibility underneath. And in particular, if we try to “explore everything”, our network of pockets of reducibility will inevitably “get complicated”, and ultimately also be mired in computational irreducibility, albeit with “higher-level” constructs than in the computational irreducibility underneath.

No finite brain will ever be able to “go all the way”, but it starts to seem likely that a bigger brain will be able to “reach further” in the network of abstraction. But what will it find there? How does the character of abstraction change when we take it further? We’ll be able to discuss this a bit more concretely when we talk about computational language below. But perhaps the main thing to say now is that—at least in my experience—most higher abstractions don’t feel as if they’re “structurally different” once one understands them. In other words, most of the time, it seems as if the same patterns of thought and reasoning that one’s applied in many other places can be applied there too, just to different kinds of constructs.

Sometimes, though, there seem to be exceptions. Shocks to intuition that seem to separate what one’s now thinking about from anything one’s thought before. And, for example, for me this happened when I started looking broadly at the computational universe. I had always assumed that simple rules would lead to simple behavior. But many years ago I discovered that in the computational universe this isn’t true (hence computational irreducibility). And this led to a whole different paradigm for thinking about things.

It feels a bit like in metamathematics. Where one can imagine one type of abstraction associated with different constructs out of which to form theorems. But where somehow there’s another level associated with different ways to build new theorems, or indeed whole spaces of theorems. Or to build proofs from proofs, or proofs from proofs of proofs, etc. But the remarkable thing is that there seems to be an ultimate construct that encompasses it all: the ruliad.

We can describe the ruliad as the entangled limit of all possible computations. But we can also describe it as the limit of all possible abstractions. And it seems to lie underneath all physical reality, as well as all possible mathematics, etc. But, we might ask, how do brains relate to it?

Inevitably, it’s full of computational irreducibility. And looked at as a whole, brains can’t get far with it. But the key idea is to think about how brains as they are—with all their various features and limitations—will “parse” it. And what I’ve argued is that what “brains as they are” will perceive about the ruliad are the core laws of physics (and mathematics) as we know them. In other words, it’s because brains are the way they are that we perceive the laws of physics that we perceive.

Would it be different for bigger brains? Not if they’re the “same kind of brains”. Because what seems to matter for the core laws of physics are really just two properties of observers. First, that they’re computationally bounded. And second, that they believe they are persistent in time, and have a single thread of experience through time. And both of these seem to be core features of what makes brains “brain-like”, rather than just arbitrary computational systems.

It’s a remarkable thing that just these features are sufficient to make core laws of physics inevitable. But if we want to understand more about the physics we’ve constructed—and the laws we’ve deduced—we probably have to understand more about what we’re like as observers. And indeed, as I’ve argued elsewhere, even our physical scale (much bigger than molecules, much smaller than the whole universe) is for example important in giving us the particular experience (and laws) of physics that we have.

Would this be different with bigger brains? Perhaps a little. But anything that something brain-like can do pales in comparison to the computational irreducibility that exists in the ruliad and in the natural world. Nevertheless, with every new pocket of computational reducibility that’s reached we get some new abstraction about the world, or in effect, some new law about how the world works.

And as a practical matter, each such abstraction can allow us to build a whole collection of new ways of thinking about the world, and making things in the world. It’s challenging to trace this arc. Because in a sense it’ll all be about “things we never thought to think about before”. Goals we might define for ourselves that are built on a tower of abstraction, far away from what we might think of as “indigenous human goals”.

It’s important to realize that there won’t just be one tower of abstraction that can be built. There’ll inevitably be an infinite network of pockets of computational reducibility, with each path leading to a different specific tower of abstraction. And indeed the abstractions we have pursued reflect the particular arc of human intellectual history. Bigger brains—or AIs—have many possible directions they can go, each one defining a different path of history.

One question to ask is to what extent reaching higher levels of abstraction is a matter of education, and to what extent it requires additional intrinsic capabilities of a brain. It is, I suspect, a mixture. Sometimes it’s really just a question of knowing “where that pocket of reducibility is”, which is something we can learn from education. But sometimes it’s a question of navigating a network of pockets, which may only be possible when brains reach a certain level of “computational ability”.

There’s another thing to discuss, related to education. And that’s the fact that over time, more and more “distinct pieces of knowledge” get built up in our civilization. There was perhaps a time in history when a brain of our size could realistically commit to memory at least the basics of much of that knowledge. But today that time has long passed. Yes, abstraction in effect compresses what one needs to know. But the continual addition of new and seemingly important knowledge, across countless specialties, makes it impossible for brains of our size to keep up.

Plenty of that knowledge is, though, quite siloed in different areas. But sometimes there are “grand analogies” to make—say pulling an idea from relativity theory and applying it to biological evolution. In a sense such analogies reveal new abstractions—but to make them requires knowledge that spans many different areas. And that’s a place where bigger brains—or AIs—can potentially do something that’s in a fundamental way “beyond us”.

Will there always be such “grand analogies” to make? The general growth of knowledge is inevitably a computationally irreducible process. And within it there will inevitably be pockets of reducibility. But how often in practice will one actually encounter “long-range connections” across “knowledge space”? As a specific example one can look at metamathematics, where such connections are manifest in theorems that link seemingly different areas of mathematics. And this example leads one to realize that at some deep level grand analogies are in a sense inevitable. In the context of the ruliad, one can think of different domains of knowledge as corresponding to different parts. But the nature of the ruliad—encompassing as it does everything that is computationally possible—inevitably imbues it with a certain homogeneity, which implies that (as the Principle of Computational Equivalence might suggest) there must ultimately be a correspondence between different areas. In practice, though, this correspondence may be at a very “atomic” (or “formal”) level, far below the kinds of descriptions (based on pockets of reducibility) that we imagine brains normally use.

But, OK, will it always take an “expanding brain” to keep up with the “expanding knowledge” we have? Computational irreducibility guarantees that there’ll always in principle be “new knowledge” to be had—separated from what’s come before by irreducible amounts of computation. But then there’s the question of whether in the end we’ll care about it. After all, it could be that the knowledge we can add is so abstruse that it will never affect any practical decisions we have to make. And, yes, to some extent that’s true (which is why only some tiny fraction of the Earth’s population will care about what I’m writing here). But another consequence of computational irreducibility is that there will always be “surprises”—and those can eventually “push into focus” even what at first seems like arbitrarily obscure knowledge.

Computational Language

Language in general—and compositional language in particular—is arguably the greatest invention of our species. But is it somehow “the top”—the highest possible representation of things? Or if, for example, we had bigger brains, is there something beyond it that we could reach?

Well, in some very formal sense, yes, compositional language (at least in idealized form) is “the top”. Because—at least if it’s allowed to include utterances of any length—then in some sense it can in principle encode arbitrary, universal computations. But this really isn’t true in any useful sense—and indeed to apply ordinary compositional language in this way would require doing computationally irreducible computations.

So we return to the question of what might in practice lie beyond ordinary human language. I wondered about this for a long time. But in the end I realized that the most important clue is in a sense right in front of me: the concept of computational language, that I’ve spent much of my life exploring.

It’s worth saying at the outset that the way computational language plays out for computers and for brains is somewhat different, and in some respects complementary. In computers you might specify something as a Wolfram Language symbolic expression, and then the “main action” is to evaluate this expression, potentially running a long computation to find out what the expression evaluates to.

Brains aren’t set up to do long computations like this. For them a Wolfram Language expression is something to use in effect as a “representation of a thought”. (And, yes, that’s an important distinction between the computational language concept of Wolfram Language, and standard “programming languages”, which are intended purely as a way to tell a computer what to do, not a way to represent thoughts.)

So what kinds of thoughts can we readily represent in our computational language? There are ones involving explicit numbers, or mathematical expressions. There are ones involving cities and chemicals, and other real-world entities. But then there are higher-level ones, that in effect describe more abstract structures.

For example, there’s NestList, which gives the result of nesting any operation, here named f:

At the outset, it’s not obvious that this would be a useful thing to do. But in fact it’s a very successful abstraction: there are lots of functions f for which one wants to do this.

In the development of ordinary human language, words tend to get introduced when they’re useful, or, in other words, when they express things one often wants to express. But somehow in human language the words one gets tend to be more concrete. Maybe they describe something that directly happens to objects in the world. Maybe they describe our impression of a human mental state. Yes, one can make rather vague statements like “I’m going to do something to someone”. But human language doesn’t normally “go meta”, doing things like NestList where one’s saying that one wants to take some “direct statement” and in effect “work with the statement”. In some sense, human language tends to “work with data”, applying a simple analog of code to it. Our computational language can “work with code” as “raw material”.

One can think about this as a “higher-order function”: a function that operates not on data, but on functions. And one can keep going, dealing with functions that operate on functions that operate on functions, and so on. And at every level one is increasing the generality—and abstraction—at which one is working. There may be many specific functions (a bit analogous to verbs) that operate on data (a bit analogous to nouns). But when we talk about operating on functions themselves we can potentially have just a single function (like NestList) that operates, quite generally, on many functions. In ordinary language, we might call such things “metaverbs”, but they aren’t something that commonly occurs.

But what makes them possible in computational language? Well, it’s taking the computational paradigm seriously, and representing everything in computational terms: objects, actions, etc. In Wolfram Language, it’s that we can represent everything as a symbolic expression. Arrays of numbers (or countries, or whatever) are symbolic expressions. Graphics are symbolic expressions. Programs are symbolic expressions. And so on.

And given this uniformity of representation it becomes feasible—and natural—to do higher-order operations, that in effect manipulate symbolic structure without being concerned about what the structure might represent. At some level we can view this as leading to the ultimate abstraction embodied in the ruliad, where in a sense “everything is pure structure”. But in practice in Wolfram Language we try to “anchor” what we’re doing to known concepts from ordinary human language—so that we use names for things (like NestList) that are derived from common English words.

In some formal sense this isn’t necessary. Everything can be “purely structural”, as it is not only in the ruliad but also in constructs like combinators, where, say, the operation of addition can be represented by:

Combinators have been around for more than a century. But they are almost impenetrably difficult for most humans to understand. Somehow they involve too much “pure abstraction”, not anchored to concepts we “have a sense of” in our brains.

It’s been interesting for me to observe over the years what it’s taken for people (including myself) to come to terms with the kind of higher-order constructs that exist in the Wolfram Language. The typical pattern is that over the course of months or years one gets used to lots of specific cases. And only after that is one able—often in the end rather quickly—to “get to the next level” and start to use some generalized, higher-order construct. But normally one can in effect only “go one level at a time”. After one groks one level of abstraction, that seems to have to “settle” for a while before one can go on to the next one.

Somehow it seems as if one is gradually “feeling out” a certain amount of computational irreducibility, to learn about a new pocket of reducibility, that one can eventually use to “think in terms of”.

Could “having a bigger brain” speed this up? Maybe it’d be useful to be able to remember more cases, and perhaps get more into “working memory”. But I rather suspect that combinators, for example, are in some sense fundamentally beyond all brain-like systems. It’s much as the Principle of Computational Equivalence suggests: one quickly “ascends” to things that are as computationally sophisticated as anything—and therefore inevitably involve computational irreducibility. There are only certain specific setups that remain within the computationally bounded domain that brain-like systems can deal with.

Of course, even though they can’t directly “run code in their brains”, humans—and LLMs—can perfectly well use Wolfram Language as a tool, getting it to actually run computations. And this means they can readily “observe phenomena” that are computationally irreducible. And indeed in the end it’s very much the same kind of thing observing such phenomena in the abstract computational universe, and in the “real” physical universe. And the point is that in both cases, brain-like systems will pull out only certain features, essentially corresponding to pockets of computational reducibility.

How do things like higher-order functions relate to this? At this point it’s not completely clear. Presumably in at least some sense there are hierarchies of higher-order functions that capture certain kinds of regularities that can be thought of as associated with networks of computational reducibility. And it’s conceivable that category theory and its higher-order generalizations are relevant here. In category theory one imagines applying sequences of functions (“morphisms”) and it’s a foundational assumption that the effect of any sequence of functions can also be represented by just a single function—which seems tantamount to saying that one can always “jump ahead”, or in other words, that everything one’s dealing with is computationally reducible. Higher-order category theory then effectively extends this to higher-order functions, but always with what seem like assumptions of computational reducibility.

And, yes, this all seems highly abstract, and difficult to understand. But does it really need to be, or is there some way to “bring it down” to a level that’s close to everyday human thinking? It’s not clear. But in a sense the core art of computational language design (that I’ve practiced so assiduously for nearly half a century) is precisely to take things that at first might seem abstruse, and somehow cast them into an accessible form. And, yes, this is something that’s about as intellectually challenging as anything—because in a sense it involves continually trying to “figure out what’s really going on”, and in effect “drilling down” to get to the foundations of everything.

But, OK, when one gets there, how simple will things be? Part of that depends on how much computational irreducibility is left when one reaches what one considers to be “the foundations”. And part in a sense depends on the extent to which one can “find a bridge” between the foundations and something that’s familiar. Of course, what’s “familiar” can change. And indeed over the four decades that I’ve been developing the Wolfram Language quite a few things (particularly in areas like functional programming) that at first seemed abstruse and unfamiliar have begun to seem more familiar. And, yes, it’s taken the collective development and dissemination of the relevant ideas to achieve that. But now it “just takes education”; it doesn’t “take a bigger brain” to deal with these things.

One of the core features of the Wolfram Language is that it represents everything as a symbolic expression. And, yes, symbolic expressions are formally able to represent any kind of computational structure. But beyond that, the important point is that they’re somehow set up to be a match for how brains work.

And in particular, symbolic expressions can be thought of “grammatically” as consisting of nested functions that form a tree-like structure; effectively a more precise version of the typical kind of grammar that we find in human language. And, yes, just as we manage to understand and generate human language with a limited working memory, so (at least at the grammatical level) we can do the same thing with computational language. In other words, in dealing with Wolfram Language we’re leveraging our faculties with human language. And that’s why Wolfram Language can serve as such an effective bridge between the way we think about things, and what’s computationally possible.

But symbolic expressions represented as trees aren’t the only conceivable structures. It’s also possible to have symbolic expressions where the elements are nodes on a graph, and the graph can even have loops in it. Or one can go further, and start talking, for example, about the hypergraphs that appear in our Physics Project. But the point is that brain-like systems have a hard time processing such structures. Because to keep track of what’s going on they in a sense have to keep track of multiple “threads of thought”. And that’s not something individual brain-like systems as we current envision them can do.

Many Brains Together: The Formation of Society

As we’ve discussed several times here, it seems to be a key feature of brains that they create a single “thread of experience”. But what would it be like to have multiple threads? Well, we actually have a very familiar example of that: what happens when we have a whole collection of people (or other animals).

One could imagine that biological evolution might have produced animals whose brains maintain multiple simultaneous threads of experience. But somehow it has ended up instead restricting each animal to just one thread of experience—and getting multiple threads by having multiple animals. (Conceivably creatures like octopuses may actually in some sense support multiple threads within one organism.)

Within a single brain it seems important to always “come to a single, definite conclusion”—say to determine where an animal will “move next”. But what about in a collection of organisms? Well, there’s still some kind of coordination that will be important to the fitness of the whole population—perhaps even something as direct as moving together as a herd or flock. And in a sense, just as all those different neuron firings in one brain get collected to determine a “final conclusion for what to do”, so similarly the conclusions of many different brains have to be collected to determine a coordinated outcome.

But how can a coordinated outcome arise? Well, there has to be communication of some sort between organisms. Sometimes it’s rather passive (just watch what your neighbor in a herd or flock does). Sometimes it’s something more elaborate and active—like language. But is that the best one can do? One might imagine that there could be some kind of “telepathic coordination”, in which the raw pattern of neuron firings is communicated from one brain to another. But as we’ve argued, such communication cannot be expected to be robust. To achieve robustness, one must “package up” all the internal details into some standardized form of communication (words, roars, calls, etc.) that one can expect can be “faithfully unpacked” and in effect “understood” by other, suitably similar brains.

But it’s important to realize that the very possibility of such standardized communication in effect requires coordination. Because somehow what goes on in one brain has to be aligned with what goes on in another. And indeed the way that’s maintained is precisely through continual communication.

So, OK, how might bigger brains affect this? One possibility is that they might enable more complex social structures. There are plenty of animals with fairly small brains that successfully form “all do the same thing” flocks, herds and the like. But the larger brains of primates seem to allow more complex “tribal” structures. Could having a bigger brain let one successfully maintain a larger social structure, in effect remembering and handling larger numbers of social connections? Or could the actual forms of these connections be more complex? While human social connections seem to be at least roughly captured by social networks represented as ordinary graphs, maybe bigger brains would for example routinely require hypergraphs.

But in general we can say that language—or standardized communication of some form—is deeply connected to the existence of a “coherent society”. For without being able to exchange something like language there’s no way to align the members of a potential society. And without coherence between members something like language won’t be useful.

As in so many other situations, one can expect that the detailed interactions between members of a society will show all sorts of computational irreducibility. And insofar as one can identify “the will of society” (or, for that matter, the “tide of history”), it represents a pocket of computational reducibility in the system.

In human society there is a considerable tendency (though it’s often not successful) to try to maintain a single “thread of society”, in which, at some level, everyone is supposed to act more or less the same. And certainly that’s an important simplifying feature in allowing brains like ours to “navigate the social world”. Could bigger brains do something more sophisticated? As in other areas, one can imagine a whole network of regularities (or pockets of reducibility) in the structure of society, perhaps connected to a whole tower of “higher-order social abstractions”, that only brains bigger than ours can comfortably deal with. (“Just being friends” might be a story for the “small brained”. With bigger brains one might instead have patterns of dependence and connectivity that can only be represented in complicated graph theoretic ways.)

Minds beyond Ours

We humans have a tremendous tendency to think—or at least hope—that our minds are somehow “at the top” of what’s possible. But with what we know now about computation and how it operates in the natural world it’s pretty clear this isn’t true. And indeed it seems as if it’s precisely a limitation in the “computational architecture” of our minds—and brains—that leads to that most cherished feature of our existence that we characterize as “conscious experience”.

In the natural world at large, computation is in some sense happening quite uniformly, everywhere. But our brains seem to be set up to do computation in a more directed and more limited way—taking in large amounts of sensory data, but then filtering it down to a small stream of actions to take. And, yes, one can remove this “limitation”. And while the result may lead to more computation getting done, it doesn’t lead to something that’s “a mind like ours”.

And indeed in what we’ve done here, we’ve tended to be very conservative in how we imagine “extending our minds”. We’ve mostly just considered what might happen if our brains were scaled up to have more neurons, while basically maintaining the same structure. (And, yes, animals physically bigger than us already have larger brains—as did Neanderthals—but what we really need to look at is size of brain relative to size of the animal, or, in effect “amount of brain for a given amount of sensory input”.)

A certain amount about what happens with different scales of brains is already fairly clear from looking at different kinds of animals, and at things like their apparent lack of human-like language. But now that we have artificial neural nets that do remarkably human-like things we’re in a position to get a more systematic sense of what different scales of “brains” can do. And indeed we’ve seen a sequence of “capability thresholds” passed as neural nets get larger.

So what will bigger brains be able to do? What’s fairly straightforward is that they’ll presumably be able to take larger amounts of sensory input, and generate larger amounts of output. (And, yes, the sensory input could come from existing modalities, or new ones, and the outputs could go to existing “actuators”, or new ones.) As a practical matter, the more “data” that has to be processed for a brain to “come to a decision” and generate an output, the slower it’ll probably be. But as brains get bigger, so presumably will the size of their working memory—as well as the number of distinct “concepts” they can “distinguish” and “remember”.

If the same overall architecture is maintained, there’ll still be just a single “thread of experience”, associated with a single “thread of communication”, or a single “stream of tokens”. At the size of brains we have, we can deal with compositional language in which “concepts” (represented, basically, as words) can have at least a certain depth of qualifiers (corresponding, say, to adjectival phrases). As brain size increases, we can expect there can both be more “raw concepts”—allowing fewer qualifiers—as well as more working memory to deal with more deeply nested qualifiers.

But is there something qualitatively different that can happen with bigger brains? Computational language (and particularly my experience with the Wolfram Language) gives some indications, the most notable of which is the idea of “going meta” and using “higher-order constructs”. Instead of, say, operating directly on “raw concepts” with (say, “verb-like”) “functions”, we can imagine higher-order functions that operate on functions themselves. And, yes, this is something of which we see powerful examples in the Wolfram Language. But it feels as if we could somehow go further—and make this more routine—if our brains in a sense had “more capacity”.

To “go meta” and “use higher-order constructs” is in effect a story of abstraction—and of taking many disparate things and abstracting to the point where one can “talk about them all together”. The world at large is full of complexity—and computational irreducibility. But in essence what makes “minds like ours” possible is that there are pockets of computational reducibility to be found. And those pockets of reducibility are closely related to being able to successfully do abstraction. And as we build up towers of abstraction we are in effect navigating through networks of pockets of computational reducibility.

The progress of knowledge—and the fact that we’re educated about it—lets us get to a certain level of abstraction. And, one suspects, the more capacity there is in a brain, the further it will be able to go.

But where will it “want to go”? The world at large—full as it is with computational irreducibility, along with infinite numbers of pockets of reducibility—leaves infinite possibilities. And it is largely the coincidence of our particular history that defines the path we have taken.

We often identify our “sense of purpose” with the path we will take. And perhaps the definiteness of our belief in purpose is related to the particular feature of brains that leads us to concentrate “everything we’re thinking” down into just a single stream of decisions and action.

And, yes, as we’ve discussed, one could in principle imagine “multiway minds” with multiple “threads of consciousness” operating at once. But we humans (and individual animals in general) don’t seem to have those. Of course, in collections of humans (or other animals) there are still inevitably multiple “threads of consciousness” —and it’s things like language that “knit together” those threads to, for example, make a coherent society.

Quite what that “knitting” looks like might change as we scale up the size of brains. And so, for example, with bigger brains we might be able to deal with “higher-order social structures” that would seem alien and incomprehensible to us today.

So what would it be like to interact with a “bigger brain”? Inside, that brain might effectively use many more words and concepts than we know. But presumably it could generate at least a rough (“explain-like-I’m-5”) approximation that we’d be able to understand. There might well be all sorts of abstractions and “higher-order constructs” that we are basically blind to. And, yes, one is reminded of something like a dog listening to a human conversation about philosophy—and catching only the occasional “sit” or “fetch” word.

As we’ve discussed several times here, if we remove our restriction to “brain-like” operation (and in particular to deriving a small stream of decisions from large amounts of sensory input) we’re thrown into the domain of general computation, where computational irreducibility is rampant, and we can’t in general expect to say much about what’s going on. But if we maintain “brain-like operation”, we’re instead in effect navigating through “networks of computational reducibility”, and we can expect to talk about things like concepts, language and towers of abstraction.

From a foundational point of view, we can imagine any mind as in effect being at a particular place in the ruliad. When minds communicate, they are effectively exchanging the rulial analog of particles—robust concepts that are somehow unchanged as they propagate within the ruliad. So what would happen if we had bigger brains? In a sense it’s a surprisingly “mechanical” story: a bigger brain—encompassing more concepts, etc.—in effect just occupies a larger region of rulial space. And the presence of abstraction—perhaps learned from a whole arc of intellectual history—can lead to more expansion in rulial space.

And in the end it seems that “minds beyond ours” can be characterized by how large the regions of the ruliad they occupy are. (Such minds are, in some very literal rulial sense, more “broad minded”.) So what is the limit of all this? Ultimately, it’s a “mind” that spans the whole ruliad, and in effect incorporates all possible computations. But in some fundamental sense this is not a mind like ours, not least because by “being everything” it “becomes nothing”—and one can no longer identify it as having a coherent “thread of individual existence”.

And, yes, the overall thrust of what we’ve been saying applies just as well to “AI minds” as to biological ones. If we remove restrictions like being set up to generate the next token, we’ll be left with a neural net that’s just “doing computation”, with no obvious “mind-like purpose” in sight. But if we make neural nets do typical “brain-like” tasks, then we can expect that they too will find and navigate pockets of reducibility. We may well not recognize what they’re doing. But insofar as we can, then inevitably we’ll mostly be sampling the parts of “minds beyond ours” that are aligned with “minds like ours”. And it’ll take progress in our whole human intellectual edifice to be able to fully appreciate what it is that minds beyond ours can do.

Thanks for recent discussions about topics covered here in particular to Richard Assar, Joscha Bach, Kovas Boguta, Thomas Dullien, Dugan Hammock, Christopher Lord, Fred Meinberg, Nora Popescu, Philip Rosedale, Terry Sejnowski, Hikari Sorensen, and James Wiles.

What Can We Learn about Engineering and Innovation from Half a Century of the Game of Life Cellular Automaton?

Stephen Wolfram — Tue, 18 Mar 2025 18:25:33 +0000

Metaengineering and Laws of Innovation

Things are invented. Things are discovered. And somehow there’s an arc of progress that’s formed. But are there what amount to “laws of innovation” that govern that arc of progress?

There are some exponential and other laws that purport to at least measure overall quantitative aspects of progress (number of transistors on a chip; number of papers published in a year; etc.). But what about all the disparate innovations that make up the arc of progress? Do we have a systematic way to study those?

We can look at the plans for different kinds of bicycles or rockets or microprocessors. And over the course of years we’ll see the results of successive innovations. But most of the time those innovations won’t stay within one particular domain—say shapes of bicycle frames. Rather they’ll keep on pulling in innovations from other domains—say, new materials or new manufacturing techniques. But if we want to get closer to the study of the pure phenomenon of innovation we need a case where—preferably over a long period of time—everything that happens can be described in a uniform way within a single narrowly defined framework.

Well, some time ago I realized that, actually, yes, there is such a case—and I’ve even personally been following it for about half a century. It’s the effort to build “engineering” structures within the Game of Life cellular automaton. They might serve as clocks, wires, logic gates, or things that generate digits of π. But the point is that they’re all just patterns of bits. So when we talk about innovation in this case, we’re talking about the rather pure question of how patterns of bits get invented, or discovered.

As a long-time serious researcher of the science of cellular automata (and of what they generically do), I must say I’ve long been frustrated by how specific, whimsical and “non-scientific” the things people do with the Game of Life have often seemed to me to be. But what I now realize is that all that detail and all that hard work have now created what amounts to a unique dataset of engineering innovation. And my goal here is to do what one can call “metaengineering”—and to study in effect what happened in that process of engineering over the nearly six decades since the Game of Life was invented.

We’ll see in rather pure form many phenomena that are at least anecdotally familiar from our overall experience of progress and innovation. Most of the time, the first step is to identify an objective: some purpose one can describe and wants to achieve. (Much more rarely, one instead observes something that happens, then realizes there’s a way one can meaningfully make use of it.) But starting from an objective, one either takes components one has, and puts human effort into arranging them to “invent” something that will achieve the objective—or in effect (usually at least somewhat systematically, and automatically) one searches to try to “discover” new ways to achieve the objective.

As we explore what’s been done with the Game of Life we’ll see occasional sudden advances—together with much larger amounts of incremental progress. We’ll see towers of technology being built, and we’ll see old, rather simple technology being used to achieve new objectives. But most of all, we’ll see an interplay between what gets discovered by searching possibilities—and what gets invented by explicit human effort.

The Principle of Computational Equivalence implies that there is, in a sense, infinite richness to what a computational system like the Game of Life can ultimately do—and it’s the role of science to explore this richness in all its breadth. But when it comes to engineering and technology the crucial question is what we choose to make the system do—and what paths we follow to get there. Inevitably, some of this is determined by the underlying computational structure of the system. But much of it is a reflection of how we, as humans, do things, and the patterns of choices we make. And that’s what we’ll be able to study—at quite large scale—by looking at the nearly six decades of work on the Game of Life.

How similar are the results of such “purposeful engineering” to the results of “blind” adaptive evolution of the kind that occurs in biology? I recently explored adaptive evolution (as it happens, using cellular automata as a model) and saw that it can routinely deliver what seem like “sequences of new ideas”. But now in the example of the Game of Life we have what we can explicitly identify as “sequences of new ideas”. And so we’re in a position to compare the results of human effort (aided, in many cases, by systematic search) with what we can “automatically” do by the algorithmic process of adaptive evolution.

In the end, we can think of the set of things that we can in principle engineer as being laid out in a kind of “metaengineering space”, much as we can think of mathematical theorems we can prove as being laid out in metamathematical space. In the mathematical case (notwithstanding some of my own work) the vast majority of theorems have historically been found purely by human effort. But, as we’ll see below, in Game-of-Life engineering it’s been a mixture of human effort and fairly automated exploration of metaengineering space. Though—much like in traditional mathematics—we’ve still in a sense always only pursuing objectives we’ve already conceptualized. And in this way what we’re doing is very different from what I’ve done for so long in studying the science (or, as I would now say, the ruliology) of what computational systems like cellular automata (of which the Game of Life is an example) do “in the wild”, when they’re unconstrained by objectives we’re trying to achieve with them.

The Nature of the Game of Life

Here’s a typical example of what it looks like to run the Game of Life:

There’s a lot of complicated—and hard to understand—stuff going on here. But there are still some recognizable structures—like the “blinkers” that alternate on successive steps

and the “gliders” that steadily move across the screen:

Seeing these structures might make one think that one should be able to “do engineering” in the Game of Life, setting up patterns that can ultimately do all sorts of things. And indeed our main subject here is the actual development of such engineering over the past nearly six decades since the introduction of the Game of Life.

What we’ll be concentrating on is essentially the “technology” of the Game of Life: how we take the “raw material” that the Game of Life provides, and make from it “meaningful engineering structures”.

But what about the science of the Game of Life? What can we say about what the Game of Life “naturally does”, independent of “useful” structures we create in it? The vast majority of the effort that’s been put into the Game of Life over the past half century hasn’t been about this. But this type of fundamental question is central to what one asks in what I now call ruliology—a kind of science that I’ve been energetically pursuing since the early 1980s.

Ruliology looks in general at classes of systems, rather then at the kind of specifics that have typically been explored in the Game of Life. And within ruliology, the Game of Life is in a sense nothing special; it’s just one of many “class 4” 2D cellular automaton (in my numbering scheme, it’s the 2-color 9-neighbor cellular automaton with outer totalistic code 224).

My own investigations of cellular automata have particularly focused in 1D than 2D examples. And I think that’s been crucial to many of the scientific discoveries I’ve made. Because somehow one learns so much more by being able to see at a glance the history of a system, rather than just seeing frames in a video go by. With a class 4 2D rule like the Game of Life, one can begin to approach this by including “trails” of what’s previously happened, and we’ll often use this kind of visualization in what follows:

We can get a more complete view of history by looking at the whole (2+1)-dimensional “spacetime history”—though then we’re confronted with 3D forms that are often somewhat difficult for our human visual system to parse:

But taking a slice through this 3D form we get “silhouette” pictures that turn out to look remarkably similar to what I generated in large quantities starting in the early 1980s across many 1D cellular automata:

Such pictures—with their complex forms—highlight the computational irreducibility that’s close at hand even in the Game of Life. And indeed it’s the presence of such computational irreducibility that ultimately makes possible the richness of engineering that can be done in the Game of Life. But in actually doing that engineering—and in setting up structures and processes that behave in understandable and “technologically useful” ways—we need to keep the computational irreducibility “bottled up”. And in the end, we can think of the path of engineering innovation in the Game of Life as like an effort to navigate through an ocean of computational irreducibility, finding “islands of reducibility” that achieve the purposes we want.

What’s Been Made in the Game of Life?

Most of the structures of “engineering interest” in the Game of Life are somehow persistent. The simplest are structures that just remain constant, some small examples being:

And, yes, structures in the Game of Life have been given all sorts of (usually whimsical) names, which I’ll use here. (And, in that vein, structures in the Game of Life that remain constant are normally called “still lifes”.)

Beyond structures that just remain constant, there are “oscillators” that produce periodic patterns:

We’ll be discussing oscillators at much greater length below, but here are a few examples (where now we’re including a visualization that shows “trails”):

Next in our inventory of classes of structures come “gliders” (or in general “spaceships”): structures that repeat periodically but move when they do so. A classic example is the basic glider, which takes on the same form every 4 steps—after moving 1 cell horizontally and 1 cell vertically:

Here are a few small examples of such “spaceship”-style structures:

Still lifes, oscillators and spaceships are most of what one sees in the “ash” that survives from typical random initial conditions. And for example the end result (after 1103 steps) from the evolution we saw in the previous section consists of:

The structures we’ve seen so far were all found not long after the Game of Life was invented; indeed, pretty much as soon it was simulated on a computer. But one feature that they all share is that they don’t systematically grow; they always return to the same number of black cells. And so one of the early surprises (in 1970) was the discovery of a “glider gun” that shoots out a glider every 30 steps forever:

Something that gives a sense of progress that’s been made in Game-of-Life “technology” is that a “more efficient” glider gun—with period 15—was discovered, but only in 2024, 54 years after the previous one:

Another kind of structure that was quickly discovered in the early history of the Game of Life is a “puffer”—a “spaceship” that “leaves debris behind” (in this case every 128 steps):

But given these kinds of “components”, what can one build? Something constructed very early was the “breeder”, that uses streams of gliders to create glider guns, that themselves then generate streams of gliders:

The original pattern covers about a quarter million cells (with 4060 being black). Running it for 1000 steps we see it builds up a triangle containing a quadratically increasing number of gliders:

OK, but knowing that it’s in principle possible to “fill a growing region of space”, is there a more efficient way to do it? The surprisingly simple answer, as discovered in 1993, is yes:

So what other kinds of things can be built in the Game of Life? Lots—even from the simple structures we’ve seen so far. For example, here’s a pattern that was constructed to compute the primes

emitting a “lightweight spaceship” at step 100 + 120n only if n is prime. It’s a little more obvious how this works when it’s viewed “in spacetime”; in effect it’s running a sieve in which all multiples of all numbers are instantiated as streams of gliders, which knock out spaceships generated at non-prime positions:

If we look at the original pattern here, it’s just made up of a collection of rather simple structures:

And indeed structures like these have been used to build all sorts of things, including for example Turing machine emulators—and also an emulator for the Game of Life itself, with this 499×499 pattern corresponding to a single emulated Life cell:

Both these last two patterns were constructed in the 1990s—from components that had been known since the early 1970s. And—as we can see—they’re large (and complicated). But do they need to be so large? One of the lessons of the Principle of Computational Equivalence is that in the computational universe there’s almost always a way to “do just as much, but with much less”. And indeed in the Game of Life many, many discoveries along these lines have been made in the past few decades.

As we’ll see, often (but not always) these discoveries built on “new devices” and “new mechanisms” that were identified in the intervening years. A long series of such “devices” and “mechanisms” involved handling “signals” associated with streams of gliders. For example, the “glider pusher” (from 1993) has the somewhat subtle (but useful) effect of “pushing” a glider by one cell when it goes past:

Another example (actually already known in 1971, and based on the period-15 “pentadecathlon” oscillator) is a glider reflector:

But a feature of this glider pusher and glider reflector is that they work only when both the glider and the stationary object are in a particular phase with respect to their periods. And this makes it very tricky to build larger structures out of these that operate correctly (and in many cases it wouldn’t be possible but for the commensurability of the period 30 of the original glider gun, and the period 15 of the glider reflector).

Could glider pushing and glider reflection be done more robustly? The answer turns out to be yes. Though it wasn’t until 2020 that the “bandersnatch” was created—a completely static structure that “pushes” gliders independent of their phase:

Meanwhile, in 2013 the “snark” had been created—which served as a phase-independent glider reflector:

One theme—to which we’ll return later—is that after certain functionality was first built in the Game of Life, there followed many “optimizations”, achieving that functionality more robustly, with smaller patterns, etc. An important methodology has revolved around so-called “hasslers”, which in effect allow one to “mine” small pieces of computational irreducibility, by providing “harnesses” that “rein in” behavior, typically returning patterns to their original states after they’ve done what one wants them to do.

So, for example, here’s a hassler (found, as it happens just on February 8, 2025!) that “harnesses” the first pattern we looked at above (that didn’t stabilize for 1103 steps) into an oscillator with period 80:

And based on this (indeed, later that same day) the most-compact-ever “spaceship gun” was constructed from this:

The Arc of Progress

We’ve talked about some of what it’s been possible to build in the Game of Life over the years. Now I want to talk about how that happened, or, in other words, the “arc of progress” in the Game of Life. And as a first indication of this, we can plot the number of new Life structures that have been identified each year (or, more specifically, the number of structures deemed significant enough to name, and to record in the LifeWiki database or its predecessors):

There’s an immediate impression of several waves of activity. And we can break this down into activity around various common categories of structures:

For oscillators we see fairly continuous activity for five decades, but with rapid acceleration recently. For “spaceships” and “guns” we see a long dry spell from the early 1970s to the 1990s, followed by fairly consistent activity since. And for conduits and reflectors we see almost nothing until sudden peaks of activity, in the mid-1990s and mid-2010s respectively.

But what was actually done to find all these structures? There have basically been two methods: construction and search. Construction is a story of “explicit engineering”—and of using human thought to build up what one wants. Search, on the other hand, is a story of automation—and of taking algorithmically generated (usually large) collections of possible patterns, and testing them to find ones that do what one wants. Particularly in more recent times it’s also become common to interleave these methods, for example using construction to build a framework, and then using search to find specific patterns that implement some feature of that framework.

When one uses construction, it’s like “inventing” a structure, and when one uses search, it’s like “discovering” it. So how much of each is being done in practice? Text mining descriptions of recently recorded structures the result is as follows—suggesting that, at least in recent times, search (i.e. “discovery”) has become the dominant methodology for finding new structures:

When the Game of Life was being invented, it wasn’t long before it was being run on computers—and people were trying to classify the things it could do. Still lifes and simple oscillators showed up immediately. And then—evolving from the (“R pentomino”) initial condition that we used at the beginning here—after 69 steps something unexpected showed up. In between complicated behavior that was hard to describe was a simple free-standing structure that just systematically moved—a “glider”:

Some other moving structures (dubbed “spaceships”) were also observed. But the question arose: could there be a structure that would somehow systematically grow forever? To find it involved a mixture of “discovery” and “invention”. In running from the (“R pentomino”) initial condition lots of things happen. But at step 785 it was noticed that there appeared the following structure:

For a while this structure (dubbed the “queen bee”) behaves in a fairly orderly way—producing two stable “beehive” structures (visible here as vertical columns). But then it “decays” into more complicated behavior:

But could this “discovered” behavior be “stabilized”? The answer was that, yes, if a “queen bee” was combined with two “blocks” it would just repeatedly “shuttle” back and forth:

What about two “queen bees”? Now whenever these collided there was a side effect: a glider was generated—with the result that the whole structure became a glider gun repeatedly producing gliders forever:

The glider gun was the first major example of a structure in the Game of Life that was found—at least in part—by construction. And within a year of it being found in November 1970, two more guns—with very similar methods of operation—had been found:

But then the well ran dry—and no further gun was found until 1990. Pretty much the same thing happened with spaceships: four were found in 1970, but no more were found until 1989. As we’ll discuss later, it was in a sense a quintessential story of computational irreducibility: there was no way to predict (or “construct”) what spaceships would exist; one just had to do the computation (i.e. search) to find out.

It was, however, easier to have incremental success with oscillators—and (as we’ll see) pretty much every year an oscillator with some new period was found, essentially always by search. Some periods were “long holdouts” (for example the first period-19 oscillator was found only in 2023), once again reflecting the effects of computational irreducibility.

Glider guns provided a source of “signals” for Life engineering. But what could one do with these signals? An important idea—that first showed up in the “breeder” in 1971—was “glider synthesis”: the concept that combinations of gliders could produce other structures. So, for example, it was found that three carefully-arranged gliders could generate a period-15 (“pentadecathlon”) oscillator:

It was also soon found that 8 gliders could make the original glider gun (the breeder made glider guns by a slightly more ornate method). And eventually there developed the conjecture that any structure that could be synthesized from gliders would need at most 15 gliders, carefully arranged at positions whose values effectively encoded the object to be constructed.

By the end of the 1970s a group of committed Life enthusiasts remained, but there was something of a feeling that “the low-hanging fruit had been picked”, and it wasn’t clear where to go next. But after a somewhat slow decade, work on the Game of Life picked up substantially towards the end of the 1980s. Perhaps my own work on cellular automata (and particularly the identification of class 4 cellular automata, of which the Game of Life is a 2D example) had something to do with. And no doubt it also helped that the fairly widespread availability of faster (“workstation class”) computers now made it possible for more people to do large-scale systematic searches. In addition, when the web arrived in the early 1990s it let people much more readily share results—and had the effect of greatly expanding and organizing the community of Life enthusiasts.

In the 1990s—along with more powerful searches that found new spaceships and guns—there was a burst of activity in constructing elaborate “machines” out of existing known structures. The idea was to start from a known type of “machine” (say a Turing machine), then to construct a Life implementation of it. The constructions were made particularly ornate by the need to make the phases of gliders, guns, etc. appropriately correspond. Needless to say, any Life configuration can be thought of as doing some computation. But the “machines” that were constructed were ones whose “purpose” and “functionality” was already well established in general computation, independent of the Game of Life.

If the 1990s saw a push towards “construction” in the Game of Life, the first decade of the 2000s saw a great expansion of search. Increasingly powerful cloud and distributed computing allowed “censuses” to be created of structures emerging from billions, then trillions of initial conditions. Mostly what was emphasized was finding new instances of existing categories of objects, like oscillators and spaceships. There were particular challenges, like (as we’ll discuss below) finding oscillators of any period (finally completely solved in 2023), or finding spaceships with different patterns of motion. Searches did yield what in censuses were usually called “objects with unusual growth”, but mostly these were not viewed as being of “engineering utility”, and so were not extensively studied (even though from the point of the “science of the Game of Life” they are, for example, perhaps the most revealing examples of computational irreducibility).

As had happened throughout the history of the Game of Life, some of the most notable new structures were created (sometimes over a long period of time) by a mixture of construction and search. For example, the “stably-reflect-gliders-without-regard-to-phase” snark—finally obtained in 2013—was the result of using parts of the (ultimately unstable) “simple-structures” construction from around 1998

and combining them with a hard-to-explain-why-it-works “still life” found by search:

Another example was the “Sir Robin knightship”—a spaceship that moves like a chess knight 2 cells down and 1 across. In 2017 a spaceship search found a structure that in 6 steps has many elements that make a knight move—but then subsequently “falls apart”:

But the next year a carefully orchestrated search was able to “find a tail” that “adds a fix” to this—and successfully produces a final “perfect knightship”:

By the way, the idea that one can take something that “almost works” and find a way to “fix it” is one that’s appeared repeatedly in the engineering history of the Game of Life. At the outset, it’s far from obvious that such a strategy would be viable. But the fact that it is seems to be similar to the story of why both biological evolution and machine learning are viable—which, as I’ve recently discussed, can be viewed as yet another consequence of the phenomenon of computational irreducibility.

One thing that’s happened many times in the history of the Game of Life is that at some point some category of structure—like a conduit—is identified, and named. But then it’s realized that actually there was something that could be seen as an instance of the same category of structure found much earlier, though without the clarity of the later instance, its significance wasn’t recognized. For example, in 1995 the “Herschel conduit” that moves a from one position to another (here in 64 steps) was discovered (by a search):

But then it was realized that—if looked at correctly—a similar phenomenon had actually already been seen in 1972, in the form of a structure that in effect takes if it is present, and “moves it” (in 28 steps) to a at a different position (albeit with a certain amount of “containable” other activity):

Looking at the plots above of the number of new structures found per year we see the largest peak after 2020. And, yes, it seems that during the pandemic people spent more time on the Game of Life—in particular trying to fill in tables of structures of particular types, for example, with each possible period.

But what about the human side of engineering in the Game of Life? The activity brought in people from many different backgrounds. And particularly in earlier years, they often operated quite independently, and with very different methods (some not even using a computer). But if we look at all “recorded structures” we can look at how many structures in total different people contributed, and when they made these contributions:

Needless to say—given that we’re dealing with an almost-60-year span—different people tend to show up as active in different periods. Looking at everyone, there’s a roughly exponential distribution to the number of (named) structures they’ve contributed. (Though note that several of the top contributors shown here found parametrized collections of structures and then recorded many instances.)

The Example of Oscillators

As a first example of systematic “innovation history” in the Game of Life let’s talk about oscillators. Here are the periods of oscillators that were found up to 1980:

As of 1980, many periods were missing. But in fact all periods are possible—though it wasn’t until 2023 that they were all filled in:

And if we plot the number of distinct periods (say below 60) found by a given year, we can get a first sense of the “arc of progress” in “oscillator technology” in the Game of Life:

Finding an oscillator of a given period is one thing. But how about the smallest oscillator of that period? We can be fairly certain that not all of these are known, even for periods below 30. But here’s a plot that shows when the progressive “smallest so far” oscillators were found for a given period (red indicates the first instance of a given period; blue the best result to date):

And here’s the corresponding plot for all periods up to 100:

But what about the actual reduction in size that’s achieved? Here’s a plot for each oscillator period showing the sequence of sizes found—in effect the “arc of engineering optimization” that’s achieved for that period:

So what are the actual patterns associated with these various oscillators? Here are some results (including timelines of when the patterns were found):

But how were these all found? The period-2 “blinker” was very obvious—showing up in evolution from almost any random initial condition. Some other oscillators were also easily found by looking at the evolution of particular, simple initial conditions. For example, a line of 10 black cells after 3 steps gives the period-15 “pentadecathlon”. Similarly, the period-3 “pulsar” emerges from a pair of length-5 blocks after 22 steps:

Many early oscillators were found by iterative experimentation, often starting with stable “still life” configurations, then perturbing them slightly, as in this period-4 case:

Another common strategy for finding oscillators (that we’ll discuss more below) was to take an “unstable” configuration, then to “stabilize” it by putting “robust” still lifes such as the “block” or the “eater” around it—yielding results like:

For periods that can be formed as LCMs of smaller periods one “construction-oriented” strategy has been to take oscillators with appropriate smaller periods, and combine them, as in:

In general, many different strategies have been used, as indicated for example by the sequence of period-3 oscillators that have been recorded over the years (where “smallest-so-far” cases are highlighted):

By the mid-1990s oscillators of many periods had been found. But there were still holdouts, like period 19 and for example pretty much all periods between 61 and 70 (except, as it happens, 66). At the time, though, all sorts of complicated constructions—say of prime generators—were nevertheless being done. And in 1996 it was figured out that one could in effect always “build a machine” (using only structures that had already been found two decades earlier) that would serve as an oscillator of any (sufficiently large) period (here 67)—effectively by “sending a signal around a loop of appropriate size”:

But by the 2010s, with large numbers of fast computers becoming available, there was again an emphasis on pure random search. A handful of highly efficient programs were developed, that could be run on anyone’s machine. In a typical case, a search might consist of starting, say, from a trillion randomly chosen initial conditions (or “soups”), identifying new structures that emerge, then seeing whether these act, for example, as oscillators. Typically any new discovery was immediately reported in online forums—leading to variations of it being tried, and new follow-on results often being reported within hours or days.

Many of the random searches started just from 16×16 regions of randomly chosen cells (or larger regions with symmetries imposed). And in a typical manifestation of computational irreducibility, many surprisingly small and “random-looking” (at least up to symmetries) results were found. So, for example, here’s the sequence of recorded period-16 oscillators with smaller-than-before cases highlighted:

Up through the 1990s results were typically found by a mixture of construction and small-scale search. But in 2016, results from large-scale random searches (sometimes symmetrical, sometimes not) started to appear.

The contrast between construction and search could be dramatic, like here for period 57:

One might wonder whether there could actually be a systematic, purely algorithmic way to find, say, possible oscillators of a given period. And indeed for one-dimensional cellular automata (as I noted in 1984), it turns out that there is. Say one considers blocks of cells of width w. Which block can follow which other is determined by a de Bruijn graph, or equivalently, a finite state machine. If one is going to have a pattern with period p, all blocks that appear in it must also be periodic with period p. But such blocks just form a subgraph of the overall de Bruijn graph, or equivalently, form another, smaller, finite state machine. And then all patterns with period p must correspond to paths through this subgraph. But how long are the blocks one has to consider?

In 1D cellular automata, it turns out that there’s an upper bound of 2^2p. But for 2D cellular automata—like the Game of Life—there is in general no such upper bound, a fact related to the undecidability of the 2D tiling problem. And the result is that there’s no complete, systematic algorithm to find oscillators in a general 2D cellular automaton, or presumably in the Game of Life.

But—as was actually already realized in the mid-1990s—it’s still possible to use algorithmic methods to “fill in” pieces of patterns. The idea is to define part of a pattern of a given period, then use this as a constraint on filling in the rest of it, finding “solutions” that satisfy the constraint using SAT-solving techniques. In practice, this approach has more often been used for spaceships than for oscillators (not least because it’s only practical for small periods). But one feature of it is that it can generate fairly large patterns with a given period.

Yet another method that’s been tried has been to generate oscillators by colliding gliders in many possible ways. But while this is definitely useful if one’s interested in what can be made using gliders, it doesn’t seem to have, for example, allowed people to find much in the way of interesting new oscillators.

Modularity

In traditional engineering a key strategy is modularity. Rather than trying to build something “all in one go”, the idea is to build a collection of independent subsystems, from which the whole system can then be assembled. But how does this work in the Game of Life? We might imagine that to identify the modular parts of a system, we’d have to know the “process” by which the system was put together, and the “intent” involved. But because in the Game of Life we’re ultimately just dealing with pure patterns of bits we can in effect just as well “come in at the end” and algorithmically figure out what pieces are operating as separate, modular parts.

So how can we do this? Basically what we want to find out is which parts of a pattern “operate independently” at a given step, in the sense that these parts don’t have any overlap in the cells they affect. Given that in the rules for the Game of Life a particular cell can affect any of the 9 cells in its neighborhood, we can say that black cells can only have “overlapping effects” if they are at most cell units apart. So then we can draw a “nearest neighbor graph” that shows which cells are connected in this sense:

But what about the whole evolution? We can draw what amounts to a causal graph that shows the causal connections between the “independent modular parts” that exist at each step:

And given this, we can summarize the “modular structure” of this particular oscillator by the causal graph:

Ultimately all that matters in the “overall operation” of the oscillator is the partial ordering defined by this graph. Parts that appear “horizontally separated” (or, more precisely, in antichains, or in physics terminology, spacelike separated) can be generated independently and in parallel. But parts that follow each other in the partial order need to be generated in that order (i.e. in physics terms, they are timelike separated).

As another example, let’s look at graphs for the various oscillators of period 16 that we showed above:

What we see is that the early period-16 oscillators were quite modular, and had many parts that in effect operated independently. But the later, smaller ones were not so modular. And indeed the last one shown here had no parts that could operate independently; the whole pattern had to be taken together at each step.

And indeed, what we’ll often see is that the more optimized a structure is, the less modular it tends to be. If we’re going to construct something “by hand” we usually need to assemble it in parts, because that’s what allows us to “understand what we’re doing”. But if, for example, we just find a structure in a search, there’s no reason for it to be “understandable”, and there’s no reason for it to be particularly modular.

Different steps in a given oscillator can involve different numbers of modular parts. But as a simple way to assess the “modularity” of an oscillator, we can just ask for the average number of parts over the course of one period. So as an example, here are the results for period-30 oscillators:

Later, we’ll discuss how we can use the level of modularity to assess whether a pattern is likely to have been found by a search or by construction. But for now, this shows how the modularity index has varied over the years for the best known progressively smaller oscillators of a given period—with the main conclusion being that as the oscillators get optimized for size, so also their modularity index tends to decrease:

Gliders & Spaceships

Oscillators are structures that cycle but do not move. “Gliders” and, more generally, “spaceships” are structures that move every time they cycle. When the Game of Life was first introduced, four examples of these (all of period 4) were found almost immediately (the last one being the result of trying to extend the one before it):

Within a couple of years, experimentation had revealed two variants, with periods 12 and 20 respectively, involving additional structures:

But after that, for nearly two decades, no more spaceships were found. In 1989, however, a systematic method for searching was invented, and in the years since, a steady stream of new spaceships have been found. A variety of different periods have been seen

as well as a variety of speeds (and three different angles):

The forms of these spaceships are quite diverse:

Some are “tightly integrated”, while some have many “modular pieces”, as revealed by their causal graphs:

Period-96 spaceships provide an interesting example of the “arc of progress” in the Game of Life. Back in 1971, a systematic enumeration of small polyominoes was done, looking for one that could “reproduce itself”. While no polyomino on its own seemed to do this, a case was found where part of the pattern produced after 48 steps seemed to reappear repeatedly every 48 steps thereafter:

One might expect this repeated behavior to continue forever. But in a typical manifestation of computational irreducibility, it doesn’t, instead stopping its “regeneration” after 24 cycles, and then reaching a steady state (apart from “radiated” gliders) after 3911 steps:

But from an engineering point of view this kind of complexity was just viewed as a nuisance, and efforts were made to “tame” and avoid it.

Adding just one still-life block to the so-called “switch engine”

produces a structure that keeps generating a “periodic wake” forever:

But can this somehow be “refactored” as a “pure spaceship” that doesn’t “leave anything behind”? In 1991 it was discovered that, yes, there was an arrangement of 13 switch engines that could successfully “clean up behind themselves”, to produce a structure that would act as a spaceship with period 96:

But could this be made simpler? It took many years—and tests of many different configurations—but in the end it was found that just 2 switch engines were sufficient:

Looking at the final pattern in spacetime gives a definite impression of “narrowly contained complexity”:

What about the causal graphs? Basically these just decrease in “width” (i.e. number of independent modular parts) as the number of engines decreases:

Like many other things in Game-of-Life engineering, both search and construction have been used to find spaceships. As an extreme example of construction let’s talk about the case of spaceships with speed 31/240. In 2013, an analog of the switch engine above was found—which “eats” blocks 31 cells apart every 240 steps:

But could this be turned into a “self-sufficient” spaceship? A year later an almost absurdly large (934852×290482) pattern was constructed that did this—by using streams of gliders and spaceships (together with dynamically assembled glider guns) to create appropriate blocks in front, and remove them behind (along with all the “construction equipment” that was used):

By 2016, a pattern with about 700× less area had been constructed. And now, just a few weeks ago, a pattern with 1300× less area (11974×45755) was constructed:

And while this is still huge, it’s still made of modular pieces that operate in an “understandable” way. No doubt there’s a much smaller pattern that operates as a spaceship of the same speed, but—computational irreducibility being what it is—we have no idea how large the pattern might be, or how we might efficiently search for it.

Glider Guns

What can one engineer in the Game of Life? A crucial moment in the development of Game-of-Life engineering was the discovery of the original glider gun in 1970. And what was particularly important about the glider gun is that it was a first example of something that could be thought of as a “signal generator”—that one could imagine would allow one to implement electrical-engineering-style “devices” in the Game of Life.

The original glider gun produces gliders every 30 steps, in a sense defining a “clock speed” of 1/30 for any “circuit” driven by it. Within a year after the original glider gun, two other “slower” glider guns had also been discovered

both working on similar principles, as suggested by their causal graphs:

It wasn’t until 1990 that any additional “guns” were found. And in the years since, a sequence of guns have been found, with a rather wide range of distinct periods:

Some of the guns found have very long periods:

But as part of the effort to do constructions in the 1990s a gun was constructed that had overall period 210, but which interwove multiple glider streams to ultimately produce gliders every 14 steps (which is the maximum rate possible, while avoiding interference of successive gliders):

Over the years, a whole variety of different glider guns have been found. Some are in effect “thoroughly controlled” constructions. Others are more based on some complex process that is reined in to the point where it just produces a stream of gliders and nothing more:

An example of a somewhat surprising glider gun—with the shortest “true period” known—was found in 2024:

The causal graph for this glider gun shows a mixture of irreducible “search-found” parts, together with a collection of “well-known” small modular parts:

By the way, in 2013 it was actually found possible to extend the construction for oscillators of any period to a construction for guns of any period (or at least any period above 78):

In addition to having streams of gliders, it’s also sometimes been found useful to have streams of other “spaceships”. Very early on, it was already known that one could create small spaceships by colliding gliders:

But by the mid-1990s it had been found that direct “spaceship guns” could also be made—and over the years smaller and smaller “optimized” versions have been found:

The last of these—from just last month—has a surprisingly simple structure, being built from components that were already known 30 years ago, and having a causal graph that shows very modular construction:

Building from History

We’ve talked about some of the history of how specific patterns in the Game of Life were found. But what about the overall “flow of engineering progress”? And, in particular, when something new is found, how much does it build on what has been found before? In real-world engineering, things like patent citations potentially give one an indication of this. But in the Game of Life one can approach the question much more systematically and directly, just asking what configurations of bits from older patterns are used in newer ones.

As we discussed above, given a pattern such as

we can pick out its “modular parts”, here rotated to canonical orientations:

Then we can see if these parts correspond to (any phase of) previously known patterns, which in this case they all do:

So now for all structures in the database we can ask what parts they involve. Here’s a plot of the overall frequencies of these parts:

It’s notable that the highest-ranked part is a so-called “eater” that’s often used in constructions, but occurs only quite infrequently in evolution from random initial conditions. It’s also notable that (for no particularly obvious reason) the frequency of the n^th most common structure is roughly 1/n.

So when were the various structures that appear here first found? As this picture shows, most—but not all—were found very early in the history of the Game of Life:

In other words, most of the parts used in structures from any time in the history of the Game of Life come from very early in its history. Or, in effect, structures typically go “back to basics” in the parts they use.

Here’s a more detailed picture, showing the relative amount of use of each part in structures from each year:

There are definite “fashions” to be seen here, with some structures “coming into fashion” for a while (sometimes, but not always, right after they were first found), and then dropping out.

One might perhaps imagine that smaller parts (i.e. ones with smaller areas) would be more popular than larger ones. But plotting areas of parts against their rank, we see that there are some large parts that are quite common, and some small ones that are rare:

We’ve seen that many of the most popular parts overall are ones that were found early in the history of the Game of Life. But plenty of distinct modular parts were also found much later. This shows the number of distinct new modular parts found across all patterns in successive years:

Normalizing by the number of new patterns found each year, we see a general gradual increase in the relative number of new modular parts, presumably reflecting the greater use of search in finding patterns, or components of patterns:

But how important have these later-found modular parts been? This shows the total rate at which modular parts found in a given year were subsequently used—and what we see, once again, is that parts found early are overwhelmingly the ones that are subsequently used:

A somewhat complementary way to look at this is to ask of all patterns found in a given year, how many are “purely de novo”, in the sense that they use no previously found modular parts (as indicated in red), and how many use previously found parts:

A cumulative version of this makes it clear that in early years most patterns are purely de novo, but later on, there’s an increasing amount of “reuse” of previously found parts—or, in other words, in later years the “engineering history” is increasingly important:

It should be said, however, that if one wants the full story of “what’s being used” it’s a bit more nuanced. Because here we’re always treating each modular part of each pattern as a separate entity, so that we consider any given pattern to “depend” only on base modular parts. But “really” it could depend on another whole structure, itself built of many modular parts. And in what we’re doing here, we’re not tracking that hierarchy of dependencies. Were we to do so, we would likely be able to see more complex “technology stacks” in the Game of Life. But instead we’re always “going down to the primitives”. (If we were dealing with electronics it’d be like asking “What are the transistors and capacitors that are being used?”, rather than “What is the caching architecture, or how is the floating point unit set up?”)

OK, but in terms of “base modular parts” a simple question to ask is how many get used in each pattern. This shows the number of (base) modular parts in patterns found in each year:

There are always a certain number of patterns that just consist of a single modular part—and, as we saw above, that was more common earlier in the history of the Game of Life. But now we also see that there have been an increasing number of patterns that use many modular parts—typically reflecting a higher degree of “construction” (rather than search) going on.

By the way, for comparison, these plots show the total areas and the numbers of (black) cells in patterns found in each year; both show increases early on, but more or less level off by the 1990s:

But, OK, if we look across all patterns in the database, how many parts do they end up using? Here’s the overall distribution:

At least for a certain range of numbers of parts, this falls roughly exponentially, reflecting the idea that it’s been exponentially less likely for people to come up with (or find) patterns that have progressively larger numbers of distinct modular parts.

How has this changed over time? This shows a cumulative plot of the relative frequencies with which different numbers of modular parts appear in patterns up to a given year

indicating that over time the distribution of the number of modular parts has gotten progressively broader—or, in other words, as we’ve seen in other ways above, more patterns make use of larger numbers of modular parts.

We’ve been looking at all the patterns that have been found. But we can also ask, say, just about oscillators. And then we can ask, for example, which oscillators (with which periods) contain which others, as in:

And looking at all known oscillators we can see how common different “oscillator primitives” are in building up other oscillators:

We can also ask in which year “oscillator primitives” at different ranks were found. Unlike in the case of all structures above, we now see that some oscillator primitives that were found only quite recently appear at fairly high ranks—reflecting the fact that in this case, once a primitive has been found, it’s often immediately useful in making oscillators that have multiples of its period:

Lifetime Hacking and Die Hards

We can think of almost everything we’ve talked about so far as being aimed at creating structures (like “clocks” and “wires”) that are recognizably useful for building traditional “machine-like” engineering systems. But a different possible objective is to find patterns that have some feature we can recognize, whether with obvious immediate “utility” or not. And as one example of this we can think about finding so-called “die hard” patterns that live as long as possible before dying out.

The phenomenon of computational irreducibility tells us that even given a particular pattern we can’t in general “know in advance” how long it’s going to take to die out (or if it ultimately dies out at all). So it’s inevitable that the problem of finding ultimate die-hard patterns can be unboundedly difficult, just like analogous problems for other computational systems (such as finding so-called “busy beavers” in Turing machines).

But in practice one can use both search and construction techniques to find patterns that at least live a long time (even if not the very longest possible time). And as an example, here’s a very simple pattern (found by search) that lives for 132 steps before dying out (the “puff” at the end on the left is a reflection of how we’re showing “trails”; all the actual cells are zero at that point):

Searching nearly 10¹⁶ randomly chosen 16×16 patterns (out of a total of ≈ 10⁷⁷ possible such patterns), the longest lifetime found is 1413 steps—achieved with a rather random-looking initial pattern:

But is this the best one can do? Well, no. Just consider a block and a spaceship n cells apart. It’ll take 2n steps for them to collide, and if the phases are right, annihilate each other:

So by picking the separation n to be large enough, we can make this configuration “live as long as we want”. But what if we limit the size of the initial pattern, say to 32×32? In 2022 the following pattern was constructed:

And this pattern is carefully set up so that after 30,274 steps, everything lines up and it dies out, as we can see in the (vertically foreshortened) spacetime diagram on the left:

And, yes, the construction here clearly goes much further than search was able to reach. But can we go yet further? In 2023 a 116×86 pattern was constructed

that it was proved eventually dies out, but only after the absurdly large number of 17↑↑↑3 steps (probably even much larger than the number of emes in the ruliad), as given by:

The Comparison with Adaptive Evolution

There are some definite rough ways in which technology development parallels biological evolution. Both involve the concept of trying out possibilities and building on ones that work. But technology development has always ultimately been driven by human effort, whereas biological evolution is, in effect, a “blind” process, based on the natural selection of random mutations. So what happens if we try to apply something like biological evolution to the Game of Life? As an example, let’s look at adaptive evolution that’s trying to maximize finite lifetime based on making a sequence of random point mutations within an initially random 16×16 pattern. Most of those mutations don’t give patterns with larger (finite) lifetimes, but occasionally there’s a “breakthrough” and the lifetime achieved so far jumps up:

The actual behaviors corresponding to the breakthroughs in this case are:

And here are some other outcomes from adaptive evolution:

In almost all cases, a limited number of steps of adaptive evolution do succeed in generating patterns with fairly long finite lifetimes. But the behavior we see typically shows no “readily understandable mechanisms”—and no obviously separable modular parts. And instead—just like in my recent studies of both biological evolution and machine learning—what we get are basically “lumps of irreducible computation” that “just happen” to show what we’re looking for (here, long lifetime).

Invented or Discovered? Made for a Purpose at All?

Let’s say we’re presented with an array of cells that’s an initial condition for the Game of Life. Can we tell “where it came from”? Is it “just arbitrary” (or “random”)? Or was it “set up for a purpose”? And if it was “set up for a purpose”, was it “invented” (and “constructed”) for that purpose, or was it just “discovered” (say by a search) to fulfill that purpose?

Whether one’s dealing with archaeology, evolutionary biology, forensic science, the identification of alien intelligence or, for that matter, theology, the question of whether something “was set up for a purpose” is a philosophically fraught one. Any behavior one sees one can potentially explain either in terms of the mechanism that produces it, or in terms of what it “achieves”. Things get a little clearer if we have a particular language for describing both mechanisms and purposes. Then we can ask questions like: “Is the behavior we care about more succinctly described in terms of its mechanism or its purpose?” So, for example, “It behaves as a period-15 glider gun” might be an adequate purpose-oriented description, that’s much shorter than a mechanism-oriented description in terms of arrangements of cells.

But what is the appropriate “lexicon of purposes” for the Game of Life? In effect, that’s a core question for Game-of-Life engineering. Because what engineering—and technology in general—is ultimately about is taking whatever raw material is available (whether from the physical world, or from the Game of Life) and somehow fashioning it into something that aligns with human purposes. But then we’re back to what counts as a valid human purpose. How deeply does the purpose have to connect in to everything we do? Is it, for example, enough for something to “look nice”, or is that not “utilitarian enough”? There aren’t absolute answers to these questions. And indeed the answers can change over time, as new uses for things are discovered (or invented).

But for the Game of Life we can start with some of the “purposes” we’ve discussed here—like “be an oscillator of a certain period”, “reflect gliders”, “generate the primes” or even just “die after as long as possible”. Let’s say we just start enumerating possible initial patterns, either randomly, or exhaustively. How often will we come across patterns that “achieve one of these purposes”? And will it “only achieve that purpose” or will it also “do extra stuff” that “seems irrelevant”?

As an example, consider enumerating all possible 3×3 patterns of cells. There are altogether 2⁹ = 512 such patterns, or just 102 after reducing by symmetries. Of these, 79 evolve to purely static patterns, while 17 evolve to patterns of period 2. But only one (in two phases) is “immediately of period 2”—and so might reasonably be immediately interpreted as having been “created for the purpose of being of period 2”:

Other patterns can take a while to “become period 2”, but then at least give “pure period-2 objects”. And for example this one can be interpreted as being the smallest precursor, and taking the least time, to reach the period-2 object it produces:

There are other cases that “get to the same place” but seem to “wander around” doing so, and therefore don’t seem as convincing as having been “created for the purpose of making a period-2 oscillator”:

Then there are much more egregious cases. Like

which after 173 steps gives

but only after going through all sorts of complicated intermediate behavior

that definitely doesn’t make it look like it’s going “straight to its purpose” (unless perhaps its purpose is to produce that final pattern from the smallest initial precursor, etc.).

But, OK. Let’s imagine we have a pattern that “goes straight to” some “recognizable purpose” (like being an oscillator of a certain period). The next question is: was that pattern explicitly constructed with an understanding of how it would achieve its purpose, or was it instead “blindly found” by some kind of search?

As an example, let’s look at some period-9 oscillators:

One like

seems like it must have been constructed out of “existing parts”, while one like

seems like it could only plausibly have been found by a search.

Spacetime views don’t tell us much in these particular cases:

But causal graphs are much more revealing:

They show that in the first case there are lots of “factored modular parts”, while in the second case there’s basically just one “irreducible blob” with no obvious separable parts. And we can view this as an immediate signal for “how human” each pattern is. In a sense it’s a reflection of the computational boundedness of our minds. When there are factored modular parts that interact fairly rarely and each behave in a fairly simple way, it’s realistic for us to “get our minds around” what’s going on. But when there’s just an “irreducible blob of activity” we’d have to compute too much and keep too much in mind at once for us to be able to really “understand what’s going on” and for example produce a human-level narrative explanation of it.

If we find a pattern by search, however, we don’t really have to “understand it”; it’s just something we computationally “discover out there in the computational universe” that “happens” to do what we want. And, indeed, as in the example here, it often does what it does in a quite minimal (if incomprehensible) way. Something that’s found by human effort is much less likely to be minimal; in effect it’s at least somewhat “optimized for comprehensibility” rather than for minimality or ease of being found by search. And indeed it will often be far too big (e.g. in terms of number of cells) for any pure exhaustive or random search to plausibly find it—even though the “human-level narrative” for it might be quite short.

Here are the causal graphs for all the period-9 oscillators from above:

Some we can see can readily be broken down into multiple rarely interacting distinct components; others can’t be decomposed in this kind of way. And in a first approximation, the “decomposable” ones seem to be precisely those that were somehow “constructed by human effort”, while the non-decomposable ones seem to be those that were “discovered by searches”.

Typically, the way the “constructions” are done is to start with some collection of known parts, then, by trial and error (sometimes computer assisted) see how these can be fit together to get something that does what one wants. Searches, on the other hand, typically operate on “raw” configurations of cells, blindly going through a large number of possible configurations, at every stage automatically testing whether one’s got something that does what one wants.

And in the end these different strategies reveal themselves in the character of the final patterns they produce, and in the causal graphs that represent these patterns and their behavior.

Principles of Engineering Strategy from the Game of Life

In engineering as it’s traditionally been practiced, the main emphasis tends to be on figuring out plans, and then constructing things based on those plans. Typically one starts from components one has, then tries to figure out how to combine them to incrementally build up what one wants.

And, as we’ve discussed, this is also a way of developing technology in the Game of Life. But as we’ve discussed at length, it’s not the only way. Another way is just to search for whole pieces of technology one wants.

Traditional intuition might make one assume this would be hopeless. But the repeated lesson of my discoveries about simple programs—as well as what’s been done with the Game of Life—is that actually it’s often not hopeless at all, and instead it’s very powerful.

Yes, what you get is not likely to be readily “understandable”. But it is likely to be minimal and potentially quite optimal for whatever it is that it does. I’ve often talked of this approach as “mining from the computational universe”. And over the course of many years I’ve had success with it in all sorts of disparate areas. And now, here, we’ve see in the Game of Life a particularly clean example where search is used alongside construction in developing technology.

It’s a feature of things produced by construction that they are “born understandable”. In effect, they are computationally reducible enough that we can “fit them in our finite minds” and “understand them”. But things found by search don’t have this feature. And most of the time the behavior they’ll show will be full of computational irreducibility.

In both biological evolution and machine learning my recent investigations suggest that most of what we’re seeing are “lumps of irreducible computation” found at random that just “happen to achieve the necessary objectives”. This hasn’t been something familiar in traditional engineering, but it’s something tremendously powerful. And from the examples we’ve seen here in the Game of Life it’s clear that it can often achieve things that seem completely inaccessible by traditional methods based on explicit construction.

At first we might assume that irreducible computation is too unruly and unpredictable to be useful in achieving “understandable objectives”. But if we find just the right piece of irreducible computation then it’ll achieve the objective we want, often in a very minimal way. And the point is that the computational universe is in a sense big enough that we’ll usually be able to find that “right piece of irreducible computation”.

One thing we see in Game-of-Life engineering is something that’s in a sense a compromise between irreducible computation and predictable construction. The basic idea is to take something that’s computationally irreducible, and to “put it in a cage” that constrains it to do what one wants. The computational irreducibility is in a sense the “spark” in the system; the cage provides the control we need to harness that spark in a way that meets our objectives.

Let’s look at some examples. As our “spark” we’ll use the R pentomino that we discussed at the very beginning. On its own, this generates all sorts of complex behavior—that for the most part doesn’t align with typical objectives we might define (though as a “side show” it does happen to generate gliders). But the idea is to put constraints on the R pentomino to make it “useful”.

Here’s a case where we’ve tried to “build a road” for the R pentomino to go down:

And looking at this every 18 steps we see that, at least for a while, the R pentomino has indeed moved down the road. But it’s also generated something of an “explosion”, and eventually this explosion catches up, and the R pentomino is destroyed.

So can we maintain enough control to let the R pentomino survive? The answer is yes. And here, for example, is a period-12 oscillator, “powered” by an R pentomino at its center:

Without the R pentomino, the structure we’ve set up cycles with period 6:

And when we insert the R pentomino this structure “keeps it under control”—so that the only effect it ultimately has is to double the period, t0 12.

Here’s a more dramatic example. Start with a static configuration of four so-called “eaters”:

Now insert two R pentominoes. They’ll start doing their thing, generating what seems like quite random behavior. But the “cage” defined by the “eaters” limits what can happen, and in the end what emerges is an oscillator—that has period 129:

What else can one “make R pentominoes do”? Well, with appropriate harnesses, they can for example be used to “power” oscillators with many different periods:

“Be an oscillator of a certain period” is in a sense a simple objective. But what about more complex objectives? Of course, any pattern of cells in the Game of Life will do something. But the question is whether that something aligns with technological objectives we have.

Generically, things in the Game of Life will behave in computationally irreducible ways. And it’s this very fact that gives such richness to what can be done with the Game of Life. But can the computational irreducibility be controlled—and harnessed for technological purposes? In a sense that is the core challenge of engineering in both the Game of Life, and in the real world. (It’s also rather directly the challenge we face in making use of the computational power of AI, but still adequately aligning it with human objectives.)

As we look at the arc of technological development in the Game of Life we see over the course of half a century all sorts of different advances being made. But will there be an end to this? Will we eventually run out of inventions and discoveries? The underlying presence of computational irreducibility makes it clear that we will not. The only thing that might end is the set of objectives we’re trying to meet. We now know how to make oscillators of any period. And unless we insist on for example finding the smallest oscillator of a given period, we can consider the problem of finding oscillators solved, with nothing more to discover.

In the real world nature and the evolution of the universe inevitably confront us with new issues, which lead to new objectives. In the Game of Life—as in any other abstract area, like mathematics—the issue of defining new objectives is up to us. Computational irreducibility leads to infinite diversity and richness of what’s possible. The issue for us is to figure out what direction we want to go. And the story of engineering and technology in the Game of Life gives us, in effect, a simple model for the issues we confront in other areas of technology, like AI.

Some Personal Backstory

I’m not sure if I made the right decision back in 1981. I had come up with a very simple class of systems and was doing computer experiments on them, and was starting to get some interesting results. And when I mentioned what I was doing to a group of (then young) computer scientists they said “Oh, those things you’re studying are called cellular automata”. Well, actually, the cellular automata they were talking about were 2D systems while mine were 1D. And though that might seem like a technical difference, it has a big effect on one’s impression of what’s going on—because in 1D one can readily see “spacetime histories” that gave an immediate sense of the “whole behavior of the system”, while in 2D one basically can’t.

I wondered what to call my models. I toyed with the term “polymones”—as a modernized nod to Leibniz’s monads. But in the end I decided that I should stick with a simpler connection to history, and just call my models, like their 2D analogs, “cellular automata”. In many ways I’m happy with that decision. Though one of its downsides has been a certain amount of conceptual confusion—more than anything centered around the Game of Life.

People often know that the Game of Life is an example of a cellular automaton. And they also know that within the Game of Life lots of structures (like gliders and glider guns) can be set up to do particular things. Meanwhile, they hear about my discoveries about the generation of complexity in cellular automata (like rule 30). And somehow they conflate these things—leading to all too many books etc. that show pictures of simple gliders in the Game of Life and say “Look at all this complexity!”

At some level it’s a confusion between science and engineering. My efforts around cellular automata have centered on empirical science questions like “What does this cellular automaton do if you run it?” But—as I’ve discussed at length above—most of what’s been done with the Game of Life has centered instead on questions of engineering, like “What recognizable (or useful) structures can you build in the system?” It’s a different objective, with different results. And, in particular, by asking to “engineer understandable technology” one’s specifically eschewing the phenomenon of computational irreducibility—and the whole story of the emergence of complexity that’s been so central to my own scientific work on cellular automata and so much else.

Many times over the years, people would show me things they’d been able to build in the Game of Life—and I really wouldn’t know what to make of them. Yes, they seemed like impressive hacks. But what was the big picture? Was this just fun, or was there some broader intellectual point? Well, finally, not long ago I realized: this is not a story of science, it’s a story about the arc of engineering, or what one can call “metaengineering”.

And back in 2018, in connection with the upcoming 50th anniversary of the Game of Life, I decided to see what I could figure out about this. But I wasn’t satisfied with how far I got, and other priorities interceded. So—beyond one small comment that ended up in a 2020 New York Times article—I didn’t write anything about what I’d done. And the project languished. Until now. When somehow my long-time interest in “alien engineering”, combined with my recent results about biological evolution coalesced into a feeling that it was time to finally figure out what we could learn from all that effort that’s been put into the Game of Life.

In a sense this brings closure to a very long-running story for me. The first time I heard about the Game of Life was in 1973. I was an early teenager then, and I’d just gotten access to a computer. By today’s standards the computer (an Elliott 903C) was a primitive one: the size of a desk, programmed with paper tape, with only 24 kilobytes of memory. I was interested in using it for things like writing a simulator for the physics of idealized gas molecules. But other kids who had access to the computer were instead more interested (much as many kids might be today) in writing games. Someone wrote a “Hunt the Wumpus” game. And someone else wrote a program for the “Game of Life”. The configurations of cells at each generation were printed out on a teleprinter. And for some reason people were particularly taken with the “Cheshire cat” configuration, in which all that was left at the end (as in Alice in Wonderland) was a “smile”. At the time, I absolutely didn’t see the point of any of this. I was interested in science, not games, and the Game of Life pretty much lost me at “Game”.

For a number of years I didn’t have any further contact with the Game of Life. But then I met Bill Gosper, who I later learned had in 1970 discovered the glider gun in the Game of Life. I met Gosper first “online” (yes, even in 1978 that was a thing, at least if you used the MIT-MC computer through the ARPANET)—then in person in 1979. And in 1980 I visited him at Xerox PARC, where he described himself as part of the “entertainment division” and gave me strange math formulas printed on a not-yet-out-of-the-lab color laser printer

and also showed me a bitmapped display (complete with GUI) with lots of pixels dancing around that he enthusiastically explained were showing the Game of Life. Knowing what I know now, I would have been excited by what I saw. But at the time, it didn’t really register.

Still, in 1981, having started my big investigation of 1D cellular automata, and having made the connection to the 2D case of the Game of Life, I started wondering whether there was something “scientifically useful” that I could glean from all the effort I knew (particularly from Gosper) had been put into Life. It didn’t help that almost none of the output of that effort had been published. And in those days before the web, personal contact was pretty much the only way to get unpublished material. One of my larger “finds” was from a friend of mine from Oxford who passed on “lab notebook pages” he’d got from someone who was enumerating outcomes from different Game-of-Life initial configurations:

And from material like this, as well as my own simulations, I came up with some tentative “scientific conclusions”, which I summarized in 1982 in a paragraph in my first big paper about cellular automata:

But then, at the beginning of 1983, as part of my continuing effort to do science on cellular automata, I made a discovery. Among all cellular automata there seemed to be four basic classes of behavior, with class 4 being characterized by the presence of localized structures, sometimes just periodic, and sometimes moving:

I immediately recognized the analogy to the Game of Life, and to oscillators and gliders there. And indeed this analogy was part of what “tipped me off” to thinking about the ubiquitous computational capabilities of cellular automata, and to the phenomenon of computational irreducibility.

Meanwhile, in March 1983, I co-organized what was effectively the first-ever conference on cellular automata (held at Los Alamos)—and one of the people I invited was Gosper. He announced his Hashlife algorithm (which was crucial to future Life research) there, and came bearing gifts: printouts for me of Life, that I annotated, and still have in my archives:

I asked Gosper to do some “more scientific” experiments for me—for example starting from a region of randomness, then seeing what happened:

But Gosper really wasn’t interested in what I saw as being science; he wanted to do engineering, and make constructions—like this one he gave me, showing two glider guns exchanging streams of gliders (why would one care, I wondered):

I’d mostly studied 1D cellular automata—where I’d discovered a lot by systematically looking at their behavior “laid out in spacetime”. But in early 1984 I resolved to also systematically check out 2D cellular automata. And mostly the resounding conclusion was that their basic behavior was very similar to 1D. Out of all the rules we studied, the Game of Life didn’t particularly stand out. But—mostly to provide a familiar comparison point—I included pictures of it in the paper we wrote:

And we also went to the trouble of making a 3D “spacetime” picture of the Game of Life on a Cray supercomputer—though it was too small to show anything terribly interesting:

It had been a column in Scientific American in 1970 that had first propelled the Game of Life to public prominence—and that had also launched the first great Life engineering challenge of finding a glider gun. And in both 1984 and 1985 a successor to that very same column ran stories about my 1D cellular automata. And in 1985, in collaboration with Scientific American, I thought it would be fun and interesting to reprise the 1970 glider gun challenge, but now for 1D class 4 cellular automata:

Many people participated. And my main conclusion was: yes, it seemed like one could do the same kinds of engineering in typical 1D class 4 cellular automata as one could in the Game of Life. But this was all several years before the web, and the kind of online community that has driven so much Game of Life engineering in modern times wasn’t yet able to form.

Meanwhile, by the next year, I was starting the development of Mathematica and what’s now the Wolfram Language, and for a few years didn’t have much time to think about cellular automata. But in 1987 when Gosper got involved in making pre-release demos of Mathematica he once again excitedly told me about his discoveries in the Game of Life, and gave me pictures like:

It was in 1992 that the Game of Life once again appeared in my life. I had recently embarked on what would become the 10-year project of writing my book A New Kind of Science. I was working on one of the rather few “I already have this figured out” sections in the book—and I wanted to compare class 4 behavior in 1D and 2D. How was I to display the Game of Life, especially in a static book? Equipped with what’s now the Wolfram Language it was easy to come up with visualizations—looking “out” into a spacetime slice with more distant cells “in a fog”, as well as “down” into a fog of successive states:

And, yes, it was immediately striking how similar the spacetime slice looked to my pictures of 1D class 4 cellular automata. And when I wrote a note for the end of the book about Life, the correspondence became even more obvious. I’d always seen the glider gun as a movie. But in a spacetime slice it “made much more sense”, and looked incredibly similar to analogous structures in 1D class 4 cellular automata:

In A New Kind of Science I put a lot of effort into historical notes. And as a part of such a note on “History of cellular automata” I had a paragraph about the Game of Life:

I first met John Conway in September 1983 (at a conference in the south of France). As I would tell his biographer many years later, my relationship with Conway was complicated from the start. We were both drawn to systems defined by very simple rules, but what we found interesting about them was very different. I wanted to understand the big picture and to explore science-oriented questions (and what I would now call ruliology). Conway, on the other hand, was interested in specific, often whimsically presented results—and in questions that could be couched as mathematical theorems.

In my conversations with Conway, the Game of Life would sometimes come up, but Conway never seemed too interested in talking about it. In 2001, though, when I was writing my note about the history of 2D cellular automata, I spent several hours specifically asking Conway about the Game of Life and its history. At first Conway told me the standard origin story that Life had arisen as a kind of game. A bit later he said he’d at the time just been hired as a logic professor, and had wanted to use Life as a simple way to enumerate the recursive functions. In the end, it was hard to disentangle true recollections from false (or “elaborated”) ones. And, notably, when asked directly about the origin of the specific rules of Life, he was evasive. Of course, none of that should detract from Conway’s achievement in the concept of the Game of Life, and in the definition of the hacker-like culture around it—the fruits of which have now allowed me to do what I’ve done here.

For many years after the publication of A New Kind of Science in 2002, I didn’t actively engage with the Game of Life—though I would hear from Life enthusiasts with some frequency, but none as much as Gosper, from whom I was a recipient of hundreds of messages about Life, a typical example from 2017 concerning

and saying:

Subject: Re: Unlimited(?) novelty from just two LWSS backrakes

Novelty is mediated by the sporadic glider gas (which forms very sparse
beams), sporadic debris (forming sparse lines), and is hidden in sporadic
defects in the denser beams and lines. At this scale, each screen pixel
represents 262144 x 262144 Life cells. Thus very sparse lines, e.g. density
10^-5, appear solid, while being very nearly transparent to gliders.

After 3.4G, (sparse) new glider beams are still fading up. The beams
repeatedly strafe the x and y axis stalagmites.

I suspect this will (very) eventually lead to a positive density of
switch-engines, and thus quadratic population growth.

⋮

Finally, around 4.2G, an eater1 (fish hook):

Depending on background novelty radiation, there ought to be one of
these every few billion, all lying on a line through the origin.

⋮

With much help from Tom R, I slogged to 18G, with *zero* new nonmovers
in the 4th quadrant, causing me to propose a mechanism that precluded
future new ones. But then Andrew Trevorrow fired up his Big Mac (TM),
ran 60G, and found three new nonmovers! They are, respectively, a mirror
image(!) of the 1st eater, and two blinkers, in phase, but not aligned with
the origin. I.e., all four are "oners'", or at least will lie on different
trash trails.

I’m still waiting for one of these to sprout switch-engines and begin quadratic
growth. But here’s a puzzle: Doesn’t the gas of sparse gliders (actually glider
packets) in the diagonal strips athwart the 1st quadrant already reveal (small
coefficient) quadratic growth? Which will *eventually* dominate? The area of the
strips is increasing quadratically. Their density *appears* to be at least holding,
but possibly along only one axis. I don’t see where quadratically many gliders could
arise. They’re being manufactured at a (roughly) fixed rate. Imagine the above
picture in the distant future. Where is the amplification that will keep those
strips full? ‐‐Bill

Does it just happen to come out that way, or was it somehow made to be that way? It was a big shock to my intuition at the beginning of the 1980s when I began to see all the richness that even very simple programs can produce. And it made me start to wonder about all our technological and other achievements. With our goals and intentions, were we producing things that were somehow different from what even simple programs “could do anyway”? How would we be able to tell whether that interstellar radio signal was the product of some sophisticated civilization, or just something that “happened naturally”? My Principle of Computational Equivalence implied that at an ultimate level there wouldn’t really be a way to tell. But I kept on wondering whether there might at least be some signature of “purposes like ours” that we could detect.

At first it was extraterrestrial intelligence and animal intelligence, later also artificial intelligence. But the question kept on coming back: what distinguishes what’s engineered from what just “scientifically happens”? (And, yes, there was a theological question in there too.) I had wondered for a while about using the Game of Life as a testing ground for this, and as the 50th anniversary of the Game of Life approached in 2018, I took this as a cue to explore it.

Over the years I had accumulated a paper file perhaps 6 inches thick about the Game of Life (a few samples from which I’ve shown above). But looking around the web I was impressed at how much well-organized material there now was out there about the Game of Life. I started to try to analyze it, imagining that I might see something like an analog of Moore’s law. Meanwhile, over the preceding decade I had written a lot about the history of science, and I thought that as part of my contribution to the 50th anniversary of the Game of Life I should try to write about its history. What were the stories of all those people whose names were attached to discoveries in Life? A research assistant of mine began to track them down, and interview them. It turned out to be a very disparate group, many of whom knew little about each other. (Though they often, but not always, had in common graduate-level education in math.) And in any case it became clear that writing a coherent history was going to be a huge undertaking. In addition, the first few ways I tried to discern trends in data about the Game of Life data didn’t yield much. And soon the 50th anniversary had passed—and I got busy with other things.

But the project of studying the “metaengineering” of the Game of Life stayed on my “to do” list (and a couple of students at our Wolfram Summer School worked on it). Then in 2022 a nice book on the Game of Life came out (by Nathaniel Johnston and Dave Greene, the latter of whom had actually been at our Summer School back in 2011). Had my project been reduced to just reading this book, I wondered. I soon realized that it hadn’t. And there were now all kinds of questions on which I imagined a study of the Game of Life could shed light. Not only questions about the “signature of purpose”. But also questions about novelty and creativity. And about the arc and rhythm of innovation.

Then in 2024 came the surprises of my work on biological evolution, and on machine learning. And I found myself again wondering about how things work when there’s “intentional engineering”. And so I finally decided to do my long-planned study of the Game of Life. There’s much, much more that can be done. But I think what I’ve done here provides an indication of some of the directions one can go, and some of what there is to discover in what is effectively the new field of “computational metaengineering”.

Notes

Thanks to Willem Nielsen of the Wolfram Institute for extensive help, as well as to Ed Pegg of Wolfram Research. (Thanks also to Brad Klee for earlier work.) Over the years, I’ve interacted with many people about the Game of Life. In rough order of my first (“Life”) interactions with them, these include: Jeremy Barford (1973), Philip Gladstone (1973), Nicholas Goulder (1973), Norman Routledge (1973), Bill Gosper (1979), Tim Robinson (1981), Paul Leyland (1981), Norman Margolus (1982), John Conway (1983), Brian Silverman (1985), Eric Weisstein (1999), Ed Pegg (2000), Jon C. R. Bennett (2006), Robert Wainwright (2010), Dave Greene (2011), Steve Bourne (2018), Tanha Kate (2018), Simon Norton (2018), Adam Goucher (2019), Keith Patarroyo (2021), Steph Macurdy (2021), Mark McAndrew (2022), Richard Assar (2024) and Nigel Martin (2025). And, of course, thanks to the many people who’ve contributed over the past half century to the historical progression of Life engineering that I’ve been analyzing here.

Note added April 24, 2025: Thanks to Dave Greene who pointed out an incorrect historical inference, which has now been updated.

Towards a Computational Formalization for Foundations of Medicine

Stephen Wolfram — Mon, 03 Feb 2025 23:27:46 +0000

A Theory of Medicine?

As it’s practiced today, medicine is almost always about particulars: “this has gone wrong; this is how to fix it”. But might it also be possible to talk about medicine in a more general, more abstract way—and perhaps to create a framework in which one can study its essential features without engaging with all of its details?

My goal here is to take the first steps towards such a framework. And in a sense my central result is that there are many broad phenomena in medicine that seem at their core to be fundamentally computational—and to be captured by remarkably simple computational models that are readily amenable to study by computer experiment.

I should make it clear at the outset that I’m not trying to set up a specific model for any particular aspect or component of biological systems. Rather, my goal is to “zoom out” and create what one can think of as a “metamodel” for studying and formalizing the abstract foundations of medicine.

What I’ll be doing builds on my recent work on using the computational paradigm to study the foundations of biological evolution. And indeed in constructing idealized organisms we’ll be using the very same class of basic computational models. But now, instead of considering idealized genetic mutations and asking what types of idealized organisms they produce, we’re going to be looking at specific evolved idealized organisms, and seeing what effect perturbations have on them. Roughly, the idea is that an idealized organism operates in its normal “healthy” way if there are no perturbations—but perturbations can “derail” its operation and introduce what we can think of as “disease”. And with this setup we can then think of the “fundamental problem of medicine” as being the identification of additional perturbations that can “treat the disease” and put the organism at least approximately back on its normal “healthy” track.

As we’ll see, most perturbations lead to lots of detailed changes in our idealized organism, much as perturbations in biological organisms normally lead to vast numbers of effects, say at a molecular level. But as in medicine, we can imagine that all we can observe (and perhaps all we care about) are certain coarse-grained features or “symptoms”. And the fundamental problem of medicine is then to work out from these symptoms what “treatment” (if any) will end up being useful. (By the way, when I say “symptoms” I mean the whole cluster of signs, symptoms, tests, etc. that one might in practice use, say for diagnosis.)

It’s worth emphasizing again that I’m not trying here to derive specific, actionable, medical conclusions. Rather, my goal is to build a conceptual framework in which, for example, it becomes conceivable for general phenomena in medicine that in the past have seemed at best vague and anecdotal to begin to be formalized and studied in a systematic way. At some level, what I’m trying to do is a bit like what Darwinism did for biological evolution. But in modern times there’s a critical new element: the computational paradigm, which not only introduces all sorts of new, powerful theoretical concepts, but also leads us to the practical methodology of computer experimentation. And indeed much of what follows is based on the (often surprising) results of computer experiments I’ve recently done that give us raw material to build our intuition—and structure our thinking—about fundamental phenomena in medicine.

A Minimal Metamodel

How can we make a metamodel of medicine? We need an idealization of biological organisms and their behavior and development. We need an idealization of the concept of disease for such organisms. And we need an idealization of the concept of treatment.

For our idealization of biological organisms we’ll use a class of simple computational systems called cellular automata (that I happen to have studied since the early 1980s). Here’s a specific example:

What’s going on here is that we’re progressively constructing the pattern on the left (representing the development and behavior of our organism) by repeatedly applying cases of the rules on the right (representing the idealized genome—and other biochemical, etc. rules—of our organism). Roughly we can think of the pattern on the left as corresponding to the “life history” of our organism—growing, developing and eventually dying as it goes down the page. And even though there’s a rather organic look to the pattern, remember that the system we’ve set up isn’t intended to provide a model for any particular real-world biological system. Rather, the goal is just for it to capture enough of the foundations of biology that it can serve as a successful metamodel to let us explore our questions about the foundations of medicine.

Looking at our model in more detail, we see that it involves a grid of squares—or “cells” (computational, not biological)—each having one of 4 possible colors (white and three others). We start from a single red “seed” cell on the top row of the grid, then compute the colors of cells on subsequent steps (i.e. on subsequent rows down the page) by successively applying the rules on the right. The rules here are basically very simple. But we can see that when we run them they lead to a fairly complicated pattern—which in this case happens to “die out” (i.e. all cells become white) after exactly 101 steps.

So what happens if we perturb this system? On the left here we’re showing the system as above, without perturbation. But on the right we’re introducing a perturbation by changing the color of a particular cell (on step 16)—leading to a rather different (if qualitatively similar) pattern:

Here are the results of some other perturbations to our system:

Some perturbations (like the one in the second panel here) quickly disappear; in essence the system quickly “heals itself”. But in most cases even single-cell perturbations like the ones here have a long-term effect. Sometimes they can “increase the lifetime” of the organism; often they will decrease it. And sometimes—like in the last case shown here—they will lead to essentially unbounded “tumor-like” growth.

In biological or medical terms, the perturbations we’re introducing are minimal idealizations of “things that can happen to an organism” in the course of its life. Sometimes the perturbations will have little or no effect on the organism. Or at least they won’t “really hurt it”—and the organism will “live out its natural life” (or even extend it a bit). But in other cases, a perturbation can somehow “destabilize” the organism, in effect “making it develop a disease”, and often making it “die before its time”.

But now we can formulate what we can think of as the “fundamental problem of medicine”: given that perturbations have had a deleterious effect on an organism, can we find subsequent perturbations to apply that will serve as a “treatment” to overcome the deleterious effect?

The first panel here shows a particular perturbation that makes our idealized organism die after 47 steps. The subsequent panels then show various “treatments” (i.e. additional perturbations) that serve at least to “keep the organism alive”:

In the later panels here the “life history” of the organism gets closer to the “healthy” unperturbed form shown in the final panel. And if our criterion is restoring overall lifetime, we can reasonably say that the “treatment has been successful”. But it’s notable that the detailed “life history” (and perhaps “quality of life”) of the organism will essentially never be the same as before: as we’ll see in more detail later, it’s almost inevitably the case that there’ll be at least some (and often many) long-term effects of the perturbation+treatment even if they’re not considered deleterious.

So now that we’ve got an idealized model of the “problem of medicine”, what can we say about solving it? Well, the main thing is that we can get a sense of why it’s fundamentally hard. And beyond anything else, the central issue is a fundamentally computational one: the phenomenon of computational irreducibility.

Given any particular cellular automaton rule, with any particular initial condition, one can always explicitly run the rule, step by step, from that initial condition, to see what will happen. But can one do better? Experience with mathematical science might make one imagine that as soon as one knows the underlying rule for a system, one should in principle immediately be able to “solve the equations” and jump ahead to work out everything about what the system does, without explicitly tracing through all the steps. But one of the central things I discovered in studying simple programs back in the early 1980s is that it’s common for such systems to show what I called computational irreducibility, which means that the only way to work out their detailed behavior is essentially just to run their rules step by step and see what happens.

So what about biology? One might imagine that with its incremental optimization, biological evolution would produce systems that somehow avoid computational irreducibility, and (like simple machinery) have obvious easy-to-understand mechanisms by which they operate. But in fact that’s not what biological evolution typically seems to produce. And instead—as I’ve recently argued—what it seems to do is basically just to put together randomly found “lumps of irreducible computation” that happen to satisfy its fitness criterion. And the result is that biological systems are full of computational irreducibility, and mostly aren’t straightforwardly “mechanically explainable”. (The presence of computational irreducibility is presumably also why theoretical biology based on mathematical models has always been so challenging.)

But, OK, given all this computational irreducibility, how is it that medicine is even possible? How is it that we can know enough about what a biological system will do to be able to determine what treatment to use on it? Well, computational irreducibility makes it hard. But it’s a fundamental feature of computational irreducibility that within any computationally irreducible process there must always be pockets of computational reducibility. And if we’re trying to achieve only some fairly coarse objective (like maximizing overall lifetime) it’s potentially possible to leverage some pocket of computational reducibility to do this.

(And indeed pockets of computational reducibility within computational irreducibility are what make many things possible—including having understandable laws of physics, doing higher mathematics, etc.)

The Diversity and Classification of Disease

With our simple idealization of disease as the effect of perturbations on the life history of our idealized organism, we can start asking questions like “What is the distribution of all possible diseases?”

And to begin exploring this, here are the patterns generated with a random sample of the 4383 possible single-point perturbations to the idealized organism we’ve discussed above:

Clearly there’s a lot of variation in these life histories—in effect a lot of different symptomologies. If we average them all together we lose the detail and we just get something close to the original:

But if we look at the distribution of lifetimes, we see that while it’s peaked at the original value, it nevertheless extends to both shorter and longer values:

In medicine (or at least Western medicine) it’s been traditional to classify “things that can go wrong” in terms of discrete diseases. And we can imagine also doing this in our simple model. But it’s already clear from the array of pictures above that this is not going to be a straightforward task. We’ve got a different detailed pattern for every different perturbation. So how should we group them together?

Well—much as in medicine—it depends on what we care about. In medicine we might talk about signs and symptoms, which in our idealized model we can basically identify with features of patterns. And as an example, we might decide that the only features that matter are ones associated with the boundary shape of our pattern:

So what happens to these boundary shapes with different perturbations? Here are the most frequent shapes found (together with their probabilities):

We might think of these as representing “common diseases” of our idealized organism. But what if we look at all possible “diseases”—at least all the ones produced by single-cell perturbations? Using boundary shape as our way to distinguish “diseases” we find that if we plot the frequency of diseases against their rank we get roughly a power law distribution (and, yes, it’s not clear why it’s a power law):

What are the “rare diseases” (i.e. ones with low frequency) like? Their boundary shapes can be quite diverse:

But, OK, can we somehow quantify all these “diseases”? For example, as a kind of “imitation medical test” we might look at how far to the left the boundary of each pattern goes. With single-point perturbations, 84% of the time it’s the same as in the unperturbed case—but there’s a distribution of other, “less healthy” results (here plotted on a log scale)

with extreme examples being:

And, yes, we could diagnose any pattern that goes further to the left than the unperturbed one as a case of, say, “leftiness syndrome”. And we might imagine that if we set up enough tests, we could begin to discriminate between many discrete “diseases”. But somehow this seems quite ad hoc.

So can we perhaps be more systematic by using machine learning? Let’s say we just look at each whole pattern, then try to place it in an image feature space, say a 2D one. Here’s an example of what we get:

The details of this depend on the particulars of the machine learning method we’ve used (here the default FeatureSpacePlot method in Wolfram Language). But it’s a fairly robust result that “visually different” patterns end up separated—so that in effect the machine learning is successfully automating some kind of “visual diagnosis”. And there’s at least a little evidence that the machine learning will identify separated clusters of patterns that we can reasonably identify as “truly distinct diseases”—even as the more common situation is that between any two patterns, there are intermediate ones that aren’t neatly classified as one disease or the other.

Somewhat in the style of the human “International Classification of Diseases” (ICD), we can try arranging all our patterns in a hierarchy—though it’s basically inevitable that we’ll always be able to subdivide further, and there’ll never be a clear point at which we can say “we’ve classified all the diseases”:

By the way, in addition to talking about possible diseases, we also need to discuss what counts as “healthy”. We could say that our organism is only “healthy” if its pattern is exactly what it would be without any perturbation (“the natural state”). But what probably better captures everyday medical thinking is to say that our organism should be considered “healthy” if it doesn’t have symptoms (or features) that we consider bad. And in particular, at least “after the fact” we might be able to say that it must have been healthy if its lifetime turned out to be long.

It’s worth noting that even in our simple model, while there are many perturbations that reduce lifetime, there are also perturbations that increase lifetime. In the course of biological evolution, genetic mutations of the overall underlying rules for our idealized organism might have managed to achieve a certain longevity. But the point is that nothing says “longevity perturbations” applied “during the life of the organism” can’t get further—and indeed here are some examples where they do:

And, actually, in a feature that’s not (at least yet) reflected in human medicine, there are perturbations than can make the lifetime very significantly longer. And for the particular idealized organism we’re studying here, the most extreme examples obtained with single-point perturbations are:

OK, but what happens if we consider perturbations at multiple points? There are immediately vastly more possibilities. Here are some examples of the 10 million or so possible configurations of two perturbations:

And here are examples with three perturbations:

Here are examples if we try to apply five perturbations (though sometimes the organism is “already dead” before we can apply later perturbations):

What happens to the overall distribution of lifetimes in these cases? Already with two perturbations, the distribution gets much broader, and with three or more, the peak at the original lifetime has all but disappeared, with a new peak appearing for organisms that in effect die almost immediately:

In other words, the particular idealized organism that we’re studying is fairly robust against one perturbation, and perhaps even two, but with more perturbations it’s increasingly likely to succumb to “infant mortality”. (And, yes, if one increases the number of perturbations the “life expectancy” progressively decreases.)

But what about the other way around? With multiple perturbations, can the organism in effect “live forever”? Here are some examples where it’s still “going strong” after 300 steps:

But after 500 steps most of these have died out:

As is typical in the computational universe (perhaps like in medicine) there are always surprises, courtesy of computational irreducibility. Like the sudden appearance of the obviously periodic case (with period 25):

As well as the much more complicated cases (where in the final pictures the pattern has been “rectified”):

So, yes, in these cases the organism does in effect “live forever”—though not in an “interesting” way. And indeed such cases might remind us of tumor-like behavior in biological organisms. But what about a case that not only lives forever, but also grows forever? Well, needless to say, lurking out in the computational universe, one can find an example:

The “incidence” of this behavior is about one in a million for 2 perturbations (or, more precisely, 7 out of 9.6 million possibilities), and one in 300,000 for 3 perturbations. And although there presumably are even more complicated behaviors out there to find, they don’t show up with 2 perturbations, and their incidence with 3 perturbations is below about one in 100 million.

Diagnosis & Prognosis

A fundamental objective in medicine is to predict from tests we do or symptoms and signs we observe what will happen. And, yes, we now know that computational irreducibility inevitably makes this in general hard. But also know from experience that a certain amount of prediction is possible—which we can now interpret as successfully managing to tap into pockets of computational reducibility.

So as an example, let’s ask what the prognosis is for our idealized organism based on the width of its pattern we measure at a certain step. So here, for example, is what happens to the original lifetime distribution (in green) if we consider only cases where the width of the measured pattern after 25 steps is less than its unperturbed (“healthy”) value (and where we’re dropping the 1% of cases when the organism was “already dead” before 25 steps):

Our “narrow” cases represent about 5% of the total. Their median lifetime is 57, as compared with the overall median of 106. But clearly the median alone does not tell the whole story. And nor do the two survival curves:

And, for example, here are the actual widths as a function of time for all the narrow cases, compared to the sequence of widths for the unperturbed case:

These pictures don’t make it look promising that one could predict lifetime from the single test of whether the pattern was narrow at step 25. Like in analogous medical situations, one needs more data. One approach in our case is to look at actual “narrow” patterns (up to step 25)—here sorted by ultimate lifetime—and then to try to identify useful predictive features (though, for example, to attempt any serious machine learning training would require a lot more examples):

But perhaps a simpler approach is not just to do a discrete “narrow or not” test, but rather to look at the actual width at step 25. So here are the lifetimes as a function of width at step 25

and here’s the distribution of outcomes, together with the median in each case:

The predictive power of our width measurement is obviously quite weak (though there’s doubtless a way to “hack p values” to get at least something out). And, unsurprisingly, machine learning doesn’t help. Like here’s a machine learning prediction (based on decision tree methods) for lifetime as a function of width (that, yes, is very close to just being the median):

Does it help if we use more history? In other words, what happens if we make our prediction not just from the width at a particular step, but from the history of all widths up to that point? As one approach, we can make a collection of “training examples” of what lifetimes particular “width histories” (say up to step 25) lead to:

There’s already something of an issue here, because a given width history—which, in a sense is a “coarse graining” of the detailed “microscopic” history—can lead to multiple different final lifetimes:

But we can still go ahead and try to use machine learning to predict lifetimes from width histories based on training on (say, half) of our training data—yielding less than impressive results (with the vertical line being associated with multiple lifetimes from a single width history in the training data):

So how can we do better? Well, given the underlying setup for our system, if we could determine not just the width but the whole precise sequence of values for all cells, even just at step 25, then in principle we could use this as an “initial condition” and run the system forward to see what it does. But regardless of it being “medically implausible” to do this, it isn’t much of a prediction anyway; it’s more just “watch and see what happens”. And the point is that insofar as there’s computational irreducibility, one can’t expect—at least in full generality—to do much better. (And, as we’ll argue later, there’s no reason to think that organisms produced by biological evolution will avoid computational irreducibility at this level.)

But still, within any computationally irreducible system, there are always pockets of computational reducibility. So we can expect that there will be some predictions that can be made. But the question is whether those predictions will be about things we care about (like lifetime) or even about things we can measure. Or, in other words, will they be predictions that speak to things like symptoms?

Our Physics Project, for example, involves all sorts of underlying processes that are computationally irreducible. But the key point there is that what physical observers like us perceive are aggregate constructs (like overall features of space) that show significant computational reducibility. And in a sense there’s an analogous issue here: there’s computational irreducibility underneath, but what do “medical observers” actually perceive, and are there computationally reducible features related to that? If we could find such things, then in a sense we’d have identified “general laws of medicine” much like we now have “general laws of physics”.

The Problem of Finding Treatments

We’ve talked a bit about giving a prognosis for what will happen to an idealized organism that’s suffered a perturbation. But what about trying to fix it? What about trying to intervene with another “treatment perturbation” that can “heal” the system, and give it a life history that’s at least close to what it would have had without the original perturbation?

Here’s our original idealized organism, together with how it behaves when it “suffers” a particular perturbation that significantly reduces its lifetime:

But what happens if we now try applying a second perturbation? Here are a few random examples:

None of these examples convincingly “heal” the system. But let’s (as we can in our idealized model) just enumerate all possible second perturbations (here 1554 of them). Then it turns out that a few of these do in fact successfully give us patterns that at least exactly reproduce the original lifetime:

Do these represent true examples of “healing”? Well, it depends on what we mean. Yes, they’ve managed to make the lifetime exactly what it would have been without the original “disease-inducing” perturbation. But in essentially all cases we see here that there are various “long-term side effects”—in the sense that the detailed patterns generated end up having obvious differences from the original unperturbed “healthy” form.

The one exception here is the very first case, in which the “disease was caught early enough” that the “treatment perturbation” manages to completely heal the effects of the “disease perturbation”:

We’ve been talking here about intervening with “treatment perturbations” to “heal” our “disease perturbation”. But actually it turns out that there are plenty of “disease perturbations” which automatically “heal themselves”, without any “treatment” intervention. In fact, of all possible 4383 single perturbations, 380 essentially heal themselves.

In many cases, the “healing” happens very locally, after one or two steps:

But there are also more complicated cases, where perturbations produce fairly large-scale changes in the pattern—that nevertheless “spontaneously heal themselves”:

(Needless to say, in cases where a perturbation “spontaneously heals itself”, adding a “treatment perturbation” will almost always lead to a worse outcome.)

So how should we think about perturbations that spontaneously heal themselves? They’re like seeds for diseases that never take hold, or like diseases that quickly burn themselves out. But from a theoretical point of view we can think of them as being where the unperturbed life history of our idealized organism is acting as attractor, to which certain perturbed states inexorably converge—a bit like how friction can dissipate perturbations to patterns of motion in a mechanical system.

But let’s say we have a perturbation that doesn’t “spontaneously heal itself”. Then to remediate it we have to “do the medical thing” and in our idealized model try to find a “treatment perturbation”. So how might we systematically set about doing that? Well, in general, computational irreducibility makes it difficult. And as one indication of this, this shows what lifetime is achieved by “treatment perturbations” made at each possible point in the pattern (after the initial perturbation):

We can think of this as providing a map of what the effects of different treatment perturbations will be. Here are some other examples, for different initial perturbations (or, in effect, different “diseases”):

There’s some regularity here. But the main observation is that different detailed choices of treatment perturbations will often have very different effects. In other words, even “nearby treatments” will often lead to very different outcomes. Given computational irreducibility, this isn’t surprising. But in a sense it underscores the difficulty of finding and applying “treatments”. By the way, cells indicated in dark red above are ones where treatment leads to a pattern that lives “excessively long”—or in effect shows tumor-like characteristics. And the fact that these are scattered so seemingly randomly reflects the difficulty of predicting whether such effects will occur as a result of treatment.

In what we’ve done so far here, our “treatment” has always consisted of just a single additional perturbation. But what about applying more perturbations? For example, let’s say we do a series of experiments where after our first “treatment perturbation” we progressively try other treatment perturbations. If a given additional perturbation doesn’t get further from the desired lifetime, we keep it. Otherwise we reject it, and try another perturbation. Here’s an example of what happens if we do this:

The highlighted panels represent perturbations we kept. And here’s how the overall lifetime “converges” over successive iterations in our experiment:

In what we just did, we allowed additional treatment perturbations to be added at any subsequent step. But what if we require treatment perturbations to always be added on successive steps—starting right after the “disease perturbation” occurred? Here’s an example of what happens in this case:

And here’s what we see zooming in at the beginning:

In a sense this corresponds to “doing aggressive treatment” as soon as the initial “disease perturbation” has occurred. And a notable feature of the particular example here is that when our succession of treatment perturbations have succeeded in “restoring the lifetime” (which happens fairly quickly), the life history they produce is similar (though not identical) to the original unperturbed case.

That definitely doesn’t always happen, as this example illustrates—but it’s fairly common:

It’s worth pointing out that if we allowed ourselves to do many single perturbations at the same time (i.e. on the same row of the pattern) we could effectively just “define new initial conditions” for the pattern, and, for example, perfectly “regenerate” the original unperturbed pattern after this “reset”. And in general we can imagine in effect “hot-wiring” the organism by applying large numbers of treatment perturbations that just repeatedly direct it back to its unperturbed form.

But such extensive and detailed “intervention”—that in effect replaces the whole state of the organism—seems far from what might be practical in typical (current) medicine (except perhaps in some kind of “regenerative treatment”). And indeed in actual (current) medicine one is normally operating in a situation where one does not have anything close to perfect “cell-by-cell” information on the state of an organism—and instead one has to figure out things like what treatment to give based on much coarser “symptom-level” information. (In some ways, though, the immune system does something closer to cell-by-cell “treatment”.)

So what can one do given coarse-grained information? As one example, let’s consider trying to predict what treatment perturbation will be best using the kind of pattern-width information we discussed above. Specifically, let’s say that we have the history of the overall width of a pattern up to a particular point, then from this we want to predict what treatment perturbation will lead to the best lifetime outcome for the system. There are a variety of ways we could approach this, but one is to make predictions of where to apply a treatment perturbation using machine learning trained on examples of optimal such perturbations.

This is analogous to what we did in the previous section in applying machine learning to predict lifetime from width history. But now we want to predict from width history what treatment perturbation to apply. To generate our training data we can search for treatment perturbations that lead to the unperturbed lifetime when starting from life histories with a given width history. Now we can use a simple neural net to create a predictor that tries to tell us from a width history what “treatment to give”. And here are comparisons between our earlier search results based on looking at complete life histories—and (shown with red arrows) the machine learning predictions based purely on width history before the original disease perturbation:

It’s clear that the machine learning is doing something—though it’s not as impressive as perhaps it looks, because a wide range of perturbations all in fact give rather similar life histories. So as a slightly more quantitative indication of what’s going on, here’s the distribution of lifetimes achieved by our machine-learning-based therapy:

Our “best treatment” was able to give lifetime 101 in all these cases. And while the distribution we’ve now achieved looks peaked around the unperturbed value, dividing this distribution by what we’d get without any treatment at all makes it clear that not so much was achieved by the machine learning we were able to do:

And in a sense this isn’t surprising; our machine learning—based, as it is, on coarse-grained features—is quite weak compared to the computational irreducibility of the underlying processes at work.

The Effect of Genetic Diversity

In what we’ve done so far, we’ve studied just a single idealized organism—with a single set of underlying “genetic rules”. But in analogy to the situation with humans, we can imagine a whole population of genetically slightly different idealized organisms, with different responses to perturbations, etc.

Many changes to the underlying rules for our idealized organism will lead to unrecognizably different patterns, that don’t, for example, have the kind of finite-but-long lifetimes we’ve been interested in. But it turns out that in the rules for our particular idealized organism there are some specific changes that actually don’t have any effect at all—at least on the unperturbed pattern of behavior. And the reason for this is that in generating the unperturbed pattern these particular cases in the rule happen never to be used:

And the result is that any one of the 4³ = 64 possible choices of outcomes for those cases in the rule will still yield the same unperturbed pattern. If there’s a perturbation, however, different cases in the rule can be sampled—including these ones. It’s as if cases in the rule that are initially “non-coding” end up being “coding” when the path of behavior is changed by a perturbation. (Or, said differently, it’s like different genes being activated when conditions are different.)

So to make an idealized model of something like a population with genetic diversity, we can look at what happens with different choices of our (initially) “non-coding” rule outcomes:

Before the perturbation, all these inevitably show the same behavior, because they’re never sampling “non-coding” rule cases. But as soon as there’s a perturbation, the pattern is changed, and after varying numbers of steps, previously “non-coding” rule cases do get sampled—and can affect the outcome.

Here are the distinct cases of what happens in all 64 “genetic variants”—with the red arrow in each case indicating where the pattern first differs from what it is with our original idealized organism:

And here is then the distribution of lifetimes achieved—in effect showing the differing consequences of this particular “disease perturbation” on all our genetic variants:

What happens with other “disease perturbations”? Here’s a sample of distributions of lifetimes achieved (where “__” corresponds to cases where all 64 genetic variants yield the same lifetime):

OK, so what about the overall lifetime distribution across all (single) perturbations for each of the genetic variants? The detailed distribution we get is different for each variant. But their general shape is always remarkably similar

though taking differences from the case of our original idealized organism reveals some structure:

As another indication of the effect of genetic diversity, we can plot the survival curve averaged over all perturbations, and compare the case for our original idealized organism with what happens if we average equally over all 64 genetic variants. The difference is small, but there is a longer tail for the average of the genetic variants than for our specific original idealized organism:

We’ve seen how our idealized genetic variation affects “disease”. But how does it affect “treatment”? For the “disease” above, we already saw that there’s a particular “treatment perturbation” that successfully returns our original idealized organism to its “natural lifespan”. So what happens if we apply this same treatment across all the genetic variants? In effect this is like doing a very idealized “clinical trial” of our potential treatment. And what we see is that the results are quite diverse—and indeed more diverse than from the disease on it own:

In essence what we’re seeing is that, yes, there are some genetic variants for which the treatment still works. But there are many for which there are (often fairly dramatic) side effects.

Biological Evolution and Our Model Organism

So where did the particular rule for the “model organism” we’ve been studying come from? Well, we evolved it—using a slight generalization of the idealized model for biological evolution that I recently introduced. The goal of our evolutionary process was to find a rule that generates a pattern that lives as long as possible, but not infinitely long—and that does so robustly even in the presence of perturbations. In essence we used lifetime (or, more accurately, “lifetime under perturbation”) as our “fitness function”, then progressively evolved our rule (or “genome”) by random mutations to try to maximize this fitness function.

In more detail, we started from the null (“everything turns white”) rule, then successively made random changes to single cases in the rule (“point mutations”)—keeping the resulting rule whenever the pattern it generated had a lifetime (under perturbation) that wasn’t smaller (or infinite). And with this setup, here’s the particular (random) sequence of rules we got (showing for each rule the outcome for each of its 64 cases):

Many of these rules don’t “make progress” in the sense that they increase the lifetime under perturbation. But every so often there’s a “breakthrough”, and a rule with a longer lifetime under perturbation is reached:

And, as we see, the rule for the particular model organism we’ve been using is what’s reached at the end.

In studying my recent idealized model for biological evolution, I considered fitness functions like lifetime that can directly be computed just by running the underlying rule from a certain initial condition. But here I’m generalizing that a bit, and considering as a fitness function not just lifetime, but “lifetime under perturbation”, computed by taking a particular rule, and finding the minimum lifetime of all patterns produced by it with certain random perturbations applied.

So, for example, here the “lifetime under perturbation” would be considered to be the minimum of the lifetimes generated with no perturbation, and with certain random perturbations—or in this case 60:

This plot then illustrates how the (lifetime-under-perturbation) fitness (indicated by the blue line) behaves in the course of our adaptive evolution process, right around where the fitness-60 “breakthrough” above occurs:

What’s happening in this plot? At each adaptive step, we’re considering a new rule, obtained by a point mutation from the previous one. Running this rule we get a certain lifetime. If this lifetime is finite, we indicate it by a green dot. Then we apply a certain set of random perturbations—indicating the lifetimes we get by gray dots. (We could imagine using all sorts of schemes for picking the random perturbations; here what we’re doing is to perturb random points on about a tenth of the rows in the unperturbed pattern.)

Then the minimum lifetime for any given rule we indicate by a red dot—and this is the fitness we assign to that rule. So now we can see the whole progression of our adaptive evolution process:

One thing that’s notable is that the unperturbed lifetimes (green dots) are considerably larger than the final minimum lifetimes (red dots). And what this means is that our requirement of “robustness”, implemented by looking at lifetime under perturbation rather than just unperturbed lifetime, considerably reduces the lifetimes we can reach. In other words, if our idealized organism is going to be robust, it won’t tend to be able to have as long a lifetime as it could if it didn’t have to “worry about” random perturbations.

And to illustrate this, here’s a typical example of a much longer lifetime obtained by adaptive evolution with the same kind of rule we’ve been using (k = 4, r = 1 cellular automaton), but now with no perturbations and with fitness being given purely by the unperturbed lifetime (exactly as in my recent work on biological evolution):

OK, so given that we’re evolving with a lifetime-under-perturbation fitness function, what are some alternatives to our particular model organism? Here are a few examples:

At an overall level, these seem to react to perturbations much like our original model organism:

One notable feature here, though, is that there seems to be a tendency for simpler overall behavior to be less disrupted by perturbations. In other words, our idealized “diseases” seem to have less dramatic effects on “simpler” idealized organisms. And we can see a reflection of this phenomenon if we plot the overall (single-perturbation) lifetime distributions for the four rules above:

But despite detailed differences, the main conclusion seems to be that there’s nothing special about the particular model organism we’ve used—and that if we repeated our whole analysis for different model organisms (i.e. “different idealized species”) the results we’d get would be very much the same.

What It Means and Where to Go from Here

So what does all this mean? At the outset, it wasn’t clear there’d be a way to usefully capture anything about the foundations of medicine in a formalized theoretical way. But in fact what we’ve found is that even the very simple computational model we’ve studied seems to successfully reflect all sorts of features of what we see in medicine. Many of the fundamental effects and phenomena are, it seems, not the result of details of biomedicine, but instead are at their core purely abstract and computational—and therefore accessible to formalized theory and metamodeling. This kind of methodology is very different from what’s been traditional in medicine—and isn’t likely to lead directly to specific practical medicine. But what it can do is to help us develop powerful new general intuition and ways of reasoning—and ultimately an understanding of the conceptual foundations of what’s going on.

At the heart of much of what we’ve seen is the very fundamental—and ubiquitous—phenomenon of computational irreducibility. I’ve argued recently that computational irreducibility is central to what makes biological evolution work—and that it’s inevitably imprinted on the core “computational architecture” of biological organisms. And it’s this computational irreducibility that inexorably leads to much of the complexity we see so ubiquitously in medicine. Can we expect to find a simple narrative explanation for the consequences of some perturbation to an organism? In general, no—because of computational irreducibility. There are always pockets of computational reducibility, but in general we can have no expectation that, for example, we’ll be able to describe the effects of different perturbations by neatly classifying them into a certain set of distinct “diseases”.

To a large extent the core mission of medicine is about “treating diseases”, or in our terms, about remediating or reversing the effects of perturbations. And once again, computational irreducibility implies there’s inevitably a certain fundamental difficulty in doing this. It’s a bit like with the Second Law of thermodynamics, where there’s enough computational irreducibility in microscopic molecular dynamics that to seriously reverse—or outpredict—this dynamics is something that’s at least far out of range for computationally bounded observers like us. And in our medical setting the analog of that is that “computationally bounded interventions” can only systematically lead to medical successes insofar as they tap into pockets of computational reducibility. And insofar as they are exposed to overall computational irreducibility they will inevitably seem to show a certain amount of apparent randomness in their outcomes.

In traditional approaches to medicine one ultimately tends to “give in to the randomness” and go no further than to assign probabilities to things. But an important feature of what we’ve done here is that in our idealized computational models we can always explicitly see what’s happening inside. Often—largely as a consequence of computational irreducibility—it’s complicated. But the fact that we can see it gives us the opportunity to get much more clarity about the fundamental mechanisms involved. And if we end up summarizing what happens by giving probabilities and doing statistics it’s because this is something we’re choosing to do, not something we’re forced to do because of our lack of knowledge of the systems we’re studying.

There’s much to do in our effort to explore the computational foundations of medicine. But already there are some implications that are beginning to emerge. Much of the workflow of medicine today is based on classifying things that can go wrong into discrete diseases. But what we’ve seen here (which is hardly surprising given practical experience with medicine) is that when one looks at the details, a huge diversity of things can happen—whose characteristics and outcomes can’t really be binned neatly into discrete “diseases”.

And indeed when we try to figure out “treatments” the details matter. As a first approximation, we might base our treatments on coarse graining into discrete diseases. But—as the approach I’ve outlined here can potentially help us analyze—the more we can directly go from detailed measurements to detailed treatments (through computation, machine learning, etc.), the more promising it’s likely to be. Not that it’s easy. Because in a sense we’re trying to beat computational irreducibility—with computationally bounded measurements and interventions.

In principle one can imagine a future in which our efforts at treatment have much more computational sophistication (and indeed the immune system presumably already provides an example in nature). We can imagine things like algorithmic drugs and artificial cells that are capable of amounts of computation that are a closer match for the irreducible computation of an organism. And indeed the kind of formalized theory that I’ve outlined here is likely what one needs to begin to get an idea of how such an approach might work. (In the thermodynamic analogy, what we need to do is a bit like reversing entropy increase by sending in large numbers of “smart molecules”.)

(By the way, seeing how difficult it potentially is to reverse the effects of a perturbation provides all the more impetus to consider “starting from scratch”—as nature does in successive generations of organisms—and simply wholesale regenerating elements of organisms, rather than trying to “fix what’s there”. And, yes, in our models this is for example like starting to grow again from a new seed, and letting the resulting pattern knit itself into the existing one.)

One of the important features of operating at the level of computational foundations is that we can expect conclusions we draw to be very general. And we might wonder whether perhaps the framework we’ve described here could be applied outside of medicine. And to some extent I suspect it can—potentially to areas like robustness of large-scale technological and social systems and specifically things like computer security and computer system failures. (And, yes, much as in medicine one can imagine for example “classifying diseases” for computer systems.) But things likely won’t be quite the same in cases like these—because the underlying systems have much more human-determined mechanisms, and less “blind” adaptive evolution.

But when it comes to medicine, the very presence of computational irreducibility introduced by biological evolution is what potentially allows one to develop a robust framework in which one can draw conclusions purely on the basis of abstract computational phenomena. Here I’ve just begun to scratch the surface of what’s possible. But I think we’ve already seen enough that we can be confident that medicine is yet another field whose foundations can be seen as fundamentally rooted in the computational paradigm.

Thanks & Notes

Thanks to Wolfram Institute researcher Willem Nielsen for extensive help.

I’ve never written anything substantial about medicine before, though I’ve had many interactions with the medical research and biomedical communities over the years—that have gradually extended my knowledge and intuition about medicine. (Thanks particularly to Beatrice Golomb, who over the course of more than forty years has helped me understand more about medical reasoning, often emphasizing “Beatrice’s Law” that “Everything in medicine is more complicated than you can possibly imagine, even taking account of Beatrice’s Law”…)

Launching Version 14.2 of Wolfram Language & Mathematica: Big Data Meets Computation & AI

Stephen Wolfram — Thu, 23 Jan 2025 19:00:09 +0000

The Drumbeat of Releases Continues…

Just under six months ago (176 days ago, to be precise) we released Version 14.1. Today I’m pleased to announce that we’re releasing Version 14.2, delivering the latest from our R&D pipeline.

This is an exciting time for our technology, both in terms of what we’re now able to implement, and in terms of how our technology is now being used in the world at large. A notable feature of these times is the increasing use of Wolfram Language not only by humans, but also by AIs. And it’s very nice to see that all the effort we’ve put into consistent language design, implementation and documentation over the years is now paying dividends in making Wolfram Language uniquely valuable as a tool for AIs—complementing their own intrinsic capabilities.

But there’s another angle to AI as well. With our Wolfram Notebook Assistant launched last month we’re using AI technology (plus a lot more) to provide what amounts to a conversational interface to Wolfram Language. As I described when we released Wolfram Notebook Assistant, it’s something extremely useful for experts and beginners alike, but ultimately I think its most important consequence will be to accelerate the ability to go from any field X to “computational X”—making use of the whole tower of technology we’ve built around Wolfram Language.

So, what’s new in 14.2? Under the hood there are changes to make Wolfram Notebook Assistant more efficient and more streamlined. But there are also lots of visible extensions and enhancements to the user-visible parts of the Wolfram Language. In total there are 80 completely new functions—along with 177 functions that have been substantially updated.

There are continuations of long-running R&D stories, like additional functionality for video, and additional capabilities around symbolic arrays. Then there are completely new areas of built-in functionality, like game theory. But the largest new development in Version 14.2 is around handling tabular data, and particularly, big tabular data. It’s a whole new subsystem for Wolfram Language, with powerful consequences throughout the system. We’ve been working on it for quite a few years, and we’re excited to be able to release it for the first time in Version 14.2.

Talking of working on new functionality: starting more than seven years ago we pioneered the concept of open software design, livestreaming our software design meetings. And, for example, since the release of Version 14.1, we’ve done 43 software design livestreams, for a total of 46 hours (I’ve also done 73 hours of other livestreams in that time). Some of the functionality that’s now in Version 14.2 we started work on quite a few years ago. But we’ve been livestreaming long enough that pretty much anything that’s now in Version 14.2 we designed live and in public on a livestream at some time or another. It’s hard work doing software design (as you can tell if you watch the livestreams). But it’s always exciting to see the fruits of those efforts come to fruition in the system we’ve been progressively building for so long. And so, today, it’s a pleasure to be able to release Version 14.2 and to let everyone use the things we’ve been working so hard to build.

Notebook Assistant Chat inside Any Notebook

Last month we released the Wolfram Notebook Assistant to “turn words into computation”—and help experts and novices alike make broader and deeper use of Wolfram Language technology. In Version 14.1 the primary way to use Notebook Assistant is through the separate “side chat” Notebook Assistant window. But in Version 14.2 “chat cells” have become a standard feature of any notebook available to anyone with a Notebook Assistant subscription.

Just type ‘ as the first character of any cell, and it’ll become a chat cell:

Now you can start chatting with the Notebook Assistant:

With the side chat you have a “separate channel” for communicating with the Notebook Assistant—that won’t, for example, be saved with your notebook. With chat cells, your chat becomes an integral part of the notebook.

We actually first introduced Chat Notebooks in the middle of 2023—just a few months after the arrival of ChatGPT. Chat Notebooks defined the interface, but at the time, the actual content of chat cells was purely from external LLMs. Now in Version 14.2, chat cells are not limited to separate Chat Notebooks, but are available in any notebook. And by default they make use of the full Notebook Assistant technology stack, which goes far beyond a raw LLM. In addition, once you have a Notebook Assistant + LLM Kit subscription, you can seamlessly use chat cells; no account with external LLM providers is needed.

The chat cell functionality in Version 14.2 inherits all the features of Chat Notebooks. For example, typing ~ in a new cell creates a chat break, that lets you start a “new conversation”. And when you use a chat cell, it’s able to see anything in your notebook up to the most recent chat break. (By the way, when you use Notebook Assistant through side chat it can also see what selection you’ve made in your “focus” notebook.)

By default, chat cells are “talking” to the Notebook Assistant. But if you want, you can also use them to talk to external LLMs, just like in our original Chat Notebook—and there’s a convenient menu to set that up. Of course, if you’re using an external LLM, you don’t have all the technology that’s now in the Notebook Assistant, and unless you’re doing LLM research, you’ll typically find it much more useful and valuable to use chat cells in their default configuration—talking to the Notebook Assistant.

Bring Us Your Gigabytes! Introducing Tabular

Lists, associations, datasets. These are very flexible ways to represent structured collections of data in the Wolfram Language. But now in Version 14.2 there’s another: Tabular. Tabular provides a very streamlined and efficient way to handle tables of data laid out in rows and columns. And when we say “efficient” we mean that it can routinely juggle gigabytes of data or more, both in core and out of core.

Let’s do an example. Let’s start off by importing some tabular data:

This is data on trees in New York City, 683,788 of them, each with 45 properties (sometimes missing). Tabular introduces a variety of new ideas. One of them is treating tabular columns much like variables. Here we’re using this to make a histogram of the values of the "tree_dbh" column in this Tabular:

You can think of a Tabular as being like an optimized form of a list of associations, where each row consists of an association whose keys are column names. Functions like Select then just work on Tabular:

Length gives the number of rows:

CountsBy treats the Tabular as a list of associations, extracting the value associated with the key "spc_latin" (“Latin species”) in each association, and counting how many times that value occurs ("spc_latin" here is short for #"spc_latin"&):

To get the names of the columns we can use the new function ColumnKeys:

Viewing Tabular as being like a list of associations we can extract parts—giving first a specification of rows, and then a specification of columns:

There are lots of new operations that we’ve been able to introduce now that we have Tabular. An example is AggregrateRows, which constructs a new Tabular from a given Tabular by aggregating groups of rows, in this case ones with the same value of "spc_latin", and then applying a function to those rows, in this case finding the mean value of "tree_dbh":

An operation like ReverseSortBy then “just works” on this table, here reverse sorting by the value of "meandbh":

Here we’re making an ordinary matrix out of a small slice of data from our Tabular:

And now we can plot the result, giving the positions of Virginia pine trees in New York City:

When should you use a Tabular, rather than, say a Dataset? Tabular is specifically set up for data that is arranged in rows and columns—and it supports many powerful operations that make sense for data in this “rectangular” form. Dataset is more general; it can have an arbitrary hierarchy of data dimensions, and so can’t in general support all the “rectangular” data operations of Tabular. In addition, by being specialized for “rectangular” data, Tabular can also be much more efficient, and indeed we’re making use of the latest type-specific methods for large-scale data handling.

If you use TabularStructure you can see some of what lets Tabular be so efficient. Every column is treated as data of a specific type (and, yes, the types are consistent with the ones in the Wolfram Language compiler). And there’s streamlined treatment of missing data (with several new functions added specifically to handle this):

What we’ve seen so far is Tabular operating with “in-core” data. But you can quite transparently also use Tabular on out-of-core data, for example data stored in a relational database.

Here’s an example of what this looks like:

It’s a tabular that points to a table in a relational database. It doesn’t by default explicitly display the data in the Tabular (and in fact it doesn’t even get it into memory—because it might be huge and might be changing quickly as well). But you can still specify operations just like on any other Tabular. This finds out what columns are there:

And this specifies an operation, giving the result as a symbolic out-of-core Tabular object:

You can “resolve” this, and get an explicit in-memory Tabular using ToMemory:

Manipulating Data in Tabular

Let’s say you’ve got a Tabular—like this one based on penguins:

There are lots of operations you can do that manipulate the data in this Tabular in a structured way—giving you back another Tabular. For example, you could just take the last 2 rows of the Tabular:

Or you could sample 3 random rows:

Other operations depend on the actual content of the Tabular. And because you can treat each row like an association, you can set up functions that effectively refer to elements by their column names:

Note that we can always use #[name] to refer to elements in a column. If name is an alphanumeric string then we can also use the shorthand #name. And for other strings, we can use #"name". Some functions let you just use "name" to indicate the function #["name"]:

So far we’ve talked only about arranging or selecting rows in a Tabular. What about columns? Here’s how we can construct a tabular that has just two of the columns from our original Tabular:

What if we don’t just want existing columns, but instead want new columns that are functions of these? ConstructColumns lets us define new columns, giving their names and the functions to be used to compute values in them:

(Note the trick of writing out Function to avoid having to put parentheses, as in "species"(StringTake[#species,1]&).)

ConstructColumns lets you take an existing Tabular and construct a new one. TransformColumns lets you transform columns in an existing Tabular, here replacing species names by their first letters:

TransformColumns also lets you add new columns, specifying the content of the columns just like in ConstructColumns. But where does TransformColumns put your new columns? By default, they go at the end, after all existing columns. But if you specifically list an existing column, that’ll be used as a marker to determine where to put the new column ("name"Nothing removes a column):

Everything we’ve seen so far operates separately on each row of a Tabular. But what if we want to “gulp in” a whole column to use in our computation—say, for example, computing the mean of a whole column, then subtracting it from each value. ColumnwiseValue lets you do this, by supplying to the function (here Mean) a list of all the values in whatever column or columns you specify:

ColumnwiseValue effectively lets you compute a scalar value by applying a function to a whole column. There’s also ColumnwiseThread, which lets you compute a list of values, that will in effect be “threaded” into a column. Here we’re creating a column from a list of accumulated values:

By the way, as we’ll discuss below, if you’ve externally generated a list of values (of the right length) that you want to use as a column, you can do that directly by using InsertColumns.

There’s another concept that’s very useful in practice in working with tabular data, and that’s grouping. In our penguin data, we’ve got an individual row for each penguin of each species. But what if we want instead to aggregate all the penguins of a given species, for example computing their average body mass? Well, we can do this with AggregateRows. AggregateRows works like ConstructColumns in the sense that you specify columns and their contents. But unlike ConstructColumns it creates new “aggregated” rows:

What is that first column here? The gray background of its entries indicates that it’s what we call a “key column”: a column whose entries (perhaps together with other key columns) can be used to reference rows. And later, we’ll see how you can use RowKey to indicate a row by giving a value from a key column:

But let’s go on with our aggregation efforts. Let’s say that we want to group not just by species, but also by island. Here’s how we can do that with AggregateRows:

In a sense what we have here is a table whose rows are specified by pairs of values (here “species” and “island”). But it’s often convenient to “pivot” things so that these values are used respectively for rows and for columns. And you can do that with PivotTable:

Note the —’s, which indicate missing values; apparently there are no Gentoo penguins on Dream island, etc.

PivotTable normally gives exactly the same data as AggregateRows, but in a rearranged form. One additional feature of PivotTable is the option IncludeGroupAggregates which includes All entries that aggregate across each type of group:

If you have multiple functions that you’re computing, AggregateRows will just give them as separate columns:

PivotTable can also deal with multiple functions—by creating columns with “extended keys”:

And now you can use RowKey and ExtendedKey to refer to elements of the resulting Tabular:

Getting Data into Tabular

We’ve seen some of the things you can do when you have data as a Tabular. But how does one get data into a Tabular? There are several ways. The first is just to convert from structures like lists and associations. The second is to import from a file, say a CSV or XLSX (or, for larger amounts of data, Parquet)—or from an external data store (S3, Dropbox, etc.). And the third is to connect to a database. You can also get data for Tabular directly from the Wolfram Knowledgebase or from the Wolfram Data Repository.

Here’s how you can convert a list of lists into a Tabular:

And here’s how you can convert back:

It works with sparse arrays too, here instantly creating a million-row Tabular

that takes 80 MB to store:

Here’s what happens with a list of associations:

You can get the same Tabular by entering its data and its column names separately:

By the way, you can convert a Tabular to a Dataset

and in this simple case you can convert it back to a Tabular too:

In general, though, there are all sorts of options for how to convert lists, datasets, etc. to Tabular objects—and ToTabular is set up to let you control these. For example, you can use ToTabular to create a Tabular from columns rather than rows:

How about external data? In Version 14.2 Import now supports a "Tabular" element for tabular data formats. So, for example, given a CSV file

Import can immediately import it as a Tabular:

This works very efficiently even for huge CSV files with millions of entries. It also does well at automatically identifying column names and headers. The same kind of thing works with more structured files, like ones from spreadsheets and statistical data formats. And it also works with modern columnar storage formats like Parquet, ORC and Arrow.

Import transparently handles both ordinary files, and URLs (and URIs), requesting authentication if needed. In Version 14.2 we’re adding the new concept of DataConnectionObject, which provides a symbolic representation of remote data, essentially encapsulating all the details of how to get the data. So, for example, here’s a DataConnectionObject for an S3 bucket, whose contents we can immediately import:

(In Version 14.2 we’re supporting Amazon S3, Azure Blob Storage, Dropbox, IPFS—with many more to come. And we’re also planning support for data warehouse connections, APIs, etc.)

But what about data that’s too big—or too fast-changing—to make sense to explicitly import? An important feature of Tabular (mentioned above) is that it can transparently handle external data, for example in relational databases.

Here’s a reference to a large external database:

This defines a Tabular that points to a table in the external database:

We can ask for the dimensions of the Tabular—and we see that it has 158 million rows:

The table we’re looking at happens to be all the line-oriented data in OpenStreetMap. Here are the first 3 rows and 10 columns:

Most operations on the Tabular will now actually get done in the external database. Here we’re asking to select rows whose “name” field contains "Wolfram":

The actual computation is only done when we use ToMemory, and in this case (because there’s a lot of data in the database) it takes a little while. But soon we get the result, as a Tabular:

And we learn that there are 58 Wolfram-named items in the database:

Another source of data for Tabular is the built-in Wolfram Knowledgebase. In Version 14.2 EntityValue supports direct output in Tabular form:

The Wolfram Knowledgebase provides lots of good examples of data for Tabular. And the same is true of the Wolfram Data Repository—where you can typically just apply Tabular to get data in Tabular form:

Cleaning Data for Tabular

In many ways it’s the bane of data science. Yes, data is in digital form. But it’s not clean; it’s not computable. The Wolfram Language has long been a uniquely powerful tool for flexibly cleaning data (and, for example, for advancing through the ten levels of making data computable that I defined some years ago).

But now, in Version 14.2, with Tabular, we have a whole new collection of streamlined capabilities for cleaning data. Let’s start by importing some data “from the wild” (and, actually, this example is cleaner than many):

(By the way, if there was really crazy stuff in the file, we might have wanted to use the option MissingValuePattern to specify a pattern that would just immediately replace the crazy stuff with Missing[…].)

OK, but let’s start by surveying what came in here from our file, using TabularStructure:

We see that Import successfully managed to identify the basic type of data in most of the columns—though for example it can’t tell if numbers are just numbers or are representing quantities with units, etc. And it also identifies that some number of entries in some columns are “missing”.

As a first step in data cleaning, let’s get rid of what seems like an irrelevant "id" column:

Next, we see that the elements in the first column are being identified as strings—but they’re really dates, and they should be combined with the times in the second column. We can do this with TransformColumns, removing what’s now an “extra column” by replacing it with Nothing:

Looking at the various numerical columns, we see that they’re really quantities that should have units. But first, for convenience, let’s rename the last two columns:

Now let’s turn the numerical columns into columns of quantities with units, and, while we’re at it, also convert from °C to °F:

Here’s how we can now plot the temperature as a function of time:

There’s a lot of wiggling there. And looking at the data we see that we’re getting temperature values from several different weather stations. This selects data from a single station:

What’s the break in the curve? If we just scroll to that part of the tabular we’ll see that it’s because of missing data:

So what can we do about this? Well, there’s a powerful function TransformMissing that provides many options. Here we’re asking it to interpolate to fill in missing temperature values:

And now there are no gaps, but, slightly mysteriously, the whole plot extends further:

The reason is that it’s interpolating even in cases where basically nothing was measured. We can remove those rows using Discard:

And now we won’t have that “overhang” at the end:

Sometimes there’ll explicitly be data that’s missing; sometimes (more insidiously) the data will just be wrong. Let’s look at the histogram of pressure values for our data:

Oops. What are those small values? Presumably they’re wrong. (Perhaps they were transcription errors?) We can remove such “anomalous” values by using TransformAnomalies. Here we’re telling it to just completely trim out any row where the pressure was “anomalous”:

We can also get TransformAnomalies to try to “fix” the data. Here we’re just replacing any anomalous pressure by the previous pressure listed in the tabular:

You can also tell TransformAnomalies to “flag” any anomalous value and make it “missing”. But, if we’ve got missing values what then happens if we try to do computations on them? That’s where MissingFallback comes in. It’s fundamentally a very simple function—that just returns its first non-missing argument:

But even though it’s simple, it’s important in making it easy to handle missing values. So, for example, this computes a “northspeed”, falling back to 0 if data needed for the computation is missing:

The Structure of Tabular

We’ve said that a Tabular is “like” a list of associations. And, indeed, if you apply Normal to it, that’s what you’ll get:

But internally Tabular is stored in a much more compact and efficient way. And it’s useful to know something about this, so you can manipulate Tabular objects without having to “take them apart” into things like lists and associations. Here’s our basic sample Tabular:

What happens if we extract a row? Well, we get a TabularRow object:

If we apply Normal, we get an association:

Here’s what happens if we instead extract a column:

Now Normal gives a list:

We can create a TabularColumn from a list:

Now we can use InsertColumns to insert a symbolic column like this into an existing Tabular (including the "b" tells InsertColumns to insert the new column after the “b” column):

But what actually is a Tabular inside? Let’s look at the example:

TabularStructure gives us a summary of the internal structure here:

The first thing to notice is that everything is stated in terms of columns, reflecting the fact that Tabular is a fundamentally column-oriented construct. And part of what makes Tabular so efficient is then that within a column everything is uniform, in the sense that all the values are the same type of data. In addition, for things like quantities and dates, we factor the data so that what’s actually stored internally in the column is just a list of numbers, with a single copy of “metadata information” on how to interpret them.

And, yes, all this has a big effect. Like here’s the size in bytes of our New York trees Tabular from above:

But if we turn it into a list of associations using Normal, the result is about 14x larger:

OK, but what are those “column types” in the tabular structure? ColumnTypes gives a list of them:

These are low-level types of the kind used in the Wolfram Language compiler. And part of what knowing these does is that it immediately tells us what operations we can do on a particular column. And that’s useful both in low-level processing, and in things like knowing what kind of visualization might be possible.

When Import imports data from something like a CSV file, it tries to infer what type each column is. But sometimes (as we mentioned above) you’ll want to “cast” a column to a different type, specifying the “destination type” using Wolfram Language type description. So, for example, this casts column “b” to a 32-bit real number, and column “c” to units of meters:

By the way, when a Tabular is displayed in a notebook, the column headers indicate the types of data in the corresponding columns. So in this case, there’s a little in the first column to indicate that it contains strings. Numbers and dates basically just “show what they are”. Quantities have their units indicated. And general symbolic expressions (like column “f” here) are indicated with . (If you hover over a column header, it gives you more detail about the types.)

The next thing to discuss is missing data. Tabular always treats columns as being of a uniform type, but keeps an overall map of where values are missing. If you extract the column you’ll see a symbolic Missing:

But if you operate on the tabular column directly it’ll just behave as if the missing data is, well, missing:

By the way, if you’re bringing in data “from the wild”, Import will attempt to automatically infer the right type for each column. It knows how to deal with common anomalies in the input data, like NaN or null in a column of numbers. But if there are other weird things—like, say, notfound in the middle of a column of numbers—you can tell Import to turn such things into ordinary missing data by giving them as settings for the option MissingValuePattern.

There are a couple more subtleties to discuss in connection with the structure of Tabular objects. The first is the notion of extended keys. Let’s say we have the following Tabular:

We can “pivot this to columns” so that the values x and y become column headers, but “under” the overall column header “value”:

But what is the structure of this Tabular? We can use ColumnKeys to find out:

You can now use these extended keys as indices for the Tabular:

In this particular case, because the “subkeys” "x" and "y" are unique, we can just use those, without including the other part of the extended key:

Our final subtlety (for now) is somewhat related. It concerns key columns. Normally the way we specify a row in a Tabular object is just by giving its position. But if the values of a particular column happen to be unique, then we can use those instead to specify a row. Consider this Tabular:

The fruit column has the feature that each entry appears only once—so we can create a Tabular that uses this column as a key column:

Notice that the numbers for rows have now disappeared, and the key column is indicated with a gray background. In this Tabular, you can then reference a particular row using for example RowKey:

Equivalently, you can also use an association with the column name:

What if the values in a single column are not sufficient to uniquely specify a row, but several columns together are? (In a real-world example, say one column has first names, and another has last names, and another has dates of birth.) Well, then you can designate all those columns as key columns:

And once you’ve done that, you can reference a row by giving the values in all the key columns:

Tabular Everywhere

Tabular provides an important new way to represent structured data in the Wolfram Language. It’s powerful in its own right, but what makes it even more powerful is how it integrates with all the other capabilities in the Wolfram Language. Many functions just immediately work with Tabular. But in Version 14.2 hundreds have been enhanced to make use of the special features of Tabular.

Most often, it’s to be able to operate directly on columns in a Tabular. So, for example, given the Tabular

we can immediately make a visualization based on two of the columns:

If one of the columns has categorical data, we’ll recognize that, and plot it accordingly:

Another area where Tabular can immediately be used is machine learning. So, for example, this creates a classifier function that will attempt to determine the species of a penguin from other data about it:

Now we can use this classifier function to predict species from other data about a penguin:

We can also take the whole Tabular and make a feature space plot, labeling with species:

Or we could “learn the distribution of possible penguins”

and randomly generate 3 “fictitious penguins” from this distribution:

Algebra with Symbolic Arrays

One of the major innovations of Version 14.1 was the introduction of symbolic arrays—and the ability to create expressions involving vector, matrix and array variables, and to take derivatives of them. In Version 14.2 we’re taking the idea of computing with symbolic arrays a step further—for the first time systematically automating what has in the past been the manual process of doing algebra with symbolic arrays, and simplifying expressions involving symbolic arrays.

Let’s start by talking about ArrayExpand. Our longstanding function Expand just deals with expanding ordinary multiplication, effectively of scalars—so in this case it does nothing:

But in Version 14.2 we also have ArrayExpand which will do the expansion:

ArrayExpand deals with many generalizations of multiplication that aren’t commutative:

In an example like this, we really don’t need to know anything about a and b. But sometimes we can’t do the expansion without, for example, knowing their dimensions. One way to specify those dimensions is as a condition in ArrayExpand:

An alternative is to use an explicit symbolic array variable:

In addition to expanding generalized products using ArrayExpand, Version 14.2 also supports general simplification of symbolic array expressions:

The function ArraySimplify will specifically do simplification on symbolic arrays, while leaving other parts of expressions unchanged. Version 14.2 supports many kinds of array simplifications:

We could do these simplifications without knowing anything about the dimensions of a and b. But sometimes we can’t go as far without knowing these. For example, if we don’t know the dimensions we get:

But with the dimensions we can explicitly simplify this to an n×n identity matrix:

ArraySimplify can also take account of the symmetries of arrays. For example, let’s set up a symbolic symmetric matrix:

And now ArraySimplify can immediately resolve this:

The ability to do algebraic operations on complete arrays in symbolic form is very powerful. But sometimes it’s also important to look at individual components of arrays. And in Version 14.2 we’ve added ComponentExpand to let you get components of arrays in symbolic form.

So, for example this takes a 2-component vector and writes it out as an explicit list with two symbolic components:

Underneath, those components are represented using Indexed:

Here’s the determinant of a 3×3 matrix, written out in terms of symbolic components:

And here’s a matrix power:

Given 3D vectors and we can also for example form the cross product

and we can then go ahead and dot it into an inverse matrix:

Language Tune-Ups

As a daily user of the Wolfram Language I’m very pleased with how smoothly I find I can translate computational ideas into code. But the more we’ve made it easy to do, the more we can see new places where we can polish the language further. And in Version 14.2—like every version before it—we’ve added a number of “language tune-ups”.

A simple one—whose utility becomes particularly clear with Tabular—is Discard. You can think of it as a complement to Select: it discards elements according to the criterion you specify:

And along with adding Discard, we’ve also enhanced Select. Normally, Select just gives a list of the elements it selects. But in Version 14.2 you can specify other results. Here we’re asking for the “index” (i.e. position) of the elements that NumberQ is selecting:

Something that can be helpful in dealing with very large amounts of data is getting a bit vector data structure from Select (and Discard), that provides a bit mask of which elements are selected or not:

By the way, here’s how you can ask for multiple results from Select and Discard:

In talking about Tabular we already mentioned MissingFallback. Another function related to code robustification and error handling is the new function Failsafe. Let’s say you’ve got a list which contains some “failed” elements. If you map a function f over that list, it’ll apply itself to the failure elements just as to everything else:

But quite possibly f wasn’t set up to deal with these kinds of failure inputs. And that’s where Failsafe comes in. Because Failsafe[f][x] is defined to give f[x] if x is not a failure, and to just return the failure if it is. So now we can map f across our list with impunity, knowing it’ll never be fed failure input:

Talking of tricky error cases, another new function in Version 14.2 is HoldCompleteForm. HoldForm lets you display an expression without doing ordinary evaluation of the expression. But—like Hold—it still allows certain transformations to get made. HoldCompleteForm—like HoldComplete—prevents all these transformations. So while HoldForm gets a bit confused here when the sequence “resolves”

HoldCompleteForm just completely holds and displays the sequence:

Another piece of polish added in Version 14.2 concerns Counts. I often find myself wanting to count elements in a list, including getting 0 when a certain element is missing. By default, Counts just counts elements that are present:

But in Version 14.2 we’ve added a second argument that lets you give a complete list of all the elements you want to count—even if they happen to be absent from the list:

As a final example of language tune-up in Version 14.2 I’ll mention AssociationComap. In Version 14.0 we introduced Comap as a “co-” (as in “co-functor”, etc.) analog of Map:

In Version 14.2 we’re introducing AssociationComap—the “co-” version of AssociationMap:

Think of it as a nice way to make labeled tables of things, as in:

Brightening Our Colors; Spiffing Up for 2025

In 2014—for Version 10.0—we did a major overhaul of the default colors for all our graphics and visualization functions, coming up with what we felt was a good solution. (And as we’ve just noticed, somewhat bizarrely, it turned out that in the years that followed, many of the graphics and visualization libraries out there seemed to copy what we did!) Well, a decade has now passed, visual expectations (and display technologies) have changed, and we decided it was time to spiff up our colors for 2025.

Here’s what a typical plot looked like in Versions 10.0 through 14.1:

And here’s the same plot in Version 14.2:

By design, it’s still completely recognizable, but it’s got a little extra zing to it.

With more curves, there are more colors. Here’s the old version:

And here’s the new version:

Histograms are brighter too. The old:

And the new:

Here’s the comparison between old (“2014”) and new (“2025”) colors:

It’s subtle, but it makes a difference. I have to say that increasingly over the past few years, I’ve felt I had to tweak the colors in almost every Wolfram Language image I’ve published. But I’m excited to say that with the new colors that urge has gone away—and I can just use our default colors again!

LLM Streamlining & Streaming

We first introduced programmatic access to LLMs in Wolfram Language in the middle of 2023, with functions like LLMFunction and LLMSynthesize. At that time, these functions needed access to external LLM services. But with the release last month of LLM Kit (along with Wolfram Notebook Assistant) we’ve made these functions seamlessly available for everyone with a Notebook Assistant + LLM Kit subscription. Once you have your subscription, you can use programmatic LLM functions anywhere and everywhere in Version 14.2 without any further set up.

There are also two new functions: LLMSynthesizeSubmit and ChatSubmit. Both are concerned with letting you get incremental results from LLMs (and, yes, that’s important, at least for now, because LLMs can be quite slow). Like CloudSubmit and URLSubmit, LLMSynthesizeSubmit and ChatSubmit are asynchronous functions: you call them to start something that will call an appropriate handler function whenever a certain specified event occurs.

Both LLMSynthesizeSubmit and ChatSubmit support a whole variety of events. An example is "ContentChunkReceived": an event that occurs when there’s a chunk of content received from the LLM.

Here’s how one can use that:

The LLMSynthesizeSubmit returns a TaskObject, but then starts to synthesize text in response to the prompt you’ve given, calling the handler function you specified every time a chunk of text comes in. After a few moments, the LLM will have finished its process of synthesizing text, and if you ask for the value of c you’ll see each of the chunks it produced:

Let’s try this again, but now setting up a dynamic display for a string s and then running LLMSynthesizeSubmit to accumulate the synthesized text into this string:

ChatSubmit is the analog of ChatEvaluate, but asynchronous—and you can use it to create a full chat experience, in which content is streaming into your notebook as soon as the LLM (or tools called by the LLM) generate it.

Streamlining Parallel Computation: Launch All the Machines!

For nearly 20 years we’ve had a streamlined capability to do parallel computation in Wolfram Language, using functions like ParallelMap, ParallelTable and Parallelize. The parallel computation can happen on multiple cores on a single machine, or across many machines on a network. (And, for example, in my own current setup I have 7 machines right now with a total of 204 cores.)

In the past few years, partly responding to the increasing number of cores typically available on individual machines, we’ve been progressively streamlining the way that parallel computation is provisioned. And in Version 14.2 we’ve, yes, parallelized the provisioning of parallel computation. Which means, for example, that my 7 machines all start their parallel kernels in parallel—so that the whole process is now finished in a matter of seconds, rather than potentially taking minutes, as it did before:

Another new feature for parallel computation in Version 14.2 is the ability to automatically parallelize across multiple variables in ParallelTable. ParallelTable has always had a variety of algorithms for optimizing the way it splits up computations for different kernels. Now that’s been extended so that it can deal with multiple variables:

As someone who very regularly does large-scale computations with the Wolfram Language it’s hard to overstate how seamlessly important its parallel computation capabilities have been to me. Usually I’ll first figure out a computation with Map, Table, etc. Then when I’m ready to do the full version I’ll swap in ParallelMap, ParallelTable, etc. And it’s remarkable how much difference a 200x increase in speed makes (assuming my computation doesn’t have too much communication overhead).

(By the way, talking of communication overhead, two new functions in Version 14.2 are ParallelSelect and ParallelCases, which allow you to select and find cases in lists in parallel, saving communication overhead by sending only final results back to the master kernel. This functionality has actually been available for a while through Parallelize[ … Select[ … ] … ] etc., but it’s streamlined in Version 14.2.)

Follow that ____! Tracking in Video

Let’s say we’ve got a video, for example of people walking through a train station. We’ve had the capability for some time to take a single frame of such a video, and find the people in it. But in Version 14.2 we’ve got something new: the capability to track objects that move around between frames of the video.

Let’s start with a video:

We could take an individual frame, and find image bounding boxes. But as of Version 14.2 we can just apply ImageBoundingBoxes to the whole video at once:

Then we can apply the data on bounding boxes to highlight people in the video—using the new HighlightVideo function:

But this just separately indicates where people are in each frame; it doesn’t connect them from one frame to another. In Version 14.2 we’ve added VideoObjectTracking to follow objects between frames:

Now if we use HighlightVideo, different objects will be annotated with different colors:

This picks out all the unique objects identified in the course of the video, and counts them:

“Where’s the dog?”, you might ask. It’s certainly not there for long:

And if we find the first frame where it is supposed to appear it does seem as if what’s presumably a person on the lower right has been mistaken for a dog:

And, yup, that’s what it thought was a dog:

Game Theory

“What about game theory?”, people have long asked. And, yes, there has been lots of game theory done with the Wolfram Language, and lots of packages written for particular aspects of it. But in Version 14.2 we’re finally introducing built-in system functions for doing game theory (both matrix games and tree games).

Here’s how we specify a (zero-sum) 2-player matrix game:

This defines payoffs when each player takes each action. We can represent this by a dataset:

An alternative is to “plot the game” using MatrixGamePlot:

OK, so how can we “solve” this game? In other words, what action should each player take, with what probability, to maximize their average payoff over many instances of the game? (It’s assumed that in each instance the players simultaneously and independently choose their actions.) A “solution” that maximizes expected payoffs for all players is called a Nash equilibrium. (As a small footnote to history, John Nash was a long-time user of Mathematica and what’s now the Wolfram Language—though many years after he came up with the concept of Nash equilibrium.) Well, now in Version 14.2, FindMatrixGameStrategies computes optimal strategies (AKA Nash equilibria) for matrix games:

This result means that for this game player 1 should play action 1 with probability and action 2 with probability , and player 2 should do these with probabilities and . But what are their expected payoffs? MatrixGamePayoff computes that:

It can get pretty hard to keep track of the different cases in a game, so MatrixGame lets you give whatever labels you want for players and actions:

These labels are then used in visualizations:

What we just showed is actually a standard example game—the “prisoner’s dilemma”. In the Wolfram Language we now have GameTheoryData as a repository of about 50 standard games. Here’s one, specified to have 4 players:

And it’s less trivial to solve this game, but here’s the result—with 27 distinct solutions:

And, yes, the visualizations keep on working, even when there are more players (here we’re showing the 5-player case, indicating the 50th game solution):

It might be worth mentioning that the way we’re solving these kinds of games is by using our latest polynomial equation solving capabilities—and not only are we able to routinely find all possible Nash equilibria (not just a single fixed point), but we’re also able to get exact results:

In addition to matrix games, which model games in which players simultaneously pick their actions just once, we’re also supporting tree games, in which players take turns, producing a tree of possible outcomes, ending with a specified payoff for each of the players. Here’s an example of a very simple tree game:

We can get at least one solution to this game—described by a nested structure that gives the optimal probabilities for each action of each player at each turn:

Things with tree games can get more elaborate. Here’s an example—in which other players sometimes don’t know which branches were taken (as indicated by states joined by dashed lines):

What we’ve got in Version 14.2 represents rather complete coverage of the basic concepts in a typical introductory game theory course. But now, in typical Wolfram Language fashion, it’s all computable and extensible—so you can study more realistic games, and quickly do lots of examples to build intuition.

We’ve so far concentrated on “classic game theory”, notably with the feature (relevant to many current applications) that all action nodes are the result of a different sequence of actions. However, games like tic-tac-toe (that I happened to recently study using multiway graphs) can be simplified by merging equivalent action nodes. Multiple sequences of actions may lead to the same game of tic-tac-toe, as is often the case for iterated games. These graph structures don’t fit into the kind of classic game theory trees we’ve introduced in Version 14.2—though (as my own efforts I think demonstrate) they’re uniquely amenable to analysis with the Wolfram Language.

Computing the Syzygies, and Other Advances in Astronomy

There are lots of “coincidences” in astronomy—situations where things line up in a particular way. Eclipses are one example. But there are many more. And in Version 14.2 there’s now a general function FindAstroEvent for finding these “coincidences”, technically called syzygies (“sizz-ee-gees”), as well as other “special configurations” of astronomical objects.

A simple example is the September (autumnal) equinox:

Roughly this is when day and night are of equal length. More precisely, it’s when the sun is at one of the two positions in the sky where the plane of the ecliptic (i.e. the orbital plane of the earth around the sun) crosses the celestial equator (i.e. the projection of the earth’s equator)—as we can see here (the ecliptic is the yellow line; the celestial equator the blue one):

As another example, let’s find the next time over the next century when Jupiter and Saturn will be closest in the sky:

They’ll get close enough to see their moons together:

There are an incredible number of astronomical configurations that have historically been given special names. There are equinoxes, solstices, equiluxes, culminations, conjunctions, oppositions, quadratures—as well as periapses and apoapses (specialized to perigee, perihelion, periareion, perijove, perikrone, periuranion, periposeideum, etc.). In Version 14.2 we support all these.

So, for example, this gives the next time Triton will be closest to Neptune:

A famous example has to do with the perihelion (closest approach to the Sun) of Mercury. Let’s compute the position of Mercury (as seen from the Sun) at all its perihelia in the first couple of decades of the nineteenth century:

We see that there’s a systematic “advance” (along with some wiggling):

So now let’s quantitatively compute this advance. We start by finding the times for the first perihelia in 1800 and 1900:

Now we compute the angular separation between the positions of Mercury at these times:

Then divide this by the time difference

and convert units:

Famously, 43 arcseconds per century of this is the result of deviations from the inverse square law of gravity introduced by general relativity—and, of course, accounted for by our astronomical computation system. (The rest of the advance is the result of traditional gravitational effects from Venus, Jupiter, Earth, etc.)

PDEs Now Also for Magnetic Systems

More than a decade and a half ago we made the commitment to make the Wolfram Language a full strength PDE modeling environment. Of course it helped that we could rely on all the other capabilities of the Wolfram Language—and what we’ve been able to produce is immeasurably more valuable because of its synergy with the rest of the system. But over the years, with great effort, we’ve been steadily building up symbolic PDE modeling capabilities across all the standard domains. And at this point I think it’s fair to say that we can handle—at an industrial scale—a large part of the PDE modeling that arises in real-world situations.

But there are always more cases for which we can build in capabilities, and in Version 14.2 we’re adding built-in modeling primitives for static and quasistatic magnetic fields. So, for example, here’s how we can now model an hourglass-shaped magnet. This defines boundary conditions—then solves the equations for the magnetic scalar potential:

We can then take that result, and, for example, immediately plot the magnetic field lines it implies:

Version 14.2 also adds the primitives to deal with slowly varying electric currents, and the magnetic fields they generate. All of this immediately integrates with our other modeling domains like heat transfer, fluid dynamics, acoustics, etc.

There’s much to say about PDE modeling and its applications, and in Version 14.2 we’ve added more than 200 pages of additional textbook-style documentation about PDE modeling, including some research-level examples.

New Features in Graphics, Geometry & Graphs

Graphics has always been a strong area for the Wolfram Language, and over the past decade we’ve also built up very strong computational geometry capabilities. Version 14.2 adds some more “icing on the cake”, particularly in connecting graphics to geometry, and connecting geometry to other parts of the system.

As an example, Version 14.2 adds geometry capabilities for more of what were previously just graphics primitives. For example, this is a geometric region formed by filling a Bezier curve:

And we can now do all our usual computational geometry operations on it:

Something like this now works too:

Something else new in Version 14.2 is MoleculeMesh, which lets you build computable geometry from molecular structures. Here’s a graphical rendering of a molecule:

And here now is a geometric mesh corresponding to the molecule:

We can then do computational geometry on this mesh:

Another new feature in Version 14.2 is an additional method for graph drawing that can make use of symmetries. If you make a layered graph from a symmetrical grid, it won’t immediately render in a symmetrical way:

But with the new "SymmetricLayeredEmbedding" graph layout, it will:

User Interface Tune-Ups

Making a great user interface is always a story of continued polishing, and we’ve now been doing that for the notebook interface for nearly four decades. In Version 14.2 there are several notable pieces of polish that have been added. One concerns autocompletion for option values.

We’ve long shown completions for options that have a discrete collection of definite common settings (such as All, Automatic, etc.). In Version 14.2 we’re adding “template completions” that give the structure of settings, and then let you tab through to fill in particular values. In all these years, one of the places I pretty much always find myself going to in the documentation is the settings for FrameLabel. But now autocompletion immediately shows me the structure of these settings:

Also in autocompletion, we’ve added the capability to autocomplete context names, context aliases, and symbols that include contexts. And in all cases, the autocompletion is “fuzzy” in the sense that it’ll trigger not only on characters at the beginning of a name but on ones anywhere in the name—which means that you can just type characters in the name of a symbol, and relevant contexts will appear as autocompletions.

Another small convenience added in Version 14.2 is the ability to drag images from one notebook to any other notebook, or, for that matter, to any other application that can accept dragged images. It’s been possible to drag images from other applications into notebooks, but now you can do it the other way too.

Something else that’s for now specific to macOS is enhanced support for icon preview (as well as Quick Look). So now if you have a folder full of notebooks and you select Icon view, you’ll see a little representation of each notebook as an icon of its content:

Under the hood in Version 14.2 there are also some infrastructural developments that will enable significant new features in subsequent versions. Some of these involve generalized support for dark mode. (Yes, one might initially imagine that dark mode would somehow be trivial, but when you start thinking about all the graphics and interface elements that involve colors, it’s clear it’s not. Though, for example, after significant effort we did recently release dark mode for Wolfram|Alpha.)

So, for example, in Version 14.2 you’ll find the new symbol LightDarkSwitched, which is part of the mechanism for specifying styles that will automatically switch for light and dark modes. And, yes, there is a style option LightDark that will switch modes for notebooks—and which is at least experimentally supported.

Related to light/dark mode is also the notion of theme colors: colors that are defined symbolically and can be switched together. And, yes, there’s an experimental symbol ThemeColor related to these. But the full deployment of this whole mechanism won’t be there until the next version.

The Beginnings of Going Native on GPUs

Many important pieces of functionality inside the Wolfram Language automatically make use of GPUs when they are available. And already 15 years ago we introduced primitives for low-level GPU programming. But in Version 14.2 we’re beginning the process of making GPU capabilities more readily available as a way to optimize general Wolfram Language usage. The key new construct is GPUArray, which represents an array of data that will (if possible) be stored so as to be immediately and directly accessible to your GPU. (On some systems, it will be stored in separate “GPU memory”; on others, such as modern Macs, it will be stored in shared memory in such a way as to be directly accessible by the GPU.)

In Version 14.2 we’re supporting an initial set of operations that can be performed directly on GPU arrays. The operations available vary slightly from one type of GPU to another. Over time, we expect to use or create many additional GPU libraries that will extend the set of operations that can be performed on GPU arrays.

Here is a random ten-million-element vector stored as a GPU array:

The GPU on the Mac on which I am writing this supports the necessary operations to do this purely in its GPU, giving back a GPUArray result:

Here’s the timing:

And here’s the corresponding ordinary (CPU) result:

In this case, the GPUArray result is about a factor of 2 faster. What factor you get will vary with the operations you’re doing, and the particular hardware you’re using. So far, the largest factors I’ve seen are around 10x. But as we build more GPU libraries, I expect this to increase—particularly when what you’re doing involves a lot of compute “inside the GPU”, and not too much memory access.

By the way, if you sprinkle GPUArray in your code it’ll normally never affect the results you get—because operations always default to running on your CPU if they’re not supported on your GPU. (Usually GPUArray will make things faster, but if there are too many “GPU misses” then all the “attempts to move data” may actually slow things down.) It’s worth realizing, though, that GPU computation is still not at all well standardized or uniform. Sometimes there may only be support for vectors, sometimes also matrices—and there may be different data types with different numerical precision supported in different cases.

And Even More…

In addition to all the things we’ve discussed here so far, there are also a host of other “little” new features in Version 14.2. But even though they may be “little” compared to other things we’ve discussed, they’ll be big if you happen to need just that functionality.

For example, there’s MidDate—that computes the midpoint of dates:

And like almost everything involving dates, MidDate is full of subtleties. Here it’s computing the week 2/3 of the way through this year:

In math, functions like DSolve and SurfaceIntegrate can now deal with symbolic array variables:

SumConvergence now lets one specify the range of summation, and can give conditions that depend on it:

A little convenience that, yes, I asked for, is that DigitCount now lets you specify how many digits altogether you want to assume your number has, so that it appropriately counts leading 0s:

Talking of conveniences, for functions like MaximalBy and TakeLargest we added a new argument that says how to sort elements to determine “the largest”. Here’s the default numerical order

and here’s what happens if we use “symbolic order” instead:

There are always so many details to polish. Like in Version 14.2 there’s an update to MoonPhase and related functions, both new things to ask about, and new methods to compute them:

In another area, in addition to major new import/export formats (particularly to support Tabular) there’s an update to "Markdown" import that gives results in plaintext, and there’s an update to "PDF" import that gives a mixed list of text and images.

And there are lots of other things too, as you can find in the “Summary of New and Improved Features in 14.2”. By the way, it’s worth mentioning that if you’re looking at a particular documentation page for a function, you can always find out what’s new in this version just by pressing show changes:

Download your 14.2 now! » (It’s already live in the Wolfram Cloud!)

Who Can Understand the Proof? A Window on Formalized Mathematics

Stephen Wolfram — Thu, 09 Jan 2025 22:42:31 +0000

The Simplest Axiom for Logic

Theorem (Wolfram with Mathematica, 2000):
The single axiom ((a•b)•c)•(a•((a•c)•a))c is a complete axiom system for Boolean algebra (and is the simplest possible)

For more than a century people had wondered how simple the axioms of logic (Boolean algebra) could be. On January 29, 2000, I found the answer—and made the surprising discovery that they could be about twice as simple as anyone knew. (I also showed that what I found was the simplest possible.)

It was an interesting result—that gave new intuition about just how simple the foundations of things can be, and for example helped inspire my efforts to find a simple underlying theory of physics.

But how did I get the result? Well, I used automated theorem proving (specifically, what’s now FindEquationalProof in Wolfram Language). Automated theorem proving is something that’s been around since at least the 1950s, and its core methods haven’t changed in a long time. But in the rare cases it’s been used in mathematics it’s typically been to confirm things that were already believed to be true. And in fact, to my knowledge, my Boolean algebra axiom is actually the only truly unexpected result that’s ever been found for the first time using automated theorem proving.

But, OK, so we know it’s true. And that’s interesting. But what about the proof? Does the proof, for example, show us why the result is true? Well, actually, in a quarter of a century, nobody (including me) has ever made much headway at all in understanding the proof (which, at least in the form we currently know it, is long and complicated). So is that basically inevitable—say as a consequence of computational irreducibility? Or is there some way—perhaps using modern AI—to “humanize” the proof to a point where one can understand it?

It is, I think, an interesting challenge—that gets at the heart of what one can (and can’t) expect to achieve with formalized mathematics. In what follows, I’ll discuss what I’ve been able to figure out—and how it relates to foundational questions about what mathematics is and how it can be done. And while I think I’ve been able to clarify some of the issues, the core problem is still out there—and I’d like to issue it here as a challenge:

Challenge: Understand the proof of the Theorem

What do I mean by “understand”? Inevitably, “understand” has to be defined in human terms. Something like “so a human can follow and reproduce it”—and, with luck, feel like saying “aha!” at some point, the kind of way they might on hearing a proof of the Pythagorean theorem (or, in logic, something like de Morgan’s law Not[And[p, q]]Or[Not[p], Not[q]]).

It should be said that it’s certainly not clear that such an understanding would ever be possible. After all, as we’ll discuss, it’s a basic metamathematical fact that out of all possible theorems almost none have short proofs, at least in terms of any particular way of stating the proofs. But what about an “interesting theorem” like the one we’re considering here? Maybe that’s different. Or maybe, at least, there’s some way of building out a “higher-level mathematical narrative” for a theorem like this that will take one through the proof in human-accessible steps.

In principle one could always imagine a somewhat bizarre scenario in which people would just rote learn chunks of the proof, perhaps giving each chunk some name (a bit like how people learned bArbArA and cElArEnt syllogisms in the Middle Ages). And in terms of these chunks there’d presumably then be a “human way” to talk about the proof. But learning the chunks—other than as some kind of recreational or devotional activity—doesn’t seem to make much sense unless there’s metamathematical structure that somehow connects the chunks to “general concepts” that are widely useful elsewhere.

But of course it’s still conceivable that there might be a “big theory” that would lead us to the theorem in an “understandable way”. And that could be a traditional mathematical theory, built up with precise, if potentially very abstract, constructs. But what about something more like a theory in natural science? In which we might treat our automatically generated proof as an object for empirical study—exploring its characteristics, trying to get intuition about it, and ultimately trying to deduce the analog of “natural laws” that give us a “human-level” way of understanding it.

Of course, for many purposes it doesn’t really matter why the theorem is true. All that matters is that it is true, and that one can deduce things on the basis of it. But as one thinks about the future of mathematics, and the future of doing mathematics, it’s interesting to explore to what extent it might or might not ultimately be possible to understand in a human-accessible way the kind of seemingly alien result that the theorem represents.

The Proof as We Know It

I first presented a version of the proof on two pages of my 2002 book A New Kind of Science, printing it in 4-point type to make it fit:

Today, generating a very similar proof is a one-liner in Wolfram Language (as we’ll discuss below, the · dot here can be thought of as representing the Nand operation):

The proof involves 307 (mostly rather elaborate) steps. And here’s one page of it (out of about 30)—presented in the form of a computable Wolfram Language dataset:

What’s the basic idea of this proof? Essentially it’s to perform a sequence of purely structural symbolic operations that go from our axiom to known axioms of Boolean algebra. And the proof does this by proving a series of lemmas which can be combined to eventually give what we want:

The highlighted “targets” here are the standard Sheffer axioms for Boolean algebra from 1913:

And, yes, even though these are quite short, the intermediate lemmas involved in the proof get quite long—the longest involving 60 symbols (i.e. having LeafCount 60):

It’s as if to get to where it’s going, the proof ends up having to go through the wilds of metamathematical space. And indeed one gets a sense of this if one plots the sizes (i.e. LeafCount) of successive lemmas:

Here’s the distribution of these sizes, showing that while they’re often small, there’s a long tail (note, by the way, that if dot · appears n times in a lemma, the LeafCount will be 2n + 3):

So how are these lemmas related? Here’s a graph of their interdependence (with the size of each dot being proportional to the size of the lemma it represents):

Zooming in on the top we see more detail:

We start from our axiom, then derive a whole sequence of lemmas—as we’ll see later, always combining two lemmas to create a new one. (And, yes, we could equally well call these things theorems—but we generate so many of them it seems more natural to call them “lemmas”.)

So, OK, we’ve got a complicated proof. But how can we check that it’s correct? Well, from the symbolic representation of the proof in the Wolfram Language we can immediately generate a “proof function” that in effect contains executable versions of all the lemmas—implemented using simple structural operations:

And when you run this function, it applies all these lemmas and checks that the result comes out right:

And, yes, this is basically what one would do in a proof assistant system (like Lean or Metamath)—except that here the steps in the proof were generated purely automatically, without any human guidance (or effort). And, by the way, the fact that we can readily translate our symbolic proof representation into a function that we can run provides an operational manifestation of the equivalence between proofs and programs.

But let’s look back at our lemma-interdependence “proof graph”. One notable feature is that we see several nodes with high out-degree—corresponding to what we can think of as “pivotal lemmas” from which many other lemmas end up directly being proved. So here’s a list of the “most pivotal” lemmas in our proof:

Or, more graphically, here are the results for all lemmas that occur:

So what are the “pivotal lemmas”? a · b = b · a we readily recognize as commutativity. But the others—despite their comparative simplicity—don’t seem to correspond to things that have specifically shown up before in the mathematical literature (or, as we’ll discuss later, that’s at least what the current generation of LLMs tell us).

But looking at our proof graph something we can conclude is that a large fraction of the “heavy lifting” needed for the whole proof has already happened by the time we can prove a · b = b · a. So, for the sake of avoiding at least some of hairy detail in the full proof, in most of what follows, we’ll concentrate on the proof of a · b = b · a—which FindEquationalProof tells us we can accomplish in 104 steps, with a proof graph of the form

with the sizes of successive lemmas (in what is basically a breadth-first traversal of the proof graph) being:

The “Machine Code” of the Proof

It’s already obvious from the previous section that the proof as we currently know it is long, complicated, and fiddly—and in many ways reminiscent of something at a “machine-code” level. But to get a grounded sense of what’s going on in the proof, it’s useful to dive into the details—even if, yes, they can be seriously hard to wrap one’s head around.

At a fundamental level, the way the proof—say of a · b = b · a—works is by starting from our axiom, and then progressively deducing new lemmas from pairs of existing lemmas. In the simplest case, that deduction works by straightforward symbolic substitution. So, for example, let’s say we have the lemmas

and

Then it turns out that from these lemmas we can deduce:

Or, in other words, knowing that the first two lemmas hold for any a gives us enough information about · that the third lemma must inevitably also hold. So how do we derive this?

Our lemmas in effect define two-way equivalences: their left-hand sides are defined as equal to their right-hand sides, which means that if we see an expression that (structurally) matches one side of a lemma, we can always replace it by the other side of the lemma. And to implement this, we can write our second lemma explicitly as a rule—where to avoid confusion we’re using x rather than a:

But if we now look at our first lemma, we see that there’s part of it (indicated with a frame) that matches the left-hand side of our rule:

If we replace this part (which is at position {2,2}) using our rule we then get

which is precisely the lemma we wanted to deduce.

We can summarize what happened here as a fragment of our proof graph—in which a “substitution event” node takes our first two lemmas as input, and “outputs” our final lemma:

As always, the symbolic expressions we’re working with here can be represented as trees:

The substitution event then corresponds to a tree rewriting:

The essence of automated theorem proving is to find a particular sequence of substitutions etc. that get us from whatever axioms or lemmas we’re starting with, to whatever lemmas or theorems we want to reach. Or in effect to find a suitable “path” through the multiway graph of all possible substitutions etc. that can be made.

So, for example, in the particular case we’re considering here, this is the graph that represents all possible transformations that can occur through a single substitution event:

The particular transformation (or “path”) we’ve used to prove a · a = a · ((a · a) · a) is highlighted. But as we can see, there are many other possible lemmas that can be generated, or in other words that can be proved from the two lemmas we’ve given as input. Put another way, we can think of our input lemmas as implying or entailing all the other lemmas shown here. And, by analogy to the concept of a light cone in physics, we can view the collection of everything entailed by given lemmas or given events as the (future) “entailment cone” of those lemmas or events. A proof that reaches a particular lemma is then effectively a path in this entailment cone—analogous in physics to a world line that reaches a particular spacetime point.

If we continue building out the entailment cone from our original lemmas, then after two (substitution) events we get:

There are 21 lemmas generated here. But it turns out that beyond the lemma we already discussed there are only three (highlighted here) that appear in the proof we are studying here:

And indeed the main algorithmic challenge of theorem proving is to figure out which lemmas to generate in order to get a path to the theorem one’s trying to prove. And, yes, as we’ll discuss later, there are typically many paths that will work, and different algorithms will yield different paths and therefore different proofs.

But, OK, seeing how new lemmas can be derived from old by substitution is already quite complicated. But actually there’s something even more complicated we need to discuss: deriving lemmas not only by substitution but also by what we’ve called bisubstitution.

We can think of both substitution and bisubstitution as turning one lemma X == Y into a transformation rule (either X Y or Y X), and then applying this rule to another lemma, to derive a new lemma. In ordinary substitution, the left-hand side of the rule directly matches (in a Wolfram Language pattern-matching sense) a subexpression in the lemma we’re transforming. But the key point is that all the variables that appear in both our lemmas are really “pattern variables” (x_ etc. in Wolfram Language). So that means there’s another way that one lemma can transform another, in which in effect replacements are made not only in the lemma being transformed, but also in the lemma that’s doing the transforming.

The net effect, though, is still to take two lemmas and derive another, as in:

But in tracing through the details of our proof, we need to distinguish “substitution events” (shown yellowish) from “bisubstitution” ones (shown reddish). (In FindEquationalProof in Wolfram Language, lemmas produced by ordinary substitution are called “substitution lemmas”, while lemmas produced by bisubstitution are called “critical pair lemmas”.)

OK, so how does bisubstitution work? Let’s look at an example. We’re going to be transforming the lemma

using the lemma (which in this case happens to be our original axiom)

to derive the new lemma:

We start by creating a rule from the second lemma. In this case, the rule we need happens to be reversed relative to the way we wrote the lemma, and this means that (in the canonical form we’re using) it’s convenient to rename the variables that appear:

To do our bisubstitution we’re going to apply this rule to a subterm of our first lemma. We can write that first lemma with explicit pattern variables:

As always, the particular names of those variables don’t matter. And to avoid confusion, we’re going to rename them:

Now look at this subterm of this lemma (which is part {2,1,1,2} of the expression):

It turns out that with appropriate bindings for pattern variables this can be matched (or “unified”) with the left-hand side of our rule. This provides a way to find such bindings:

(Note that in these bindings things like c_ stand only for explicit expressions, like c_, not for expressions that the ordinary Wolfram Language pattern c_ would match.)

Now if we apply the bindings we’ve found to the left-hand side of our rule

and to the subterm we picked out from our lemma

we see that we get the same expression. Which means that with these bindings the subterm matches the left-hand side of our rule, and we can therefore replace this subterm with the right-hand side of the rule. To see all this in operation, we first apply the bindings we’ve found to the lemma we’re going to transform (and, as it happens, the binding for y_ is the only one that matters here):

Now we take this form and apply the rule at the position of the subterm we identified:

Renaming variables

we now finally get exactly the lemma that we were trying to derive:

And, yes, getting here was a pretty complicated process. But with the symbolic character of our lemmas, it’s one that is inevitably possible, and so can be used in our proof. And in the end, out of the 101 lemmas used in the proof, 47 were derived by ordinary substitution, while 54 were derived by bisubstitution.

And indeed the first few steps of the proof turn out to use only bisubstituion. An example is the first step—which effectively applies the original axiom to itself using bisubsitution:

And, yes, even this very first step is pretty difficult to follow.

If we start from the original axiom, there are 16 lemmas that can be derived purely by a single ordinary substitution (effectively of the axiom into itself)—resulting in the following entailment cone:

As it happens, though, none of the 16 new lemmas here actually get used in our proof. On the other hand, in the bisubstitution entailment cone

there are 24 new lemmas, and 4 of them get used in the proof—as we can see from the first level of the proof graph (here rotated for easier rendering):

At the next level of the entailment cone from ordinary substitution, there are 5062 new lemmas—none of which get used in the proof. But of the 31431 new lemmas in the (pure) bisubstitution entailment cone, 13 do get used:

At the next level, lemmas generated by ordinary substitution also start to get used:

Here’s another rendering of these first few levels of the proof graph:

Going to another couple of levels we’re starting to see quite a few independent chains of lemmas developing

which eventually join up when we assemble the whole proof graph:

A notable feature of this proof graph is that it has more bisubstitution events at the top, and more ordinary substitution events at the bottom. So why is that? Essentially it seems to be because bisubstitution events tend to produce larger lemmas, and ordinary substitution events tend to produce smaller ones—as we can see if we plot input and output lemma sizes for all events in the proof:

So in effect what seems to be happening is that the proof first has to “spread out in metamathematical space”, using bisubstitution to generate large lemmas “far out in metamathematical space”. Then later the proof has to “corral things back in”, using ordinary substitution to generate smaller lemmas. And for example, at the very end, it’s a substitution event that yields the final theorem we’re trying to prove:

And earlier in the graph, there’s a similar “collapse” to a small (and rather pivotal) lemma:

As the plot above indicates, ordinary substitution can lead to large lemmas, and indeed bisubstitution can also lead to smaller ones, as in

or slightly more dramatically:

But, OK, so this is some of what’s going on at a “machine-code” level inside the proof we have. Of course, given our axiom and the operations of substitution and bisubstitution there are inevitably a huge number of different possible proofs that could be given. The particular proof we’re considering is what the Wolfram Language FindEquationalProof gives. (In the Appendix, we’ll also look at results from some other automated theorem proving systems. The results will be very comparable, if usually a little lengthier.)

We won’t discuss the detailed (and rather elaborate) algorithms inside FindEquationalProof. But fundamentally what they’re doing is to try constructing certain lemmas, then to find sequences of lemmas that eventually form a “path” to what we’re trying to prove. And as some indication of what’s involved in this, here’s a plot of the number of “candidate lemmas” that are being maintained as possible when different lemmas in the proof are generated:

And, yes, for a while there’s roughly exponential growth, leveling off at just over a million when we get to the “pulling everything together” stage of the proof.

Unrolling the Proof

In what we’ve done so far, we’ve viewed our proof as working by starting from an axiom, then progressively building up lemmas, until eventually we get to the theorem we want. But there’s an alternative view that’s in some ways useful in getting a more direct, “mechanical” intuition about what’s going on in the proof.

Let’s say we’re trying to prove that our axiom implies that p · q = q · p. Well, then there must be some way to start from the expression p · q and just keep on judiciously applying the axiom until eventually we get to the expression q · p. And, yes, the number of axiom application steps required might be very large. But ultimately, if it’s true that the axiom implies p · q = q · p there must be a path that gets from p · q to q · p.

But before considering the case of our full proof, let’s start with something simpler. Let’s assume that we’ve already established the lemmas:

Then we can treat them as axioms, and ask a question like whether they imply the lemma

or, in our current approach, whether they can be used to form a path from a · a to a · (a · (a · a)).

Well, it’s not too hard to see that in fact there is such a path. Apply our second lemma to a · a to get:

But now this subterm

matches the left-hand of the first lemma, so that it can be replaced by the right-hand side of that lemma (i.e. by a · (a · a)), giving in the end the desired a · (a · (a · a)).

So now we can summarize this process as:

In what follows, it’ll be convenient to label lemmas. We’ll call our original axiom A1, we’ll call our successive lemmas generated by ordinary substitution Sn and the ones generated by bisubsitution Bn:

In our proof we’ll also use and to indicate whether we’re going to use the lemma (say X = Y) in the “forward direction” X Y or the “reverse direction” X Y. And with this labeling, the proof we just gave (which is for the lemma S23) becomes:

Each step here is a pure substitution, and requires no replacement in the rule (i.e. “axiom”) being used. But proofs like this can also be done with bisubstitution, where replacements are applied to the rule to get it in a form where it can directly be applied to transform an expression:

OK, so how about the first lemma in our full proof? Here’s a proof that its left-hand side can be transformed to its right-hand side just by judiciously applying the original axiom:

Here’s a corresponding proof for the second lemma:

Both these involve bisubstitution. Here’s a proof of the first lemma derived purely by ordinary substitution:

This proof is using not only the original axiom but also the lemma B5. Meanwhile, B5 can be proved using the original axiom together with B2:

But now, inserting the proof we just gave above for B2, we can give a proof of B5 just in terms of the original axiom:

And recursively continuing this unrolling process, we can then prove S1 purely using the original axiom:

What about the whole proof? Well, at the very end we have:

If we “unroll” one step we have

and after 2 steps:

In principle we could go on with this unrolling, in effect recursively replacing each rule by the sequence of transformations that represents its proof. Typically this process will, however, generate exponentially longer proof sequences. But say for lemma S5

the result is still very easily manageable:

We can summarize this result by in effect plotting the sizes of the intermediate expressions involved—and indicating what part of each expression is replaced at each step (with as above indicating “forward” use of the axiom A1 and “backward” A1 ):

For lemma B33

the unrolled proof is now 30 steps long

while for lemma S11

the unrolled proof is 88 steps long:

But here there is a new subtlety. Doing a direct substitution of the “proof paths” for the lemmas used to prove S11 in our original proof gives a proof of length 104:

But this proof turns out to be repetitive, with the whole gray section going from one copy to another of:

As an example of a larger proof, we can consider lemma B47:

And despite the simplicity of this lemma, our proof for it is 1008 steps long:

If we don’t remove repetitive sections, it’s 6805 steps:

Can we unroll the whole proof of a · b = b · a? We can get closer by considering lemma S36:

Its proof is 27105 steps long:

The distribution of expression sizes follows a roughly exponential distribution, with a maximum of 20107:

Plotting the expression sizes on a log scale one gets:

And what stands out most here is a kind of recursive structure—which is the result of long sequences that basically represent the analog of “subroutine calls” that go back and repeatedly prove lemmas that are needed.

OK, so what about the whole proof of a · b = b · a? Yes, it can be unrolled—in terms of 83,314 applications of the original axiom. The sequence of expression sizes is:

Or on a log scale:

The distribution of expression sizes now shows clear deviation from being exponential:

The maximum is 63245, which occurs just 81 steps after the exact midpoint of the proof. In other words, in the middle, the proof has wandered incredibly far out in metamathematical space (there are altogether CatalanNumber[63245] ≈ 10³⁸⁰⁷⁰ possible expressions of the size it reaches).

The proof returns to small expressions just a few times; here are all the cases in which the size is below 10:

So, yes, it is possible to completely unroll the proof into a sequence of applications of the original axiom. But if one does this, it inevitably involves repeating lots of work. Being able to use intermediate lemmas in effect lets one “share common subparts” in the proof. So that one ends up with just 104 “rule applications”, rather than 83314. Not that it’s easy to understand those 104 steps…

Is There a Better Notation?

Looking at our proof—either in its original “lemma” form, or in its “unrolled” form—the most striking aspect of it is how complicated (and incomprehensible) it seems to be. But one might wonder whether much of that complexity is just the result of not “using the right notation”. In the end, we’ve got a huge number of expressions written in terms of · operations that we can interpret as Nand (or Nor). And maybe it’s a little like seeing the operation of a microprocessor down at the level of individual gates implementing Nands or Nors. And might there perhaps be an analog of a higher-level representation—with higher-level operations (even like arithmetic) that are more accessible to us humans?

It perhaps doesn’t help that Nand itself is a rather non-human construct. For example, not a single natural human language seems to have a word for Nand. But there are combinations of Nands that have more familiar interpretations:

But what combinations actually occur in our proof? Here are the most common subexpressions that appear in lemmas in the proof:

And, yes, we could give the most common of these special names. But it wouldn’t really help in “compressing” the proof—or making it easier to understand.

What about “upgrading” our “laws of inference”, i.e. the way that we can derive new lemmas from old? Perhaps instead of substitution and bisubstitution, which both take two lemmas and produce one more, we could set up more elaborate “tactics” that for example take in more input lemmas. We’ve seen that if we completely unroll the proof, it gets much longer. So perhaps there is a “higher-order” setup that for example dramatically shortens the proof.

One way one might identify this is by seeing commonly repeating structures in the subgraphs that lead to lemmas. But in fact these subgraphs are quite diverse:

What Are the Popular Lemmas?

A typical feature of human-written mathematical proofs is that they’re “anchored” by famous theorems or lemmas. They may have fiddly technical pieces. But usually there’s a backbone of “theorems people know”.

We have the impression that the proof we’re discussing here “spends most of its time wandering around the wilds of metamathematical space”. But perhaps it visits waypoints that are somehow recognizable, or at least should be. Or in other words, perhaps out in the metamathematical space of lemmas there are ones that are somehow sufficiently popular that they’re worth giving names to, and learning—and can then be used as “reference points” in terms of which our proof becomes simpler and more human accessible.

It’s a story very much like what happens with human language. There are things out there in the world, but when there’s a certain category of them that are somehow common or important enough, we make a word for them in our language, which we can then use to “compactly” refer to them. (It’s again the same story when it comes to computational language, and in particular the Wolfram Language, except that in that case it’s been my personal responsibility to come up with the appropriate definitions and names for functions to represent “common lumps of computation”.)

But, OK, so what are the “popular lemmas” of Nand proofs? One way to explore this is to enumerate statements that are “true about Nand”—then to look at proofs of these statements (say found with FindEquationalProof from our axiom) and see what lemmas show up frequently in them.

Enumerating statements “true about Nand”, starting from the smallest, we get

where we have highlighted statements from this list that appear as lemmas in our proof.

Proving each of these statements from our original axiom, here are the lengths of proofs we find (for all 1341 distinct theorems with up to LeafCount 4 on each side):

A histogram shows that it’s basically a bimodal distribution

with the smallest “long-proof” theorem being:

In aggregate, all these proofs use about 200,000 lemmas. But only about 1200 of these are distinct. And we can plot which lemmas are used in which proofs—and we see that there are indeed many lemmas that are used across wide ranges of proofs, while there are a few others that are “special” to each proof (the diagonal stripe is associated with lemmas close to the statement being proved):

If we rank all distinct lemmas from most frequently to least frequently used, we get the following distribution of lemma usage frequencies across all our proofs:

It turns out that there is a “common core” of 49 lemmas that are used in every single one of the proofs. So what are these lemmas? Here’s a plot of the usage frequency of lemmas against their size—with the “common ones” populating the top line:

And at first this might seem surprising. We might have expected that short lemmas would be the most frequent, but instead we’re seeing long lemmas that always appear, the very longest being:

So why is this? Basically it’s that these long lemmas are being used at the beginning of every proof. They’re the result of applying bisubstitution to the original axiom, and in some sense they seem to be laying down a kind of net in metamathematical space that then allows more diverse—and smaller—lemmas to be derived.

But how are these “common core” popular lemmas distributed within proofs? Here are a few examples:

And what we see is that while, yes, the common core lemmas are always at the beginning, they don’t seem to have a uniform way of “plugging into” the rest of the proof. And it doesn’t, for example, seem as if there’s just some small set of (perhaps simple) “waypoint” lemmas that one can introduce that will typically shorten these proofs.

If one effectively allows all the common core lemmas to be used as axioms, then inevitably proofs will be shortened; for example, the proof of a · b = b · a—which only ends up using 5 of the common core lemmas—is now shortened to 51 lemmas:

It doesn’t seem to become easier to understand, though. And if it’s unrolled, it’s still 5013 steps.

Still, one can ask what happens if one just introduces particular “recognizable” lemmas as additional axioms. For example, if we include “commutativity” a · b = b · a then we find that, yes, we do manage to reduce the lengths of some proofs, but certainly not all:

Are there any other “pivotal” lemmas we could add? In particular, what about lemmas that can help with the length-200 or more proofs? It turns out that all of these proofs involve the lemma:

So what happens if we add this? Well, it definitely reduces proof lengths:

And sometimes it even seems like it brings proofs into “human range”. For example, a proof of

from our original axiom has length 56. Adding in commutativity reduces it to length 18. And adding our third lemma reduces it to just length 9—and makes it not even depend directly on the original axiom:

But despite the apparent simplicity here, the steps involved—particularly when bisubstitution is used—are remarkably hard to follow. (Note the use of a = a as a kind of “implicit axiom”—something that has actually also appeared, without comment, in many of our other proofs.)

Can We Get a Shorter Proof?

The proof that we’ve been studying can be seen in some ways as a rather arbitrary artifact. It’s the output of FindEquationalProof, with all its specific detailed internal algorithms and choices. In the Appendix, we’ll see that other automated theorem proving systems give very similar results. But we still might wonder whether actually the complexity of the proof as we’ve been studying it is just a consequence of the details of our automated theorem proving—and that in fact there’s a much shorter (and perhaps easier to understand) proof that exists.

One approach we could take—reminiscent of higher category theory—is to think about just simplifying the proof we have, effectively using proof-to-proof transformations. And, yes, this is technically difficult, though it doesn’t seem impossible. But what if there are “holes” in proof space? Then a “continuous deformation” of one proof into another will get stuck, and even if there is a much shorter proof, we’re liable to get “topologically stuck” before we find it.

One way to be sure we’re getting the shortest proof of a particular lemma is to explicitly find the first place that lemma appears in the (future) entailment cone of our original axiom. For example, as we saw above, a single substitution event leads to the entailment cone:

Every lemma produced here is, by construction, in principle derivable by a proof involving a single substitution event. But if we actually use FindEquationalProof to prove these lemmas, the proofs we get most involve 2 events (and in one case 4):

If we take another step in the entailment cone, we get a total of 5062 lemmas. From the way we generated them, we know that all these lemmas can in principle be reached by proofs of length 2. But if we run FindEquationalProof on them, we find a distribution of proof lengths:

And, yes, there is one lemma (with LeafCount 183) that is found only by a proof of length 15. But most often the proof length is 4—or about double what it could be.

If we generate the entailment cone for lemmas using bisubstitution rather than just ordinary substitution, there are slightly more cases where FindEquationalProof does worse at getting minimal proofs.

For example, the lemma

and 3 others can be generated by a single bisubstitution from the original axiom, but FindEquationalProof gives only proofs of length 4 for all of these.

What about unrolled proofs, in which one can generate an entailment cone by starting from a particular expression, and then applying the original axiom in all possible ways? For example, let’s say we start with:

Then applying bisubstitution with the original axiom once in all possible ways gives:

Applying bisubstitution a second time gives a larger entailment cone:

But now it turns out that—as indicated—one of the expressions in this cone is:

So this shows that the lemma

can in principle be reached with just two steps of “unrolled” proof:

And in this particular case, if we use FindEquationalProof and then unroll the resulting proof we also get a proof of length 3—but it goes through a different intermediate expression:

As it happens, this intermediate expression is also reached in the entailment cone that we get by starting from our “output” expression and then applying two bisubsitutions:

What Actually Is the “·”? Models and the Proof

We can think of logic (or Boolean algebra) as being associated with a certain collection of theorems. And what our axiom does is to provide something from which all theorems of logic (and nothing but theorems of logic) can be derived. At some level, we can think of it as just being about symbolic expressions. But in our effort to understand what’s going on—say with our proof—it’s sometimes useful to ask how we can “concretely” interpret these expressions.

For example, we might ask what the · operator actually is. And what kinds of things can our symbolic variables be? In effect we’re asking for what in model theory are called “models” of our axiom system. And in aligning with logic the most obvious model to discuss is one in which variables can be True or False, and the · represents either the logical operator Nand or the logical operator Nor.

The truth table, say for Nand, is:

And as expected, with this model for ·, we can confirm that our original axiom holds:

In general, though, our original axiom allows two size-2 models (that we can interpret as Nand and Nor):

It allows no size-3 models, and in fact in general allows only models of size 2ⁿ; for example, for size 4 its models are:

So what about a · b = b · a? What models does it allow? For size 2, it’s all 8 possible models with symmetric “multiplication tables”:

But the crucial point is that the 2 models for our original axiom system are part of these. In other words, at least for size-2 models, satisfying the original axiom system implies satisfying a · b = b · a.

And indeed any lemma derived from our axiom system must allow the models associated with our original axiom system. But it may also allow more—and sometimes many more. So here’s a map of our proof, showing how many models (out of 16 possible) each lemma allows:

Here are the results for size-3 models:

And, once again, these look complicated. We can think of models as defining—in some sense—what lemmas are “about”. So, for example, our original axiom is “about” Nand and Nor. The lemma a · b = b · a is “about” symmetric functions. And so on. And we might have hoped that we could gain some understanding of our proof by looking at how different lemmas that occur in it “sculpt” what is being talked about. But in fact we just seem to end up with complicated descriptions of sets that don’t seem to have any obvious relationship with each other.

What about a Higher-Level Abstraction?

If there’s one thing that stands out about our proof—and the analysis we’ve given of it here—it’s how fiddly and “in the weeds” it seems to be. But is that because we’re missing some big picture? Is there actually a more abstract way of discussing things, that gets to our result without having to go through all the details?

In the history of mathematics many of the most important themes have been precisely about finding such higher-level abstractions. We could start from the explicit symbolic axioms

or even

and start building up theorems much as we’ve done here. Or we could recognize that these are axioms for group theory, and then start using the abstract ideas of group theory to derive our theorems.

So is there some higher-level version of what we’re discussing here? Remember that the issue is not about the overall structure of Boolean algebra; rather it’s about the more metamathematical question of how one can prove that all of Boolean algebra can be generated from the axiom:

In the last few sections we’ve tried a few semi-empirical approaches to finding higher-level representations. But they haven’t gotten very far. And to get further we’re probably going to need a serious new idea.

And, if history is a guide, we’re going to need to come up with an abstraction that somehow “goes outside of the system” before “coming back”. It’s like trying to figure out the real roots of a cubic equation, and realizing that the best way to do this is to introduce complex numbers, even though the imaginary parts will cancel at the end.

In the direct exploration of our proof, it feels as if the intermediate lemmas we generate “wander off into the wilds of metamathematical space” before coming back to establish our final result. And if we were using a higher-level abstraction, we’d instead be “wandering off” into the space of that abstraction. But what we might hope is that—at least with the concepts we would use in discussing that abstraction—the path that would be involved would be “short enough to be accessible to human understanding”.

Will we be able to find such an abstraction? It’s a subtle question. Because in effect it asks whether we can reduce the computational effort needed for the proof—or, in other words, whether we can find a pocket of computational reducibility in what in general will be a computationally irreducible process. But it’s not a question that can really be answered just for our specific proof on it own. After all, our “abstraction” could in principle just involve introducing a primitive that represents our whole proof or a large part of it. But to make it what we can think of as a real abstraction we need something that spans many different specific examples—and, in our case, likely many axiomatic systems or symbolic proofs.

So is such an abstraction possible? In the history of mathematics the experience has been that after enough time (often measured in centuries) has passed, abstractions tend to be found. But at some level this has been self fulfilling. Because the areas that are considered to have remained “interesting for mathematics” tend to be just those where general abstractions have in fact been found.

In ruliology, though, the typical experience has been different. Because there it’s been routine to sample the computational universe of possible simple programs and encounter computational irreducibility. In the end it’s still inevitable that among the computational irreducibility there must be pockets of computational reducibility. But the issue is that these pockets of computational reducibility may not involve features of our system that we care about.

So is a proof of the kind we’re discussing here more like ruliology, or more like “typical mathematics”? Insofar as it’s a mathematical-style proof of a mathematical statement it feels more like typical mathematics. But insofar as it’s something found by the computational process of automated theorem proving it perhaps seems more ruliology.

But what might a higher-level abstraction for it look like? Figuring that out is probably tantamount to finding the abstraction. But perhaps one can at least expect that in some ways it will be metamathematical, and more about the structure and character of proofs than about their content. Perhaps it will be something related to the framework of higher category theory, or some form of meta-algebra. But as of now, we really don’t know—and we can’t even say that such an abstraction with any degree of generality is possible.

LLMs to the Rescue?

The unexpected success of LLMs in language generation and related tasks has led to the idea that perhaps eventually systems like LLMs will be able to “do everything”—including for example math. We already know—not least thanks to Wolfram Language—that lots of math can be done computationally. But often the computations are hard—and, as in the example of the proof we’re discussing here, incomprehensible to humans. So the question really is: can LLMs “humanize” what has to be done in math, turning everything into a human-accessible narrative? And here our proof seems like an excellent—if challenging—test case.

But what happens if we just ask a current LLM to generate the proof from scratch? It’s not a good picture. Very often the LLM will eagerly generate a proof, but it’ll be completely wrong, often with the same kind of mistakes that a student somewhat out of their depth might make. Here’s a typical response where an LLM simply assumes that the · operator is associative (which it isn’t in Boolean algebra) then produces a proof that on first blush looks at least vaguely plausible, but is in fact completely wrong:

Coming up with an explanation for what went wrong is basically an exercise in “LLM psychology”. But in a first approximation one might say the following. LLMs are trained to “fill in what’s typical”, where “typical” is defined by what appears in the training set. But (absent some recent Wolfram Language and Wolfram|Alpha based technology of ours) what’s been available as a training set has been human-generated mathematical texts, where, yes, operators are often associative, and typical proofs are fairly short. And in the “psychology of LLMs” an LLM is much more likely to “do what’s typical” than to “rigorously follow the rules”.

If you press the LLM harder, then it might just “abdicate”, and suggest using the Wolfram Language as a tool to generate the proof. So what happens if we do that, then feed the finished proof to the LLM and ask it to explain? Well, typically it just does what LLMs do so well, and writes an essay:

So, yes, it does fine in “generally framing the problem”. But not on the details. And if you press it for details, it’ll typically eventually just start parroting what it was given as input.

How else might we try to get the LLM to help? One thing I’ve certainly wondered is how the lemmas in the proof relate to known theorems—perhaps in quite different areas of mathematics. It’s something one might imagine one would be able to answer by searching the literature of mathematics. But, for example, textual search won’t be sufficient: it has to be some form of semantic search based on the meaning or symbolic structure of lemmas, not their (fairly arbitrary) textual presentation. A vector database might be all one needs, but one can certainly ask an LLM too:

It’s not extremely helpful, though, charmingly, it correctly identifies the source of our original axiom. I’ve tried similar queries for our whole set of lemmas across a variety of LLMs, with a variety of RAG systems. Often the LLM will talk about an interpretation for some lemma—but the lemma isn’t actual present in our proof. But occasionally the LLM will mention possible connections (“band theory”; “left self-distributive operations in quandles”; “Moufang loops”)—though so far none have seemed to quite hit the mark.

And perhaps this failure is itself actually a result—telling us that the lemmas that show up in our proof really are, in effect, out in the wilds of metamathematical space, probing places that haven’t ever been seriously visited before by human mathematics.

But beyond LLMs, what about more general machine learning and neural net approaches? Could we imagine using a neural net as a probe to find “exploitable regularities” in our proof? It’s certainly possible, but I suspect that the systematic algorithmic methods we’ve already discussed for finding optimal notations, popular lemmas, etc. will tend to do better. I suppose it would be one thing if our systematic methods had failed to even find a proof. Then we might have wanted something like neural nets to try to guess the right paths to follow, etc. But as it is, our systematic methods rather efficiently do manage to successfully find a proof.

Of course, there’s still the issue that we’re discussing here that the proof is very “non-human”. And perhaps we could imagine that neural nets, etc.—especially when trained on existing human knowledge—could be used to “form concepts” that would help us humans to understand the proof.

We can get at least a rough analogy for how this might work by looking at visual images produced by a generative AI system trained from billions of human-selected images. There’s a concept (like “a cube”) that exists somewhere in the feature space of possible images. But “around” that concept are other things—“out in interconcept space”—that we don’t (at least yet) explicitly have words for:

And it’ll presumably be similar for math, though harder to represent in something like a visual way. There’ll be existing math concepts. But these will be embedded in a vast domain of “mathematical interconcept space” that we humans haven’t yet “colonized”. And what we can imagine is that—perhaps with the help of neural nets, etc.—we can identify a limited number of “points in interconcept space” that we can introduce as new concepts that will, for example, provide useful “waypoints” in understanding our proof.

But Why Is the Theorem True?

It’s a common human urge to think that anything that’s true must be true for a reason. But what about our theorem? Why is it true? Well, we’ve seen a proof. But somehow that doesn’t seem satisfactory. We want “an explanation we can understand”. But we know that in general we can’t always expect to get one.

It’s a fundamental implication of computational irreducibility that things can happen where the only way to “see how they happen” is just to “watch them happen”; there’s no way to “compress the explanation”.

Consider the following patterns. They’re all generated by cellular automata. And all live exactly 100 steps before dying out. But why?

In a few cases it seems like we can perhaps at least begin to imagine “narratively describing” a mechanism. But most of the time all we can say is basically that they “live 100 steps because they do”.

It’s a quintessential consequence of computational irreducibility. It might not be what we’d expect, or hope for. But it’s reality in the computational universe. And it seems very likely that our theorem—and its proof—is like this too. The theorem in effect “just happens to be true”—and if you run the steps in the proof (or find the appropriate path in the entailment cone) you’ll find that it is. But there’s no “narrative explanation”. No “understanding of why it’s true”.

Intuition and Automated Theorem Proving

We’ve been talking a lot about the proof of our theorem. But where did the theorem to prove come from in the first place? Its immediate origin was an exhaustive search I did of simple axiom systems, filtering for ones that could conceivably generate Boolean algebra, followed by testing each of the candidates using automated theorem proving.

But how did I even get the idea of searching for a simple axiom system for Boolean algebra? Based on the axiom systems for Boolean algebra known before—and the historical difficulty of finding them—one might have concluded that it was quite hopeless to find an axiom system for Boolean algebra by exhaustive search. But by 2000 I had nearly two decades of experience in exploring the computational universe—and I was well used to the remarkable phenomenon that even very simple computational rules can lead to behavior of great complexity. So the result was that when I came to think about axiom systems and the foundations of mathematics my intuition led me to imagine that perhaps the simplest axiom system for something like Boolean algebra might be simple enough to exhaustively search for.

And indeed discovering the axiom system we’ve discussed here helped further expand and deepen my intuition about the consequences of simple rules. But what about the proof? What intuition might one get from the proof as we now know it, and as we’ve discussed here?

There’s much intuition to be got from observing the world as it is. But for nearly half a century I’ve had another crucial source of intuition: observing the computational universe—and doing computational experiments. I was recently reflecting on how I came to start developing intuition in this way. And what it might mean for intuition I could now develop from things like automated theorem proving and AI.

Back in the mid-1970s my efforts in particle physics led me to start using computers to do not just numerical, but also algebraic computations. In numerical computations it was usual to just get a few numbers out, that perhaps one could plot to make a curve. But in algebraic computations one instead got out formulas—and often very ornate ones full of structure and detail. And for me it was routine to get not just one formula, but many. And looking at these formulas I started to develop intuition about them. What functions would they involve? What algebraic form would they take? What kind of numbers would they involve?

I don’t think I ever consciously realized that I was developing a new kind of computationally based intuition. But I soon began to take it for granted. And when—at the beginning of the 1980s—I started to explore the consequences of simple abstract systems like cellular automata it was natural to expect that I would get intuition from just “seeing” how they behaved. And here there was also another important element. Because part of the reason I concentrated on cellular automata was precisely because one could readily visualize their behavior on a computer.

I don’t think I would have learned much if I’d just been printing out “numerical summaries” of what cellular automata do. But as it was, I was seeing their behavior in full detail. And—surprising though what I saw was—I was soon able to start getting an intuition for what could happen. It wasn’t a matter of knowing what the value of every cell would be. But I started doing things like identifying four general classes of cellular automata, and then recognizing the phenomenon of computational irreducibility.

By the 1990s I was much more broadly exploring the computational universe—always trying to see what could happen there. And in almost all cases it was a story of defining simple rules, then running them, and making an explicit step-by-step visualization of what they do—and thereby in effect “seeing computation in action”.

In recent years—spurred by our Physics Project—I’ve increasingly explored not just computational processes, but also multicomputational ones. And although it’s more difficult I’ve made every effort to visualize the behavior of multiway systems—and to get intuition about what they do.

But what about automated theorem proving? In effect, automated theorem proving is about finding a particular path in a multiway system that leads to a theorem we want. We’re not getting to see “complete behavior”; we’re in effect just seeing one particular “solution” for how to prove a theorem.

And after one’s seen many examples, the challenge once again is to develop intuition. And that’s a large part of what I’ve been trying to do here. It’s crucial, I think, to have some way to visualize what’s happening—in effect because visual input is the most efficient way to get information into our brains. And while the visualizations we’ve developed here aren’t as direct and complete as, say, for cellular automaton evolution, I think they begin to give some overall sense of our proof—and other proofs like it.

In studying simple programs like cellular automata, the intuition I developed led me to things like my classification of cellular automaton behavior, as well as to bigger ideas like the Principle of Computational Equivalence and computational irreducibility. So having now exposed myself to automated theorem proving as I exposed myself to algebraic computation and the running of simple rules in the past, what general principles might I begin to see? And might they, for example, somehow make the fact that our proof works ultimately seem “obvious”?

In some ways yes, but in other ways no. Much as with simple programs, there are axiom systems so simple that, for example, the multiway systems they generate are highly regular. But beyond a low threshold, it’s common to get very complicated—and in many ways seemingly random—multiway system structures. Typically an infinite number of lemmas are generated, with little or no obvious regularity in their forms.

And one can expect that—following the ideas of universal computation—it’ll typically be possible to encode in any one such multiway system the behavior of any other multiway system. In terms of axioms what one’s saying is that if one sets up the right translation between theorems, one will be able to use any one such axiom system to generate the theorems of any other. But the issue is that the translation will often make major changes to the structure of the theorems, and in effect define not just a “mathematical translation” (like between geometry and algebra) but a metamathematical one (as one would need to get from Peano arithmetic to set theory).

And what this means is that it isn’t surprising that even a very simple axiom system can generate a complicated set of possible lemmas. But knowing this doesn’t immediately tell one whether those lemmas will align with some particular existing theory—like Boolean algebra. And in a sense that’s a much more detailed question.

At some metamathematical level it might not be a natural question. But at a “mathematical level” it is. And it’s what we have to address in connection with the theorem—and proof—we’re discussing here. Many aspects of the overall form and properties of the proof will be quite generic, and won’t depend on the particulars of the axiom system we’re using. But some will. And quite what intuition we may be able to get about these isn’t clear. And perhaps it’ll necessarily be fragmented and specific—in effect responding to the presence of computational irreducibility.

It’s perhaps worth commenting that LLMs—and machine learning in general—represent another potential source of intuition. That intuition may well be more about the general features of us as observers and thinkers. But such intuition is potentially critical in framing just what we can experience, not only in the natural world, but also in the mathematical and metamathematical worlds. And perhaps the apparent impotence of LLMs when faced with the proof we’ve been discussing already tells us something significant about the nature of “mathematical observers” like us.

So What Does It Mean for the Future of Mathematics?

Let’s say we never manage to “humanize” the proof we’ve been discussing here. Then in effect we’ll end up with a “black-box theorem”—that we can be sure is true—but we’ll never know quite how or why. So what would that mean for mathematics?

Traditionally, mathematics has tended to operate in a “white box” kind of way, trying to build narrative and understanding along with “facts”. And in this respect it’s very different from natural science. Because in natural science much of our knowledge has traditionally been empirical—derived from observing the world or experimenting on it—and without any certainty that we can “understand its origins”.

Automated theorem proving of the kind we’re discussing here—or, for that matter, pretty much any exploratory computational experimentation—aligns mathematics much more with natural science, deriving what’s true without an expectation of having a narrative explanation of why.

Could one imagine practicing mathematics that way? One’s already to some extent following such a path as soon as one introduces axiom systems to base one’s mathematics on. Where do the axiom systems come from? In the time of Euclid perhaps they were thought of as an idealization of nature. But in more modern times they are realistically much more the result of human choice and human aesthetics.

So let’s say we determine (given a particular axiom system) that some black-box theorem is true. Well, then we can just add it, just as we could another axiom. Maybe one day it’ll be possible to prove P≠NP or the Riemann Hypothesis from existing axioms of mathematics (if they don’t in fact turn out to be independent). And—black box or not—we can expect to add them to what we assume in subsequent mathematics we do, much as they’re routinely added right now, even though their status isn’t yet known.

But it’s one thing to add one or two “black-box theorems”. But what happens when black-box theorems—that we can think of as “experimentally determined”—start to dominate the landscape of mathematics?

Well, then mathematics will take on much more of the character of ruliology—or of an experimental science. When it comes to the applications of mathematics, this probably won’t make much difference, except that in effect mathematics will be able to become much more powerful. But the “inner experience” of mathematics will be quite different—and much less “human”.

If one indeed starts from axioms, it’s not at the outset obvious why everything in mathematics should not be mired in the kind of alien-seeming metamathematical complexity that we’ve encountered in the discussion of our proof here. But what I’ve argued elsewhere is that the fact that in our experience of doing mathematics it’s not is a reflection of how “mathematical observers like us” sample the raw metamathematical structure generated by axioms (or ultimately by the subaxiomatic structure of the ruliad).

The physics analogy I’ve used is that we succeed in doing mathematics at a “fluid dynamics level”, far above the detailed “molecular dynamics level” of things like the proof we’ve discussed here. Yes, we can ask questions—like ones about the structure of our proof—that probe the axiomatic “molecular dynamics level”. But it’s an important fact that in doing what we normally think of as mathematics we almost never have to; there’s a coherent way to operate purely at the “fluid dynamics level”.

Is it useful to “dip down” to the molecular dynamics? Definitely yes, because that’s where we can readily do computations—like those in our proof, or in general those going on in the internals of the Wolfram Language. But a key idea in the design of the Wolfram Language is to provide a computational language that can express concepts at a humanized “fluid dynamics” level—in effect bridging between the way humans can think and understand things, and the way raw computation can be done with them.

And it’s notable that while we’ve had great success over the years in defining “human-accessible” high-level representations for what amount to the “inputs” and “outputs” of computations, that’s been much less true of the “ongoing processes” of computation—or, for example, of the innards of proofs.

Is there a good “human-level” way to represent proofs? If the proofs are short, it’s not too difficult (and the step-by-step solutions technology of Wolfram|Alpha provides a good large-scale example of what can be done). But—as we’ve discussed—computational irreducibility implies that some proofs will inevitably be long.

If they’re not too long, then at least some parts of them might be constructed by human effort, say in a system like a proof assistant. But as soon as there’s much automation (whether with automated theorem proving or with LLMs) it’s basically inevitable that one will end up with things that at least approach what we’ve seen with the proof we’re discussing here.

What can then be done? Well, that’s the challenge. Maybe there is some way to simplify, abstract or otherwise “humanize” the proof we’ve been discussing. But I rather doubt it. I think this is likely one of those cases where we inevitably find ourselves face to face with computational irreducibility.

And, yes, there’s important science (particularly ruliology) to do on the structures we see. But it’s not mathematics as it’s traditionally been practiced. But that’s not to say that the results that come out of things like our proof won’t be useful for mathematics. They will be. But they make mathematics more like an experimental science—where what matters most is in effect the input and output rather than a “publishable” or human-readable derivation in between. And where the key issue in making progress is less in the innards of derivations than in defining clear computational ways to express input and output. Or, in effect, in capturing “human-level mathematics” in the primitives and structure of computational language.

Appendix: What about a Different Theorem Proving System?

The proof we’ve been discussing here was created using FindEquationalProof in the Wolfram Language. But what if we were to use a different automated theorem proving system? How different would the results be? In the spectrum of things that automated theorem proving systems do, our proof here is on the difficult end. And many existing automated theorem proving systems don’t manage to do it all. But some of the stronger ones do. And in the end—despite their different internal algorithms and heuristics—it’s remarkable how similar the results they give are to those from the Wolfram Language FindEquationalProof (differences in the way lemmas vs. inference steps, etc. are identified make detailed quantitative comparisons difficult):

Thanks

Thanks to Nik Murzin of the Wolfram Institute for his extensive help as part of the Wolfram Institute Empirical Metamathematics Project. Also Roger Germundsson, Sergio Sandoval, Adam Strzebonski, Michael Trott, Liubov Tupikina, James Wiles and Carlos Zapata for input. Thanks to Arnim Buch and Thomas Hillenbrand for their work in the 1990s on Waldmeister which is now part of FindEquationalProof (also to Jonathan Gorard for his 2017 work on the interface for FindEquationalProof). I was first seriously introduced to automated theorem proving in the late 1980s by Dana Scott, and have interacted with many people about it over the years, including Richard Assar, Bruno Buchberger, David Hillman, Norm Megill, Todd Rowland and Matthew Szudzik. (I’ve also interacted with many people about proof assistant, proof presentation and proof verification systems, both recently and in the past.)

Useful to the Point of Being Revolutionary: Introducing Wolfram Notebook Assistant

Stephen Wolfram — Mon, 09 Dec 2024 18:38:15 +0000

Note: As of today, copies of Wolfram Version 14.1 are being auto-updated to allow subscription access to the capabilities described here. [For additional installation information see here.]

Just Say What You Want! Turning Words into Computation

Nearly a year and a half ago—just a few months after ChatGPT burst on the scene—we introduced the first version of our Chat Notebook technology to integrate LLM-based chat into Wolfram Notebooks. For the past year and a half we’ve been building on those foundations. And today I’m excited to be able to announce that we’re releasing the fruits of those efforts: the first version of our Wolfram Notebook Assistant.

There are all sorts of gimmicky AI assistants out there. But Notebook Assistant isn’t one of them. It’s a serious, deep piece of new technology, and what’s more important, it’s really, really useful! In fact, I think it’s so useful as to be revolutionary. Personally, I thought I was a pretty efficient user of Wolfram Language—but Notebook Assistant has immediately made me not only significantly more efficient, but also more ambitious in what I try to do. I hadn’t imagined just how useful Notebook Assistant was going to be. But seeing it now I can say for sure that it’s going to raise the bar for what everyone can do. And perhaps most important of all, it’s going to open up computational language and computational thinking to a vast range of new people, who in the past assumed that those things just weren’t accessible to them.

Leveraging the decades of work we’ve done on the design and implementation of the Wolfram Language (and Wolfram|Alpha), Notebook Assistant lets people just say in their own words what they want to do; then it does its best to crispen it up and give a computational implementation. Sometimes it goes all the way and just delivers the answer. But even when there’s no immediate “answer” it does remarkably well at building up structures where things can be represented computationally and tackled concretely. People really don’t need to know anything about computational language—or computational thinking to get started; Notebook Assistant will take their ideas, rough as they may be, and frame them in computational language terms.

I’ve long seen Wolfram Language as uniquely providing the infrastructure and “notation” to enable “computational X” for all fields X. I’m excited to say that I think Notebook Assistant now bridges “the last mile” to let anyone—at almost any level—access the power of computational language, and “do computational X”. In its original conception, Wolfram Notebook Assistant was just intended to be “useful”. But it’s emerging as something much more than that; something positively revolutionary.

“I can’t believe it’ll do anything useful with that”, I’ll think. But then I’ll try it. And, very often, something amazing will happen. Something that gets me past some sticking point or over some confusion. Something that gives me an unexpected new building block—or new idea—for what I’m trying to do. And that uses the medium of our computational language to take me beyond where I would ever have reached before.

So how does one use Notebook Assistant? Once you’ve signed up you can just go to the toolbar of any notebook, and open a Notebook Assistant chat window:

Now tell Notebook Assistant what you want to do. The more precise and explicit you are, the better. But you don’t have to have thought things through. Just type what comes into your mind. Imagine you’ve been working in a notebook, and (somehow) you’ve got a picture of some cats. You wonder “How can I find the cats in this picture?” Well, just ask Notebook Assistant!

Notebook Assistant gives some narrative text, and then a piece of Wolfram Language code—which you can just run in your notebook (by pressing ):

It seems a bit like magic. You say something vague, and Notebook Assistant turns it into something precise and computational—which you can then run. It’s not always as straightforward as in this example. But the important thing is that in practice (at least in my rather demanding experience) Notebook Assistant essentially always does spectacularly well at being useful—and at telling me things that move forward what I’m trying to do.

Big or Small, Just Try It!

Imagine that sitting next to you, you had someone very knowledgeable about Wolfram Language and about computational thinking in general. Think what you might ask them. That’s what you should ask Notebook Assistant. And if there’s one thing to communicate here, it’s “Just try it!” You might think what you’re thinking about is too vague, or too specific, or too technical. But just try asking Notebook Assistant. In my experience, you’ll be amazed at what it’s able to do, and how helpful it’s able to be.

Maybe you’re an experienced Wolfram Language user who “knows there must be a way to do something”, but can’t quite remember how. Just ask Notebook Assistant. And not only will it typically be able to find the function (or whatever) you need; it’ll also usually be able to create a code fragment that does the very specific thing you asked about. And, by the way, it’ll save you lots of typing (and debugging) by filling in those fiddly options and so on just how you need them. And even if it doesn’t quite nail it, it’ll have given a skeleton of what you need, that you can then readily edit. (And, yes, the fact that it’s realistic to edit it relies on the fact that Wolfram Language represents it in a way that humans can readily read as well as write.)

What if you’re a novice, who’s never used Wolfram Language before, and never really been exposed to computational thinking, or for that matter, “techie stuff” at all? Well, the remarkable thing is that Notebook Assistant will still be able to help you—a lot. You can ask it something very vague, that doesn’t even seem particularly computational. It does remarkably well at “computationalizing things”. Taking what you’ve said, and finding a way to address it computationally—and to lead you into the kind of computational thinking that’ll be needed for the particular thing you’re trying to do.

In what follows, we’ll see a whole range of different ways to use Notebook Assistant. In fact, even as I’ve been writing this, I’ve discovered quite a few new ways to use it that I’d never thought of before.

There are some general themes, though. The most important is the way Notebook Assistant pivotally relies on the Wolfram Language. In a sense, the main mission of Notebook Assistant is to make things computational. And the whole reason it can so successfully do that is that it has the Wolfram Language as its target. It’s leveraging the unique nature of the Wolfram Language as a full-scale computational language, able to coherently represent abstract and real-world things in a computational way.

One might think that the Wolfram Language would in the end be mainly an “implementation layer”—serving to make what Notebook Assistant produces runnable. But in reality it’s very, very much more than that. In particular, it’s basically the medium—the language—in which computational ideas are communicated. When Notebook Assistant generates Wolfram Language, it’s not just something for the computer to run; it’s also something for humans to read. Yes, Notebook Assistant can produce text, and that’s useful, especially for contextualizing things. But the most concentrated and poignant communication comes in the Wolfram Language it produces. Want the TL;DR? Just look at the Wolfram Language code!

Part of how Wolfram Language code manages to communicate so much so efficiently is that it’s precise. You can just mention the name of a function, and you know precisely what it does. You don’t have to “scaffold” it with text to make its meaning clear.

But there’s something else as well. With its symbolic character—and with all the coverage and consistency that we’ve spent so much effort on over the decades—the Wolfram Language is uniquely able to “communicate in fragments”. Any fragment of Wolfram Language code can be run, and more important, it can smoothly fit into a larger structure. And that means that even small fragments of code that Notebook Assistant generates can be used as building blocks.

It produces Wolfram Language code. You read the code (and it’s critical that it’s set up to be read). You figure out if it’s what you want. (And if it’s not, you edit it, or ask Notebook Assistant to do that.) Then you can use that code as a robust building block in whatever structure—large or small—that you might be building.

In practice, a critical feature is that you don’t have to foresee how Notebook Assistant is going to respond to what you asked. It might nail the whole thing. Or it might just take steps in the right direction. But then you just look at what it produced, and decide what to do next. Maybe in the end you’ll have to “break the problem down” to get Notebook Assistant to deal with it. But there’s no need to do that in advance—and Notebook Assistant will often surprise you by how far it’s able to get on its own.

You might imagine that Notebook Assistant would usually need you to break down what you’re asking into “pure computational questions”. But in effect it has good enough “general knowledge” that it doesn’t. And in fact it will usually do better the more context you give it about why you’re asking it to do something. (Is it for chemical engineering, or for sports analytics, or what?)

But how ambitious can what you ask Notebook Assistant be? What if you ask it something “too big”? Yes, it won’t be able to solve that 100-year-old problem or build a giant software system in its immediate output. But it does remarkably well at identifying pieces that it can say something about, and that can help you understand how to get started. So, as with many things about Notebook Assistant, you shouldn’t assume that it won’t be helpful; just try it and see what happens! And, yes, the more you use Notebook Assistant, the more you’ll learn just what kind of thing it does best, and how to get the most out of it.

So how should you ultimately think about Notebook Assistant? Mainly you should think of it like an very knowledgeable and hardworking expert. But at a more mundane level it can serve as a super-enhanced documentation lookup system or code completion system. It can also take something vague you might ask it, and somehow undauntedly find the “closest formalizable construct”—that it can then compute with.

An important feature is that it is—in human terms—almost infinitely patient and hardworking. Where a human might think: “it’s too much trouble to write out all those details”, Notebook Assistant just goes ahead and does it. And, yes, it saves you huge amounts of typing. But, more important, it makes it “cheap” to do things more perfectly and more completely. So that means you actually end up labeling those plot axes, or adding a comment to your code, or coming up with meaningful names for your variables.

One of the overarching points about Notebook Assistant is that it lowers the barrier to getting help. You don’t have to think carefully about formulating your question. You don’t have to go clicking through lots of links. And you don’t have to worry that it’s too trivial to waste a coworker’s time on the question. You can just ask Notebook Assistant. Oh, and it’ll give you a response immediately. (And you can go back and forth with it, and ask it to clarify and refine things.)

“How Can I Do That?”

At least for me it’s very common: you have something in your mind that you want to do, but you don’t quite know how to achieve it in the Wolfram Language. Well, now you can just ask Notebook Assistant!

I’ll show various examples here. It’s worth emphasizing that these examples typically won’t look exactly the same if you run them again. Notebook Assistant has a certain amount of “AI-style random creativity”—and it also routinely makes use of what you’ve done earlier in a session, etc. It also has to be said that Notebook Assistant will sometimes make mistakes—or will misunderstand what you’re asking it. But if you don’t like what it did, you can always press the button to generate a new response.

Let’s start off with a basic computational operation:

As an experienced user of Wolfram Language, a simple “do it with FoldList” would already have been enough. But Notebook Assistant goes all the way—generating specific code for exactly what I asked. Courtesy of Wolfram Language, the code is very short and easy to read. But Notebook Assistant does something else for one as well: it produces an example of the code in action—which lets one check that it really does what one wanted. Oh, and then it goes even further, and tells me about a function in the Wolfram Function Repository (that I, for one, had never heard of; wait did I write it?) that directly does the operation I want.

OK, so that was a basic computational operation. Now let’s try something a little more elaborate:

This involves several steps, but Notebook Assistant nails it, giving a nice example. (And, yes, it’s reading the Wolfram Language documentation, so often its examples are based on that.)

But even after giving an A+ result right at the top, Notebook Assistant goes on, talking about various options and extensions. And despite being (I think) quite an expert on what the Wolfram Language can do, I was frankly surprised by what it came up with; I didn’t know about these capabilities!

There’s an incredible amount of functionality built into the Wolfram Language (yes, four decades worth of it). And quite often things you want to do can be done with just a single Wolfram Language function. But which one? One of the great things about Notebook Assistant is that it’s very good at taking “raw thoughts”, sloppily worded, and figuring out what function you need. Like here, bam, “use LineGraph!”

You can ask Notebook Assistant “fairly basic” questions, and it’ll respond with nice, synthesized-on-the-spot “custom documentation”:

You can also ask it about obscure and technical things; it knows about every Wolfram Language function, with all its details and options:

Notebook Assistant is surprisingly good at writing quite minimal code that does sophisticated things:

If you ask it open-ended questions, it’ll often answer with what amount to custom-synthesized computational essays:

Notebook Assistant is pretty good at “pedagogically explaining what you can do”:

In everything we’ve seen so far, the workflow is that you ask Notebook Assistant something, then it generates a result, and then you use it. But everything can be much more interactive, and you can go back and forth with Notebook Assistant—say refining what you want it to do.

Here I had something in mind, but I was quite sloppy in describing it. And although Notebook Assistant came up with a reasonable interpretation of what I asked, it wasn’t really what I had in mind:

So I went back and edited what I asked (right there in the Notebook Assistant window), and tried again:

The result was better, but still not right. But all I had to do was to tell it to make a change, and lo and behold, I got what I was thinking of:

By the way, you can also perfectly well ask about deployment to the web:

And while I might have some minor quibbles (why use a string for the molecule name, not "Chemical"; why not use CloudPublish; etc.) what Notebook Assistant produces works, and provides an excellent scaffold for further development. And, as it often does, Notebook Assistant adds a kind of “by the way, did you know?” at the end, showing how one could use ARPublish to produce output for augmented reality.

Here’s one last example: creating a user interface element. I want to make a slider-like control that goes around (like an analog clock):

Well, actually, I had in mind something more minimal:

Impressive. Even if maybe it got that from some documentation or other example. But what if I wanted to tweak it? Well, actually, Notebook Assistant does seem to understand what it has:

“Can You Just Do That for Me?”

What we’ve seen so far are a few examples of asking Notebook Assistant to tell us how to do things. But you can also just ask Notebook Assistant to do things for you, in effect producing “finished goods”:

Pretty impressive! And it even just went ahead and made the picture. By the way, if I wanted the code packaged up into a single line, I can just ask for that:

Notebook Assistant can generate interactive content too. And—very usefully—you don’t have to give precise specifications up front: Notebook Assistant will automatically pick “sensible defaults” (that, yes, you can trivially edit later, or just tell Notebook Assistant to change it for you):

Here’s an example that requires putting together several different ideas and functions. But Notebook Assistant manages it just fine—and in fact the code it produces is interesting and clarifying to read:

Notebook Assistant knows about every area of Wolfram Language functionality—here synthetic geometry:

And here chemistry:

It also knows about things like the Wolfram Function Repository, here running a function from there that generates a video:

Here’s something that again leverages Notebook Assistant’s encyclopedic knowledge of Wolfram Language capabilities, now pulling in real-time data:

I can’t resist trying a few more examples:

Let’s try something involving more sophisticated math:

(I would have used RegularPolygon[5], and I don’t think DiscretizeRegion is necessary … but what Notebook Assistant did is still very impressive.)

Or here’s some more abstract math:

OK, so Notebook Assistant provides a very powerful way to go from words to computational results. So what then is the role of computational language and of “raw Wolfram Language”? First of all, it’s the Wolfram Language that makes everything we’ve seen here work; it’s what the words are being turned into so that they can be computed from. But there’s something much more than that. The Wolfram Language isn’t just for computers to compute with. It’s also for humans to think with. And it’s an incredibly powerful medium for that thinking. Like a great generalization of mathematical notation from the distant past, it provides a streamlined way to broadly formalize things in computational terms—and to systematically build things up.

Notebook Assistant is great for getting started with things, and for producing a first level of results. But words aren’t ultimately an efficient way say how to build up from there. You need the crisp, formal structure of computational language. In which even the tiny amounts of code you write can be incredibly powerful.

Now that I’ve been using Notebook Assistant for a while I think I can say that on quite a few occasions it’s helped me launch things, it’s helped me figure out details, and it’s helped me debug things that have gone wrong. But the backbone of my computational progress has been me writing Wolfram Language myself (though quite often starting from something Notebook Assistant wrote). Notebook Assistant is an important new part of the “on ramp” to Wolfram Language; but it’s raw Wolfram Language that lets one really zoom forward to build new structures and achieve what’s computationally possible.

“Where Do I Start?”

Computational thinking is an incredibly powerful approach. But sometimes it’s hard to get started with, particularly if you’re not used to it. And although one might not imagine it, Notebook Assistant can be very useful here, essentially helping one brainstorm about what direction to take.

I was explaining this to our head of Sales, and tried:

I really didn’t expect this to do anything terribly useful … and I was frankly amazed at what happened. Pushing my luck I tried:

Obviously this isn’t the end of the story, but it’s a remarkably good beginning—going from a vague request to something that’s set up to be thought about computationally.

Here’s another example. I’m trying to invent a good system for finding books in my library. I just took a picture of a shelf of books behind my desk:

Once again, a very impressive result. Not the final answer, but a surprisingly good start. That points me in the direction of image processing and segmentation. At first, it’s running too slowly, so it downsamples the image. Then it tells me I might need to tweak the parameters. So I just ask it to create a tool to do that:

And then:

It’s very impressive how much Notebook Assistant can help one go “from zero to computation”. And when one gets used to using it, it starts to be quite natural to just try it on all sorts of things one’s thinking about. But if it’s just “quick, tell me something to compute”, it’s usually harder to come up with anything.

And that reminds me of the very first time I ever saw a computer in real life. It was 1969 and I was 9 years old (and the computer was an IBM mainframe). The person who was showing me the computer asked me: “So what do you want to compute?” I really had no idea at that time “what one might compute”. Rather lamely I said “the weight of a dinosaur”. So, 55 years later, let’s try that again:

And let’s try going further:

“Tweak the Details for Me”

Something I find very useful with Notebook Assistant is having it “tweak the details” of something I’ve already generated. For example, let’s say I have a basic plot of a sine curve in a notebook:

Assuming I have that notebook in focus, Notebook Assistant will “see” what’s there. So then I can tell it to modify my sine curve—and what it will do is produce new code with extra details added:

That’s a good result. But as a Wolfram Language aficionado I notice that the code is a bit more complicated than it needs to be. So what can I do about it? Well, I can just ask Notebook Assistant to simplify it:

I can keep going, asking it to further “embellish” the plot:

Let’s push our luck and try going even further:

Oops. Something went wrong. No callouts, and a pink “error” box. I tried regenerating a few times. Often that helps. But this time it didn’t seem to. So I decided to give Notebook Assistant a suggestion:

And now it basically got it. And with a little more back and forth I can expect to get exactly what I want.

In the Wolfram Language, functions (like Plot) are set up to have good automatic defaults. But when you want, for example, to achieve some particular, detailed look, you often have to end up specifying all sorts of additional settings. And Notebook Assistant is very good at doing this, and in effect, patiently typing out all those option settings, etc.

“What Went Wrong? Fix It!”

Let’s say you wrote some Wolfram Language (or perhaps Notebook Assistant did it for you). And let’s say it doesn’t work. Maybe it just produces the wrong output. Or maybe it generates all sorts of messages when it runs. Either way, you can just ask the Assistant “What went wrong?”

Here the Assistant rather patiently and clearly explained the message that was generated, then suggested “correct code”:

The Assistant tends to be remarkably helpful in situations like this—even for an experienced Wolfram Language user like me. In a sense, though, it has an “unfair advantage”. Not only has it learned “what’s reasonable” from seeing large amounts of Wolfram Language code; it also has access to “internal information”—like a stream of telemetry about messages that were generated (as well as stack traces, etc.).

In general, Notebook Assistant is rather impressive at “spotting errors” even in long and sophisticated pieces of Wolfram Language code—and in suggesting possible fixes. And I can say that this is a way in which using Notebook Assistant has immediately saved me significant time in doing things with Wolfram Language.

“Improve My Code”

Notebook Assistant doesn’t just know how to write Wolfram Language code; it knows how to write good Wolfram Language code. And in fact if you give it even a sloppy “outline” of Wolfram Language code, the Assistant is usually quite good at making it clean and complete. And that’s important not only in being able to produce code that will run correctly; it’s also important in making code that’s clear enough that you can understand it (courtesy of the readability of good Wolfram Language code).

Here’s an example starting with a rather horrible piece of Wolfram Language code on the right:

The code on the right is quite buggy (it doesn’t initialize list, for example). But Notebook Assistant guesses what it’s supposed to do, and then makes nice “Wolfram style” versions, explaining what it’s doing.

If the code you’re dealing with is long and complicated, Notebook Assistant may (like a person) get confused. But you can always select a particular part, then ask Notebook Assistant specifically about that. And the symbolic nature—and coherence—of the Wolfram Language will typically mean that Notebook Assistant will be able to act “modularly” on the piece that you’ve selected.

Something I’ve found rather useful is to have Notebook Assistant refactor code for me. Here I’m starting from a sequence of separate inputs (yes, itself generated by Notebook Assistant) and I’m turning it into a single function:

Now we can use the function however we want:

Going the other way is useful too. And Notebook Assistant is surprisingly good at grokking what a piece of code is “about”, and coming up with reasonable names for variables, functions, etc.:

Yet another thing Notebook Assistant is good at is knowing all sorts of tricks to make code run faster:

“Explain That to Me”

“What does that piece of code actually do?” Good Wolfram Language code—like good prose or good mathematical formalism—can succinctly communicate ideas, in its case in computational terms, precisely grounded in the definition of the language. But (as with prose and math) you sometimes need a more detailed exploration. And providing narrative explanations of code is something else that Notebook Assistant is good at. Here it’s taking a single line of (rather elegant) Wolfram Language code and writing a whole essay about what the code is doing:

What if you have a long piece of code, and you just want to explain some small part of it? Well, since Notebook Assistant sees selections you make, you can just select one part of your code, and Notebook Assistant will know that’s what you want to explain.

“Fill in the Paperwork for Me”

The Wolfram Language is carefully designed to have built-in functions that just “do what you need”, without having to use idioms or set up repeated boilerplate. But there are situations where there’s inevitably a certain amount of “bureaucracy” to do. For example, let’s say you’re writing a function to deploy to the Function Repository. You enter the definition for the function into a Function Resource Definition Notebook. But now you have to fill in documentation, examples, etc. And in fact that’s often the part that typically takes the longest. But now you can ask Notebook Assistant to do it for you. Here I put the cursor in the Examples section:

It’s always a good idea to set up tests for functions you define. And this is another thing Notebook Assistant can help with:

The “Inspiration Button”

All the examples of interacting with Notebook Assistant that we’ve seen so far involve using the Notebook Assistant window, that you can open with the button on the notebook toolbar. But another method involves using the button in the toolbar, which we’ve been calling the “inspiration button”.

When you use the Notebook Assistant window, the Assistant will always try to figure out what you’re talking about. For example, if you say “Plot that” it’ll use what it knows about what notebook you’re using, and where you are in it, to try to work out what you mean by “that”. But when you use the button it’ll specifically try to “provide inspiration at your current selection”.

Let’s say you’ve typed Plot[Sin[x]. Press and it’ll suggest a possible completion:

After using that suggestion, you can keep going:

You can think of the button as providing a sophisticated meaning-aware autocomplete.

It also lets you do things like code simplification. Imagine you’ve written the (rather grotesque):

If you want to get rid of the For loops, just select them and press the button to get a much simpler version:

Want to go even further? Select that result and Notebook Assistant manages to get to a one-liner:

Magic Writing, Magic Coding

At some level it seems bizarre. Write a text cell that describes code to follow it. Start an Input cell, then press and Notebook Assistant will try to magically write the code!

You can go the other way as well. Start with the code, then start a CodeText cell above it, and it’ll “magically” write a caption:

If you start a heading cell, it’ll try to make up a heading:

Start a Text cell, and it’ll try to “magically” write relevant textual content:

You can go even further: just put the cursor underneath the existing content, and press —and Notebook Assistant will start suggesting how you can go on:

As I write this, of course I had to try it: what does Notebook Assistant think I should write next? Here’s what it suggests (and, yes, in this case, those aren’t such bad ideas):

The Practicalities of the Assistant

One of the objectives for Notebook Assistant is to have it provide “hassle-free” access to AI and LLM technology integrated into the Wolfram System. And indeed, once you’ve set up your subscription (within your Wolfram Account), everything “just works”. Under the hood, there’s all sorts of technology, servers, etc. But you don’t have to worry about any of that; you can just use Notebook Assistant as a built-in part of the Wolfram Notebook experience.

As you work with Notebook Assistant, you’ll get progressively better intuition about where it can best help you. (And, yes, we’ll be continually updating Notebook Assistant, so it’ll often be worth trying things again if a bit of time has passed.) Notebook Assistant—like any AI-based system—has definite human-like characteristics, including sometimes making mistakes. Often those mistakes will be obvious (e.g. code with incorrect syntax colored red); sometimes they may be more difficult to spot. But the great thing about Notebook Assistant is that it’s firmly anchored to the “solid ground” of Wolfram Language. And any time it writes Wolfram Language code that you can see does what you want, you can always confidently use it.

There are some things that will help Notebook Assistant do its best for you. Particularly important is giving it the best view of the “context” for what you ask it. Notebook Assistant will generally look at whatever has already been said in a particular chat. So if you’re going to change the subject, it’s best to use the button to start a new chat, so Notebook Assistant will focus on the new subject, and not get confused by what you (or it) said before.

When you open the Notebook Assistant chat window you’ll often want to talk about—or refer to—material in some other notebook. Generally Notebook Assistant will assume that the notebook you last used is the one that’s relevant—and that any selection you have in that notebook is the thing to concentrate on the most. If you want Notebook Assistant to focus exclusively on what you’re saying in the chat window, one way to achieve that is to start a blank notebook. Another approach is to use the menu, which provides more detailed control over what material Notebook Assistant will consider. (For now, it just deals with notebooks you have open—but external files, URLs, etc. are coming soon.)

Notebook Assistant will by default store all your chat sessions. You can see your chat history (with chats automatically assigned names by the Assistant) by pressing the History button. You can delete chats from your history here. You can also “pop out” chats with , creating standalone notebooks that you can save, send to other people, etc.

So what’s inside Notebook Assistant? It’s quite a tower of technology. The core of its “linguistic interface” is an LLM (actually, several different LLMs)—trained on extensive Wolfram Language material, and with access to a variety of tools, especially Wolfram Language evaluators. Also critical to Notebook Assistant is its access to a variety of RAGs based on vector databases, that it uses for immediate semantic search of material such as Wolfram Language documentation. Oh, and then there’s a lot of technology to connect Notebook Assistant to the symbolic internal structure of notebooks, etc.

So when you use Notebook Assistant, where is it actually running? Its larger LLM tasks are currently running on cloud servers. But a substantial part of its functionally is running right on your computer—using Wolfram Language (notably the Wolfram machine learning framework, vector database system, etc.) And because these things are running locally, the Assistant can request access to local information on your computer—as well as avoiding the latency of accessing cloud-based systems.

Chats in Your Main Notebook (Coming Soon)

Much of the time, you want your interactions with Notebook Assistant to be somehow “off on the side”—say in the Notebook Assistant window, or in the inspiration button menu. But sometimes you want your interactions to be right in your main notebook.

And for this you’ll soon (in Version 14.2) be able to use an enhanced version of the Chat Notebook technology that we developed last year, not just in a separate “Chat Notebook”, but fully integrated into any notebook.

At the beginning of a cell in any notebook, just press ‘. You get a chat cell that communicates with Notebook Assistant:

And now the output from that chat cell is placed directly below in the notebook—so you can create a notebook that mixes standard notebook content with chat content.

It all works basically just like a fully integrated version of our Chat Notebook technology. (And this functionality is already available in Version 14.1 if you explicitly create a chat notebook with File > New > Chat Notebook.) As in Chat Notebooks, you use a chat break (with ~) to start a new chat within the same notebook. (In general, when you use a chat cell in an ordinary notebook to access Notebook Assistant, the assistant will see only material that occurs before the chat, and within the same chat block.)

Also Introducing: LLM Kit

In mid-2023 we introduced LLMFunction, LLMSynthesize and related functions (as well as ChatEvaluate, ImageSynthesize, etc.) to let you access LLM functionality directly within the Wolfram Language. Until now these functions required connection to an external LLM provider. But along with Notebook Assistant we’re introducing today LLM Kit—which allows you to access all LLM functionality in the Wolfram Language directly through a subscription within your Wolfram Account.

It’s all very easy: as soon as you enable your subscription, not only Notebook Assistant but also all LLM functionality will just work, going through our LLM service. (And, yes, Notebook Assistant is basically built on top of LLM Kit and the LLM service access it defines.)

When you’ve enabled your Notebook Assistant + LLM Kit subscription, this is what you’ll see in the Preferences panel:

Our LLM service is primarily aimed at “human speed” LLM usage, in other words, things like responding to what you ask the Notebook Assistant. But the service also seamlessly supports programmatic things like LLMFunction. And for anything beyond small-scale uses of LLMFunction, etc. you’ll probably want to upgrade from the basic “Essentials” subscription level to the “Pro” level. And if you want to go “industrial scale” in your LLM usage, you can do that by explicitly purchasing Wolfram Service Credits.

Everything is set up to be easy if you use our Wolfram LLM service—and that’s what Notebook Assistant is based on. But for Chat Notebooks and programmatic LLM functionality, our Wolfram Language framework also supports connection to a wide range of external LLM service providers. You have to have your own external subscription to whatever external service you want to use. But once you have the appropriate access key you’ll be able to set things up so that you can pick that LLM provider interactively in Chat Notebooks, programmatically through LLMConfiguration, or in the Preferences panel.

(By the way, we’re continually monitoring the performance of different LLMs on Wolfram Language generation; you can see weekly benchmark results at the Wolfram LLM Benchmark Project website—or get the data behind that from the Wolfram Data Repository.)

Opening Up the Ability to “Go Computational”

There’s really never been anything quite like it before: a way of automatically taking what can be quite vague human thoughts and ideas, and making them crisp and structured—by expressing them computationally. And, yes, this is made possible now by the unexpectedly effective linguistic interface that LLMs give us. But ultimately what makes it possible is that the LLMs have a target: the Wolfram Language in all its breadth and depth.

For me it’s an exciting moment. Because it’s a moment where everything we’ve been building these past four decades is suddenly much more broadly accessible. Expert users of Wolfram Language will be able to make use of all sorts of amazing nooks of functionality they never knew about. And people who’ve never used Wolfram Language before—or never even formulated anything computationally—will suddenly be able to do so.

And it’s remarkable what kinds of things one can “make computational”. Let’s say you ask Wolfram Notebook Assistant to make up a story. Like pretty much anything today with LLMs inside, it’ll dutifully do that:

But how can one make something like this computational? Well, just ask Notebook Assistant:

And what it does is rather remarkable: it uses Wolfram Language to create an interactive agent-based computational game version of the story!

Computation is the great paradigm of our times. And the development of “computational X” for all X seems destined to be the future of pretty much every field. The whole tower of ideas and technology that is the modern Wolfram Language was built precisely to provide the computational language that is needed. But now Notebook Assistant is dramatically broadening access to that—making it possible to get “computational language superpowers” using just ordinary (and perhaps even vague) natural language.

And even though I’ve now been living the computational language story for more than four decades Notebook Assistant keeps on surprising me with what it manages to make computational. It’s incredibly powerful to be able “go computational”. And even if you can’t imagine how it could work in what you’re doing, you should still just try it! Notebook Assistant may well surprise you—and in that moment show you a path to leverage the great power of the computational paradigm in ways that you’re never imagined.

Subscribe to Wolfram Notebook Assistant now »

Foundations of Biological Evolution: More Results & More Surprises

Stephen Wolfram — Thu, 05 Dec 2024 23:13:27 +0000

This is a follow-on to Why Does Biological Evolution Work? A Minimal Model for Biological Evolution and Other Adaptive Processes [May 3, 2024].

Even More from an Extremely Simple Model

A few months ago I introduced an extremely simple “adaptive cellular automaton” model that seems to do remarkably well at capturing the essence of what’s happening in biological evolution. But over the past few months I’ve come to realize that the model is actually even richer and deeper than I’d imagined. And here I’m going to describe some of what I’ve now figured out about the model—and about the often-surprising things it implies for the foundations of biological evolution.

The starting point for the model is to view biological systems in abstract computational terms. We think of an organism as having a genotype that’s represented by a program, that’s then run to produce its phenotype. So, for example, the cellular automaton rules on the left correspond to a genotype which are then run to produce the phenotype on the right (starting from a “seed” of a single red cell):

The key idea in our model is to adaptively evolve the genotype rules—say by making single “point mutations” to the list of outcomes from the rules:

At each step in the adaptive evolution we “accept” a mutation if it leads to a phenotype that has a higher—or at least equal—fitness relative to what we had before. So, for example, taking our fitness function to be the height (i.e. lifetime) of the phenotype pattern (with patterns that are infinite being assigned zero fitness), a sequence of (randomly chosen) adaptive evolution steps that go from the null rule to the rule above might be:

What if we make a different sequence of randomly chosen adaptive evolution steps? Here are a few examples of what happens—each in a sense “using a different idea” for how to achieve high fitness:

And, yes, one can’t help but be struck by how “lifelike” this all looks—both in the complexity of these patterns, and in their diversity. But what is ultimately responsible for what we’re seeing? It’s long been a core question about biological evolution. Are the forms it produces the result of careful “sculpting” by the environment (and by the fitness functions it implies)—or are their most important features somehow instead a consequence of something more intrinsic and fundamental that doesn’t depend on details of fitness functions?

Well, let’s say we pick a different fitness function—for example, not the height of a phenotype pattern, but instead its width (or, more specifically, the width of its bounding box). Here are some results of adaptive evolution in this case:

And, yes, the patterns we get are now ones that achieve larger “bounding box width”. But somehow there’s still a remarkable similarity to what we saw with a rather different fitness function above. And, for example, in both cases, high fitness, it seems, is normally achieved in a complicated and hard-to-understand way. (The last pattern is a bit of an exception; as can also happen in biology, this is a case where for once there’s a “mechanism” in evidence that we can understand.)

So what in the end is going on? As I discussed when I introduced the model a few months ago, it seems that the “dominant force” is not selection according to fitness functions, but instead the fundamental computational phenomenon of computational irreducibility. And what we’ll find here is that in fact what we see is, more than anything, the result of an interplay between the computational irreducibility of the process by which our phenotypes develop, and the computational boundedness of typical forms of fitness functions.

The importance of such an interplay is something that’s very much come into focus as a result of our Physics Project. And indeed it now seems that the foundations of both physics and mathematics are—more than anything—reflections of this interplay. And now it seems that’s true of biological evolution as well.

In studying our model, there are many detailed phenomena we’ll encounter—most of which seem to have surprisingly direct analogs in actual biological evolution. For example, here’s what happens if we plot the behavior of the fitness function for our first example above over the course of the adaptive evolution process:

We see a sequence of “plateaus”, punctuated by jumps in fitness that reflect some “breakthrough” being made. In the picture, each red dot represents the fitness associated with a genotype that was tried. Many fall below the line of “best results so far”. But there are also plenty of red dots that lie right on the line. And these correspond to genotypes that yield the same fitness that’s already been achieved. But here—as in actual biological evolution—it’s important that there can be “fitness-neutral evolution”, where genotypes change, but the fitness does not. Usually such changes of genotype yield not just the same fitness, but also the exact same phenotype. Sometimes, however, there can be multiple phenotypes with the same fitness—and indeed this happens at one stage in the example here

and at multiple stages in the second example we showed above:

The Multiway Graph of All Possible Evolutions

In the previous section we saw examples of the results of a few particular random sequences of mutations. But what if we were to look at all possible sequences of mutations? As I discussed when I introduced the model, it’s possible to construct a multiway graph that represents all possible mutation paths. Here’s what one gets for symmetric k = 2, r = 2 rules—starting from the null rule, and using height as a fitness function:

The way this graph is constructed, there are arrows from a given phenotype to all phenotypes with larger (finite) height that can be reached by a single mutation.

But what if our fitness function is width rather than height? Well, then we get a different multiway graph in which arrows go to phenotypes not with larger height but instead with larger width:

So what’s really going on here? Ultimately one can think of there being an underlying graph (that one might call the “mutation graph”) in which every edge represents a transformation between two phenotypes that can be achieved by a single mutation in the underlying genotype:

At this level, the transformations can go either way, so this graph is undirected. But the crucial point is that as soon as one imposes a fitness function, it defines a particular direction for each transformation (at least, each transformation that isn’t fitness neutral for this fitness function). And then if one starts, say, from the null rule, one will pick out a certain “evolution cone” subgraph of the original mutation graph.

So, for example, with width as the fitness function, the subgraph one gets is what’s highlighted here:

There are several subtleties here. First, we simplified the multiway graph by doing transitive reduction and drawing only the minimal edges necessary to define the connectivity of the graph. If we want to see all possible single-mutation transformations between phenotypes we need to do transitive completion, in which case for the width fitness function the multiway graph we get is:

But now there’s another subtlety. The edges in the multiway graph represent fitness-changing transformations. But there are also fitness-neutral transformations. And occasionally these can even lead to different (though equal-fitness) phenotypes, so that really each node in the graph above (say, the transitively reduced one) should sometimes be associated with multiple phenotypes

which can “fitness neutrally” transform into each other, as in:

But even this isn’t the end of the subtleties. Fitness-neutral sets typically contain many genotypes differing by changes of rule cases that don’t affect the phenotype they produce. But it may be that just one or a few of these genotypes are “primed” to be able to generate another phenotype with just one additional mutation. Or, in other words, each node in the multiway graph above represents a whole class of genotypes “equivalent under fitness-neutral transformations”, and when we draw an arrow it indicates that some genotype in that class can be transformed by a single mutation to some genotype in the class associated with a different phenotype:

But beyond the subtleties, the key point is that particular fitness functions in effect just define particular orderings on the underlying mutation graph. It’s somewhat like choices of reference frames or families of simultaneity surfaces in physics. Different choices of fitness function in effect define different ways in which the underlying mutation graph can be “navigated” by evolution over the course of time.

As it happens, the results are not so different between height and width fitness functions. Here’s a combined multiway graph, indicating transformations variously allowed by these different fitness functions:

Homing in on a small part of this graph, we see that there are different “flows” associated with maximizing height and maximizing width:

With a single fitness function that for any two phenotypes systematically treats one phenotype as fitter than another, the multiway graph must always define a definite flow. But as soon as one considers changing fitness functions in the course of evolution, it’s possible to get cycles in the multiway graph, as in the example above—so that, in effect, “evolution can repeat itself”.

Fitness Functions Based on Aspect Ratio

We’ve looked at fitness functions based on maximizing height and on maximizing width. But what if we try to combine these? Here’s a plot of the widths and heights of all phenotypes that occur in the symmetric k = 2, r = 2 case we studied above:

We could imagine a variety of ways to define “fitness frontiers” here. But as a specific example, let’s consider fitness functions that are based on trying to achieve specific aspect ratios—i.e. phenotypes that are as close as possible to a particular constant-aspect-ratio line in the plot above.

With the symmetric k = 2, r = 2 rules we’re using here, only a certain set of aspect ratios can ever be obtained:

The corresponding phenotypes (with their aspect ratios) are:

As we change the aspect ratio that we’re trying to achieve, the evolution multiway graph will change:

In all cases we’re starting from the null rule. For target aspect ratio 1.0 this rule itself already achieves that aspect ratio—so the multiway graph in that case is trivial. But in general, different aspect ratios yield evolution multiway graphs that are different subgraphs of the complete mutation graph we saw above.

So if we follow all possible paths of evolution, how close can we actually get to any given target aspect ratio? This plot shows what final aspect ratios can be achieved as a function of target aspect ratio:

And in a sense this is a summary of the effect of “developmental constraints” for “adaptive cellular automaton organisms” like this. If there were no constraints then for every target aspect ratio it’d be possible to get an “organism” with that aspect ratio—so in the plot there’d be a point lying on the red line. But in actuality the process of cellular automaton growth imposes constraints—that in particular allows only certain phenotypes, with certain aspect ratios, to exist. And beyond that, which phenotypes can actually be reached by adaptive evolution depends on the evolution multiway graph, with “different turns” on the graph leading to different fitness (i.e. different aspect ratio) phenotypes.

But what the plot above shows overall is that for a certain range of target aspect ratios, adaptive evolution is successfully able to get at least close to those aspect ratios. If the target aspect ratio gets out of that range, however, “developmental constraints” come in that prevent the target from being reached.

With “larger genomes”, i.e. rules with larger numbers of cases to specify, it’s possible to do better, and to more accurately achieve particular aspect ratios, over larger ranges of values. And indeed we can see some version of this effect even for symmetric k = 2, r = 2 rules by plotting aspect ratios that can be achieved as a function of the number of cases that need to be specified in the rule:

As an alternative visualization, we can plot the “best convergence to the target” as a function of the number of rule cases—and once again we see that larger numbers of rule cases let us get closer to target aspect ratios:

It’s worth mentioning that—just as we discussed for height and width fitness functions above—there are subtleties here associated with fitness-neutral sets. For example, here are sets of phenotypes that all have the specified aspect ratios—with phenotypes that can be reached by single point mutations being joined:

In the evolution multiway graphs above, we included only one phenotype for each fitness-neutral set. But here’s what we get for target aspect ratio 0.7 if we show all phenotypes with a given fitness:

Note that on the top line, we don’t just get the null rule. Instead, we get four phenotypes, all of which, like the null rule, have aspect ratio 1, and so are equally far from the target aspect ratio 0.7.

The picture above is only the transitively reduced graph. But if we include all possible transformations associated with single point mutations, we get instead:

Based on this graph, we can now make what amounts to a foliation, showing collections of phenotypes reached by a certain minimum number of mutations, progressively approaching our target aspect ratio (here 0.7):

Here’s what we get from the range of target aspect ratios shown above (where, as above, “terminal phenotypes” are highlighted):

In a sense these sequences show us what phenotypes can appear at progressive stages in the “fossil record” for different (aspect-ratio) fitness functions in our very simple model. The highlighted cases are “evolutionary dead ends”. The others can evolve further.

Unreachable Cases

Our model takes the process of adaptive evolution to never “go backwards”, or, in other words, to never evolve from a particular genotype to one with lower fitness. But this means that starting with a certain genotype (say the null rule) there may be genotypes (and hence phenotypes) that will never be reached.

With height as a fitness function, there are just two single (“orphan”) phenotypes that can’t be reached:

And with width as the fitness function, it turns out the very same phenotypes also can’t be reached:

But if we use a fitness function that, for example, tries to achieve aspect ratio 0.7, we get many more phenotypes that can’t be reached starting from the null rule:

In the original mutation graph all the phenotypes appear. But when we foliate (or, more accurately, order) that graph using a particular fitness function, some phenotypes become unreachable by evolutionarily-possible transformations—in a rough analogy to the way some events in physics can become unreachable in the presence of an event horizon.

Multiway Graphs for Larger Rule Spaces

So far we’ve discussed multiway graphs here only for symmetric k = 2, r = 2 rules. There are a total of 524,288 (= 2¹⁹) possible such rules, producing 77 distinct phenotypes. But what about larger classes of rules? As an example, we can consider all k = 2, r = 2 rules, without the constraint of symmetry. There are 2,147,483,648 (= 2³¹) possible such rules, and there turn out to be3137distinct phenotypes.

For the height fitness function, the complete multiway graph in this case is

or, annotated with actual phenotypes:

Click image to zoom and pan

If instead we just show bounding boxes, it’s easier to see where long-lifetime phenotypes occur:

With a different graph layout the evolution multiway graph (with initial node indicated) becomes:

One subtlety here is that the null rule has no successors with single point mutation. When we were talking about symmetric k = 2, r = 2 rules, we took a “single point mutation” always to change both a particular rule case and its mirror image. But if we don’t have the symmetry requirement, a single point mutation really can just change a single rule case. And if we start from the null range and look at the results of changing just one bit (i.e. the output of just one rule case) in all possible ways we find that we either get the same pattern as with the null rule, or we get a pattern that grows without bound:

Or, put another way, we can’t get anywhere with single bit mutations starting purely from the null rule. So what we’ve done is instead to start our multiway graph from k = 2, r = 2 rule 20, which has two bits “on”, and gives phenotype:

But starting from this, just one mutation (together with a sequence of fitness-neutral mutations) is sufficient to give94phenotypes—or49after removing mirror images:

The total number of new phenotypes we can reach after successively more (non-fitness-neutral) mutations is

while the successive longest-lifetime patterns are:

And what we see here is that it’s in principle possible to achieve long lifetimes even with fairly few mutations. But when the mutations are done at random, it can still take a very large number of steps to successfully “random walk” to long lifetime phenotypes.

And out of a total of2407distinct phenotypes,984are “dead ends” where no further evolution is possible. Some of these dead ends have long lifetimes

but others have very short lifetimes:

There’s much more to explore in this multiway graph—and we’ll continue a bit below. But for now let’s look at another evolution multiway graph of accessible size: the one for symmetric k = 3, r = 1 rules. There are a total of 129,140,163 (= 3¹⁷) possible such rules, that yield a total of14,778distinct phenotypes:

Click image to zoom and pan

Showing only bounding boxes of patterns this becomes:

Unlike the k = 2, r = 2 case, we can now start this whole graph with the null rule. However, if we look at all possible symmetric k = 3, r = 1 rules, there turn out to be 6 “isolates” that can’t be reached from the null rule by adaptive evolution with the height fitness function:

Starting from the null rule, the number of phenotypes reached after successively more (non-fitness-neutral) mutations is

and the successive longest-lived of these phenotypes are:

Aspect Ratio Fitness

Just as we looked at fitness functions based on aspect ratio above for symmetric k = 2, r = 2 rules, so now we can do this for the whole space of all possible k = 2, r = 2 rules. Here’s a plot of the heights and widths of patterns that can be achieved with these rules:

These are the possible aspect ratios this implies:

And here’s their distribution (on a log scale):

The range of possible values extends much further than for symmetric k = 2, r = 2 rules: to rather than to . The patterns now with the largest aspect ratios are

while those with the smallest aspect ratios are:

Note that just as for symmetric k = 2, r = 2 rules, to reach a wider range of aspect ratios, more cases in the rule have to be specified:

So what happens if we use adaptive evolution to try to reach different possible target aspect ratios? Most of the time (at least up to aspect ratio ≈ 3) there’s some sequence of mutations that will do it—though often we can get stuck at a different aspect ratio:

If we look at the “best convergence” to a given target aspect ratio then we see that this improves as we increase the number of cases specified in the rule:

So what does the multiway graph look like for a fitness function associated with a particular aspect ratio? Here’s the result for aspect ratio 3:

The initial node involves patterns with aspect ratio 1—actually a fitness-neutral set of 263 of them. And as we go through the multiway graph, the aspect ratios get nearer to 3. The very closest they get, though, are for the patterns (whose locations are indicated on the graph):

But actually (as we saw in the lineup above), there is a rule that gives aspect ratio exactly 3:

But it turns out that this rule can’t be reached by adaptive evolution using single point mutations. In effect, adaptive evolution isn’t “strong enough” to achieve the exact aspect ratio we want; we can think of it as being “unpredictably prevented” by computationally irreducible “developmental constraints”.

OK, so what about the symmetric k = 3, r = 1 rules? Here’s how they’re distributed in width and height:

And, yes, in a typical “there are always surprises” story, there’s a strange height 265, width 173 pattern that shows up:

The overall possible aspect ratios are now

and their (log) distribution is:

The phenotypes with the largest aspect ratios are

while those with the smallest aspect ratios are:

Once again, to reach a larger range of aspect ratios, one has to specify more cases in the rule:

If we try to target a certain aspect ratio, there’s somewhat more of a tendency to get stuck than for k = 2, r = 2 rules—perhaps somewhat as a result of there now being fewer total rules (though more phenotypes) available:

Branching in the Multiway Evolution Graph

Looking at a typical multiway evolution graph such as

we see that different phenotypes can be quite separated in the graph—a bit like organisms on different branches of the tree of life in actual biology. But how can we characterize this separation? One approach is to compute the so-called dominator tree of the graph:

We can think of this as a way to provide a map of the least common ancestors of all nodes. The tree is set up so that given two nodes you just trace up the tree to find their common ancestor. Another interpretation of the tree is that it shows you what nodes you have no choice but to pass through in getting from the initial node to any given node—or, in other words, what phenotypes adaptive evolution has to produce on the way to a given phenotype.

Here’s another rendering of the tree:

We can think of this as the analog of the biological tree of life, with successive branchings picking out finer and finer “taxonomic domains” (analogous to kingdoms, phyla, etc.)

The tree also shows us something else: how significant different links or nodes are—and how much of the tree one would “lop off” if they were removed. Or, put a different way, how much would be achieved by blocking a certain link or node—as one might imagine doing to try to block the evolution of bacteria or tumor cells?

What if we look at larger multiway evolution graphs, like the complete k = 2, r = 2 one? Once again we can construct a dominator tree:

It’s notable that there’s tremendous variance in the “fan out” here, with the phenotypes with largest successor counts being the rather undistinguished:

But what if one’s specifically trying to reach, say, one of the maximum lifetime (length308) phenotypes? Well, then one has to follow the paths in a particular subgraph of the original multiway evolution graph

corresponding to the phenotype graph:

If one goes off this “narrow path” then one simply can’t reach the length-308 phenotype; one inevitably gets stuck in what amounts to another branch of the analog of the “tree of life”. So if one is trying to “guide evolution” to a particular outcome, this tells one that one needs to block off lots of “exit ramps”.

But what “fraction of the whole graph” is the subgraph that leads to the length-308 phenotype? The whole graph has2409vertices and 3878 edges, while the subgraph has 64 vertices and 119 edges, i.e. in both cases about 3%. A different measure is what fraction of all paths through the graph lead to the length-308 phenotype. The total number of paths is606,081,while the number leading to the length-308 phenotype is 1260, or about 0.2%. Does this tell us what the probability of reaching that phenotype will be if we just make a random sequence of mutations? Not quite, because in the multiway evolution graph many equivalencings have been done, notably for fitness-neutral sets. And if we don’t do such equivalencings, it turns out (as we’ll discuss below) that the corresponding number is significantly smaller—about 0.007%.

Exact-Match Fitness Functions

The fitness functions we’ve been considering so far look only at coarse features of phenotype patterns—like their height, width and aspect ratio. But what happens if we have a fitness function that’s maximal only for a phenotype that exactly matches a particular pattern?

As an example, let’s consider k = 2, r = 1 cellular automata with phenotypes grown for a specific number of steps—and with a fitness function that counts the number of cells that agree with ones in a target:

Let’s say we start with the null rule, then adaptively evolve by making single point mutations to the rule (here just 8 bits). With a target of the rule 30 pattern, this is the multiway graph we get:

And what we see is that after a grand tour of nearly a third of all possible rules, we can successfully reach the rule 30 pattern. But we can also get stuck at rule 86 and rule 190 patterns—even though their fitness values are much lower:

If we consider all possible k = 2, r = 1 cellular automaton patterns as targets, it turns out that these can always be reached by adaptive evolution from the null rule—though a little less than half the time there are other possible endpoints (here specified by rule numbers) at which the evolution process can get stuck:

So far we’ve been assuming that we have a fitness function that’s maximized by matching some pattern generated by a cellular automaton pattern. But what if we pick some quite different pattern to match against? Say our pattern is:

With k = 2, r = 1 rules (running with wraparound in a finite-size region), we can construct a multiway graph

and find out that the maximum fitness endpoints are the not-very-good approximations:

We can also get to these by applying random mutations:

But what if we try a larger rule space, say k = 2, r = 2 rules? Our approximations to the “A” image get a bit better:

Going to k = 2, r = 3 leads to slightly better (but not great) final approximations:

If we try to do the same thing with our target instead being

we get for example

while with target

we get (even less convincing) results like:

What’s going on here? Basically it’s that if we try to set up too intricate a fitness function, then our rule spaces won’t contain rules that successfully maximize it, and our adaptive evolution process will end up with a variety of not-very-good approximations.

How Fitness Builds Up

When one looks at an evolution process like

one typically has the impression that successive phenotypes are achieving greater fitness by somehow progressively “building on the ideas” of earlier ones. And to get a more granular sense of this we can highlight cells at each step that are using “newly added cases” in the rule:

We can think of new rule cases as a bit like new genes in biology. So what we’re seeing here is the analog of new genes switching on (or coming into existence) as we progress through the process of biological evolution.

Here’s what happens for some other paths of evolution:

What we see is quite variable. There are a few examples where new rule cases show up only at the end, as if a new “incrementally engineered” pattern was being “grafted on at the end”. But most of the time new rule cases show up sparsely dotted all over the pattern. And somehow those few “tweaks” lead to higher fitness—even though there’s no obvious reason why, and no obvious way to predict where they should be.

It’s interesting to compare this with actual biology, where it’s pretty common to see what appear to be “random gratuitous changes” between apparently very similar organisms. (And, yes, this can lead to all sorts of problems in things like comparing toxicity or drug effectiveness in model animals versus humans.)

There are many ways to consider quantitatively characterizing how “rule utilization” builds up. As just one example, here are plots for successive phenotypes along the evolution paths shown above of what stages in growth new rule cases show up:

But Is It Explainable?

Here are two “adaptively evolved” long-lifetime rules that we discussed at the beginning:

We can always run these rules and see what patterns they produce. But is there a way to explain what they do? And for example to analyze how they manage to yield lifetimes? Or is what we’re seeing in these rules basically “pure computational irreducibility” where the only way to tell what patterns they will generate—and how long they’ll live—is just explicitly to run them step by step?

The second rule here seems to have a bit more regularity than the first, so let’s tackle it first. Let’s look at the “blade” part. Once such an object—of any width—has formed, its behavior will basically be repetitive, and it’s easy to predict what will happen:

The left-hand edge moves by 1 position every 7 steps, and the right-hand edge by 4 positions every 12 steps. And since , however wide the initial configuration is, it’ll always die out, after a number of steps that’s roughly times the initial width.

But OK, how does a configuration like this get produced? Well, that’s far from obvious. Here’s what happens with a sequence of few-cell initial conditions …:

So, yes, it doesn’t always directly make the “blade”. Sometimes, for example, it instead makes things like these, some of which basically just become repetitive, and live forever:

And even if it starts with a “blade texture” unexpected things can happen:

There are repetitive patterns that can persist—and indeed the “blade” uses one of these:

Starting from a random initial condition one sees various kinds of behavior, with the blade being fairly common:

But none of this really makes much of a dent in “explaining” why with this rule, starting from a single red cell, we get a long-lived pattern. Yes, once the “blade” forms, we know it’ll take a while to come to a point. But beyond this little pocket of computational reducibility we can’t say much in general about what the rule does—or why, for example, a blade forms with this initial condition.

So what about our other rule? There’s no obvious interesting pocket of reducibility there at all. Looking at a sequence of few-cell initial conditions we get:

And, yes, there’s all sorts of different behavior that can occur:

The first of these patterns is basically periodic, simply shifting 2 cells to the left every 56 steps. The third one dies out after 369 steps, and the fourth one becomes basically periodic (with period 56) after 1023 steps:

If we start from a random initial condition we see a few places where things die out in a repeatable pattern. But mostly everything just looks very complicated:

As always happens, the rule supports regions of repetitive behavior, but they don’t normally extend far enough to introduce any significant computational reducibility:

So what’s the conclusion? Basically it’s that these rules—like pretty much all others we’ve seen here—behave in essentially computationally irreducible ways. Why do they have long lifetimes? All we can really say is “because they do”. Yes, we can always run them and see what happens. But we can’t make any kind of “explanatory theory”, for example of the kind we’re used to in mathematical approaches to physics.

Distribution in Morphospace

We can think of the pattern of growth seen in each phenotype as defining what we might call in biology its “morphology”. So what happens if we try to operate as “pure taxonomists”, laying out different phenotypes in “morphospace”? Here’s a result based on using machine learning and FeatureSpacePlot:

And, yes, this tends to group “visually similar” phenotypes together. But how does proximity in morphospace relate to proximity in genotypes? Here is the same arrangement of phenotypes as above, but now indicating the transformations associated with single mutations in genotype:

If for example we consider maximizing for height, only some of the phenotypes are picked out:

For width, a somewhat different set are picked out:

And here is what happens if our fitness function is based on aspect ratio:

In other words, different fitness functions “select out” different regions in morphospace.

We can also construct a morphospace not just for symmetric but for all k = 2, r = 2 rules:

The detailed pattern here is not particularly significant, and, more than anything, just reflects the method of dimension reduction that we’ve used. What is more meaningful, however, is how different fitness functions select out different regions in morphospace. This shows the results for fitness functions based on height and on width—with points colored according to the actual values of height and width for those phenotypes:

Here are the corresponding results for fitness functions based on different aspect ratios, where now the coloring is based on closeness to the target aspect ratio:

What’s the main conclusion here? We might have expected that different fitness functions would cleanly select visibly different parts of morphospace. But at least with our machine-learning-based way of laying out morphospace that’s not what we’re seeing. And it seems likely that this is actually a general result—and that there is no layout procedure that can make any “easy to describe” fitness function “geometrically simple” in morphospace. And once again, this is presumably a consequence of underlying computational irreducibility—and to the fact that we can’t expect any morphospace layout procedure to be able to provide a way to “untangle the irreducibility” that will work for all fitness functions.

Probabilities and the Time Course of Evolution

In what we’ve done so far, we’ve mostly been concerned with things like what sequences of phenotypes can ever be produced by adaptive evolution. But in making analogies to actual biological evolution—and particularly to how it’s captured in the fossil record—it’s also relevant to discuss time, and to ask not only what phenotypes can be produced, but also when, and how frequently.

For example, let’s assume there’s a constant rate of point mutations in time. Then starting from a given rule (like the null rule) there’ll be a certain rate at which transitions to other rules occur. Some of these transitions will lead to rules that are selected out. Others will be kept, but will yield the same phenotype. And still others will lead to transitions to different phenotypes.

We can represent this by a “phenotype transition diagram” in which the thickness of each outgoing edge from a given phenotype indicates the fraction of all possible mutations that lead to the transition associated with that edge:

Gray self-loops in this diagram represent transitions that lead back to the same phenotype (because they change cases in the rule that don’t matter). Pink self-loops correspond to transitions that lead to rules that are selected out. We don’t show rules that have been selected out here; instead we assume that in this case we just “wait at the original phenotype” and don’t make a transition.

We can annotate the whole symmetric k = 2, r = 2 multiway evolution graph with transition probabilities:

Underlying this graph is a matrix of transition probabilities between all 2¹⁹ possible symmetric k = 2, r = 2 rules (where the structure reflects the fact that many rules transform to rules which differ only by one bit):

Keeping only distinct phenotypes and ordering by lifetime, we can then make a matrix of phenotype transition probabilities:

Treating the transitions as a Markov process, this allows us to compute the expected frequency of each phenotype as a function of time (i.e. number of mutations):

What’s basically happening here is that there’s steady evolution away from the single-cell phenotype. There are some intermediate phenotypes that come and go, but in the end, everything “flows” to the final (“leaf”) phenotypes on the multiway evolution graph—leading to a limiting “equilibrium” probability distribution:

Stacking the different curves, we get an alternative visualization of the evolution of phenotype frequencies:

If we were “running evolution” with enough separate individuals, these would be the limiting curves we’d get. If we reduced the number of individuals, we’d start to see fluctuations—and there’d be a certain probability, for example, for a particular phenotype to end up with zero individuals, and effectively go extinct.

So what happens with a different fitness function? Here’s the result using width instead of height:

And here are results for fitness functions based on a sequence of targets for aspect ratio:

And, yes, the fitness function definitely influences the time course of our adaptive evolution process.

So far we’ve been looking only at symmetric k = 2, r = 2 rules. If we look at the space of all possible k = 2, r = 2 rules, the behavior we see is similar. For example, here’s the time evolution of possible phenotypes based on our standard height fitness function:

And this is what we see if we look only at the longest-lifetime (i.e. largest-height) cases:

As the scale here indicates, such long-lived phenotypes are quite rare—though most still occur with nonzero frequency even after arbitrarily large times (which is an inevitable given that they appear as “maximal fitness” terminal nodes in the multiway graph).

And indeed if we plot the final frequencies of phenotypes against their lifetimes we see that there are a wide range of different cases:

The phenotypes with the highest “equilibrium” frequencies are

with some having fairly small lifetimes, and others larger.

The Macroscopic Flow of Evolution

In the previous section, we looked at the time course of evolution with various different—but fixed—fitness functions. But what if we had a fitness function that changes with time—say analogous to an environment for biological evolution that changes with time?

Here’s what happens if we have an aspect ratio fitness function whose target value increases linearly with time:

The behavior we see is quite complex, with certain phenotypes “winning for a while” but then dying out, often quite precipitously—with others coming to take their place.

If instead the target aspect ratio decreases with time, we see rather different behavior:

(The discontinuous derivatives here are basically associated with the sudden appearance of new phenotypes at particular target aspect ratio values.)

It’s also possible to give a “shock to the system” by suddenly changing the target aspect ratio:

And what we see is that sometimes this shock leads to fewer surviving phenotypes, and sometimes to more.

We can think of a changing fitness function as being something that applies a “macroscopic driving force” to our system. Things happen quickly down at the level of individual mutation and selection events—but the fitness function defines overall “goals” for the system that in effect change only slowly. (It’s a bit like a fluid where there are fast molecular-scale processes, but typically slow changes of macroscopic parameters like pressure.)

But if the fitness function defines a goal, how well does the system manage to meet it? Here’s a comparison between an aspect ratio goal (here, linearly increasing) and the distribution of actual aspect ratios achieved, with the darker curve indicating the mean aspect ratio obtained by a weighted average over phenotypes, and the lighter blue area indicating the standard deviation:

And, yes, as we might have expected from earlier results, the system doesn’t do particularly well at achieving the goal. Its behavior is ultimately not “well sculpted” by the forces of a fitness function; instead it is mostly dominated by the intrinsic (computationally irreducible) dynamics of the underlying adaptive evolution process.

One important thing to note however is that our results depend on the value of a parameter: essentially the rate at which underlying mutations occur relative to the rate of change of the fitness function. In the picture above 5000 mutations occur over the time the fitness function goes from minimum to maximum value. This is what happens if we change the number of mutations that occur (or, in effect, the “mutation rate”):

Generally—and not surprisingly—adaptive evolution does better at achieving the target when the mutation rate is higher, though in both the cases shown here, nothing gets terribly close to the target.

In their general character our results here seem reminiscent of what one might expect in typical studies of continuum systems, say based on differential equations. And indeed one can imagine that there might be “continuum equations of adaptive evolution” that govern situations like the ones we’ve seen here. But it’s important to understand that it’s far from self evident that this is possible. Because underneath everything is a multiway evolution graph with a definite and complicated structure. And one might think that the details of this structure would matter to the overall “continuum evolution process”. And indeed sometimes they will.

But—as we have seen throughout our Physics Project—underlying computational irreducibility leads to a certain inevitable simplicity when looking at phenomena perceived by computationally bounded observers. And we can expect that something similar can happen with biological evolution (and indeed adaptive evolution in general). Assuming that our fitness functions (and their process of change) are computationally bounded, then we can expect that their “aggregate effects” will follow comparatively simple laws—which we can perhaps think of as laws for the “flow of evolution” in response to external input.

Can Evolution Be Reversed?

In the previous section we saw that with different fitness functions, different time series of phenotypes appear, with some phenotypes, for example, sometimes “going extinct”. But let’s say evolution has proceeded to a certain point with a particular fitness function—and certain phenotypes are now present. Then one question we can ask is whether it’s possible to “reverse” that evolution, and revert to phenotypes that were present before. In other words, if we change the fitness function, can we make evolution “go backwards”?

We’ve often discussed a fitness function based on maximizing total (finite) lifetime. But what if, after using this fitness function for a while, we “reverse it”, now minimizing total lifetime?

Consider the multiway evolution graph for symmetric k = 2, r = 2 rules starting from the null rule, with the fitness function yet again being maximizing lifetime:

But what if we now say the fitness function minimizes lifetime? If we start from the longest-lifetime phenotype we get the “lifetime minimization” multiway graph:

We can compare this “reversed graph” to the “forward graph” based on all paths from the null rule to the maximum-lifetime rule:

And in this case we see that the phenotypes that occur are almost the same, with the exception of the fact that can appear in the reverse case.

So what happens when we look at all k = 2, r = 2 rules? Here’s the “reverse graph” starting from the longest-lifetime phenotype:

A total of 345 phenotypes appear here eventually leading all the way back to . In the overall “forward graph” (which has to start from rather than ) a total of2409phenotypes appear, though (as we saw above) only 64 occur in paths that eventually lead to the maximum lifetime phenotype:

And what we see here is that the forward and reverse graphs look quite different. But could we perhaps construct a fitness function for the reverse graph that will successfully corral the evolution process to precisely retrace the steps of the forward graph?

In general, this isn’t something we can expect to be able to do. Because to do so would in effect require “breaking the computational irreducibility” of the system. It would require having a fitness function that can in essence predict every detail of the evolution process—and in so doing be in a position to direct it. But to achieve this, the fitness function would in a sense have to be computationally as sophisticated as the evolution process itself.

It’s a variant of an argument we’ve used several times here. Realistic fitness functions are computationally bounded (and in practice often very coarse). And that means that they can’t expect to match the computational irreducibility of the underlying evolution process.

There’s an analogy to the Second Law of thermodynamics. Just as the microscopic collisions of individual molecules are in principle easy to reverse, so potentially are individual transitions in the evolution graph. But putting many collisions or many transitions together leads to a process that is computationally sophisticated enough that the fairly coarse means at our disposal can’t “decode” and reverse it.

Put another way, there is in practice a certain inevitable irreversibility to both molecular dynamics and biological evolution. Yes, with enough computational effort—say carefully controlling the fitness function for every individual organism—it might in principle be possible to precisely “reverse evolution”. But in practice the kinds of fitness functions that exist in nature—or that one can readily set up in a lab—are computationally much too weak. And as a result one can’t expect to be able to get evolution to precisely retrace its steps.

Random or Selected? Can One Tell?

Given only a genotype, is there a way to tell whether it’s “just random” or whether it’s actually the result of some long and elaborate process of adaptive evolution? From the genotype one can in principle use the rules it defines to “grow” the corresponding phenotype—and then look at whether it has an “unusually large” fitness. But the question is whether it’s possible to tell anything directly from the genotype, without going through the computational effort of generating the phenotype.

At some level it’s like asking, whether, say, from a cellular automaton rule, one can predict the ultimate behavior of the cellular automaton. And a core consequence of computational irreducibility is that one can’t in general expect to do this. Still, one might imagine that one could at least make a “reasonable guess” about whether a genotype is “likely” to have been chosen “purely randomly” or to have been “carefully selected”.

To explore this, we can look at the genotypes for symmetric k = 2, r = 2 rules, say ordered by their lifetime-based fitness—with black and white here representing “required” rule cases, and gray representing undetermined ones (which can all independently be either black or white):

On the right is a summary of how many white, black and undetermined (gray) outcomes are present in each genotype. And as we have seen several times, to achieve high fitness all or almost all of the outcomes must be determined—so that in a sense all or almost all of the genome is “being used”. But we still need to ask whether, given a certain actual pattern of outcomes, we can successfully guess whether or not a genotype is the result of selection.

To get more of a sense of this, we can look at plots of the probabilities for different outcomes for each case in the rule, first (trivially) for all combinatorially possible genotypes, then for all genotypes that give viable (i.e. in our case, finite-lifetime) phenotypes, and then for “selected genotypes”:

Certain cases are always completely determined for all viable genomes—but rather trivially so, because, for example, if then the pattern generated will expand at maximum speed forever, and so cannot have a finite lifetime.

So what happens for all k = 2, r = 2 rules? Here are the actual genomes that lead to particular fitness levels:

And now here are the corresponding probabilities for different outcomes for each case in the rule:

And, yes, given a particular setup we could imagine working out from results like these at least an approximation to the likelihood for a given randomly chosen genome to be a selected one. But what’s true in general? Is there something that can be determined with bounded computational effort (i.e. without explicitly computing phenotypes and their fitnesses) that gives a good estimate of whether a genome is selected? There are good reasons to believe that computational irreducibility will make this impossible.

It’s a different story, of course, if one’s given a “fully computed” phenotype. But at the genome level—without that computation—it seems unlikely that one can expect to distinguish random from “selected-somehow” genotypes.

Adaptive Evolution of Initial Conditions

In making our idealized model of biological evolution we’ve focused (as biology seems to) on the adaptive evolution of the genotype—or, in our case, the underlying rule for our cellular automata. But what if instead of changing the underlying rule, we change the initial condition used to “grow each organism”?

For example, let’s say that we start with the “single cell” we’ve been using so far, but then at each step in adaptive evolution we change the value of one cell in the initial condition (say within a certain distance of our original cell)—then keep any initial condition that does not lead to a shorter lifetime:

The sequence of lifetimes (“fitness values”) obtained in this process of adaptive evolution is

and the “breakthrough” initial conditions are:

The basic setup is similar to what we’ve seen repeatedly in the adaptive evolution of rules rather than initial conditions. But one immediate difference is that, at least in the example we’ve just seen, changing initial conditions does not as obviously “introduce new ideas” for how to increase lifetime; instead, it gives more of an impression of just directly extending “existing ideas”.

So what happens more generally? Rules with k = 2, r = 1 tend to show either infinite growth or no growth—with finite lifetimes arising only from direct “erosion” of initial conditions (here for rules 104 and 164):

For k = 2, r = 2 rules the story is more complicated, even in the symmetric case. Here are the sequences of longest lifetime patterns obtained with all possible progressively wider initial conditions with various rules:

Again, there is a certain lack of “fundamentally new ideas” in evidence, though there are definitely “mechanisms” that get progressively extended with larger initial conditions. (One notable regularity is that the maximum lifetimes of patterns often seem roughly proportional to the width of initial condition allowed.)

Can adaptive evolution “discover more”? Typically, when it’s just modifying initial conditions in a fixed region, it doesn’t seem so—again it seems to be more about “extending existing mechanisms” than introducing new ones:

2D Cellular Automata

Everything we’ve done so far has been for 1D cellular automata. So what happens if we go to 2D? In the end, the story is going to be very similar to 1D—except that the rule spaces even for quite minimal neighborhoods are vastly larger.

With k = 2 colors, it turns out that with a 5-cell neighborhood one can’t “escape from the null rule” by single point mutations. The issue is that any single case one adds in the rule will either do nothing, or will lead only to unbounded growth. And even with a 9-cell neighborhood one can’t get rules that show growth that is neither limited nor infinite with a single-cell initial condition. But with a initial condition this is possible, and for example here is a sequence of phenotype patterns generated by adaptive evolution using lifetime as a fitness function:

Here’s what these patterns look like when “viewed from above”:

And here’s how the fitness progressively increases in this case:

There are a total of 2⁵¹² ≈ 10¹⁵⁴ possible 9-neighbor rules, and in this vast rule space it’s easy for adaptive evolution to find rules with long finite lifetimes. (By the way, I’ve no idea what the absolute maximum “busy beaver” lifetime in this space is.)

Just as in 1D, there’s a fair amount of variation in the behavior one sees. Here are some examples of the “final rules” for various instances of adaptive evolution:

In a few cases one can readily “see the mechanism” for the lifetime—say associated with collisions between localized structures. But mostly, as in the other examples we’ve seen, there’s no realistic “narrative explanation” for how these rules achieve long yet finite lifetimes.

The Turing Machine Case

OK, so we’ve now looked at 2D as well as 1D cellular automata. But what about systems that aren’t cellular automata at all? Will we still see the same core phenomena of adaptive evolution that we’ve identified in cellular automata? The Principle of Computational Equivalence would certainly lead one to expect that we would. But to check at least one example let’s look at Turing machines.

Here’s a Turing machine with s = 3 states for its head, and k = 2 colors for cells on its tape:

The Turing machine is set up to halt if it ever reaches a case in the rule where the output is . Starting from a blank initial condition, this particular Turing machine halts after 19 steps.

So what happens if we try to adaptively evolve Turing machines with long lifetimes (i.e. that take many steps to halt)? Say we start from a “null rule” that halts in all cases, and then we make a sequence of single point mutations in the rule, keeping ones that don’t lead the Turing machine to halt in fewer steps than before. Here’s an example where the adaptive evolution eventually reaches a Turing machine that takes 95 steps to halt:

The sequence of (“breakthrough”) mutations involved here is

corresponding to a fitness curve of the form:

And, yes, all of this is very analogous to what we’ve seen in cellular automata. But one difference is that with Turing machines there are routinely much larger jumps in halting times. And the basic reason for this is just that Turing machines have much less going on at any particular step than typical cellular automata do—so it can take them much longer to achieve some particular state, like a halting state.

Here’s an example of adaptive evolution in the space of s = 3, k = 3 Turing machines—and in this case the final halting time is long enough that we’ve had to squash the image vertically (by a factor of 5):

The fitness curve in this case is best viewed on a logarithmic scale:

But while the largest-lifetime cellular automata that we saw above typically seemed to have very complex behavior, the largest-lifetime Turing machine here seems, at least on the face of it, to operate in a much more “systematic” and “mechanical” way. And indeed this becomes even more evident if we compress our visualization by looking only at steps on which the Turing machine head reverses its direction:

Long-lifetime Turing machines found by adaptive evolution are not always so simple, though they still tend to show more regularity than long-lifetime cellular automata:

But—presumably because Turing machines are “less efficient” than cellular automata—the very longest possible lifetimes can be very large. It’s not clear whether rules with such lifetimes can be found by adaptive evolution—not least because even to evaluate the fitness function for any particular candidate rule could take an unbounded time. And indeed among s = 3, k = 3 rules the very longest possible is about 10¹⁷ steps—achieved by the rule

with the following “very pedantic behavior”:

So what about multiway evolution graphs? There are a total of 20,736 s = 2, k = 2 Turing machines with halting states allowed. From these there are 37 distinct finite-lifetime phenotypes:

Just as in other cases we’ve investigated, there are fitness-neutral sets such as:

Taking just one representative from each of these 18 sets, we can then construct a multiway evolution graph for 2,2 Turing machines with lifetime as our fitness function:

Here’s the analogous result for 3,2 Turing machines—with2250distinct phenotypes, and a maximum lifetime of 21 steps (and the patterns produced by the machines just show by “slabs”):

We could pick other fitness functions (like maximum pattern width, number of head reversals, etc.) But the basic structure and consequences of adaptive evolution seem to work very much the same in Turing machines as in cellular automata—much as we expect from the Principle of Computational Equivalence.

Multiway Turing Machines

Ordinary Turing machines (as well as ordinary cellular automata) in effect always follow a single path of history, producing a definite sequence of states based on their underlying rule. But it’s also possible to study multiway Turing machines in which many paths of history can be followed. Consider for example the rule:

The case in this rule has two possible outcomes—so this is a multiway system, and to represent its behavior we need a multiway graph:

From a biological point of view, we can potentially think of such a multiway system as an idealized model for a process of adaptive evolution. So now we can ask: can we evolve this evolution? Or, in other words, can we apply adaptive evolution to systems like multiway Turing machines?

As an example, let’s assume that we make single point mutation changes to just one case in a multiway Turing machine rule:

Many multiway Turing machines won’t halt, or at least won’t halt on all their branches. But for our fitness function let’s assume we require multiway Turing machines to halt on all branches (or at least go into loops that revisit the same states), and then let’s take the fitness to be the total number of nodes in the multiway graph when everything has halted. (And, yes, this is a direct generalization of our lifetime fitness function for ordinary Turing machines.)

So with this setup here are some examples of sequences of “breakthroughs” in adaptive evolution processes:

But what about looking at all possible paths of evolution for multiway Turing machines? Or, in other words, what about making a multiway graph of the evolution of multiway Turing machines?

Here’s an example of what we get by doing this (showing at each node just a single example of a fitness-neutral set):

So what’s really going on here? We’ve got a multiway graph of multiway graphs. But it’s worth understanding that the inner and outer multiway graphs are a bit different. The outer one is effectively a rulial multiway graph, in which different parts correspond to following different rules. The inner one is effectively a branchial multiway graph, in which different parts correspond to different ways of applying a particular rule. Ultimately, though, we can at least in principle expect to encode branchial transformations as rulial ones, and vice versa.

So we can think of the adaptive evolution of multiway Turing machines as a first step in exploring “higher-order evolution”: the evolution of evolution, etc. And ultimately in exploring inevitable limits of recursive evolution in the ruliad—and how these might relate to the formation of observers in the ruliad.

Some Conclusions

What does all this mean for the foundations of biological evolution? First and foremost, it reinforces the idea of computational irreducibility as a dominant force in biology. One might have imagined that what we see in biology must have been “carefully sculpted” by fitness constraints (say imposed by the environment). But what we’ve found here suggests that instead much of what we see is actually just a direct reflection of computational irreducibility. And in the end, more than anything else, what biological evolution seems to be doing is to “recruit” lumps of irreducible computation, and set them up so as to achieve “fitness objectives”.

It is, as I recently discovered, very similar to what happens in machine learning. And in both cases this picture implies that there’s a limit to the kind of explanations one can expect to get. If one asks why something has the form it does, the answer will often just be: “because that’s the lump of irreducible computation that happened to be picked up”. And there isn’t any reason to think that there’ll be a “narrative explanation” of the kind one might hope for in traditional science.

The simplicity of models makes it possible to study not just particular possible paths of adaptive evolution, but complete multiway graphs of all possible paths. And what we’ve seen here is that fitness functions in effect define a kind of traversal order or (roughly) foliation for such multiway graphs. If such foliations could be arbitrarily complex, then they could potentially pick out specific outcomes for evolution—in effect successfully “sculpting biology” through the details of natural selection and fitness functions.

But the point is that fitness functions and resulting foliations of multiway evolution graphs don’t get arbitrarily complex. And even as the underlying processes by which phenotypes develop are full of computational irreducibility, the fitness functions that are applied are computationally bounded. And in effect the complexity that is perhaps the single most striking immediate feature of biological systems is therefore a consequence of the interplay between the computational boundedness of selection processes, and the computational irreducibility of underlying processes of growth and development.

All of this relies on the fundamental idea that biological evolution—and biology—are at their core computational phenomena. And given this interpretation, there’s then a remarkable unification that’s emerging.

It begins with the ruliad—the abstract object corresponding to the entangled limit of all possible computational processes. We’ve talked about the ruliad as the ultimate foundation for physics, and for mathematics. And we now see that we can think of it as the ultimate foundation for biology too.

In physics what’s crucial is that observers like us “parse” the ruliad in certain ways—and that these ways lead us to have a perception of the ruliad that follows core known laws of physics. And similarly, when observers like us do mathematics, we can think of ourselves as “extracting that mathematics” from the way we parse the ruliad. And now what we’re seeing is that biology emerges because of the way selection from the environment, etc. “parses” the ruliad.

And what makes this view powerful is that we have to assume surprisingly little about how selection works to still be able to deduce important things about biology. In particular, if we assume that the selection operates in a computationally bounded way, then just from the inevitable underlying computational irreducibility “inherited” from the ruliad, we immediately know that biology must have certain features.

In physics, the Second Law of thermodynamics arises from the interplay of underlying computational irreducibility of mechanical processes involving many particles or other objects, and our computational boundedness as observers. We have the impression that “randomness is increasing” because as computationally bounded observers we can’t “decrypt” the underlying computational irreducibility.

What’s the analog of this in biology? Much as we can’t expect to “say what happens” in a system that follows the Second Law, so we can’t expect to “explain by selection” what happens in a biological system. Or, put another way, much of what we see in biology is just the way it is because of computational irreducibility—and try as we might it won’t be “explainable” by some fitness criterion that we can describe.

But that doesn’t mean that we can’t expect to deduce “general laws of biology”, much as there are general laws about gases whose detailed structure follows the Second Law. And in what we’ve done here we can begin to see some hints of what those general laws might look like.

They’ll be things like bulk statements about possible paths of evolution, and the effect of changing the constraints on them—a bit like laws of fluid mechanics but now applied to the rulial space of possible genotypes. But if there’s one thing that’s clear it’s that the minimal model we’ve developed of biological evolution has remarkable richness and potential. In the past it’s been possible to say things about what amounts to the pure combinatorics of evolution; now we can start talking in a structured way about what evolution actually does. And in doing this we go in the direction of finally giving biology a foundation as a theoretical science.

There’s So Much More to Study!

Even though this is my second long piece about my minimal model of biological evolution, I’ve barely scratched the surface of what can be done with it. First and foremost there are many detailed connections to be made with actual phenomena that have been observed—or could be observed—in biology. But there are also many things to be investigated directly about the model itself—and in effect much ruliology to be done on it. And what’s particularly notable is how accessible a lot of that ruliology is. (And, yes, you can click any picture here to get the Wolfram Language code that generates it.) What are some obvious things to do? Here are few. Investigate other fitness functions. Other rule spaces. Other initial conditions. Other evolution strategies. Investigate evolving both rules and initial conditions. Investigate different kinds of changes of fitness functions during evolution. Investigate the effect of having a much larger rule space. Investigate robustness (or not) to perturbations.

In what I’ve done here, I’ve effectively aggregated identical genotypes (and phenotypes). But one could also investigate what happens if one in effect “traces every individual organism”. The result will be abstract structures that generalize the multiway systems we’ve shown here—and that are associated with higher levels of abstract formalism capable of describing phenomena that in effect go “below species”.

For historical notes see here »

Thanks

Thanks to Wolfram Institute fellows Richard Assar and Nik Murzin for their help, as well as to the supporters of the new Wolfram Institute initiative in theoretical biology. Thanks also to Brad Klee for his help. Related student projects were done at our Summer Programs this year by Brian Mboya, Tadas Turonis, Ahama Dalmia and Owen Xuan.

Since writing my first piece about biological evolution in March, I’ve had occasion to attend two biology conferences: SynBioBeta and WISE (“Workshop on Information, Selection, and Evolution” at the Carnegie Institution). I thank many attendees at both conferences for their enthusiasm and input. Curiously, before the WISE conference in October 2024 the last conference I had attended on biological evolution was more than 40 years earlier: the June 1984 Mountain Lake Conference on Evolution and Development.

On the Nature of Time

Stephen Wolfram — Tue, 08 Oct 2024 21:41:58 +0000

The Computational View of Time

Time is a central feature of human experience. But what actually is it? In traditional scientific accounts it’s often represented as some kind of coordinate much like space (though a coordinate that for some reason is always systematically increasing for us). But while this may be a useful mathematical description, it’s not telling us anything about what time in a sense “intrinsically is”.

We get closer as soon as we start thinking in computational terms. Because then it’s natural for us to think of successive states of the world as being computed one from the last by the progressive application of some computational rule. And this suggests that we can identify the progress of time with the “progressive doing of computation by the universe”.

But does this just mean that we are replacing a “time coordinate” with a “computational step count”? No. Because of the phenomenon of computational irreducibility. With the traditional mathematical idea of a time coordinate one typically imagines that this coordinate can be “set to any value”, and that then one can immediately calculate the state of the system at that time. But computational irreducibility implies that it’s not that easy. Because it says that there’s often essentially no better way to find what a system will do than by explicitly tracing through each step in its evolution.

In the pictures on the left there’s computational reducibility, and one can readily see what state will be after any number of steps t. But in the pictures on the right there’s (presumably) computational irreducibility, so that the only way to tell what will happen after t steps is effectively to run all those steps:

And what this implies is that there’s a certain robustness to time when viewed in these computational terms. There’s no way to “jump ahead” in time; the only way to find out what will happen in the future is to go through the irreducible computational steps to get there.

There are simple idealized systems (say with purely periodic behavior) where there’s computational reducibility, and where there isn’t any robust notion of the progress of time. But the point is that—as the Principle of Computational Equivalence implies—our universe is inevitably full of computational irreducibility which in effect defines a robust notion of the progress of time.

The Role of the Observer

That time is a reflection of the progress of computation in the universe is an important starting point. But it’s not the end of the story. For example, here’s an immediate issue. If we have a computational rule that determines each successive state of a system it’s at least in principle possible to know the whole future of the system. So given this why then do we have the experience of the future only “unfolding as it happens”?

It’s fundamentally because of the way we are as observers. If the underlying system is computationally irreducible, then to work out its future behavior requires an irreducible amount of computational work. But it’s a core feature of observers like us that we are computationally bounded. So we can’t do all that irreducible computational work to “know the whole future”—and instead we’re effectively stuck just doing computation alongside the system itself, never able to substantially “jump ahead”, and only able to see the future “progressively unfold”.

In essence, therefore, we experience time because of the interplay between our computational boundedness as observers, and the computational irreducibility of underlying processes in the universe. If we were not computationally bounded, we could “perceive the whole of the future in one gulp” and we wouldn’t need a notion of time at all. And if there wasn’t underlying computational irreducibility there wouldn’t be the kind of “progressive revealing of the future” that we associate with our experience of time.

A notable feature of our everyday perception of time is that it seems to “flow only in one direction”—so that for example it’s generally much easier to remember the past than to predict the future. And this is closely related to the Second Law of thermodynamics, which (as I’ve argued at length elsewhere) is once again a result of the interplay between underlying computational irreducibility and our computational boundedness. Yes, the microscopic laws of physics may be reversible (and indeed if our system is simple—and computationally reducible—enough of this reversibility may “shine through”). But the point is that computational irreducibility is in a sense a much stronger force.

Imagine that we prepare a state to have orderly structure. If its evolution is computationally irreducible then this structure will effectively be “encrypted” to the point where a computationally bounded observer can’t recognize the structure. Given underlying reversibility, the structure is in some sense inevitably “still there”—but it can’t be “accessed” by a computationally bounded observer. And as a result such an observer will perceive a definite flow from orderliness in what is prepared to disorderliness in what is observed. (In principle one might think it should be possible to set up a state that will “behave antithermodynamically”—but the point is that to do so would require predicting a computationally irreducible process, which a computationally bounded observer can’t do.)

One of the longstanding confusions about the nature of time has to do with its “mathematical similarity” to space. And indeed ever since the early days of relativity theory it’s seemed convenient to talk about “spacetime” in which notions of space and time are bundled together.

But in our Physics Project that’s not at all how things fundamentally work. At the lowest level the state of the universe is represented by a hypergraph which captures what can be thought of as the “spatial relations” between discrete “atoms of space”. Time then corresponds to the progressive rewriting of this hypergraph.

And in a sense the “atoms of time” are the elementary “rewriting events” that occur. If the “output” from one event is needed to provide “input” to another, then we can think of the first event as preceding the second event in time—and the events as being “timelike separated”. And in general we can construct a causal graph that shows the dependencies between different events.

So how does this relate to time—and spacetime? As we’ll discuss below, our everyday experience of time is that it follows a single thread. And so we tend to want to “parse” the causal graph of elementary events into a series of slices that we can view as corresponding to “successive times”. As in standard relativity theory, there typically isn’t a unique way to assign a sequence of such “simultaneity surfaces”, with the result that there are different “reference frames” in which the identifications of space and time are different.

The complete causal graph bundles together what we usually think of as space with what we usually think of as time. But ultimately the progress of time is always associated with some choice of successive events that “computationally build on each other”. And, yes, it’s more complicated because of the possibilities of different choices. But the basic idea of the progress of time as “the doing of computation” is very much the same. (In a sense time represents “computational progress” in the universe, while space represents the “layout of its data structure”.)

Very much as in the derivation of the Second Law (or of fluid mechanics from molecular dynamics), the derivation of Einstein’s equations for the large-scale behavior of spacetime from the underlying causal graph of hypergraph rewriting depends on the fact that we are computationally bounded observers. But even though we’re computationally bounded, we still have to “have something going on inside”, or we wouldn’t record—or sense—any “progress in time”.

It seems to be the essence of observers like us—as captured in my recent Observer Theory—that we equivalence many different states of the world to derive our internal perception of “what’s going on outside”. And at some rough level we might imagine that we’re sensing time passing by the rate at which we add to those internal perceptions. If we’re not adding to the perceptions, then in effect time will stop for us—as happens if we’re asleep, anesthetized or dead.

It’s worth mentioning that in some extreme situations it’s not the internal structure of the observer that makes perceived time stop; instead it’s the underlying structure of the universe itself. As we’ve mentioned, the “progress of the universe” is associated with successive rewriting of the underlying hypergraph. But when there’s been “too much activity in the hypergraph” (which physically corresponds roughly to too much energy-momentum), one can end up with a situation in which “there are no more rewrites that can be done”—so that in effect some part of the universe can no longer progress, and “time stops” there. It’s analogous to what happens at a spacelike singularity (normally associated with a black hole) in traditional general relativity. But now it has a very direct computational interpretation: one’s reached a “fixed point” at which there’s no more computation to do. And so there’s no progress to make in time.

Multiple Threads of Time

Our strong human experience is that time progresses as a single thread. But now our Physics Project suggests that at an underlying level time is actually in effect multithreaded, or, in other words, that there are many different “paths of history” that the universe follows. And it is only because of the way we as observers sample things that we experience time as a single thread.

At the level of a particular underlying hypergraph the point is that there may be many different updating events that can occur, and each sequence of such updating event defines a different “path of history”. We can summarize all these paths of history in a multiway graph in which we merge identical states that arise:

But given this underlying structure, why is it that we as observers believe that time progresses as a single thread? It all has to do with the notion of branchial space, and our presence within branchial space. The presence of many paths of history is what leads to quantum mechanics; the fact that we as observers ultimately perceive just one path is associated with the traditionally-quite-mysterious phenomenon of “measurement” in quantum mechanics.

When we talked about causal graphs above, we said that we could “parse” them as a series of “spacelike” slices corresponding to instantaneous “states of space”—represented by spatial hypergraphs. And by analogy we can similarly imagine breaking multiway graphs into “instantaneous slices”. But now these slices don’t represent states of ordinary space; instead they represent states of what we call branchial space.

Ordinary space is “knitted together” by updating events that have causal effects on other events that can be thought of as “located at different places in space”. (Or, said differently, space is knitted together by the overlaps of the elementary light cones of different events.) Now we can think of branchial space as being “knitted together” by updating events that have effects on events that end up on different branches of history.

(In general there is a close analogy between ordinary space and branchial space, and we can define a multiway causal graph that includes both “spacelike” and “branchlike” directions—with the branchlike direction supporting not light cones but what we can call entanglement cones.)

So how do we as observers parse what’s going on? A key point is that we are inevitably part of the system we’re observing. So the branching (and merging) that’s going on in the system at large is also going on in us. So that means we have to ask how a “branching mind” will perceive a branching universe. Underneath, there are lots of branches, and lots of “threads of history”. And there’s lots of computational irreducibility (and even what we can call multicomputational irreducibility). But computationally bounded observers like us have to equivalence most of those details to wind up with something that “fits in our finite minds”.

We can make an analogy to what happens in a gas. Underneath, there are lots of molecules bouncing around (and behaving in computationally irreducible ways). But observers like us are big compared to molecules, and (being computationally bounded) we don’t get to perceive their individual behavior, but only their aggregate behavior—from which we extract a thin set of computationally reducible “fluid-dynamics-level” features.

And it’s basically the same story with the underlying structure of space. Underneath, there’s an elaborately changing network of discrete atoms of space. But as large, computationally bounded observers we can only sample aggregate features in which many details have been equivalenced, and in which space tends to seem continuous and describable in basically computationally reducible ways.

So what about branchial space? Well, it’s basically the same story. Our minds are “big”, in the sense that they span many individual branches of history. And they’re computationally bounded so they can’t perceive the details of all those branches, but only certain aggregated features. And in a first approximation what then emerges is in effect a single aggregated thread of history.

With sufficiently careful measurements we can sometimes see “quantum effects” in which multiple threads of history are in evidence. But at a direct human level we always seem to aggregate things to the point where what we perceive is just a single thread of history—or in effect a single thread of progression in time.

It’s not immediately obvious that any of these “aggregations” will work. It could be that important effects we perceive in gases would depend on phenomena at the level of individual molecules. Or that to understand the large-scale structure of space we’d continually be having to think about detailed features of atoms of space. Or, similarly, that we’d never be able to maintain a “consistent view of history”, and that instead we’d always be having to trace lots of individual threads of history.

But the key point is that for us to stay as computationally bounded observers we have to pick out only features that are computationally reducible—or in effect boundedly simple to describe.

Closely related to our computational boundedness is the important assumption we make that we as observers have a certain persistence. At every moment in time, we are made from different atoms of space and different branches in the multiway graph. Yet we believe we are still “the same us”. And the crucial physical fact (that has to be derived in our model) is that in ordinary circumstances there’s no inconsistency in doing this.

So the result is that even though there are many “threads of time” at the lowest level—representing many different “quantum branches”—observers like us can (usually) successfully still view there as being a single consistent perceived thread of time.

But there’s another issue here. It’s one thing to say that a single observer (say a single human mind or a single measuring device) can perceive history to follow a single, consistent thread. But what about different human minds, or different measuring devices? Why should they perceive any kind of consistent “objective reality”?

Essentially the answer, I think, is that they’re all sufficiently nearby in branchial space. If we think about physical space, observers in different parts of the universe will clearly “see different things happening”. The “laws of physics” may be the same—but what star (if any) is nearby will be different. Yet (at least for the foreseeable future) for all of us humans it’s always the same star that’s nearby.

And so it is, presumably, in branchial space. There’s some small patch in which we humans—with our shared origins—exist. And it’s presumably because that patch is small relative to all of branchial space that all of us perceive a consistent thread of history and a common objective reality.

There are many subtleties to this, many of which aren’t yet fully worked out. In physical space, we know that effects can in principle spread at the speed of light. And in branchial space the analog is that effects can spread at the maximum entanglement speed (whose value we don’t know, though it’s related by Planck unit conversions to the elementary length and elementary time). But in maintaining our shared “objective” view of the universe it’s crucial that we’re not all going off in different directions at the speed of light. And of course the reason that doesn’t happen is that we don’t have zero mass. And indeed presumably nonzero mass is a critical part of being observers like us.

In our Physics Project it’s roughly the density of events in the hypergraph that determines the density of energy (and mass) in physical space (with their associated gravitational effects). And similarly it’s roughly the density of events in the multiway graph (or in branchial graph slices) that determines the density of action—the relativistically invariant analog of energy—in branchial space (with its associated effects on quantum phase). And though it’s not yet completely clear how this works, it seems likely that once again when there’s mass, effects don’t just “go off at the maximum entanglement speed in all directions”, but instead stay nearby.

There are definitely connections between “staying at the same place”, believing one is persistent, and being computationally bounded. But these are what seem necessary for us to have our typical view of time as a single thread. In principle we can imagine observers very different from us—say with minds (like the inside of an idealized quantum computer) capable of experiencing many different threads of history. But the Principle of Computational Equivalence suggests that there’s a high bar for such observers. They need not only to be able to deal with computational irreducibility but also multicomputational irreducibility, in which one includes both the process of computing new states, and the process of equivalencing states.

And so for observers that are “anything like us” we can expect that once again time will tend to be as we normally experience it, following a single thread, consistent between observers.

(It’s worth mentioning that all of this only works for observers like us “in situations like ours”. For example, at the “entanglement horizon” for a black hole—where branchially-oriented edges in the multiway causal graph get “trapped”—time as we know it in some sense “disintegrates”, because an observer won’t be able to “knit together” the different branches of history to “form a consistent classical thought” about what happens.)

Time in the Ruliad

In what we’ve discussed so far we can think of the progress of time as being associated with the repeated application of rules that progressively “rewrite the state of the universe”. In the previous section we saw that these rules can be applied in many different ways, leading to many different underlying threads of history.

But so far we’ve imagined that the rules that get applied are always the same—leaving us with the mystery of “Why those rules, and not others?” But this is where the ruliad comes in. Because the ruliad involves no such seemingly arbitrary choices: it’s what you get by following all possible computational rules.

One can imagine many bases for the ruliad. One can make it from all possible hypergraph rewritings. Or all possible (multiway) Turing machines. But in the end it’s a single, unique thing: the entangled limit of all possible computational processes. There’s a sense in which “everything can happen somewhere” in the ruliad. But what gives the ruliad structure is that there’s a definite (essentially geometrical) way in which all those different things that can happen are arranged and connected.

So what is our perception of the ruliad? Inevitably we’re part of the ruliad—so we’re observing it “from the inside”. But the crucial point is that what we perceive about it depends on what we are like as observers. And my big surprise in the past few years has been that assuming even just a little about what we’re like as observers immediately implies that what we perceive of the ruliad follows the core laws of physics we know. In other words, by assuming what we’re like as observers, we can in effect derive our laws of physics.

The key to all this is the interplay between the computational irreducibility of underlying behavior in the ruliad, and our computational boundedness as observers (together with our related assumption of our persistence). And it’s this interplay that gives us the Second Law in statistical mechanics, the Einstein equations for the structure of spacetime, and (we think) the path integral in quantum mechanics. In effect what’s happening is that our computational boundedness as observers makes us equivalence things to the point where we are sampling only computationally reducible slices of the ruliad, whose characteristics can be described using recognizable laws of physics.

So where does time fit into all of this? A central feature of the ruliad is that it’s unique—and everything about it is “abstractly necessary”. Much as given the definition of numbers, addition and equality it’s inevitable that one gets 1 + 1 = 2, so similarly given the definition of computation it’s inevitable that one gets the ruliad. Or, in other words, there’s no question about whether the ruliad exists; it’s just an abstract construct that inevitably follows from abstract definitions.

And so at some level this means that the ruliad inevitably just “exists as a complete thing”. And so if one could “view it from outside” one could think of it as just a single timeless object, with no notion of time.

But the crucial point is that we don’t get to “view it from the outside”. We’re embedded within it. And, what’s more, we must view it through the “lens” of our computational boundedness. And this is why we inevitably end up with a notion of time.

We observe the ruliad from some point within it. If we were not computationally bounded then we could immediately compute what the whole ruliad is like. But in actuality we can only discover the ruliad “one computationally bounded step at a time”—in effect progressively applying bounded computations to “move through rulial space”.

So even though in some abstract sense “the whole ruliad is already there” we only get to explore it step by step. And that’s what gives us our notion of time, through which we “progress”.

Inevitably, there are many different paths that we could follow through the ruliad. And indeed every mind (and every observer like us)—with its distinct inner experience—presumably follows a different path. But much as we described for branchial space, the reason we have a shared notion of “objective reality” is presumably that we are all very close together in rulial space; we form in a sense a tight “rulial flock”.

It’s worth pointing out that not every sampling of the ruliad that may be accessible to us conveniently corresponds to exploration of progressive slices of time. Yes, that kind of “progression in time” is characteristic of our physical experience, and our typical way of describing it. But what about our experience, say, of mathematics?

The first point to make is that just as the ruliad contains all possible physics, it also contains all possible mathematics. If we construct the ruliad, say from hypergraphs, the nodes are now not “atoms of space”, but instead abstract elements (that in general we call emes) that form pieces of mathematical expressions and mathematical theorems. We can think of these abstract elements as being laid out now not in physical space, but in some abstract metamathematical space.

In our physical experience, we tend to remain localized in physical space, branchial space, etc. But in “doing mathematics” it’s more as if we’re progressively expanding in metamathematical space, carving out some domain of “theorems we assume are true”. And while we could identify some kind of “path of expansion” to let us define some analog of time, it’s not a necessary feature of the way we explore the ruliad.

Different places in the ruliad in a sense correspond to describing things using different rules. And by analogy to the concept of motion in physical space, we can effectively “move” from one place to another in the ruliad by translating the computations done by one set of rules to computations done by another. (And, yes, it’s nontrivial to even have the possibility of “pure motion”.) But if we indeed remain localized in the ruliad (and can maintain what we can think of as our “coherent identity”) then it’s natural to think of there being a “path of motion” along which we progress “with time”. But when we’re just “expanding our horizons” to encompass more paradigms and to bring more of rulial space into what’s covered by our minds (so that in effect we’re “expanding in rulial space”), it’s not really the same story. We’re not thinking of ourselves as “doing computation in order to move”. Instead, we’re just identifying equivalences and using them to expand our definition of ourselves, which is something that we can at least approximate (much like in “quantum measurement” in traditional physics) as happening “outside of time”. Ultimately, though, everything that happens must be the result of computations that occur. It’s just that we don’t usually “package” these into what we can describe as a definite thread of time.

So What in the End Is Time?

From the paradigm (and Physics Project ideas) that we’ve discussed here, the question “What is time?” is at some level simple: time is what progresses when one applies computational rules. But what’s critical is that time can in effect be defined abstractly, independent of the details of those rules, or the “substrate” to which they’re applied. And what makes this possible is the Principle of Computational Equivalence, and the ubiquitous phenomenon of computational irreducibility that it implies.

To begin with, the fact that time can robustly be thought of as “progressing”, in effect in a linear chain, is a consequence of computational irreducibility—because computational irreducibility is what tells us that computationally bounded observers like us can’t in general ever “jump ahead”; we just have to follow a linear chain of steps.

But there’s something else as well. The Principle of Computational Equivalence implies that there’s in a sense just one (ubiquitous) kind of computational irreducibility. So when we look at different systems following different irreducible computational rules, there’s inevitably a certain universality to what they do. In effect they’re all “accumulating computational effects” in the same way. Or in essence progressing through time in the same way.

There’s a close analogy here with heat. It could be that there’d be detailed molecular motion that even on a large scale worked noticeably differently in different materials. But the fact is that we end up being able to characterize any such motion just by saying that it represents a certain amount of heat, without getting into more details. And that’s very much the same kind of thing as being able to say that such-and-such an amount of time has passed, without having to get into the details of how some clock or other system that reflects the passage of time actually works.

And in fact there’s more than a “conceptual analogy” here. Because the phenomenon of heat is again a consequence of computational irreducibility. And the fact that there’s a uniform, “abstract” characterization of it is a consequence of the universality of computational irreducibility.

It’s worth emphasizing again, though, that just as with heat, a robust concept of time depends on us being computationally bounded observers. If we were not, then we’d able to break the Second Law by doing detailed computations of molecular processes, and we wouldn’t just describe things in terms of randomness and heat. And similarly, we’d be able to break the linear flow of time, either jumping ahead or following different threads of time.

But as computationally bounded observers of computationally irreducible processes, it’s basically inevitable that—at least to a good approximation—we’ll view time as something that forms a single one-dimensional thread.

In traditional mathematically based science there’s often a feeling that the goal should be to “predict the future”—or in effect to “outrun time”. But computational irreducibility tells us that in general we can’t do this, and that the only way to find out what will happen is just to run the same computation as the system itself, essentially step by step. But while this might seem like a letdown for the power of science, we can also see it as what gives meaning and significance to time. If we could always jump ahead then at some level nothing would ever fundamentally be achieved by the passage of time (or, say, by the living of our lives); we’d always be able to just say what will happen, without “living through” how we got there. But computational irreducibility gives time and the process of it passing a kind of hard, tangible character.

So what does all this imply for the various classic issues (and apparent paradoxes) that arise in the way time is usually discussed?

Let’s start with the question of reversibility. The traditional laws of physics basically apply both forwards and backwards in time. And the ruliad is inevitably symmetrical between “forward” and “backward” rules. So why is it then that in our typical experience time always seems to “run in the same direction”?

This is closely related to the Second Law, and once again it’s consequence of our computational boundedness interacting with underlying computational irreducibility. In a sense what defines the direction of time for us is that we (typically) find it much easier to remember the past than to predict the future. Of course, we don’t remember every detail of the past. We only remember what amounts to certain “filtered” features that “fit in our finite minds”. And when it comes to predicting the future, we’re limited by our inability to “outrun” computational irreducibility.

Let’s recall how the Second Law works. It basically says that if we set up some state that’s “ordered” or “simple” then this will tend to “degrade” to one that’s “disordered” or “random”. (We can think of the evolution of the system as effectively “encrypting” the specification of our starting state to the point where we—as computationally bounded observers—can no longer recognize its ordered origins.) But because our underlying laws are reversible, this degradation (or “encryption”) must happen when we go both forwards and backwards in time:

But the point is that our “experiential” definition of the direction of time (in which the “past” is what we remember, and the “future” is what we find hard to predict) is inevitably aligned with the “thermodynamic” direction of time we observe in the world at large. And the reason is that in both cases we’re defining the past to be something that’s computationally bounded (while the future can be computationally irreducible). In the experiential case the past is computationally bounded because that’s what we can remember. In the thermodynamic case it’s computationally bounded because those are the states we can prepare. In other words, the “arrows of time” are aligned because in both cases we are in effect “requiring the past to be simpler”.

So what about time travel? It’s a concept that seems natural—and perhaps even inevitable—if one imagines that “time is just like space”. But it becomes a lot less natural when we think of time in the way we’re doing here: as a process of applying computational rules.

Indeed, at the lowest level, these rules are by definition just sequentially applied, producing one state after another—and in effect “progressing in one direction through time”. But things get more complicated if we consider not just the raw, lowest-level rules, but what we might actually observe of their effects. For example, what if the rules lead to a state that’s identical to one they’ve produced before (as happens, for example, in a system with periodic behavior)? If we equivalence the state now and the state before (so we represent both as a single state) then we can end up with a loop in our causal graph (a “closed timelike curve”). And, yes, in terms of the raw sequence of applying rules these states can be considered different. But the point is that if they are identical in every feature then any observer will inevitably consider them the same.

But will such equivalent states ever actually occur? As soon as there’s computational irreducibility it’s basically inevitable that the states will never perfectly match up. And indeed for the states to contain an observer like us (with “memory”, etc.) it’s basically impossible that they can match up.

But can one imagine an observer (or a “timecraft”) that would lead to states that match up? Perhaps somehow it could carefully pick particular sequences of atoms of space (or elementary events) that would lead it to states that have “happened before”. And indeed in a computationally simple system this might be possible. But as soon as there’s computational irreducibility, this simply isn’t something one can expect any computationally bounded observer to be able to do. And, yes, this is directly analogous to why one can’t have a “Maxwell’s demon” observer that “breaks the Second Law”. Or why one can’t have something that carefully navigates the lowest-level structure of space to effectively travel faster than light.

But even if there can’t be time travel in which “time for an observer goes backwards”, there can still be changes in “perceived time”, say as a result of relativistic effects associated with motion. For example, one classic relativistic effect is time dilation, in which “time goes slower” when objects go faster. And, yes, given certain assumptions, there’s a straightforward mathematical derivation of this effect. But in our effort to understand the nature of time we’re led to ask what its physical mechanism might be. And it turns out that in our Physics Project it has a surprisingly direct—and almost “mechanical”—explanation.

One starts from the fact that in our Physics Project space and everything in it is represented by a hypergraph which is continually getting rewritten. And the evolution of any object through time is then defined by these rewritings. But if the object moves, then in effect it has to be “re-created at a different place in space”—and this process takes up a certain number of rewritings, leaving fewer for the intrinsic evolution of the object itself, and thus causing time to effectively “run slower” for it. (And, yes, while this is a qualitative description, one can make it quite formal and precise, and recover the usual formulas for relativistic time dilation.)

Something similar happens with gravitational fields. In our Physics Project, energy-momentum (and thus gravity) is effectively associated with greater activity in the underlying hypergraph. And the presence of this greater activity leads to more rewritings, causing “time to run faster” for any object in that region of space (corresponding to the traditional “gravitational redshift”).

More extreme versions of this occur in the context of black holes. (Indeed, one can roughly think of spacelike singularities as places where “time ran so fast that it ended”.) And in general—as we discussed above—there are many “relativistic effects” in which notions of space and time get mixed in various ways.

But even at a much more mundane level there’s a certain crucial relationship between space and time for observers like us. The key point is that observers like us tend to “parse” the world into a sequence of “states of space” at successive “moments in time”. But the fact that we do this depends on some quite specific features of us, and in particular our effective physical scale in space as compared to time.

In our everyday life we’re typically looking at scenes involving objects that are perhaps tens of meters away from us. And given the speed of light that means photons from these objects get to us in less than a microsecond. But it takes our brains milliseconds to register what we’ve seen. And this disparity of timescales is what leads us to view the world as consisting of a sequence of states of space at successive moments in time.

If our brains “ran” a million times faster (i.e. at the speed of digital electronics) we’d perceive photons arriving from different parts of a scene at different times, and we’d presumably no longer view the world in terms of overall states of space existing at successive times.

The same kind of thing would happen if we kept the speed of our brains the same, but dealt with scenes of a much larger scale (as we already do in dealing with spacecraft, astronomy, etc.).

But while this affects what it is that we think time is “acting on”, it doesn’t ultimately affect the nature of time itself. Time remains that computational process by which successive states of the world are produced. Computational irreducibility gives time a certain rigid character, at least for computationally bounded observers like us. And the Principle of Computational Equivalence allows there to be a robust notion of time independent of the “substrate” that’s involved: whether us as observers, the everyday physical world, or, for that matter, the whole universe.

Nestedly Recursive Functions

Stephen Wolfram — Fri, 27 Sep 2024 17:50:59 +0000

Yet Another Ruliological Surprise

Integers. Addition. Subtraction. Maybe multiplication. Surely that’s not enough to be able to generate any serious complexity. In the early 1980s I had made the very surprising discovery that very simple programs based on cellular automata could generate great complexity. But how widespread was this phenomenon?

At the beginning of the 1990s I had set about exploring this. Over and over I would consider some type of system and be sure it was too simple to “do anything interesting”. And over and over again I would be wrong. And so it was that on the night of August 13, 1993, I thought I should check what could happen with integer functions defined using just addition and subtraction.

I knew, of course, about defining functions by recursion, like Fibonacci:

But could I find something like this that would have complex behavior? I did the analog of what I have done so many times, and just started (symbolically) enumerating possible definitions. And immediately I saw cases with nested functions, like:

(For some reason I wanted to keep the same initial conditions as Fibonacci: f[1] = f[2] = 1.) What would functions like this do? My original notebook records the result in this case:

But a few minutes later I found something very different: a simple nestedly recursive function with what seemed like highly complex behavior:

I remembered seeing a somewhat similarly defined function discussed before. But the behavior I’d seen reported for that function, while intricate, was nested and ultimately highly regular. And, so far as I could tell, much like with rule 30 and all the other systems I’d investigated, nobody had ever seen serious complexity in simple recursive functions.

It was a nice example. But it was one among many. And when I published A New Kind of Science in 2002, I devoted just four pages (and 7 notes) to “recursive sequences”—even though the gallery I made of their behavior became a favorite page of mine:

A year after the book was published we held our first Wolfram Summer School, and as an opening event I decided to do a live computer experiment—in which I would try to make a real-time science discovery. The subject I chose was nestedly recursive functions. It took a few hours. But then, yes, we made a discovery! We found that there was a nestedly recursive function simpler than the ones I’d discussed in A New Kind of Science that already seemed to have very complex behavior:

Over the couple of decades that followed I returned many times to nestedly recursive functions—particularly in explorations I did with high school and other students, or in suggestions I made for student projects. Then recently I used them several times as “intuition-building examples” in various investigations.

I’d always felt my work with nestedly recursive functions was unfinished. Beginning about five years ago—particularly energized by our Physics Project—I started looking at harvesting seeds I’d sown in A New Kind of Science and before. I’ve been on quite a roll, with a few pages or even footnotes repeatedly flowering into rich book-length stories. And finally—particularly after my work last year on “Expression Evaluation and Fundamental Physics”—I decided it was time to try to finish my exploration of nestedly recursive functions.

Our modern Wolfram Language tools—as well as ideas from our Physics Project—provided some new directions to explore. But I still thought I pretty much knew what we’d find. And perhaps after all these years I should have known better. Because somehow in the computational universe—and in the world of ruliology—there are always surprises.

And here, yet again, there was indeed quite a surprise.

The Basic Idea

Consider the definition (later we’ll call this “P312”)

which we can also write as:

The first few values for f[n] generated from this definition are:

Continuing further we get:

But how are these values actually computed? To see that we can make an “evaluation graph” in which we show how each value of f[n] is computed from ones with smaller values of n, here starting from f[20]:

The gray nodes represent initial conditions: places where f[n] was sampled for n ≤ 0. The two different colors of edges correspond to the two different computations done in evaluating each f[n]:

Continuing to f[30] we get:

But what’s the structure of this graph? If we pull out the “red” graph on its own, we can see that it breaks into two path graphs, that consist of the sequences of the f[n] for odd and even n, respectively:

The “blue” graph, on the other hand, breaks into four components—each always a tree—leading respectively to the four different initial conditions:

And for example we can now plot f[n], showing which tree each f[n] ends up being associated with:

We’ll be using this same basic setup throughout, though for different functions. We’ll mostly consider recursive definitions with a single term (i.e. with a single “outermost f”, not two, as in Fibonacci recurrences).

The specific families of recursive functions we’ll be focusing on are:

And with this designation, the function we just introduced is P312.

A Closer Look at P312 ( f[n_] := 3 + f[n – f[n – 2]] )

Let’s start off by looking in more detail at the function we just introduced. Here’s what it does up to n = 500:

It might seem as if it’s going to go on “seemingly randomly” forever. But if we take it further, we get a surprise: it seems to “resolve itself” to something potentially simpler:

What’s going on? Let’s plot this again, but now showing which “blue graph tree” each value is associated with:

And now what we see is that the f[–3] and f[–2] trees stop contributing to f[n] when n is (respectively) 537 and 296, and these trees are finite (and have sizes 53 and 15):

The overall structures of the “remaining” trees—here shown up to f[5000]—eventually start to exhibit some regularity:

We can home in on this regularity by arranging these trees in layers, starting from the root, then plotting the number of nodes in each successive layer:

Looking at these pictures suggests that there should be some kind of more-or-less direct “formula” for f[n], at least for large n. They also suggest that such a formula should have some kind of mod-6 structure. And, yes, there does turn out to be essentially a “formula”. Though the “formula” is quite complicated—and reminiscent of several other “strangely messy” formulas in other ruliological cases—like Turing machine 600720 discussed in A New Kind of Science or combinator s[s[s]][s][s][s][s].

Later on, we’ll see the much simpler recursive function P111 (f[n_] := 1 + f[n – f[n – 1]]). The values for this function form a sequence in which successive blocks of length k have value k:

P312 has the same kind of structure, but much embellished. First, it has 6 separate riffled (“mod”) subsequences. Each subsequence then consists of a sequence of blocks. Given a value n, this computes which subsequence this is on, which block for that subsequence it’s in, and where it is within that block:

So, for example, here are results for multiples of 1000:

For n = 1000 we’re not yet in the “simple” regime, we can’t describe the sequence in any simple way, and our “indices” calculation is meaningless. For n = 2000 it so happens that we are at block 0 for the mod-1 subsequence. And the way things are set up, we just start by giving exactly the form of block 0 for each mod. So for mod 1 the block is:

But now n = 2000 has offset 16 within this block, so the final value of f[2000] is simply the 16th value from this list, or 100. f[2001] is then simply the next element within this block, or 109. And so on—until we reach the end of the block.

But what if we’re not dealing with block 0? For example, according to the table above, f[3000] is determined by mod-3 block 1. It turns out there’s a straightforward, if messy, way to compute any block b (for mod m):

So now we have a way to compute the value, say of f[3000], effectively just by “evaluating a formula”:

And what’s notable is that this evaluation doesn’t involve any recursion. In other words, at the cost of “messiness” we’ve—somewhat surprisingly—been able to unravel all the recursion in P312 to arrive at a “direct formula” for the value of f[n] for any n.

So what else can we see about the behavior of f[n] for P312? One notable feature is its overall growth rate. For large n, it turns out that (as can be seen by substituting this form into the recursive definition and taking a limit):

One thing this means is that our evaluation graph eventually has a roughly conical form:

This can be compared to the very regular cone generated by P111 (which has asymptotic value ):

If one just looks at the form of the recursive definition for P312 it’s far from obvious “how far back” it will need to probe, or, in other words, what values of f[n] one will need to specify as initial conditions. As it turns out, though, the only values needed are f[–3], f[–2], f[–1] and f[0].

How can one see this? In 3 + f[n – f[n – 2]] it’s only the outer f that can probe “far back” values. But how far it actually goes back depends on how much larger f[n – 2] gets compared to n. Plotting f[n – 2] and n together we have:

And the point is that only for very few values of n does f[n – 2] exceed n—and it’s these values that probe back. Meanwhile, for larger n, there can never be additional “lookbacks”, because f[n] only grows like .

So does any P312 recursion always have the same lookback? So far, we’ve considered specifically the initial condition f[n] = 1 for all n ≤ 0. But what if we change the value of f[0]? Here are plots of f[n] for different cases:

And it turns out that with f[0] = z, the lookback goes to –z for z ≥ 3, and to z – 4 for 1 ≤ z ≤ 2.

(If z ≤ 0 the function f[n] is basically not defined, because the recursion is trying to compute f[n] from f[n], f[n + 1], etc., so never “makes progress”.)

The case f[0] = 2 (i.e. z = 2) is the one that involves the least lookback—and a total of 3 initial values. Here is the evaluation graph in this case:

By comparison, here is the evaluation graph for the case f[0] = 5, involving 6 initial values:

If we plot the value of f[n] as a function of f[0] we get the following:

For n < 3 f[0], f[n] always has simple behavior, and is essentially periodic in n with period 3:

And it turns out that for any specified initial configuration of values, there is always only bounded lookback—with the bound apparently being determined by the largest of the initial values f[n_init].

So what about the behavior of f[n] for large n? Just like in our original f[0] = 1 case, we can construct “blue graph trees” rooted at each of the initial conditions. In the case f[0] = 1 we found that of the 4 trees only two continue to grow as n increases. As we vary f[0], the number of “surviving trees” varies quite erratically:

What if instead of just changing f[0], and keeping all other f[–k] = 1, we set f[n] = s for all n ≤ 0? The result is somewhat surprising:

For s ≥ 2, the behavior turns out to be simple—and similar to the behavior of P111.

So what can P312 be made to do if we change its initial conditions? With f[n] = 2 for n < 0, we see that for small f[0] the behavior remains “tame”, but as f[0] increases it starts showing its typical complexity:

One question to ask is what set of values f[n] takes on. Given that the initial values have certain residues mod 3, all subsequent values must have the same residues. But apart from this constraint, it seems that all values for f[n] are obtained—which is not surprising given that f[n] grows only like .

The “P Family”: f[n_] := a + f[n – b f[n – c]]

P312 is just one example of the “P family” of sequences defined by:

Here is the behavior of some other Pabc sequences:

And here are their evaluation graphs:

P312 is the first “seriously complex” example.

P111 (as mentioned earlier) has a particularly simple form

which corresponds to the simple formula:

The evaluation graph in this case is just:

Only a single initial condition f[0] = 1 is used, and there is only a single “blue graph tree” with a simple form:

Another interesting case is P123:

Picking out only odd values of n we get:

This might look just like the behavior of P111. But it’s not. The lengths of the successive “plateaus” are now

with differences:

But this turns out to be exactly a nested sequence generated by joining together the successive steps in the evolution of the substitution system:

P123 immediately “gets into its final behavior”, even for small n. But—as we saw rather dramatically with P312—there can be “transient behavior” that doesn’t “resolve” until n is large. A smaller case of this phenomenon occurs with P213. Above n = 68 it shows a simple “square root” pattern of behavior, basically like P111. But for smaller n it’s a bit more complicated:

And in this case the transients aren’t due to “blue graph trees” that stop growing. Instead, there are only two trees (associated with f[0] and f[–1]), but both of them soon end up growing in very regular ways:

The “T Family”: f[n_] := a f[n – b f[n – c]]

What happens if our outermost operation is not addition, but multiplication?

Here are some examples of the behavior one gets. In each case we’re plotting on a log scale—and we’re not including T1xx cases, which are always trivial:

We see that some sequences have regular and readily predictable behavior, but others do not. And this is reflected in the evaluation graphs for these functions:

The first “complicated case” is T212:

The evaluation graph for f[50] in this case has the form:

And something that’s immediately notable is that in addition to “looking back” to the values of f[0] and f[–1], this also looks back to the value of f[–24]. Meanwhile, the evaluation graph for f[51] looks back not only to f[0] and f[–1] but also to f[–3] and f[–27]:

How far back does it look in general? Here’s a plot showing which lookbacks are made as a function of n (with the roots of the “blue graph trees” highlighted):

There’s alternation between behaviors for even and odd n. But apart from that, additional lookbacks are just steadily added as n increases—and indeed the total number of lookbacks seems to follow a simple pattern:

But—just for once—if one looks in more detail, it’s not so simple. The lengths of the successive “blocks” are:

So, yes, the lookbacks are quite “unpredictable”. But the main point here is that—unlike for the P family—the number of lookbacks isn’t limited. In a sense, to compute T212 for progressively larger n, progressively more information about its initial conditions is needed.

When one deals with ordinary, unnested recurrence relations, one’s always dealing with a fixed lookback. And the number of initial conditions then just depends on the lookback. (So, for example, the Fibonacci recurrence has lookback 2, so needs two initial conditions, while the standard factorial recurrence has lookback 1, so needs only one initial condition.)

But for the nested recurrence relation T212 we see that this is no longer true; there can be an unboundedly large lookback.

OK, but let’s look back at the actual T212 sequence. Here it is up to larger values of n:

Or, plotting each point as a dot:

Given the recursive definition of f[n], the values of f[n] must always be powers of 2. This shows where each successive power of 2 is first reached as a function of n:

Meanwhile, this shows the accumulated average of f[n] as a function of n:

This is well fit by 0.38 Log[n], implying that, at least with this averaging, f[n] asymptotically approximates n^0.26. And, yes, it is somewhat surprising that what seems like a very “exponential” recursive definition should lead to an f[n] that increases only like a power. But, needless to say, this is the kind of surprise one has to expect in the computational universe.

It’s worth noticing that f[n] fluctuates very intensely as a function of n. The overall distribution of values is very close to exponentially distributed—for example with the distribution of logarithmic values of f[n] for n between 9 million and 10 million being:

What else can we say about this sequence? Let’s say we reduce mod 2 the powers of 2 for each f[n]. Then we get a sequence which starts:

This is definitely not “uniformly random”. But if one look at blocks of sequential values, one can plot at what n each of the 2^b possible configurations of a length-b block first appears:

And eventually it seems as if all length-b blocks for any given b will appear.

By the way, whereas in the P family, there were always a limited number of “blue graph trees” (associated with the limited number of initial conditions), for T212 the number of such trees increases with n, as more initial conditions are used. So, for example, here are the trees for f[50] and f[51]:

We’ve so far discussed T212 only with the initial condition f[n] = 1 for n ≤ 0. The fact that f[n] is always a power of 2 relies on every initial value also being a power of 2. But here’s what happens, for example, if f(n) = 2^s for n ≤ 0:

In general, one can think of T212 as transforming an ultimately infinite sequence of initial conditions into an infinite sequence of function values, with different forms of initial conditions potentially giving very different sequences of function values:

(Note that not all choices of initial conditions are possible; some lead to “f[n] = f[n]” or “f[n] = f[n + 1]” situations, where the evaluation of the function can’t “make progress”.)

The “Summer School” Sequence T311 (f[n_] := 3 f[n – f[n – 1]])

Having explored T212, let’s now look at T311—the original one-term nestedly recursive function discovered at the 2003 Wolfram Summer School:

Here’s its basic behavior:

And here is its evaluation graph—which immediately reveals a lot more lookback than T212:

Plotting lookbacks as a function of n we get:

Much as with T212, the total number of lookbacks varies with n in the fairly simple way (~ 0.44 n):

Continuing the T311 sequence further, it looks qualitatively very much like T212:

And indeed T311—despite its larger number of lookbacks—seems to basically behave like T212. In a story typical of the Principle of Computational Equivalence, T212 seems to have already “filled out the computational possibilities”, so T311 “doesn’t have anything to add”.

The “S Family”: f[n_] := n – f[f[n – a] – b]

As another (somewhat historically motivated) example of nestedly recursive functions, consider what we’ll call the “S family”, defined by:

Let’s start with the very minimal case S10 (or “S1”):

Our standard initial condition f[n] = 1 for n ≤ 0 doesn’t work here, because it implies that f[1] = 1 – f[1]. But if we take f[n] = 1 for n ≤ 1 we get:

Meanwhile, with f[n] = 1 for n ≤ 3 we get:

The first obvious feature of both these results is their overall slope: 1/ϕ ≈ 0.618, where ϕ is the golden ratio. It’s not too hard to see why one gets this slope. Assume that for large n we can take f[n] = σ n. Then substitute this form into both sides of the recursive definition for the S family to get σ n == n – σ (σ (n – a) – b). For large n all that survives is the condition for the coefficients of n

which has solution σ = 1/ϕ.

Plotting f[n] – n/ϕ for the case f[n] = 1 for n ≤ 1 we get:

The evaluation graph is this case has a fairly simple form

as we can see even more clearly with a different graph layout:

It’s notable that only the initial condition f[1] = 1 is used—leading to a single “blue graph tree” that turns out to have a very simple “Fibonacci tree” form (which, as we’ll discuss below, has been known since the 1970s):

From this it follows that f[n] related to the “Fibonacci-like” substitution system

and in fact the sequence of values of f[n] can be computed just as:

And indeed it turns out that in this case f[n] is given exactly by:

What about when f[n] = 1 not just for n ≤ 1 but beyond? For n ≤ 2 the results are essentially the same as for n ≤ 1. But for n ≤ 3 there’s a surprise: the behavior is considerably more complicated—as we can see if we plot f[n] – n/ϕ:

Looking at the evaluation graph in this case we see that the only initial conditions sampled are f[1] = 1 and f[3] = 1 (with f[2] only being reached if one specifically starts with f[2]):

And continuing the evaluation graph we see a mixture of irregularity and comparative regularity:

The plot of f[n] has a strange “hand-drawn” appearance, with overall regularity but detailed apparent randomness. The most obvious large-scale feature is “bursting” behavior (interspersed in an audio rendering with an annoying hum). The bursts all seem to have approximately (though not exactly) the same structure—and get systematically larger. The lengths of successive “regions of calm” between bursts (characterized by runs with Abs[f[n] – n/ϕ] < 3) seem to consistently increase by a factor ϕ:

What happens to S1 with other initial conditions? Here are a few examples:

So how does Sa depend on a? Sometimes there’s at least a certain amount of clear regularity; sometimes it’s more complicated:

As is very common, adding the parameter b in the definition doesn’t seem to lead to fundamentally new behavior—though for b > 0 the initial condition f[n] = 1, n ≤ 0 can be used:

In all cases, only a limited number of initial conditions are sampled (bounded by the value of a + b in the original definition). But as we can see, the behavior can either be quite simple, or can be highly complex.

More Complicated Rules

Highly complex behavior arises even from very simple rules. It’s a phenomenon one sees all over the computational universe. And we’re seeing it here in nestedly recursive functions. But if we make the rules (i.e. definitions) for our functions more complicated, will we see fundamentally different behavior, or just more of the same?

The Principle of Computational Equivalence (as well as many empirical observations of other systems) suggests that it’ll be “more of the same”: that once one’s passed a fairly low threshold the computational sophistication—and complexity—of behavior will no longer change.

And indeed this is what one sees in nestedly recursive functions. But below the threshold different kinds of things can happen with different kinds of rules.

There are several directions in which we can make rules more complicated. One that we won’t discuss here is to use operations (conditional, bitwise, etc.) that go beyond arithmetic. Others tend to involve adding more instances of f in our definitions.

An obvious way to do this is to take f[n_] to be given by a sum of terms, “Fibonacci style”. There are various specific forms one can consider. As a first example—that we can call ab—let’s look at:

The value of a doesn’t seem to matter much. But changing b we see:

12 has unbounded lookback (at least starting with f[n] = 1 for n ≤ 0), but for larger b, 1b has bounded lookback. In both 13 and 15 there is continuing large-scale structure (here visible in log plots)

though this does not seem to be reflected in the corresponding evaluation graphs:

As another level of Fibonacci-style definition, we can consider ab:

But the typical behavior here does not seem much different from what we already saw with one-term definitions involving only two f’s:

(Note that aa is equivalent to a. Cases like 13 lead after a transient to pure exponential growth.)

A somewhat more unusual case is what we can call abc:

Subtracting overall linear trends we get:

For 111 using initial conditions f[1] = f[2] = 1 and plotting f[n] – n/2 we get

which has a nested structure that is closely related to the result of concatenating binary digit sequences of successive integers:

But despite the regularity in the sequence of values, the evaluation graph for this function is not particularly simple:

So how else might we come up with more complicated rules? One possibility is that instead of “adding f’s by adding terms” we can add f’s by additional nesting. So, for example, we can consider what we can call S₃1 (here shown with initial condition f[n] = 1 for n ≤ 3):

We can estimate the overall slope here by solving for x in x == 1 – x³ to get ≈ 0.682. Subtracting this off we get:

We can also consider deeper nestings. At depth d the slope is the solution to x == 1 – x^d. Somewhat remarkably, in all cases the only initial conditions probed are f[1] = 1 and f[3] = 1:

As another example of “higher nesting” we can consider the class of functions (that we call a):

Subtracting a constant 1/ϕ slope we get:

The evaluation graph for 1 is complicated, but has some definite structure:

What happens if we nest even more deeply, say defining a functions:

With depth-d nesting, we can estimate the overall slope of f[n] by solving for x in

so that for the d = 3 case here the overall slope is the real root of or about 0.544. Subtracting out this overall slope we get:

And, yes, the sine-curve-like form of 5 is very odd. Continuing 10x longer, though, things are “squaring off”:

What happens if we continue nesting deeper? stays fairly tame:

However, already allows for more complicated behavior:

And for different values of a there are different regularities:

There are all sorts of other extensions and generalizations one might consider. Some involve alternate functional forms; others involve introducing additional functions, or allowing multiple arguments to our function f.

An Aside: The Continuous Case

In talking about recursive functions f[n] we’ve been assuming—as one normally does—that n is always an integer. But can we generalize what we’re doing to functions f[x] where x is a continuous real number?

Consider for example a continuous analog of the Fibonacci recurrence:

This produces a staircase-like function whose steps correspond to the usual Fibonacci numbers:

Adjusting the initial condition produces a slightly different result:

We can think of these as being solutions to a kind of “Fibonacci delay equation”—where we’ve given initial conditions not at discrete points, but instead on an interval.

So what happens with nestedly recursive functions? We can define an analog of S1 as:

Plotting this along with the discrete result we get:

In more detail, we get

where now the plateaus occur at the (“Wythoff numbers”) .

Changing the initial condition to be x ≤ 3 we get:

Removing the overall slope by subtracting x/ϕ gives:

One feature of the continuous case is that one can continuously change initial conditions—though the behavior one gets typically breaks into “domains” with discontinuous boundaries, as in this case where we’re plotting the value of f[x] as a function of x and the “cutoff” in the initial conditions f[x], x ≤ :

So what about other rules? A rule like P312 (f[n_] := 3 + f[n – f[n – 2]]) given “constant” initial conditions effectively just copies and translates the initial interval, and gives a simple order-0 interpolation of the discrete case. With initial condition f[x] = x some segments get “tipped”:

All the cases we’ve considered here don’t “look back” to negative values, in either the discrete or continuous case. But what about a rule like T212 (f[n_] := 2 f[n – 1 f[n – 2]]) that progressively “looks back further”? With the initial condition f[x] = 1 for x ≤ 0, one gets the same result as in the discrete case:

But if one uses the initial condition f[x ] = Abs[x – 1] for x ≤ 0 (the Abs[x – 1] is needed to avoid ending up with f[x] depending on f[y] for y > x) one instead has

yielding the rather different result:

Continuing for larger x (on a log scale) we get:

Successively zooming in on one of the first “regions of noise” we see that it ultimately consists just of a large number of straight segments:

What’s going on here? If we count the number of initial conditions that are used for different values of x we see that this has discontinuous changes, leading to disjoint segments in f[x]:

Plotting over a larger range of x values the number of initial conditions used is:

And plotting the actual values of those initial conditions we get:

If we go to later, “more intense” regions of noise, we see more fragmentation—and presumably in the limit x ∞ we get the analog of an essential singularity in f[x]:

For the S family, with its overall n/ϕ trend, even constant initial conditions—say for S1—already lead to tipping, here shown compared to the discrete case:

How Do You Actually Compute Recursive Functions?

Let’s say we have a recursive definition—like the standard Fibonacci one:

How do we actually use this to compute the value of, say, f[7]? Well, we can start from f[7], then use the definition to write this as f[6] + f[5], then write f[6] as f[5] + f[4], and so on. And we can represent this using a evaluation graph, in the form:

But this computation is in a sense very wasteful; for example, it’s independently computing f[3] five separate times (and of course getting the same answer each time). But what if we just stored each f[n] as soon as we compute, and then just retrieve that stored (“cached”) value whenever we need it again?

In the Wolfram Language, it’s a very simple change to our original definition:

And now our evaluation graph becomes much simpler:

And indeed it’s this kind of minimal evaluation graph that we’ve been using in everything we’ve discussed so far.

What’s the relationship between the “tree” evaluation graph, and this minimal one? The tree graph is basically an “unrolled” version of the minimal graph, in which all the possible paths that can be taken from the root node to the initial condition nodes have been treed out.

In general, the number of edges that come out of a single node in a evaluation graph will be equal to the number of instances of the function f that appear on the right-hand side of the recursive definition we’re using (i.e. 2 in the case of the standard Fibonacci definition). So this means that if the maximum length of path from the root to the initial conditions is s, the maximum number of nodes that can appear in the “unrolled” graph is 2^s. And whenever there are a fixed set of initial conditions (i.e. if there’s always the same lookback), the maximum path length is essentially n—implying in the end that the maximum possible number of nodes in the unrolled graph will be 2ⁿ.

(In the actual case of the Fibonacci recurrence, the number of nodes in the unrolled graph is, or about 1.6ⁿ.)

But if we actually evaluate f[7]—say in the Wolfram Language—what is the sequence of f[n]’s that we’ll end up computing? Or, in effect, how will the evaluation graph be traversed? Here are the results for the unrolled and minimal evaluation graphs—i.e. without and with caching:

Particularly in the first case this isn’t the only conceivable result we could have gotten. It’s the way it is here because of the particular “leftmost innermost” evaluation order that the Wolfram Language uses by default. In effect, we’re traversing the graph in a depth-first way. In principle we could use other traversal orders, leading to f[n]’s being evaluated in different orders. But unless we allow other operations (like f[3] + f[3] 2 f[3]) to be interspersed with f evaluations, we’ll still always end up with the same number of f evaluations for a given evaluation graph.

But which is the “correct” evaluation graph? The unrolled one? Or the minimal one? Well, it depends on the computational primitives we’re prepared to use. With a pure stack machine, the unrolled graph is the only one possible. But if we allow (random-access) memory, then the minimal graph becomes possible.

OK, so what happens with nestedly recursive functions? Here, for example, are unrolled and minimal graphs for T212:

Here are the sequences of f[n]’s that are computed:

And here’s a comparison of the number of nodes (i.e. f evaluations) from unrolled and minimal evaluation graphs (roughly 1.2ⁿ and 0.5 n, respectively):

Different recursive functions lead to different patterns of behavior. The differences are less obvious in evaluation graphs, but can be quite obvious in the actual sequence of f[n]’s that are evaluated:

But although looking at evaluation sequences from unrolled evaluation graphs can be helpful as a way of classifying behavior, the exponentially more steps involved in the unrolled graph typically makes this impractical in practice.

Primitive Recursive or Not?

Recursive functions have a fairly long history, that we’ll be discussing below. And for nearly a hundred years there’s been a distinction made between “primitive recursive functions” and “general recursive functions”. Primitive recursive functions are basically ones where there’s a “known-in-advance” pattern of computation that has to be done; general recursive functions are ones that may in effect make one have to “search arbitrarily far” to get what one needs.

In Wolfram Language terms, primitive recursive functions are roughly ones that can be constructed directly using functions like Nest and Fold (perhaps nested); general recursive functions can also involve functions like NestWhile and FoldWhile.

So, for example, with the Fibonacci definition

the function f[n] is primitive recursive and can be written, say, as:

Lots of the functions one encounters in practice are similarly primitive recursive—including most “typical mathematical functions” (Plus, Power, GCD, Prime, …). And for example functions that give the results of n steps in the evolution of a Turing machine, cellular automaton, etc. are also primitive recursive. But functions that for example test whether a Turing machine will ever halt (or give the state that it achieves if and when it does halt) are not in general primitive recursive.

On the face of it, our nestedly recursive functions seem like they must be primitive recursive, since they don’t for example appear to be “searching for anything”. But things like the presence of longer and longer lookbacks raise questions. And then there’s the potential confusion of the very first example (dating from the late 1920s) of a recursively defined function known not to be primitive recursive: the Ackermann function.

The Ackermann function has three (or sometimes two) arguments—and, notably, its definition (here given in its classic form) includes nested recursion:

This is what the evaluation graphs look like for some small cases:

Looking at these graphs we can begin to see a pattern. And in fact there’s a simple interpretation: f[m, x, y] for successive m is doing progressively more nested iterations of integer successor operations. f[0, x, y] computes x + y; f[1, x, y] does “repeated addition”, i.e. computes x × y; f[2, x, y] does “repeated multiplication”, i.e. computes xy; f[3, x, y] does “tetration”, i.e. computes the “power tower” Nest[x^#&, 1, y]; etc.

Or, alternatively, these can be given explicitly in successively more nested form:

And at least in this form f[m, x, y] involves m nestings. But a given primitive recursive function can involve only a fixed number of nestings. It might be conceivable that we could rewrite f[m, x, y] in certain cases to involve only a fixed number of nestings. But if we look at f[m, m, m] then this turns out to inevitably grow too rapidly to be represented by a fixed number of nestings—and thus cannot be primitive recursive.

But it turns out that the fact that this can happen depends critically on the Ackermann function having more than one argument—so that one can construct the “diagonal” f[m, m, m].

So what about our nestedly recursive functions? Well, at least in the form that we’ve used them, they can all be written in terms of Fold. The key idea is to accumulate a list of values so far (conveniently represented as an association)—sampling whichever parts are needed—and then at the end take the last element. So for example the “Summer School function” T311

can be written:

An important feature here is that we’re getting Lookup to give 1 if the value it’s trying to look up hasn’t been filled in yet, implementing the fact that f[n] = 1 for n ≤ 0.

So, yes, our recursive definition might look back further and further. But it always just finds value 1—which is easy for us to represent without, for example, any extra nesting, etc.

The ultimate (historical) definition of primitive recursion, though, doesn’t involve subsets of the Wolfram Language (the definition was given almost exactly 100 years too early!). Instead, it involves a specific set of simple primitives:

(An alternative, equivalent definition for recursion—explicitly involving Fold—is r[g_, h_] := Fold[{u, v} h[u, v, ##2]], g[##2], Range[0, #1 – 1]] &.)

So can our nestedly recursive functions be written purely in terms of these primitives? The answer is yes, though it’s seriously complicated. A simple function like Plus can for example be written as r[p[1], s], so that e.g. r[p[1], s][2,3]5. Times can be written as r[z, c[Plus, p[1], p[3]]] or r[z, c[r[p[1], s], p[1], p[3]]], while Factorial can be written as r[c[s, z], c[Times, p[1], c[s, p[2]]]]. But even Fibonacci, for example, seems to require a very much longer specification.

In writing “primitive-recursive-style” definitions in Wolfram Language we accumulated values in lists and associations. But in the ultimate definition of primitive recursion, there are no such constructs; the only form of “data” is positive integers. But for our definitions of nestedly recursive functions we can use a “tupling function” that “packages up” any list of integer values into a single integer (and an untupling function that unpacks it). And we can do this say based on a pairing (2-element-tupling) function like:

But what about the actual If[n ≤0, 1, ...] lookback test itself? Well, If can be written in primitive recursive form too: for example, r[c[s, z], c[f, c[s, p[2]]]][n] is equivalent to If[n ≤ 0, 1, f[n]].

So our nestedly recursive functions as we’re using them are indeed primitive recursive. Or, more strictly, finding values f[n] is primitive recursive. Asking questions like “For what n does f[n] reach 1000?” might not be primitive recursive. (The obvious way of answering them involves a FoldWhile-style non-primitive-recursive search, but proving that there’s no primitive recursive way to answer the question is likely very much harder.)

By the way, it’s worth commenting that while for primitive recursive functions it’s always possible to compute a value f[n] for any n, that’s not necessarily true for general recursive functions. For example, if we ask “For what n does f[n] reach 1000?” there might simply be no answer to this; f[n] might never reach 1000. And when we look at the computations going on underneath, the key distinction is that in evaluating primitive recursive functions, the computations always halt, while for general recursive functions, they may not.

So, OK. Our nestedly recursive functions can be represented in “official primitive recursive form”, but they’re very complicated in that form. So that raises the question: what functions can be represented simply in this form? In A New Kind of Science I gave some examples, each minimal for the output it produces:

And then there’s the most interesting function I found:

It’s the simplest primitive recursive function whose output has no obvious regularity:

Because it’s primitive recursive, it’s possible to express it in terms of functions like Fold—though it’s two deep in those, making it in some ways more complicated (at least as far as the Grzegorczyk hierarchy that counts “Fold levels” is concerned) than our nestedly recursive functions:

But there’s still an issue to address with nestedly recursive functions and primitive recursion. When we have functions (like T212) that “reach back” progressively further as n increases, there’s a question of what they’ll find. We’ve simply assumed f[n] = 1 for n ≤0. But what if there was something more complicated there? Even if f[–m] was given by some primitive recursive function, say p[m], it seems possible that in computing f[n] one could end up somehow “bouncing back and forth” between positive and negative arguments, and in effect searching for an m for which p[m] has some particular value, and in doing that searching one could find oneself outside the domain of primitive recursive functions.

And this raises yet another question: are all definitions we can give of nestedly recursive functions consistent? Consider for example:

Now ask: what is f[1]? We apply the recursive definition. But it gives us f[1] = 1 – f[f[0]] or f[1] = 1 – f[1], or, in other words, an inconsistency. There are many such inconsistencies that seem to “happen instantly” when we apply definitions. But it seems conceivable that there could be “insidious inconsistencies” that show up only after many applications of a recursive definition. And it’s also conceivable that one could end up with “loops” like f[i] = f[i]. And things like this could be reasons that f[n] might not be a “total function”, defined for all n.

We’ve seen all sorts of complex behavior in nestedly recursive functions. And what the Principle of Computational Equivalence suggests is that whenever one sees complex behavior, one must in some sense be dealing with computations that are “as sophisticated as any computation can be”. And in particular one must be dealing with computations that can somehow support computation universality.

So what would it mean for a nestedly recursive function to be universal? For a start, one would need some way to “program” the function. There seem to be a couple of possibilities. First, one could imagine packing both “code” and “data” into the argument n of f[n]. So, for example, one might use some form of tupling function to take a description of a rule and an initial state for a Turing machine, together with a specification of a step number, then package all these things into an integer n that one feeds into one’s universal nestedly recursive function f. Then the idea would be that the value computed for f[n] could be decoded to give the state of the Turing machine at the specified step. (Such a computation by definition always halts—but much as one computes with Turing machines by successively asking for the next steps in their evolution, one can imagine setting up a “harness” that just keeps asking for values of f[n] at an infinite progression of values n.)

Another possible approach to making a universal nestedly recursive function is to imagine feeding in a “program” through the initial conditions one gives for the function. There might well need to be decoding involved, but in some sense what one might hope is that just by changing its initial conditions one could get a nestedly recursive function with a specific recursive definition to emulate a nestedly recursive function with any other recursive definition (or, say, for a start, any linear recurrence).

Perhaps one could construct a complicated nestedly recursive function that would have this property. But what the Principle of Computational Equivalence suggests is that it should be possible to find the property even in “naturally occurring cases”—like P312 or T212.

The situation is probably going to be quite analogous to what happened with the rule 110 cellular automaton or the s = 2, k = 3 596440 Turing machine. By looking at the actual typical behavior of the system one got some intuition about what was likely to be going on. And then later, with great effort, it became possible to actually prove computation universality.

In the case of nestedly recursive functions, we’ve seen here examples of just how diverse the behavior generated by changing initial conditions can be. It’s not clear how to harness this diversity to extract some form of universality. But it seems likely that the “raw material” is there. And that nestedly recursive functions will show themselves as able join so many other systems in fitting into the framework defined by the Principle of Computational Equivalence.

Some History

Once one has the concept of functions and the concept of recursion, nestedly recursive functions aren’t in some sense a “complicated idea”. And between this fact and the fact that nestedly recursive functions haven’t historically had a clear place in any major line of mathematical or other development it’s quite difficult to be sure one’s accurately tracing their history. But I’ll describe here at least what I currently know.

The concept of something like recursion is very old. It’s closely related to mathematical induction, which was already being used for proofs by Euclid around 300 BC. And in a quite different vein, around the same time (though not recorded in written form until many centuries later) Fibonacci numbers arose in Indian culture in connection with the enumeration of prosody (“How many different orders are there in which to say the Sanskrit words in this veda?”).

Then in 1202 Leonardo Fibonacci, at the end of his calculational math book Liber Abaci (which was notable for popularizing Hindu-Arabic numerals in the West) stated—more or less as a recreational example—his “rabbit problem” in recursive form, and explicitly listed the Fibonacci numbers up to 377. But despite this early appearance, explicit recursively defined sequences remained largely a curiosity until as late as the latter part of the twentieth century.

The concept of an abstract function began to emerge with calculus in the late 1600s, and became more solidified in the 1700s—but basically always in the context of continuous arguments. A variety of specific examples of recurrence relations—for binomial coefficients, Bernoulli numbers, etc.—were in fairly widespread use. But there didn’t seem to have yet been a sense that there was a general mathematical structure to study.

In the course of the 1800s there had been an increasing emphasis on rigor and abstraction in mathematics, leading by the latter part of the century to a serious effort to axiomatize concepts associated with numbers. Starting with concepts like the recursive definition of integers by repeated application of the successor operation, by the time of Peano’s axioms for arithmetic in 1891 there was a clear general notion (particularly related to the induction axiom) that (integer) functions could be defined recursively. And when David Hilbert’s program of axiomatizing mathematics got underway at the beginning of the 1900s, it was generally assumed that all (integer) functions of interest could actually be defined specifically using primitive recursion.

The notation for recursively specifying functions gradually got cleaner, making it easier to explore more elaborate examples. And in 1927 Wilhelm Ackermann (a student of Hilbert’s) introduced (in completely modern notation) a “reasonable mathematical function” that—as we discussed above—he showed was not primitive recursive. And right there, in his paper, without any particular comment, is a nestedly recursive function definition:

In 1931 Kurt Gödel further streamlined the representation of recursion, and solidified the notion of general recursion. There soon developed a whole field of recursion theory—though most of it was concerned with general issues, not with specific, concrete recursive functions. A notable exception was the work of Rózsa Péter (Politzer), beginning in the 1930s, and leading in 1957 to her book Recursive Functions—which contains a chapter on “Nested Recursion” (here in English translation):

But despite the many specific (mostly primitive) recursive functions discussed in the rest of the book, this chapter doesn’t stray far from the particular function Ackermann defined (or at least Péter’s variant of it).

What about the recreational mathematics literature? By the late 1800s there were all sorts of publications involving numbers, games, etc. that at least implicitly involved recursion (an example being Édouard Lucas’s 1883 Tower of Hanoi puzzle). But—perhaps because problems tended to be stated in words rather than mathematical notation—it doesn’t seem as if nestedly recursive functions ever showed up.

In the theoretical mathematics literature, a handful of somewhat abstract papers about “nested recursion” did appear, an example being one in 1961 by William Tait, then at Stanford:

But, meanwhile, the general idea of recursion was slowly beginning to go from purely theoretical to more practical. John McCarthy—who had coined the term “artificial intelligence”—was designing LISP as “the language for AI” and by 1960 was writing papers with titles like “Recursive Functions of Symbolic Expressions and Their Computation by Machine”.

In 1962 McCarthy came to Stanford to found the AI Lab there, bringing with him enthusiasm for both AI and recursive functions. And by 1968 these two topics had come together in an effort to use “AI methods” to prove properties of programs, and in particular programs involving recursive functions. And in doing this, John McCarthy came up with an example he intended to be awkward—that’s exactly a nestedly recursive function:

In our notation, it would be:

And it became known as “McCarthy’s 91-function” because, yes, for many n, f[n] = 91. These days it’s trivial to evaluate this function—and to find out that f[n] = 91 only up to n = 102:

But even the evaluation graph is somewhat large

and in pure recursive evaluation the recursion stack can get deep—which back then was a struggle for LISP systems to handle.

There were efforts at theoretical analysis, for example by Zohar Manna, who in 1974 published Mathematical Theory of Computation which—in a section entitled “Fixpoints of Functionals”—presents the 91-function and other nestedly recursively functions, particularly in the context of evaluation-order questions.

In the years that followed, a variety of nestedly recursive functions were considered in connection with proving theorems about programs, and with practical assessments of LISP systems, a notable example being Ikuo Takeuchi’s 1978 triple recursive function:

But in all these cases the focus was on how these functions would be evaluated, not on what their behavior would be (and it was typically very simple).

But now we have to follow another thread in the story. Back in 1961, right on the Stanford campus, a then-16-year-old Douglas Hofstadter was being led towards nestedly recursive functions. As Doug tells it, it all started with him seeing that squares are interspersed with gaps of 1 or 2 between triangular numbers, and then noticing patterns in those gaps (and later realizing that they showed nesting). Meanwhile, at Stanford he had access to a computer running Algol, a language which (like LISP and unlike Fortran) supported recursion (though this wasn’t particularly advertised, since recursion was still generally considered quite obscure).

And as Doug tells it, within a year or two he was using Algol to do things like recursively create trees representing English sentences. Meanwhile—in a kind of imitation of the Eleusis “guess-a-card-rule” game—Doug was apparently challenging his fellow students to a “function game” based on guessing a math function from specified values. And, as he tells it, he found that functions that were defined recursively were the ones people found it hardest to guess.

That was all in the early 1960s, but it wasn’t until the mid-1970s that Doug Hofstadter returned to such pursuits. After various adventures, Doug was back at Stanford—writing what became his book Gödel, Escher, Bach. And in 1977 he sent a letter to Neil Sloane, creator of the 1973 A Handbook of Integer Sequences (and what’s now the Online Encyclopedia of Integer Sequences, or OEIS):

As suggested by the accumulation of “sequence ID” annotations on the letter, Doug’s “eta sequences” had actually been studied in number theory before—in fact, since at least the 1920s (they are now usually called Beatty sequences). But the letter went on, now introducing some related sequences—that had nestedly recursive definitions:

As Doug pointed out, these particular sequences (which were derived from golden ratio versions of his “eta sequences”) have a very regular form—which we would now call nested. And it was the properties of this form that Doug seemed most concerned about in his letter. But actually, as we saw above, just a small change in initial conditions in what I’m calling S1 would have led to much wilder behavior. But that apparently wasn’t something Doug happened to notice. A bit later in the letter, though, there was another nestedly recursive sequence—that Doug described as a “horse of an entirely nother color”: the “absolutely CRAZY” Q sequence:

Two years later, Doug’s Gödel, Escher, Bach book was published—and in it, tucked away at the bottom of page 137, a few pages after a discussion of recursive generation of text (with examples such as “the strange bagels that the purple cow without horns gobbled”), there was the Q sequence:

Strangely, though, there was no picture of it, and Doug listed only 17 terms (which, until I was writing this, was all I assumed he had computed):

So now nestedly recursive sequences were out in the open—in what quickly became a very popular book. But I don’t think many people noticed them there (though, as I’ll discuss, I did). Gödel, Escher, Bach is primarily a playful book focused on exposition—and not the kind of place you’d expect to find a new, mathematical-style result.

Still—quite independent of the book—Neil Sloane showed Doug’s 1977 letter to his Bell Labs colleague Ron Graham, who within a year made a small mention of the Q sequence in a staid academic math publication (in a characteristic “state-it-as-a-problem” Erdös form):

There was a small and tight-knit circle of serious mathematicians—essentially all of whom, as it happens, I personally knew—who would chase these kinds of easy-to-state-but-hard-to-solve problems. Another was Richard Guy, who soon included the sequence as part of problem E31 in his Unsolved Problems in Number Theory, and mentioned it again a few years later.

But for most of the 1980s little was heard about the sequence. As it later turns out, a senior British applied mathematician named Brian Conolly (who wasn’t part of the aforementioned tight-knit circle) had—presumably as a kind of hobby project—made some progress, and in 1986 had written to Guy about it. Guy apparently misplaced the letter, but later told Conolly that John Conway and Sol Golomb had worked on similar things.

Conway presumably got the idea from Hofstadter’s work (though he had a habit of obfuscating his sources). But in any case, on July 15, 1988, Conway gave a talk at Bell Labs entitled “Some Crazy Sequences” (note the word “crazy”, just like in Hofstadter’s letter to Sloane) in which he discussed the regular-enough-to-be-mathematically-interesting sequence (which we’re calling G₃111 here):

Despite its visual regularity, Conway couldn’t mathematically prove certain features of the wiggles in the sequence—and in his talk offered a $10,000 prize for anyone who could. By August a Bell Labs mathematician named Colin Mallows had done it. Conway claimed (later to be contradicted by video evidence) that he’d only offered $1000—and somehow the whole affair landed as a story in the August 30 New York Times under the heading “Intellectual Duel: Brash Challenge, Swift Response”. But in any case, this particular nestedly recursive sequence became known as “Conway’s Challenge Sequence”.

So what about Sol Golomb? It turns out he’d started writing a paper—though never finished it:

He’d computed 280 terms of the Q sequence (he wasn’t much of a computer user) and noticed a few coincidences. But he also mentioned another kind of nestedly recursive sequence, no doubt inspired by his work on feedback shift registers:

As he noted, the behavior depends greatly on the initial conditions, though is always eventually periodic—with his student Unjeng Cheng having found long-period examples.

OK, so by 1988 nestedly recursive functions had at least some notoriety. So what happened next? Not so much. There’s a modest academic literature that’s emerged over the last few decades, mostly concentrated very specifically around “Conway’s Challenge Sequence”, Hofstadter’s Q function, or very similar “meta Fibonacci” generalizations of them. And so far as I know, even the first published large-scale picture of the Q sequence only appeared in 1998 (though I had pictures of it many years earlier):

Why wasn’t more done with nestedly recursive functions? At some level it’s because they tend to have too much computational irreducibility—making it pretty difficult to say much about them in traditional mathematical ways. But perhaps more important, studying them broadly is really a matter of ruliology: it requires the idea of exploring spaces of rules, and of expecting the kinds of behavior and phenomena that are characteristic of systems in the computational universe. And that’s something that’s still not nearly as widely understood as it should be.

My Personal Story with Nestedly Recursive Functions

I think 1979 was the year when I first took recursion seriously. I’d heard about the Fibonacci sequence (though not under that name) as a young child a decade earlier. I’d implicitly (and sometimes explicitly) encountered recursion (sometimes through error messages!) in computer algebra systems I’d used. In science, I’d studied fractals quite extensively (Benoit Mandelbrot’s book having appeared in 1977), and I’d been exposed to things like iterated maps. And I’d quite extensively studied cascade processes, notably of quarks and gluons in QCD.

As I think about it now, I realize that for several years I’d written programs that made use of recursion (and I had quite a lot of exposure to LISP, and the culture around it). But it was in 1979—having just started using C—that I first remember writing a program (for doing percolation theory) where I explicitly thought “this is using recursion”. But then, in late 1979, I began to design SMP (“Symbolic Manipulation Program”), the forerunner of the modern Wolfram Language. And in doing this I quickly solidified my knowledge of mathematical logic and the (then-fledgling) field of theoretical computer science.

My concept of repeated transformations for symbolic expressions—which is still the core of Wolfram Language today—is somehow fundamentally recursive. And by the time we had the first signs of life for our SMP system, Fibonacci was one of our very first tests. We soon tried the Ackermann function too. And in 1980 I became very interested in the problem of evaluation order, particularly for recursive functions—and the best treatment I found of it (though at the time not very useful to me) was in none other than the book by Zohar Manna that I mentioned above. (In a strange twist, I was at that time also studying gauge choices in physics—and it was only last year that I realized that they’re fundamentally the same thing as evaluation orders.)

It was soon after it came out in 1979 that I first saw Douglas Hofstadter’s book. At the time I wasn’t too interested in its Lewis-Carroll-like aspects, or its exposition; I just wanted to know what the “science meat” in it was. And somehow I found the page about the Q sequence, and filed it away as “something interesting”.

I’m not sure when I first implemented the Q sequence in SMP, but by the time we released Version 1.0 in July 1981, there it was: an external package (hence the “X” prefix) for evaluating “Hofstadter’s recursive function”, elegantly using memoization—with the description I gave saying (presumably because that’s what I’d noticed) that its values “have several properties of randomness”:

Firing up a copy of SMP today—running on a virtual machine that still thinks it’s 1986—I can run this code, and easily compute the function:

I can even plot it—though without an emulator for a 1980s-period storage-tube display, only the ASCIIfied rendering works:

So what did I make of the function back in 1981? I was interested in how complexity and randomness could occur in nature. But at the time, I didn’t have enough of a framework to understand the connection. And, as it was, I was just starting to explore cellular automata, which seemed a lot more “nature like”—and which soon led me to things like rule 30 and the phenomenon of computational irreducibility.

Still, I didn’t forget the Q sequence. And when I was building Mathematica I again used it as a test (the .tq file extension came from the brief period in 1987 when we were trying out “Technique” as the name of the system):

When Mathematica 1.0 was released on June 23, 1988, the Q sequence appeared again, this time as an example in the soon-in-every-major-bookstore Mathematica book:

I don’t think I was aware of Conway’s lecture that occurred just 18 days later. And for a couple of years I was consumed with tending to a young product and a young company. But by 1991, I was getting ready to launch into basic science again. Meanwhile, the number theorist (and today horologist) Ilan Vardi—yet again from Stanford—had been working at Wolfram Research and writing a book entitled Computational Recreations in Mathematica, which included a long section on the analysis of Takeuchi’s nested recursive function (“TAK”). My email archive records an exchange I had with him:

He suggested a “more symmetrical” nested recursive function. I responded, including a picture that made it fairly clear that this particular function would have only nested behavior, and not seem “random”:

By the summer of 1991 I was in the thick of exploring different kinds of systems with simple rules, discovering the remarkable complexity they could produce, and filling out what became Chapter 3 of A New Kind of Science: “The World of Simple Programs”. But then came Chapter 4: “Systems Based on Numbers”. I had known since the mid-1980s about the randomness in things like digit sequences produced by successive arithmetic operations. But what about randomness in pure sequences of integers? I resolved to find out just what it would take to produce randomness there. And so it was that on August 13, 1993, I came to be enumerating possible symbolic forms for recursive functions—and selecting ones that could generate at least 10 terms:

As soon as I plotted the “survivors” I could see that interesting things were going to happen:

Was this complexity somehow going to end? I checked out to 10 million terms. And soon I started collecting my “prize specimens” and making a gallery of them:

I had some one-term recurrences, and some two-term ones. Somewhat shortsightedly I was always using “Fibonacci-like” initial conditions f[1] = f[2] = 1—and I rejected any recurrence that tried to “look back” to f[0], f[–1], etc. And at the time, with this constraint, I only found “really interesting” behavior in two-term recurrences.

In 1994 I returned briefly to recursive sequences, adding a note “solving” a few of them, and discussing the evaluation graphs of others:

When I finally finished A New Kind of Science in 2002, I included a list of historical “Close approaches” to its core discoveries, one of them being Douglas Hofstadter’s work on recursive sequences:

In retrospect, back in 1981 I should have been able to take the “Q sequence” and recognize in it the essential “rule 30 phenomenon”. But as it was, it took another decade—and many other explorations in the computational universe—for me to build up the right conceptual framework to see this. In A New Kind of Science I studied many kinds of systems, probing them far enough, I hoped, to see their most important features. But recursive functions were an example where I always felt there was more to do; I felt I’d only just scratched the surface.

In June 2003—a year after A New Kind of Science was published—we held our first summer school. And as a way to introduce methodology—and be sure that people knew I was fallible and approachable—I decided on the first day of the summer school to do a “live experiment”, and try to stumble my way to discovering something new, live and in public.

A few minutes before the session started, I picked the subject: recursive functions. I began with some examples I knew. Then it was time to go exploring. At first lots of functions “didn’t work” because they were looking back too far. But then someone piped up “Why not just say that f[n] is 1 whenever n isn’t a positive integer?” Good idea! And very easy to try.

Soon we had the “obvious” functions written (today Apply[Plus, ...] could be just Total[...], but otherwise there’s nothing “out of date” here):

In a typical story of Wolfram-Language-helps-one-think-clearly, the obvious function was also very general, and allowed a recurrence with any number of terms. So why not start with just one term? And immediately, there it was—what we’re now calling T311:

And then a plot (yes, after Version 6 one didn’t need the Show or the trailing “;”):

Of course, as is the nature of computational constructions, this is something timeless—that looks the same today as it did 21 years ago (well, except that now our plots display with color by default).

I thought this was a pretty neat discovery. And I just couldn’t believe that years earlier I’d failed to see the obvious generalization of having “infinite” initial conditions.

The next week I did a followup session, this time talking about how one would write up a discovery like this. We started off with possible titles (including audience suggestions):

And, yes, the first title listed is exactly the one I’ve now used here. In the notebook I created back then, there were first some notes (some of which should still be explored!):

Three hours later (on the afternoon of July 11, 2003) there’s another notebook, with the beginnings of a writeup:

By the way, another thing we came up with at the summer school was the (non-nestedly) recursive function:

Plotting g[n + 1] – g[n] gives:

And, yes, bizarrely (and reminiscent of McCarthy’s 91-function) for n ≥ 396, g[n + 1] – g[n] is always 97, and g[n] = 38606 + 97 (n – 396).

But in any case, a week or so after my “writeups” session, the summer school was over. In January 2004 I briefly picked the project up again, and made some pictures that, yes, show interesting structure that perhaps I should investigate now:

In the years that followed, I would occasionally bring nestedly recursive functions out again—particularly in interacting with high school and other students. And at our summer programs I suggested projects with them for a number of students.

In 2008, they seemed like an “obvious interesting thing” to add to our Demonstrations Project:

But mostly, they languished. Until, that is, my burst of “finish this” intellectual energy that followed the launch of our Physics Project in 2020. So here now, finally, after a journey of 43 years, I feel like I’ve been able to do some justice to nestedly recursive functions, and provided a bit more illumination to yet another corner of the computational universe.

(Needless to say, there are many, many additional questions and issues to explore. Different primitives, e.g. Mod, Floor, etc. Multiple functions that refer to each other. Multiway cases. Functions based on rational numbers. And endless potential approaches to analysis, identifying pockets of regularity and computational reducibility.)

Thanks

Thanks to Brad Klee for extensive help. Thanks also to those who’ve worked on nestedly recursive functions as students at our summer programs over the years, including Roberto Martinez (2003), Eric Rowland (2003), Chris Song (2021) and Thomas Adler (2024). I’ve benefitted from interactions about nestedly recursive functions with Ilan Vardi (1991), Tal Kubo (1993), Robby Villegas (2003), Todd Rowland (2003 etc.), Jim Propp (2004), Matthew Szudzik (2005 etc.), Joseph Stocke (2021 etc.), Christopher Gilbert (2024) and Max Niedermann (2024). Thanks to Doug Hofstadter for extensive answers to questions about history for this piece. It’s perhaps worth noting that I’ve personally known many of the people mentioned in the history section here (with the dates I met them indicated): John Conway (1984), Paul Erdös (1986), Sol Golomb (1981), Ron Graham (1983), Benoit Mandelbrot (1986), John McCarthy (1981) and Neil Sloane (1983).

Bibliography of Nestedly Recursive Functions »

Five Most Productive Years: What Happened and What’s Next

Stephen Wolfram — Thu, 29 Aug 2024 16:31:46 +0000

So… What Happened?

Today is my birthday—for the 65th time. Five years ago, on my 60th birthday, I did a livestream where I talked about some of my plans. So… what happened? Well, what happened was great. And in fact I’ve just had the most productive five years of my life. Nine books. 3939 pages of writings (1,283,267 words). 499 hours of podcasts and 1369 hours of livestreams. 14 software product releases (with our great team). Oh, and a bunch of big—and beautiful—ideas and results.

It’s been wonderful. And unexpected. I’ve spent my life alternating between technology and basic science, progressively building a taller and taller tower of practical capabilities and intellectual concepts (and sharing what I’ve done with the world). Five years ago everything was going well, and making steady progress. But then there were the questions I never got to. Over the years I’d come up with a certain number of big questions. And some of them, within a few years, I’d answered. But others I never managed to get around to.

And five years ago, as I explained in my birthday livestream, I began to think “it’s now or never”. I had no idea how hard the questions were. Yes, I’d spent a lifetime building up tools and knowledge. But would they be enough? Or were the questions just not for our time, but only perhaps for some future century?

At several points before in my life I’d faced such issues—and things had worked out well (A New Kind of Science, Wolfram|Alpha, etc.). And from this, I had gotten a certain confidence about what might be possible. In addition, as a serious student of intellectual history, I had a sense of what kind of boldness was needed. Five years ago there wasn’t really anything that made me need to do something big and new. But I thought: “What the heck. I might as well try. I’ll never know what’s possible unless I try.”

A major theme of my work since the early 1980s had been exploring the consequences of simple computational rules. And I had found the surprising result that even extremely simple rules could lead to immensely complex behavior. So what about the universe? Could it be that at a fundamental level our whole universe is just following some simple computational rule?

I had begun my career in the 1970s as a teenager studying the frontiers of existing physics. And at first I couldn’t see how computational rules could connect to what is known in physics. But in the early 1990s I had an idea, and by the late 1990s I had developed it and gotten some very suggestive results. But when I published these in A New Kind of Science in 2002, even my friends in the physics community didn’t seem to care—and I decided to concentrate my efforts elsewhere (e.g. building Wolfram|Alpha, Wolfram Language, etc.).

But I didn’t stop thinking “one day I need to get back to my physics project”. And in 2019 I decided: “What the heck. Let’s try it now.” It helped that I’d made a piece of technical progress the year before, and that now two young physicists were enthusiastic to work with me on the project.

And so it was, soon after my birthday in 2019, that we embarked on our Physics Project. It was a mixture of computer experiments and big concepts. But before the end of 2019 it was clear: it was going to work! It was an amazing experience. Thing after thing in physics that had always been mysterious I suddenly understood. And it was beautiful—a theory of such strength built on a structure of such incredible simplicity and elegance.

We announced what we’d figured out in April 2020, right when the pandemic was in full swing. There was still much to do (and there still is today). But the overall picture was clear. I later learned that a century earlier many well-known physicists were beginning to think in a similar direction (matter is discrete, light is discrete; space must be too) but back then they hadn’t had the computational paradigm or the other tools needed to move this forward. And now the responsibility had fallen on us to do this. (Pleasantly enough, given our framework, many modern areas of mathematical physics seemed to fit right in.)

And, yes, figuring out the basic “machine code” for our universe was of course pretty exciting. But seeing an old idea of mine blossom like this had another very big effect on me. It made me think: “OK, what about all those other projects I’ve been meaning to do? Maybe it’s time to do those too.”

And something else had happened as well. In doing the Physics Project we’d developed a new way of thinking about things—not just computational, but “multicomputational”. And actually, the core ideas behind this were in A New Kind of Science too. But somehow I’d never taken them seriously enough before, and never extended my intuition to encompass them. But now with the Physics Project I was doing this. And I could see that the ideas could also go much further.

So, yes, I had a new and powerful conceptual framework for doing science. And I had all the technology of the modern Wolfram Language. But in 2020 I had another thing too—in effect, a new distribution channel for my ideas and efforts. Early in my career I had used academic papers as my “channel” (at one point in 1979 even averaging a paper every few weeks). But in the late 1980s I had a very different kind of channel: embodying my ideas in the design and implementation of Mathematica and what’s now the Wolfram Language. Then in the 1990s I had another channel: putting everything together into what became my book A New Kind of Science.

After that was published in 2002 I would occasionally write small posts—for the community site around the science in my book, for our corporate blog, etc. And in 2010 I started my own blog. At first I mostly just wrote small, fun pieces. But by 2015—partly driven by telling historical stories (200th anniversary of George Boole, 200th anniversary of Ada Lovelace, …)—the things I was writing were getting ever meatier. (There’d actually already been some meaty ones about personal analytics in 2012.)

And by 2020 my pattern was set and I would routinely write 50+ -page pieces, full of pictures (all with immediately runnable “click-to-copy” code) and intended for anyone who cared to read them. Finally I had a good channel again. And I started using it. As I’d found over the years—whether with language documentation or with A New Kind of Science—the very act of exposition was a critical part of organizing and developing my ideas.

And now I started producing pieces. Some were directly about specific topics around the Physics Project. But within two months I was already writing about a “spinoff”: “Exploring Rulial Space: The Case of Turing Machines”. I had realized that one of the places the ideas of the Physics Project should apply was to the foundations of mathematics, and to metamathematics. In a footnote to A New Kind of Science I had introduced the idea of “empirical metamathematics”. And in the summer of 2020, fuelled by my newfound “finish those old projects” mindset, I ended up writing an 80-page piece on “The Empirical Metamathematics of Euclid and Beyond”.

December 7, 1920 was the date a certain Moses Schönfinkel introduced what we now call combinators: the very first clear foundations for universal computation. I had always found combinators interesting (if hard to understand). I had used ideas from them back around 1980 in the predecessor of what’s now the Wolfram Language. And I had talked about them a bit in A New Kind of Science. But as the centenary approached, I decided to make a more definitive study, in particular using methods from the Physics Project. And, for good measure, even in the middle of the pandemic I tracked down the mysterious history of Moses Schönfinkel.

In March 2021, there was another centenary, this time of Emil Post’s tag system, and again I decided to finish what I’d started in A New Kind of Science, and write a definitive piece, this time running to about 75 pages.

One might have thought that the excursions into empirical metamathematics, combinators, tag systems, rulial and multiway Turing machines would be distractions. But they were not. Instead, they just deepened my understanding and intuition for the new ideas and methods that had come out of the Physics Project. As well as finishing projects that I’d wondered about for decades (and the world had had open for a century).

Perhaps not surprisingly given its fundamental nature, the Physics Project also engaged with some deep philosophical issues. People would ask me about them with some regularity. And in March 2021 I started writing a bit about them, beginning with a piece on consciousness. The next month I wrote “Why Does the Universe Exist? Some Perspectives from Our Physics Project”. (This piece of writing happened to coincide with the few days in my life when I’ve needed to do active cryptocurrency trading—so I was in the amusing position of thinking about a philosophical question about as deep as they come, interspersed with making cryptocurrency trades.)

Everything kept weaving together. These philosophical questions made me internalize just how important the nature of the observer is in our Physics Project. Meanwhile I started thinking about the relationship of methods from the Physics Project to distributed computing, and to economics. And in May 2021 that intersected with practical blockchain questions, which caused me to write about “The Problem of Distributed Consensus”—which would soon show up again in the science and philosophy of observers.

The fall of 2021 involved really leaning into the new multicomputational paradigm, among other things giving a long list of where it might apply: metamathematics, chemistry, molecular biology, evolutionary biology, neuroscience, immunology, linguistics, economics, machine learning, distributed computing. And, yes, in a sense this was my “to do” list. In many ways, half the battle was just defining this. And I’m happy to say that just three years later, we’ve already made a big dent in it.

While all of this was going on, I was also energetically pursuing my “day job” as CEO of Wolfram Research. Version 12.1 of the Wolfram Language had come out less than a month before the Physics Project was announced. Version 12.2 right after the combinator centenary. And in 2021 there were two new versions. In all 635 new functions, all of which I had carefully reviewed, and many of which I’d been deeply involved in designing.

It’s a pattern in the history of science (as well as technology): some new methodology or some new paradigm is introduced. And suddenly vast new areas are opened up. And there’s lots of juicy “low-hanging fruit” to be picked. Well, that’s what had happened with the ideas from our Physics Project, and the concept of multicomputation. There were many directions to go, and many people wanting to get involved. And in 2021 it was becoming clear that something organizational had to be done: this wasn’t a job for a company (even for one as terrific and innovative as ours is), it was a job for something like an institute. (And, yes, in 2022, we indeed launched what’s now the Wolfram Institute for Computational Foundations of Science.)

But back in 1986, I had started the very first institute concentrating on complexity and how it could arise from simple rules. Running it hadn’t been a good fit for me back then, and very quickly I started our company. In 2002, when A New Kind of Science was published, I’d thought again about starting an institute. But it didn’t happen. But now there really seemed to be no choice. I started reflecting on what had happened to “complexity”, and whether there was something to leverage from the institutional structure that had grown up around it. Nearly 20 years after the publication of A New Kind of Science, what should “complexity” be now?

I wrote “Charting a Course for ‘Complexity’: Metamodeling, Ruliology and More”—and in doing so, finally invented a word for the “pure basic science of what simple rules do”: ruliology.

My original framing of what became our Physics Project had been to try to “find a computational rule that gives our universe”. But I’d always found this unsatisfying. Because even if we had the rule, we’d still be left asking “why this one, and not another?” But in 2020 there’d been a dawning awareness of a possible answer.

Our Physics Project is based on the idea of applying rules to abstract hypergraphs that represent space and everything in it. But given a particular rule, there are in general many ways it can be applied. And a key idea in our Physics Project is that somehow it’s always applied in all these ways—leading to many separate threads of history, that branch and merge—and, importantly, giving us a way to understand quantum mechanics.

We talked about these different threads of history corresponding to different places in branchial space—and about how the laws of quantum mechanics are the direct analogs in branchial space (or branchtime) of the laws of classical mechanics (and gravity) in physical space (or spacetime). But what if instead of just applying a given rule in all possible ways, we applied all possible rules in all possible ways?

What would one get? In November 2021 I came up with a name for it: the ruliad. A year and a half earlier I’d already been starting to talk about rulial space—and the idea of us as observers perceiving the universe in terms of our particular sampling of rulial space. But naming the ruliad really helped to crystallize the concept. And I began to realize that I had come upon a breathtakingly broad intellectual arc.

The ruliad is the biggest computational thing there can be: it’s the entangled limit of all possible computations. It’s abstract and it’s unique—and it’s as inevitable in its structure as 2 + 2 = 4. It encompasses everything computational—including us. So what then is physics? Well, it’s a description of how observers like us embedded in the ruliad perceive the ruliad.

Back in 1984 I’d introduced what I saw as being the very central concept of computational irreducibility: the idea that there are many computational processes whose outcomes can be found only by following them step by step—with no possibility of doing what mathematical science was used to, and being able to “jump ahead” and make predictions without going through each step. At the beginning of the 1990s, when I began to work on A New Kind of Science, I’d invented the Principle of Computational Equivalence—the idea that systems whose behavior isn’t obviously simple will always tend to be equivalent in the sophistication of the computations they do.

Given the Principle of Computational Equivalence, computational irreducibility was inevitable. It followed from the fact that the observer could only be as computationally sophisticated as the system they were observing, and so would never be able to “jump ahead” and shortcut the computation. There’d come to be a belief that eventually science would always let one predict (and control) things. But here—from inside science—was a fundamental limitation on the power of science. All these things I’d known in some form since the 1980s, and with clarity since the 1990s.

But the ruliad took things to another level. For now I could see that the very laws of physics we know were determined by the way we are as observers. I’d always imagined that the laws of physics just are the way they are. But now I realized that we could potentially derive them from the inevitable structure of the ruliad, and very basic features of what we’re like as observers.

I hadn’t seen this philosophical twist coming. But somehow it immediately made sense. We weren’t getting our laws of physics from nothing; we were getting them from being the way we are. Two things seemed to be critical: that as observers we are computationally bounded, and that (somewhat relatedly) we believe we are persistent in time (i.e. we have a continuing thread of experience through time).

But even as I was homing in on the idea of the ruliad as it applied to physics, I was also thinking about another application: the foundations of mathematics. I’d been interested in the foundations of mathematics for a very long time; in fact, in the design of Mathematica (and what’s now the Wolfram Language) and its predecessor SMP, I’d made central use of ideas that I’d developed from thinking about the foundations of mathematics. And in A New Kind of Science, I’d included a long section on the foundations of mathematics, discussing things like the network of all possible theorems, and the space of all possible axiom systems.

But now I was developing a clearer picture. The ruliad represented not only all possible physics, but also all possible mathematics. And the actual mathematics that we perceive—like the actual physics—would be determined by our nature as observers, in this case mathematical observers. There were lots of technical details, and it wasn’t until March 2022 that I published “The Physicalization of Metamathematics and Its Implications for the Foundations of Mathematics”.

In some ways this finished what I’d started in the mid-1990s. But it went much further than I expected, in particular in providing a sweeping unification of the foundations of physics and mathematics. It talked about what the ultimate limit of mathematics would be like. And it talked about how “human-level mathematics”—where we can discuss things like the Pythagorean theorem rather than just the microdetails of underlying axioms—emerges for observers like us just like our human-level impression of physical space emerges from the underlying network of atoms of space.

One of the things I’d discovered in computational systems is how common computational irreducibility is, along with undecidability. And I had always wondered why undecidability wasn’t more common in typical mathematics. But now I had an answer: it just isn’t what mathematical observers like us “see” in the ruliad. At some level, this was a very philosophical result. But for me it also had practical implications, notably greatly validating the idea of using higher-level computational language to represent useful human-level mathematics, rather than trying to drill down to “axiomatic machine code”.

October 22, 2021 had marked a third of a century of Mathematica. And May 14, 2022 was the 20th anniversary of A New Kind of Science. And in contextualizing my activities, and planning for the future, I’ve increasingly found it useful to reflect on what I’ve done before, and how it’s worked out. And in both these cases I could see that seeds I’d planted many years earlier had blossomed, sometimes in ways I’d suspected they might, and sometimes in ways that far exceeded what I’d imagined.

What had I done right? The key, it seemed, was drilling down to find the essence of things, and then developing that. Even if I hadn’t been able to imagine quite what could be built on them, I’d been able to construct solid foundations, that successfully encapsulated things in the cleanest and simplest ways.

In talking about observers and the ruliad—and in fact our Physics Project in general—I kept on making analogies to the way that the gas laws and fluid dynamics emerge from the complicated underlying dynamics of molecules. And at the core of this is the Second Law of thermodynamics.

Well, as it happens, the very first foundational question in physics that I ever seriously studied was the origin of the Second Law. But that was when I was 12 years old, in 1972. For more than a century the Second Law had been quite mysterious. But when I discovered computational irreducibility in 1984 I soon realized that it might be the key to the Second Law. And in the summer of 2022—armed with a new perspective on the importance of observers—I decided I’d better once and for all write down how the Second Law works.

Once again, there were lots of technical details. And as a way to check my ideas I decided to go back and try to untangle the rather confused 150-year history of the Second Law. It was an interesting exercise, satisfying for seeing how my new ways of thinking clarified things, but cautionary in seeing how wrong turns had been taken—and solidified—in the past. But in the end, there it was: the Second Law was a consequence of the interplay between underlying computational irreducibility, and our limitations as observers.

It had taken half a century, but finally I had finished the project I’d started when I was 12 years old. I was on a roll finishing things. But I was also realizing that a bigger structure than I’d ever imagined was emerging. The Second Law project completed what I think is the most beautiful thing I’ve ever discovered. That all three of the core theories of twentieth century physics—general relativity, quantum mechanics and the Second Law (statistical mechanics)—have the same origin: the interplay between the underlying computational structure of the ruliad, and our characteristics and limitations as observers.

And I knew it didn’t stop there. I’d already applied the same kind of thinking to the foundations of mathematics. And I was ready to start applying it to all sorts of deep questions in science, in philosophy, and beyond. But at the end of 2022, just as I was finishing my pieces about the Second Law, there was a surprise: ChatGPT.

I’d been following AI and neural nets for decades. I first simulated a neural net in 1981. My first company, started in 1981, had, to my chagrin, been labeled an “AI company”. And from the early 2010s we’d integrated neural nets into the Wolfram Language. But—like the creators of ChatGPT—I didn’t expect the capabilities that emerged in ChatGPT. And as soon as I saw ChatGPT I started trying to understand it. What was it really doing? What would its capabilities be?

In the world at large, there was a sense of shock: if AI can do this now, soon it’ll be able to do everything. But I immediately thought about computational irreducibility. And it gave us limitations. But those limitations would inevitably apply to AIs as well. There would be things that couldn’t be “quickly figured out by pure thought”—by humans and AIs alike. And, by the way, I’d just spent four decades building a way to represent things computationally, and actually do systematic computations on them—because that was the point of the Wolfram Language.

So immediately I could see we were in a very interesting position. The Wolfram Language had the completely unique mission of creating a full-scale computational language. And now this was a crucial tool for AIs. The AIs could provide a very interesting and useful broad linguistic interface. But when it came to solid computation, they were—like humans—going to need a tool. Conveniently, Wolfram|Alpha already communicated in natural language. And it took only a few weeks to hook up Wolfram|Alpha—and Wolfram Language—to ChatGPT. We’d given “computational superpowers” to the AI.

ChatGPT was everywhere. And people kept asking me about it. And over and over again I ended up explaining things about it. So at the beginning of February 2023 I decided it’d be better for me just to write down once and for all what I knew. It took a little over a week (yes, I’m a fast writer)—and then I had an “explainer” (that ran altogether to 76 pages) of ChatGPT.

Partly it talked in general about how machine learning and neural nets work, and how ChatGPT in particular works. But what a lot of people wanted to know was not “how” but “why” ChatGPT works. Why was something like that possible? Well, in effect ChatGPT was showing us a new science discovery—about language. Everyone knows that there’s a certain syntactic grammar of language—like that, in English, sentences typically have the form noun-verb-noun. But what ChatGPT was showing us is that there’s also a semantic grammar—some pattern of rules for what words can be put together and make sense.

I’ve thought about the foundations of language for a long time (which isn’t too surprising, given the four decades I’ve spent as a computational language designer). So in effect I was well primed to think about its interaction with ChatGPT. And it also helped that—as I’ll talk about below—one of my long-unfinished projects is precisely on a formal framework for capturing meaning that I call “symbolic discourse language”.

In technology and other things I always like best situations where basically nothing is known, and one has to invent everything from scratch. And that’s what was happening for functionality based on LLMs in the middle of 2023. How would LLM-based Wolfram Language functions work? How would a prompt repository work? How would LLMs interact with notebooks?

Meanwhile, there was still lots of foment in the world about the “AI shock”. Before the arrival of the Physics Project in 2019—I’d been quite involved in AI philosophy, AI ethics, etc. And in March 2023 I wrote a piece on “Will AIs Take All Our Jobs and End Human History—or Not?” In the end—after all sorts of philosophical arguments, and an analysis of actual historical data—the answer was: “It’s Complicated”. But along the way computational irreducibility and the ruliad were central elements: limiting the controllability of AIs, allowing for an infinite frontier of invention, and highlighting the inevitable meaninglessness of everything in the absence of human choice.

By this point (and actually, with remarkable speed) my explainer on ChatGPT had turned into a book—that proved extremely popular (and now, for example, exists in over 10 languages). It was nice that people found the book useful—and perhaps it helped remove some of the alarming mystique of AI. But I couldn’t help noticing that of all the many things I’d written, this had been one of the fastest to write, yet it was garnering one of the largest readerships.

One might have imagined that AI was pretty far from our Physics Project, the ruliad, etc. But actually it soon became clear that there were close connections, and that there were things to learn in both directions. In particular, I’d come to think of minds that work in different ways as occupying different positions in the ruliad. But how could one get intuition about what such minds would experience—or observe? Well, I realized, one could just look at generative AI. In July I wrote “Generative AI Space and the Mental Imagery of Alien Minds”. I called this the “cats in hats piece”, because, yes, it has lots of pictures of (often bizarrely distorted) cats (in hats)—used as examples of what happens if one moves a mind around in rulial space. But despite the whimsy of the cats, this piece provided a surprisingly useful window into what for me has been a very longstanding question of how other minds might perceive things.

And this fed quite directly into my piece on “Observer Theory” in December 2023. Ever since things like Turing machines we’ve had a formal model for the process of computation. My goal was to do the same kind of thing for the process of observation. In a sense, computation constructs sequences of new things, say with time. Observation, on the other hand, equivalences things together, so they fit in finite minds. And just what equivalencing is done—by our senses, our measuring devices, our thinking—determines what our ultimate perceptions will be. Or, put another way, if we can characterize well enough what we’re like as observers, it’ll show us how we sample the ruliad, and what we’ll perceive the laws of physics to be.

When I started the Physics Project I wasn’t counting on it having any applications for hundreds of years. But quite soon it became clear that actually there were going to be all sorts of near-term applications, particularly of the formalism of multicomputation. And every time one used that formalism one could get more intuition about features of the Physics Project, particularly related to quantum mechanics. I ended up writing a variety of “ruliological” pieces, all, as it happens, expanding on footnotes in A New Kind of Science. There was “Multicomputation with Numbers” (October 2021), “Games and Puzzles as Multicomputational Systems” (June 2022) and “Aggregation and Tiling as Multicomputational Processes” (November 2023). And in September 2023 there was also “Expression Evaluation and Fundamental Physics”.

Back around 1980—when I was working on SMP—I’d become interested in the theory of expression evaluation. And finally, now, with the Physics Project—and my work on combinators and metamathematics—four decades later I had a principled way to study it (potentially with immediate application in distributed computing and computational language design around that). And I could check off progress on another long-pending project.

I give many talks, and do many podcasts and livestreams—essentially all unprepared. But in October 2023 I agreed to give a TED talk. And I just didn’t see any way to fit a reasonable snapshot of my activities into 18 minutes without preparation. How was I to coherently explain the Physics Project, the ruliad and computational language in such a short time? I called the talk “How to Think Computationally about AI, the Universe and Everything”. And I began with what for me was a new condensation: “Human language. Mathematics. Logic. These are all ways to formalize the world. And in our century there’s a new and yet more powerful one: computation.”

Over the years I’d done all sorts of seemingly very different projects in science and in technology. But somehow it seemed like they were now all converging. Back in 1979, for example, I’d invented the idea of transformations for symbolic expressions as a foundation for computational language. But now—more than four decades later—our Physics Project was saying that those kinds of transformations (specifically on hypergraphs) were just what the “machine code of the universe” was made of.

Since the 1980s I’d thought that computation was a useful paradigm with which to think about the world. But now our Physics Project and the ruliad were saying that it wasn’t just useful; it was the underlying paradigm of the world. For some time I’d been viewing our whole Wolfram Language effort as a way to provide a way to formalize computation for the purposes of both humans and machines. Four hundred years ago mathematical notation had streamlined mathematical thinking, allowing what became the mathematical sciences to develop. I saw what we were doing with our computational language as a way to streamline computational thinking, and allow “computational X” for all fields “X” to develop.

I began to see computational thinking as a way to “humanize” the ruliad; to pick out those parts that are meaningful to humans. And I began to see computational language as the bridge between the power of raw computation, and the kinds of things we humans think about.

But how did AI fit in? At the beginning of 2024, lots of people were still asking in effect “Can AI Solve Science?” So I decided to analyze that. I certainly didn’t expect AI to be able to “break computational irreducibility”. And it didn’t. Yes, it could automate much of what humans could do in a quick look. But formalized, irreducible computation: that was going to need computational language, not AI.

It’s easy to be original in the computational universe: if you pick a rule at random, it’s overwhelmingly likely nobody’s ever looked at it before. But will anyone care? They’ll care if in effect that part of the ruliad has been “colonized”; if there’s already a human connection to it. But what if you define some attribute that you want, then just “search out there” for a rule that exhibits it? That’s basically what biological evolution—or machine learning training—seems to do.

And as a kind of off-hand note I decided to just see if I could make a minimal model for that. I’d tried before—in the mid-1980s. And in the 1990s when I was writing A New Kind of Science I’d become convinced that computational irreducibility was in a sense a stronger force than adaptive evolution, and that when complex behavior was seen in biology, it was computational irreducibility that should take most of the credit.

But I decided to just do the experiment and see. And although computational irreducibility in a sense tells one to always “expect the unexpected”, in all these years I’ve never fully come to terms with that—and I’m still regularly surprised by what simple systems somehow “cleverly” manage to do. And so it was with my minimal model of biological evolution.

I’d always wondered why biological evolution managed to work at all, why it didn’t “get stuck”, and how it managed to come up with the ornate “solutions” it did. Well, now I knew: and it turned out it was, once again, a story of computational irreducibility. And I’d managed to finish another project that I started in the 1980s.

But then there was machine learning. And despite all the energy around it—as well as practical experience with it—it didn’t seem like there was a good foundational understanding of what it was doing or why it worked. For a couple of years I’d been asking all the machine learning experts I ran into what they knew. But mostly they confirmed that, yes, it wasn’t well understood. And in fact several of them suggested that I’d be the best person to figure it out.

So just a few weeks ago, starting with ideas from the biological evolution project, and mixing in some things I tried back in 1985, I decided to embark on exploring minimal models of machine learning. I just posted the results last week. And, yes, one seems to be able to see the essence of machine learning in systems vastly simpler than neural nets. In these systems one can visualize what’s going on—and it’s basically a story of finding ways to put together lumps of irreducible computation to do the tasks we want. Like stones one might pick up off the ground to put together into a stone wall, one gets something that works, but there’s no reason for there to be any understandable structure to it.

Like so many of the projects I’ve done in the past five years, I could in principle have done this project much earlier—even in the 1980s. But back then I didn’t have the intuition, the tools or the intellectual confidence to actually dive in and get the project done. And what’s been particularly exciting over the past five years is that I can feel—and even very tangibly see—how what I can do has grown. With every project I’ve done I’ve further honed my intuition, developed more tools (both conceptual and practical), and built my intellectual confidence. Could I have gotten here earlier in my life? I don’t think so. I think to get to where I am now required the kind of journey I’ve taken through science, technology and the other things I’ve done. A living example of the phenomenon of computational irreducibility.

The Process of Getting Things Done

I started my career young—and usually found myself the “youngest person in the room”. But shockingly fast all those years whizzed by, and now I’m usually the “oldest person in the room”. But somehow I always still seem to feel like a young whippersnapper—not settled into some expected pattern, and “pushing for the future”.

I’ve always done projects that are hard. Projects that many people thought were impossible. Projects that stretched my capabilities to the limit. And to do this has required a certain mixture of confidence and humility. Confidence that it’s worth me trying the project. Humility in not assuming that it’ll be easy for me.

I’ve learned a lot of fields by now, and with them a lot of different ways of thinking. But somehow it’s never enough to make the projects I do easy. Somehow the projects are always far enough out on the frontier that I have to learn new things and new ways of thinking to succeed at them. And so there I am, often the only person in the room whose project isn’t somehow easy for them. And who still has to be pushing, whippersnapper style.

At this point, a fair fraction of the projects I do are ones that I’ve thought about for a long time; a smaller fraction are opportunistic—coming into scope just now as a result of something I’ve done, or something that’s happened in the world at large. Before the past five years I had a lot of projects that had languished, often for decades. Yes, I thought they would be interesting, and I gradually collected information about them. But somehow I wasn’t quite in a place to tackle them.

But now I feel quite differently. In the past five years, I’ve gone back and finished a fair fraction of all those languishing projects. And it’s been great. Without exception, the projects turned out to be richer and more interesting than I expected. Often I realized I really couldn’t have done them without the tools and ideas (and infrastructure) I now have. And—often to my great surprise—the projects turned out to have very direct connections to big themes around the ruliad, the Physics Project and, for that matter, computational language.

Why was this happening? Partly it’s a tribute to the breadth of the computational (and now multicomputational) paradigm. But partly it has to do with the specific character of projects I was choosing—always seeking what seemed like the simplest, most foundational versions of things.

I’ve done quite a few big projects in my life, many seemingly very different. But as I look back, I realize that all my projects have a certain overall pattern to them. They’re all about taking something that seems complicated, then drilling down to find the foundations of what’s going on, and then building up from these—often with considerable engineering-style effort. And the methods and tools I’ve developed have in a sense implicitly been optimized for this pattern of work.

I suppose one gets used to the rhythm of it all. The time when one’s drilling down, slowly trying to understand things. The time when one’s doing all the work to build the big structure up. And yes, it’s all hard. But by now I know the signs of progress, and they’re always energizing to see.

At any given time, I’ll have many projects gestating—often for years or decades. But once a project becomes active, it’s usually the only one I’m working on. And I’ll work on it with great intensity, pushing hard to keep going until it’s done. Often I’ll be working with other people, usually much younger than me. And I think it’s always a surprise that I’ll routinely be the one who works with the greatest intensity—every day, at all hours.

I think I’m pretty efficient too. Of course, it helps that I have a tool—Wolfram Language—that I’ve been building for decades to support me. And it helps that I’ve developed all kinds of practices around how I organize code and notebooks I create, and how I set up my process of writing about things. Of course, it also helps that I have very capable people around me to make suggestions, explore additional directions, fill in details, check things, and get my write-ups produced and published.

As I have written about elsewhere, my life is in many ways set up to be quite simple and routine. I get up at the same time every day, eat the same thing for breakfast, and so on. But in a sense this frees me to concentrate on the intellectual things I’m doing—which are different every day, often in unexpected ways.

But how is it that I even get the time to do all these intellectual things? After all, I am—as I have been for the past 38 years—the CEO of a very active tech company. Two things I think help (in addition, of course, to the fact that I have such a great long-term team at the company). First, organization. And second, resolve. Every day I’ll have tightly scheduled meetings over the course of the working day. (And there are lots of details to this. I get up in the late morning, then do my first two meetings while walking, and so on.) But somehow—mostly on evenings and weekends—I find time to work intensely on my intellectual projects.

It’s not as if I ignore everything else in the world. But I do have a certain drive—and resolve—that fills any time available with my projects, and somehow seems to succeed in getting them done. (And, yes, there are many optimizations in the details of my life, saving me all sorts of time. And it probably helps that I’ve been a work-from-home CEO now for 33 years.)

One might have thought that CEOing would greatly detract from being able to do intellectual work. But I find the exact opposite. Because in my experience the discipline of strategy and decision making (as well as communicating thoughts and ideas to other people) that comes with CEOing is critical to being able to do incisive intellectual work. And, by the way, the kind of thinking that goes with intellectual work is also incredibly valuable in being an effective CEO.

There’s another critical part to my “formula”. And that has to do with exposition. For me, the exposition of a project is an integral part of the project. Part of it is that the very definition of the question is often one of the most important parts of a project. But more than that, it’s through exposition that I find I really understand things. It takes a certain discipline. It can be easy enough to make some highfalutin technical statement. But can one grind it down into truly simple pieces that one can immediately understand? Yes, that means other people will be able to understand it too. But for me, what’s critical is that that’s the way I can tell if I’m getting things right. And for me the exposition is what in the end defines the backbone of a project.

Normally I write quickly, and basically without revision. But whenever there’s a piece I’m finding unduly hard to write I know that’s where I’m muddled, and need to go back and understand what’s going on. Some of my projects (like creating this piece, for example) end up being essentially “pure writing”. But most are deeply computational—and full of computer experiments. And just as I put a lot of effort into making written exposition clear, I do the same for computational language, and for pictures. Indeed, many of my projects are in large measure driven by pictures. Usually these are what one can think of as “algorithmic diagrams”—created automatically with a structure optimized for exposition.

And the pictures aren’t just useful for presenting what I’ve done; they’re also critical to my own efforts to figure things out. And I’ve learned that it’s important to get the presentational details of pictures right as early as possible in a project—to give myself the best chance to notice things.

Often the projects I do require exploring large numbers of possible systems. And somehow with great regularity this leads to me ending up looking at large arrays of little pictures. Yes, there’s a lot of “looking” that can be automated. But in the end computational irreducibility means there’ll always be the unexpected, that I basically have to see for myself.

A great thing about the Wolfram Language is that it’s been very stable ever since it was first released. And that means that I can take notebooks even from the 1980s and immediately run them today. And, yes, given all the “old” projects I’ve worked on in the past five years, that’s been very important.

But in addition to being very stable, the Wolfram Language is also very self contained—and very much intended to be readable by humans. And the result is something that I’ve found increasingly important: every computational picture in everything I write has Wolfram Language code “behind it”, that you can get by clicking. All the time I find myself going back to previous things I’ve written, and picking up click-to-copy code to run for some new case, or use as the basis for something new I’m doing.

And of course that click-to-copy code is open for anyone to use. Not only for its “computational content”, but also for the often-elaborate visuals it implements.

Most of my writings over the past five years have been about new basic science. But interspersed with this—along with pieces about technology and about philosophy—are pieces about history. And in fact many of my scientific pieces have had extensive historical sections as well.

Why do I put such effort into history? Partly I just find it fun to figure out. But mostly it’s to contextualize my understanding of things. Particularly in the past five years I’ve ended up working on a whole sequence of projects that are in a sense about changing longstanding directions in science. And to feel confident about making such changes, one has to know why people went in those directions in the first place. And that requires studying history.

Make no mistake: history—or at least good history—is hard. Often there’ll be a standard simple story about how some discovery was suddenly made, or how some direction was immediately defined. But the real story is usually much more complicated—and much more revealing of the true intellectual foundations of what was figured out. Almost never did someone discover something “one day”; almost always it took many years to build up the conceptual framework so that “one day” the key thing could even be noticed.

When I do history I always make a big effort to look at the original documents. And often I realize that’s critical—because it’s only with whatever new understanding I’ve developed that one would stand a chance of correctly interpreting what’s in the documents. And even if one’s mainly interested in the history of ideas, I’ve always found it’s crucial to also understand the people who were involved with them. What was their motivation? What was their practical situation? What kinds of things did they know about? What was their intellectual style in thinking about things?

It has helped me greatly that I’ve had my own experiences in making discoveries—that gives me an intuition for how the process of discovery works. And it also helps that I’ve had my fair share of “worldly” experiences. Still, often it’s at first a mystery how some idea developed or some discovery got made. But my consistent experience is that with enough effort one can almost always solve it.

Particularly for the projects I’ve done in recent years, it often leaves me with a strange feeling of connection. For in many cases I find out that the things I’ve now done can be viewed as direct follow-ons to ideas that were thought about a century or more ago, and for one reason or another ignored or abandoned since.

And I’m then usually left with a strong sense of responsibility. An idea that was someone’s great achievement had been buried and lost to the world. But now I have found it again, and it rests on me to bring it into the future.

In addition to writing about “other people’s history”, I’ve also been writing quite a bit about my own history. And in the last few years I’ve made a point of explaining my personal history around the science—and technology—I describe. In doing this, it helps a lot that I have excellent personal archives—that routinely let me track to within minutes discoveries I made even four decades ago.

My goal in describing my own history is to help other people contextualize things I write about. But I have to say that time and time again I’ve found the effort to piece together my own history extremely valuable just for me. As I go through life, I try to build up a repertoire of patterns for how things I do fit together. But often those patterns aren’t visible at the time. And it takes going back—often years later—to see them.

I do the projects I do first and foremost for myself. But I’ve always liked the idea that other people can get their own pleasure and benefit from my projects. And—basically starting with the Physics Project—I’ve tried to open to the world not just the results of my projects, but the process by which they’re done.

I post my working notebooks. Whenever practical I livestream my working meetings. And, perhaps taking things to an extreme, I record even my own solitary work, posting it in “video work logs”. (Except I just realized I forgot to record the writing I’m doing right now!)

A couple of years before the Physics Project I actually also opened up my technology development activities—livestreaming our software design reviews, in the past five years 692 hours of them. (And, yes, I put a lot of work and effort into designing the Wolfram Language!)

At the beginning of the pandemic I thought: “There are all these kids out of school. Let me try to do a little bit of public service and livestream something about science and technology for them.” And that’s how I started my “Science & Technology Q&A for Kids & Others” livestreams, that I’ve now been doing for four and a half years. Along the way, I’ve added “History of Science & Technology Q&A”, “Future of Science & Technology Q&A”, and “Business, Innovation & Managing Life Q&A”. Altogether I’ve done 272 hours of these, that have generated 376 podcast episodes.

Twice a week I sit down in front of a camera, watch the feed of questions, and try to answer them. It’s always off the cuff, completely unprepared. And I find it a great experience. I can tell that over the time I’ve been doing this, I’ve become a better and more fluent explainer, which no doubt helps my written exposition too. Often in answering questions I’ll come up with a new way to explain something, that I’ve never thought of before. And often there’ll be questions that make me think about things I’ve never thought about at all before. Indeed, several of my recent projects actually got started as a result of questions people asked.

When I was younger I always just wanted to get on with research, create things, and so on; I wasn’t interested in education. But as I’ve gotten older I’ve come to really like education. Partly it’s because I feel I learn a lot myself from it, but mostly it’s because I find it fulfilling to use what I know and try to help people develop.

I’ve always been interested in people—a useful attribute in running a talent-rich company for four decades. (I’m particularly interested in how people develop through their lives—leading me recently, for example, to organize a 50-year reunion for my elementary school class.) I’ve had a long-time “hobby” of mentoring CEOs and kids (both being categories of people who tend to believe that anything is possible).

But my main educational efforts are concentrated in a few weeks of the year when we do our Wolfram Summer School (started in 2003) and our Wolfram High School Summer Research Program (started in 2012). All the students in these programs (775 of them over the past five years) do an original project, and one of my jobs is to come up with what all these projects should be. Over the course of the year I’ll accumulate ideas—though rather often when I actually meet a student I’ll invent something new.

I obviously do plenty of projects myself. But it’s always an interesting—and invigorating—experience to see so many projects get done with such intensity at our summer programs. Plus, I get lots of extra practice in framing projects that helps when I come to frame my own projects.

At this point, I’ve spent years trying to organize my life to optimize it for what I want to get out of it. I need long stretches of time when I can concentrate coherently. But I like having a diversity of activities, and I’m pretty sure I wouldn’t have the energy and effectiveness I do without that. Over the years, I’ve added in little pieces. Like my weekly virtual sessions where I “do my homework” with a group of kids, working on something that I need to get done, but that doesn’t quite fit elsewhere. Or my weekly sessions with local kids, talking about things that make me and them think. Or, for that matter, my “call while driving” list of calls it’s good to make, but wouldn’t usually quite get the priority to happen.

Doing all the things I do is hard work. But it’s what I want to do. Yes, things can drag from time to time. But at this point I’m so used to the rhythm of projects that I don’t think I notice much. And, yes, I work basically every hour of every day I can. Do I have hobbies? Well, back when I was an academic, business was my main “hobby”. When I started CEOing, science became a “hobby”. Writing. Education. Livestreaming. These were all “hobbies” too. But somehow one of the patterns of my life is that nothing really stays quite as a “true hobby”.

What’s Next?

The past five years have not only been my most productive ever, but they’ve also built more “productivity momentum” than I’ve had before. So, what’s next? I have a lot of projects currently “in motion”, or ready to “get into motion”. Then I have many more that are in gestation, for which the time may finally have come. But I know there’ll also be surprises: projects that suddenly occur to me, or that I suddenly realize are possible. And one of the great challenges is to be in a position to actually jump into such things.

It has to be said that there’s always a potentially complicated tradeoff. To what extent should one “tend” the things one’s already done, and to what extent should one do new things? Of course, there are some things that are never “done”—like the Wolfram Language, which I started building 38 years ago, and still (energetically) work on every day. Or the Physics Project, where there’s just so much to figure out. But one of the things that’s worked well in most of the basic science projects I’ve done in the past five years or is that once I’ve written my piece about the project, I can usually consider the project “done for now”. It always takes a lot of effort to get a project to the point where I can write about it. But I work hard to make sure I only have to do it once; that I’ve “picked the low-hanging fruit”, so I don’t feel I have to come back “to add a little more”.

I put a lot of effort into the pieces I write about my projects. And I also give talks, do interviews, etc. (about 500 altogether in the past five years). But I certainly don’t “market” my efforts as much as I could. It’s a decision I’ve made: that at this point in my life—particularly with the burst of productivity I’m experiencing—I want to spend as much of my time as possible doing new things. And so I need to count on others to follow up and spread knowledge about what I’ve done, whether in the academic world, on Wikipedia, the web, etc. (And, yes, pieces I write and the pictures they contain are set up to be immediately reproducible wherever appropriate.)

OK, so what specific new things are currently in my pipeline? Well, there’s lots of science (and related intellectual things). And there’s also lots of technology. But let’s talk about science first.

A big story is the Physics Project—where there’s a lot to be done, in many different directions. There’s foundational theory to be developed. And there are experimental implications to be found.

It’d be great if we could find experimental evidence of the discreteness of space, or maximum entanglement speed, or a host of other unexpected phenomena in our models. A century or so ago it was something of a stroke of luck that atoms were big enough that they could be detected. And we don’t know if the discreteness of space is something we’ll be able to detect now—or only centuries from now.

There are phenomena—particularly associated with black holes—that might effectively serve as powerful “spacetime microscopes”. And there are phenomena like dimension fluctuations that could potentially show up in a variety of astrophysical settings. But one direction I’m particularly interested in exploring is what one might call “spacetime heat”—the effect of detailed microscopic dynamics in the hypergraph that makes up spacetime. Could “dark matter”, for example, not be “matter” at all, but instead be associated with spacetime heat?

Part of investigating this involves building practical simulation software to investigate our models on as large a scale as possible. And part of it involves “good, old-fashioned physics”, figuring out how to go from underlying foundational effects to observable phenomena.

And there’s a foundational piece to this too. How does one set up mathematics—and mathematical physics—when one’s starting from a hypergraph? A traditional manifold is ultimately built up from Euclidean space. But what kind of object is the limit of a hypergraph? To understand this, we need to construct what I’m calling infrageometry—and infracalculus alongside it. Infrageometry—as its name suggests—starts from something lower level than traditional geometry. And the challenge is in effect to build a “21st century Euclid”, then Newton, etc.—eventually finding generalizations of things like differential geometry and algebraic topology that answer questions like what 3-dimensional curvature tensors are like, or how we might distinguish local gauge degrees of freedom from spatial ones in a limiting hypergraph.

Another direction has to do with particles—like electrons. The fact is that existing quantum field theory in a sense only really deals with particles indirectly, by thinking of them as perturbations in a field—which in turn is full of (usually unobservable) zero-point fluctuations. In our models, the structure of everything—from spacetime up—is determined by the “fluctuating” structure of the underlying hypergraph (or, more accurately, by the whole multiway graph of “possible fluctuations”). And what this suggests is that there’s in a sense a much lower level version of the Feynman diagrams we use in quantum field theory and where we can discuss the “effect of particles” without ever having to say exactly what a particle “is”.

I must say that I expected we’d have to know what particles were even to talk about energy. But it turned out there was a “bulk” way to do that. And maybe similarly there’s an indirect way to talk about interactions between particles. My guess is that in our model particles are structures a bit like black holes—but we may be able to go a very long way without having to know the details.

One of the important features of our models is that quantum mechanics is “inevitable” in them. And one of the projects I’m hoping to do is to finally “really understand quantum mechanics”. In general terms, it’s connected to the way branching observers (like us) perceive branching universes. But how do we get intuition for this, and what effects can we expect? Several projects over the past years (like multiway Turing machines, multiway games, multiway aggregation, etc.) I’ve done in large part to bolster my intuition about branchial space and quantum mechanics.

I first worked on quantum computers back in 1980. And at the time, I thought that the measurement process (whose mechanism isn’t described in the standard formalism of quantum mechanics) would be a big problem for them. Years have gone by, and enthusiasm for quantum computers has skyrocketed. In our models there’s a rather clear picture that inside a quantum computer there are “many threads of history” that can in effect do computations in parallel. But for an observer like us to “know what the answer is” we have to knit those threads together. And in our models (particularly with my observer theory efforts) we start to be able to see how that might happen, and what the limitations might be.

Meanwhile, in the world at large there are all sorts of experimental quantum computers being built. But what are their limitations? I have a suspicion that there’s some as-yet-unknown fundamental physics associated with these limitations. It’s like building telescopes: you polish the mirror, and keep on making engineering tweaks. But unless you know about diffraction, you won’t understand why your resolution is limited. And I have a slight hope that even existing results on quantum computers may be enough to see limitations perhaps associated with maximum entanglement speed in our models. And the way our models work, knowing this speed, you can for example immediately deduce the discreteness scale of space.

Back in 1982, I and another physicist wrote two papers on “Properties of the Vacuum”. Part 1 was mechanical properties. Part 2 was electrodynamic. We announced a part 3, on gravitational properties. But we never wrote it. Well, finally, it looks as if our Physics Project shows us how to think about such properties. So perhaps it’s time to finally write “Part 3”, and respond to all those people who sent preprint request cards for it four decades ago.

One of the great conclusions of our Physics Project—and the concept of the ruliad—is that we have the laws of physics we do because we are observers of the kind we are. And just knowing very coarsely about us as observers seems to already imply the major laws of twentieth century physics. And to be able to say more, I think we need more characterization of us as observers. And my guess is, for example, that some feature of us that we probably consider completely obvious is what leads us to perceive space as (roughly) three dimensional. And indeed I increasingly suspect that the whole structure of our Physics Project can be derived—a bit like early derivations of special relativity—from certain axiomatic assumptions about our nature as observers, and fundamental features of computation.

There’s plenty to do on our Physics Project, and I’m looking forward to making progress with all of it. But the ideas of the Physics Project—and multicomputation in general—apply to lots of other fields too. And I have many projects planned on these.

Let’s talk first about chemistry. I never found chemistry interesting as a kid. But as we’ve added chemistry functionality in the Wolfram Language, I’ve understood more about it, and why it’s interesting. And I’ve also followed molecular computing since the 1980s. And now, largely inspired by thinking about multicomputation, I’ve become very interested in what one might call the foundations of chemistry. Actually, what I’m most interested in is what I’m calling “subchemistry”. I suppose one can think of it as having a similar kind of relation to chemistry as infrageometry has to geometry.

In ordinary chemistry, one thinks about reactions between different species of molecules. And to calculate rates of reactions, one multiplies concentrations of different species, implicitly assuming that there’s perfect randomness in which specific molecules interact. But what if one goes to a lower level, and starts talking about the interactions not of species of molecules, but individual molecules? From our Physics Project we get the idea of making causal graphs that represent the causal relations between different specific interaction events.

In a gas the assumption of molecular-level randomness will probably be pretty good. But even in a liquid it’ll be more questionable. And in more exotic materials it’ll be a completely different story. And I suspect that there are “subchemical” processes that can potentially be important, perhaps in a sense finding a new “slice of computational reducibility” within the general computational irreducibility associated with the Second Law.

But the most important potential application of subchemistry is in biology. If we look at biological tissue, a basic question might be: “What phase of matter is it?” One of the major takeaways from molecular biology in the last few decades has been that in biological systems, molecules (or at least large ones) are basically never just “bouncing around randomly”. Instead, their motion is typically carefully orchestrated.

So when we look at biological tissue—or a biological system—we’re basically seeing the result of “bulk orchestration”. But what are the laws of bulk orchestration? We don’t know. But I want to find out. I think the “mechanoidal phase” that I identified in studying the Second Law is potentially a good test case.

If we look at a microprocessor, it’s not very useful to describe it as “containing a gas of electrons”. And similarly, it’s not useful to describe a biological cell as “being liquid inside”. But just what kind of theory is needed to have a more useful description we don’t know. And my guess is that there’ll be some new level of abstraction that’s needed to think about this (perhaps a bit like the new abstraction that was needed to formulate information theory).

Biology is not big on theory. Yes, there’s natural selection. And there’s the digital nature of biomolecules. But mostly biology has ended up just accumulating vast amounts of data (using ever better instrumentation) without any overarching theory. But I suspect that in fact there’s another foundational theory to be found in biology. And if we find it, a lot of the data that’s been collected will suddenly fall into place.

There’s the “frankly molecular” level of biology. And there’s the more “functional” level. And I was surprised recently to be able to find a very minimal model that seems to capture “functional” aspects of biological evolution. It’s a surprisingly rich model, and there’s much more to explore with it, notably about how different “ideas” get propagated and developed in the process of adaptive evolution—and what kinds of tree-of-life-style branchings occur.

And then there’s the question of self replication—a core feature of biology. Just how simple a system can exhibit it in a “biologically relevant way”? I had thought that self replication was “just relevant for biology”. But in thinking about the problem of observers in the ruliad, I’ve come to realize that it’s also relevant at a foundational level there. It’s no good to just have one observer; you have to have a whole “rulial flock” of similar ones. And to get similar ones you need something like self replication.

Talking of “societies of observers” brings me to another area I want to study: economics. How does a coherent economic system emerge from all the microscopic transactions and other events in a society? I suspect it’s a story that’s in the end similar to the theories we’ve studied in physics—from the emergence of bulk properties in fluids, to the emergence of continuum spacetime, and so on. But now in economics we’re dealing not with fluid density or metric, but instead with things like price. I don’t yet know how it will work out. Maybe computational reducibility will be associated with value. Maybe computational irreducibility will be what determines robustness of value. But I suspect that there’s a way of thinking about “economic observers” in the ruliad—and figuring out what “natural laws” they’ll “inevitably observe”. And maybe some of those natural laws will be relevant in thinking about the kind of questions we humans care about in economics.

It’s rather amazing in how many different areas one seems to be able to apply the kind of approach that’s emerged from the Physics Project, the ruliad, etc. One that I’ve very recently tackled is machine learning. And in my effort to understand its foundations, I’ve ended up coming up with some very minimal models. My purpose was to understand the essence of machine learning. But—somewhat to my surprise—it looks as if these minimal models can actually be practical ways to do machine learning. Their hardware-level tradeoffs are somewhat different. But—given my interest in practical technology—I want to see if one can build out a practical machine-learning framework that’s based on these (fundamentally discrete) models.

And while I’m not currently planning to investigate this myself, I suspect that the approach I’ve used to study machine learning can also be applied to neuroscience, and perhaps to linguistics. And, yes, there’ll probably be a lot of computational irreducibility in evidence. And once again one has to hope that the pockets of computational reducibility that exist will give rise to “natural laws” that are useful for what we care about in these fields.

In addition to these “big” projects, I’m also hoping to do a variety of “smaller” projects. Many I started decades ago, and in fact mentioned in A New Kind of Science. But now I feel I have the tools, intuition and intellectual momentum to finally finish them. Nestedly recursive functions. Deterministic random tilings. Undecidability in the three-body problem. “Meta-engineering” in the Game of Life. These might on their own seem esoteric. But my repeated experience—particularly in the past five years—is that by solving problems like these one builds examples and intuition that have surprisingly broad application.

And then there are history projects. Just what did happen to theories of discrete space in the early twentieth century (and how close did people like Einstein get to the ideas of our Physics Project)? What was “ancient history” of neural nets, and why did people come to assume they should be based on continuous real numbers? I fully expect that as I investigate these things, I’ll encounter all sorts of “if only” situations—where for example some unpublished note languishing in an archive (or attic) would have changed the course of science if it had seen the light of day long ago. And when I find something like this, it’s yet more motivation to actually finish those projects of mine that have been languishing so long in the filesystem of my computer.

There’s a lot I want to do “down in the computational trenches”, in physics, chemistry, biology, economics, etc. But there are also things at a more abstract level in the ruliad. There’s more to study about metamathematics, and about how mathematics that we humans care about can emerge from the ruliad. And there are also foundational questions in computer science. P vs. NP, for example, can be formulated as an essentially geometric problem in the ruliad—and conceivably there are mathematical methods (say from higher category theory) that might give insight into it.

Then there are questions about hyperruliads and hyporuliads. In a hyperruliad that’s based on hypercomputation, there will be hyperobservers. But is there a kind of “rulial relativity” that makes their perception of things just the same as “ordinary observers” in the ordinary ruliad? A way to get some insight into this may be to study hyporuliads—versions of the ruliad in which there are only limited levels of computation possible. A bit like the way a spacelike singularity associated with a black hole supports only limited time histories, or a decidable axiomatic theory supports only proofs of limited length, there will be limitations in the hyporuliad. And by studying them, there’s a possibility that we’ll be able to see more about issues like what kinds of mathematical axioms can be compatible with observers like us.

It’s worth commenting that our Physics Project—and the ruliad—have all sorts of connections and resonances with long-studied ideas in philosophy. “Didn’t Kant talk about that? Isn’t that similar to Leibniz?”, etc. I’ve wanted to try to understand these historical connections. But while I’ve done a lot of work on the historical development of ideas, the ideas in question have tended to be more focused, and more tied to concrete formalism than they usually are in philosophy. “Did Kant actually mean that, or something completely different?” You might have to understand all his works to know. And that’s more than I think I can do.

I invented the concept of the ruliad as a matter of science. But it’s now clear that the ruliad has all sorts of connections and resonances not only with philosophy but also with theology. Indeed, in a great many belief systems there’s always been the idea that somehow in the end “everything is one”. In cases where this gets slightly more formalized, there’s often some kind of combinatorial enumeration involved (think: I Ching, or various versions of “counting the names of God”).

There are all sorts of examples where long-surviving “ancient beliefs” end up having something to them, even if the specific methods of post-1600s science don’t have much to say about them. One example is the notion of a soul, which we might now see as an ancient premonition of the modern notion of abstract computation. And whenever there’s a belief that’s ancient, there’s likely to have been lots of thinking done around it over the millennia. So if we can, for example, see a connection to the ruliad, we can expect to leverage that thinking. And perhaps also be able to provide new input that can refine the belief system in interesting and valuable ways.

I’m always interested in different viewpoints about things—whether from science, philosophy, theology, wherever. And an extreme version of this is to think about how other “alien” minds might view things. Nowadays I think of different minds as effectively being at different places in the ruliad. Humans with similar backgrounds have minds that are close in rulial space. Cats and dogs have minds that are further away. And the weather (with its “mind of its own”) is still further.

Now that we have AIs we potentially have a way to study the correspondence—and communication—between “different minds”. I looked at one aspect of this in my “cats” piece. But my recent work on the foundations of machine learning suggests a broader approach, that can also potentially tell us things about the fundamental character of language, and about how it serves as a medium that can “transport thoughts” from one mind to another.

Many non-human animals seem to have at least some form of language—though mostly in effect just a few standalone words. But pretty unquestionably the greatest single invention of our species is language—and particularly compositional language where words and phrases can fit together in an infinite number of ways. But is there something beyond compositional language? And, for example, where might we get if our brains were bigger?

With the 100 billion neurons in our brains, we seem to be able to handle about 50,000 words. If we had a trillion neurons we’d probably be able to handle more words (though perhaps more slowly), in effect letting us describe more things more easily. But what about something fundamentally beyond compositional language? Something perhaps “higher order”?

With a word we are in effect conflating all instances of a certain concept into a single object that we can then work with. But typically with ordinary words we’re dealing with what we might call “static concepts”. So what about “ways of thinking”, or paradigms? They’re more like active, functional concepts. And it’s a bit like dogs versus us: dogs deal with a few standalone words; we “package” those together into whole sentences and beyond. And at the next level, we could imagine in effect packaging things like generators of meaningful sentences.

Interestingly enough, we have something of a preview of ideas like this—in computational language. And this is one of those places where my efforts in science—and philosophy—start to directly intersect with my efforts in technology.

The foundation of the Wolfram Language is the idea of representing everything in computational terms, and in particular in symbolic computational terms. And one feature of such a representation is that it can encompass both “data” and “code”—i.e. both things one might think about, and ways one might think about them.

I first started building Wolfram Language as a practical tool—though one very much informed by my foundational ideas. And now, four decades later, the Wolfram Language has emerged as the largest single project of my life, and something that, yes, I expect to always put immense effort into. It wasn’t long ago that we finally finished my 1991 to-do list for Wolfram Language—and we have many projects running now that will take years to complete. But the mission has always remained the same: to take the concept of computation and apply it as broadly as possible, through the medium of computational language.

Now, however, I have some additional context for that—viewing computational language as a bridge from what we humans think about to what’s possible in the computational universe. And this helps in framing some of the ways to expand the foundations of our computational language, for example to multicomputation, or to hypergraph-based representations. It also helps in understanding the character of current AI, and how it needs to interact with computational language.

In the Wolfram Language we’ve been steadily trying to create a representation for everything. And when it comes to definitive, objective things we’ve gotten a long way. But there’s more than that in everyday discourse. For example, I might say “I’m going to drink a glass of orange juice.” Well, we do just fine at representing “a glass of orange juice” in the Wolfram Language, and we can compute lots of things—like nutrition content—about it. But what about “I’m going to drink…”? For that we need something different.

And, actually, I’ve been thinking for a shockingly long time about what one might need. I first considered the question in the early 1980s, in connection with “extending SMP to AI”. I learned about the attempts to make “philosophical languages” in the 1600s, and about some of the thinking around modern conlangs (constructed languages). Something that always held me back, though, was use cases. Yes, I could see how one could use things like this for tasks like customer service. But I wasn’t too excited about that.

But finally there was blockchain, and with it, smart contracts. And around 2015 I started thinking about how one might represent contracts in general not in legalese but in some precise computational way. And the result was that I began to crispen my ideas about what I called “symbolic discourse language”. I thought about how this might relate to questions like a “constitution for AIs” and so on. But I never quite got around to actually starting to design the specifics of the symbolic discourse language.

But then along came LLMs, together with my theory that their success had to do with a “semantic grammar” of language. And finally now we’ve launched a serious project to build a symbolic discourse language. And, yes, it’s a difficult language design problem, deeply entangled with a whole range of foundational issues in philosophy. But as, by now at least, the world’s most experienced language designer (for better or worse), I feel a responsibility to try to do it.

In addition to language design, there’s also the question of making all the various “symbolic calculi” that describe in appropriately coarse terms the operation of the world. Calculi of motion. Calculi of life (eating, dying, etc.). Calculi of human desires. Etc. As well as calculi that are directly supported by the computation and knowledge in the Wolfram Language.

And just as LLMs can provide a kind of conversational linguistic interface to the Wolfram Language, one can expect them also to do this to our symbolic discourse language. So the pattern will be similar to what it is for Wolfram Language: the symbolic discourse language will provide a formal and (at least within its purview) correct underpinning for the LLM. It may lose the poetry of language that the LLM handles. But from the outset it’ll get its reasoning straight.

The symbolic discourse language is a broad project. But in some sense breadth is what I have specialized in. Because that’s what’s needed to build out the Wolfram Language, and that’s what’s needed in my efforts to pull together the foundations of so many fields.

And in maintaining a broad range of interests there are some where I imagine that someday there’ll be a project I can do, but there may for example be many years of “ambient technology” that are needed before that project will be feasible. Usually, though, I have some “conceptual idea” of what the project might be. For example, I’ve followed robotics, imagining that one day there’ll be a way to do “general-purpose robotics”, perhaps constructing everything out of modular elements. I’ve followed biomedicine, partly out of personal self interest, and partly because I think it’ll relate to some of the foundational questions I’m asking in biology.

But in addition to all the projects where the goal is basic research, or technology development, I’m also hoping to pursue my interests in education. Much of what I hope to do relates to content, but some of it relates to access and motivation. I don’t have perfect evidence, but I strongly believe there’s a lot of young talent out there in the world that never manages to connect for example with things like the educational programs we put on. We–and I—have tried quite hard over the years to “bridge the gap”. But with the world as it is, it’s proved remarkably difficult. But it’s still a problem I’d like to solve, and I’ll keep picking away at it, hoping to change for the better some kids’ “trajectories”.

But about content I believe my path is clearer. With the modern Wolfram Language I think we’ve gone a long way towards being able to take computational thinking about almost anything, and being able to represent it in a formalized way, and compute from it. But how do people manage to do the computational thinking in the first place? Well, like mathematical thinking and other formalized kinds of thinking, they have to learn how to do it.

For years people have been telling me I should “write the book” to teach this. And finally in January of this year I started. I’m not sure how long it will take, but I’ll soon be starting to post sections I’ve written so far.

My goal is to create a general book—and course—that’s an introduction to computational thinking at a level suitable for typical first-year college students. Lots of college students these days say they want to study “computer science”. But really it’s computational X for some field X that they’re ultimately interested in. And neither the theoretical nor the engineering aspects of typical “computer science” are what’s most relevant to them. What they need to know is computational thinking as it might be applied to computational X—not “CS” but what one might call “CX”.

So what will CX101 be like? In some ways more like a philosophy course than a CS one. Because in the end it’s about generally learning to think, albeit in the new paradigm of computation. And the point is that once someone has a clear computational conceptualization of something, then it’s our job in the Wolfram Language to make sure that it’s easy for them to concretely implement it.

But how does one teach computational conceptualization? What I’ve concluded is that one needs to anchor it in actual things in the world. Geography. Video. Genomics. Yes, there are principles to explain. But they need practical context to make them useful, or even understandable. And what I’m finding is that framing everything computationally makes things incredibly much easier to explain than before. (A test example coming soon is whether I can easily explain math ideas like algebra and calculus this way.)

OK, so that’s a lot of projects. But I’m excited about all of them, and can’t wait to make them happen. At an age when many of my contemporaries are retiring, I feel like I’m just getting started. And somehow the way my projects keep on connecting back to things I did decades ago makes me feel—in a computational irreducibility kind of way—that there’s something necessary about all the steps I’ve taken. I feel like the things I’ve done have let me climb some hills. But now there are many more hills that have come into view. And I look forward to being able to climb those too. For myself and for the world.

What’s Really Going On in Machine Learning? Some Minimal Models

Stephen Wolfram — Thu, 22 Aug 2024 18:28:17 +0000

The Mystery of Machine Learning

It’s surprising how little is known about the foundations of machine learning. Yes, from an engineering point of view, an immense amount has been figured out about how to build neural nets that do all kinds of impressive and sometimes almost magical things. But at a fundamental level we still don’t really know why neural nets “work”—and we don’t have any kind of “scientific big picture” of what’s going on inside them.

The basic structure of neural networks can be pretty simple. But by the time they’re trained up with all their weights, etc. it’s been hard to tell what’s going on—or even to get any good visualization of it. And indeed it’s far from clear even what aspects of the whole setup are actually essential, and what are just “details” that have perhaps been “grandfathered” all the way from when computational neural nets were first invented in the 1940s.

Well, what I’m going to try to do here is to get “underneath” this—and to “strip things down” as much as possible. I’m going to explore some very minimal models—that, among other things, are more directly amenable to visualization. At the outset, I wasn’t at all sure that these minimal models would be able to reproduce any of the kinds of things we see in machine learning. But, rather surprisingly, it seems they can.

And the simplicity of their construction makes it much easier to “see inside them”—and to get more of a sense of what essential phenomena actually underlie machine learning. One might have imagined that even though the training of a machine learning system might be circuitous, somehow in the end the system would do what it does through some kind of identifiable and “explainable” mechanism. But we’ll see that in fact that’s typically not at all what happens.

Instead it looks much more as if the training manages to home in on some quite wild computation that “just happens to achieve the right results”. Machine learning, it seems, isn’t building structured mechanisms; rather, it’s basically just sampling from the typical complexity one sees in the computational universe, picking out pieces whose behavior turns out to overlap what’s needed. And in a sense, therefore, the possibility of machine learning is ultimately yet another consequence of the phenomenon of computational irreducibility.

Why is that? Well, it’s only because of computational irreducibility that there’s all that richness in the computational universe. And, more than that, it’s because of computational irreducibility that things end up being effectively random enough that the adaptive process of training a machine learning system can reach success without getting stuck.

But the presence of computational irreducibility also has another important implication: that even though we can expect to find limited pockets of computational reducibility, we can’t expect a “general narrative explanation” of what a machine learning system does. In other words, there won’t be a traditional (say, mathematical) “general science” of machine learning (or, for that matter, probably also neuroscience). Instead, the story will be much closer to the fundamentally computational “new kind of science” that I’ve explored for so long, and that has brought us our Physics Project and the ruliad.

In many ways, the problem of machine learning is a version of the general problem of adaptive evolution, as encountered for example in biology. In biology we typically imagine that we want to adaptively optimize some overall “fitness” of a system; in machine learning we typically try to adaptively “train” a system to make it align with certain goals or behaviors, most often defined by examples. (And, yes, in practice this is often done by trying to minimize a quantity normally called the “loss”.)

And while in biology there’s a general sense that “things arise through evolution”, quite how this works has always been rather mysterious. But (rather to my surprise) I recently found a very simple model that seems to do well at capturing at least some of the most essential features of biological evolution. And while the model isn’t the same as what we’ll explore here for machine learning, it has some definite similarities. And in the end we’ll find that the core phenomena of machine learning and of biological evolution appear to be remarkably aligned—and both fundamentally connected to the phenomenon of computational irreducibility.

Most of what I’ll do here focuses on foundational, theoretical questions. But in understanding more about what’s really going on in machine learning—and what’s essential and what’s not—we’ll also be able to begin to see how in practice machine learning might be done differently, potentially with more efficiency and more generality.

Traditional Neural Nets

Note: Click any diagram to get Wolfram Language code to reproduce it.

To begin the process of understanding the essence of machine learning, let’s start from a very traditional—and familiar—example: a fully connected (“multilayer perceptron”) neural net that’s been trained to compute a certain function f[x]:

If one gives a value x as input at the top, then after “rippling through the layers of the network” one gets a value at the bottom that (almost exactly) corresponds to our function f[x]:

Scanning through different inputs x, we see different patterns of intermediate values inside the network:

And here’s (on a linear and log scale) how each of these intermediate values changes with x. And, yes, the way the final value (highlighted here) emerges looks very complicated:

So how is the neural net ultimately put together? How are these values that we’re plotting determined? We’re using the standard setup for a fully connected multilayer network. Each node (“neuron”) on each layer is connected to all nodes on the layer above—and values “flow” down from one layer to the next, being multiplied by the (positive or negative) “weight” (indicated by color in our pictures) associated with the connection through which they flow. The value of a given neuron is found by totaling up all its (weighted) inputs from the layer before, adding a “bias” value for that neuron, and then applying to the result a certain (nonlinear) “activation function” (here ReLU or Ramp[z], i.e. If[z < 0, 0, z]).

What overall function a given neural net will compute is determined by the collection of weights and biases that appear in the neural net (along with its overall connection architecture, and the activation function it’s using). The idea of machine learning is to find weights and biases that produce a particular function by adaptively “learning” from examples of that function. Typically we might start from a random collection of weights, then successively tweak weights and biases to “train” the neural net to reproduce the function:

We can get a sense of how this progresses (and, yes, it’s complicated) by plotting successive changes in individual weights over the course of the training process (the spikes near the end come from “neutral changes” that don’t affect the overall behavior):

The overall objective in the training is progressively to decrease the “loss”—the average (squared) difference between true values of f[x] and those generated by the neural net. The evolution of the loss defines a “learning curve” for the neural net, with the downward glitches corresponding to points where the neural net in effect “made a breakthrough” in being able to represent the function better:

It’s important to note that typically there’s randomness injected into neural net training. So if one runs the training multiple times, one will get different networks—and different learning curves—every time:

But what’s really going on in neural net training? Effectively we’re finding a way to “compile” a function (at least to some approximation) into a neural net with a certain number of (real-valued) parameters. And in the example here we happen to be using about 100 parameters.

But what happens if we use a different number of parameters, or set up the architecture of our neural net differently? Here are a few examples, indicating that for the function we’re trying to generate, the network we’ve been using so far is pretty much the smallest that will work:

And, by the way, here’s what happens if we change our activation function from ReLU
to the smoother ELU :

Later we’ll talk about what happens when we do machine learning with discrete systems. And in anticipation of that, it’s interesting to see what happens if we take a neural net of the kind we’ve discussed here, and “quantize” its weights (and biases) in discrete levels:

The result is that (as recent experience with large-scale neural nets has also shown) the basic “operation” of the neural net does not require precise real numbers, but survives even when the numbers are at least somewhat discrete—as this 3D rendering as a function of the discreteness level δ also indicates:

Simplifying the Topology: Mesh Neural Nets

So far we’ve been discussing very traditional neural nets. But to do machine learning, do we really need systems that have all those details? For example, do we really need every neuron on each layer to get an input from every neuron on the previous layer? What happens if instead every neuron just gets input from at most two others—say with the neurons effectively laid out in a simple mesh? Quite surprisingly, it turns out that such a network is still perfectly able to generate a function like the one we’ve been using as an example:

And one advantage of such a “mesh neural net” is that—like a cellular automaton—its “internal behavior” can readily be visualized in a rather direct way. So, for example, here are visualizations of “how the mesh net generates its output”, stepping through different input values x:

And, yes, even though we can visualize it, it’s still hard to understand “what’s going on inside”. Looking at the intermediate values of each individual node in the network as a function of x doesn’t help much, though we can “see something happening” at places where our function f[x] has jumps:

So how do we train a mesh neural net? Basically we can use the same procedure as for a fully connected network of the kind we saw above (ReLU activation functions don’t seem to work well for mesh nets, so we’re using ELU here):

Here’s the evolution of differences in each individual weight during the training process:

And here are results for different random seeds:

At the size we’re using, our mesh neural nets have about the same number of connections (and thus weights) as our main example of a fully connected network above. And we see that if we try to reduce the size of our mesh neural net, it doesn’t do well at reproducing our function:

Making Everything Discrete: A Biological Evolution Analog

Mesh neural nets simplify the topology of neural net connections. But, somewhat surprisingly at first, it seems as if we can go much further in simplifying the systems we’re using—and still successfully do versions of machine learning. And in particular we’ll find that we can make our systems completely discrete.

The typical methodology of neural net training involves progressively tweaking real-valued parameters, usually using methods based on calculus, and on finding derivatives. And one might imagine that any successful adaptive process would ultimately have to rely on being able to make arbitrarily small changes, of the kind that are possible with real-valued parameters.

But in studying simple idealizations of biological evolution I recently found striking examples where this isn’t the case—and where completely discrete systems seemed able to capture the essence of what’s going on.

As an example consider a (3-color) cellular automaton. The rule is shown on the left, and the behavior one generates by repeatedly applying that rule (starting from a single-cell initial condition) is shown on the right:

The rule has the property that the pattern it generates (from a single-cell initial condition) survives for exactly 40 steps, and then dies out (i.e. every cell becomes white). And the important point is that this rule can be found by a discrete adaptive process. The idea is to start, say, from a null rule, and then at each step to randomly change a single outcome out of the 27 in the rule (i.e. make a “single-point mutation” in the rule). Most such changes will cause the “lifetime” of the pattern to get further from our target of 40—and these we discard. But gradually we can build up “beneficial mutations”

that through “progressive adaptation” eventually get to our original lifetime-40 rule:

We can make a plot of all the attempts we made that eventually let us reach lifetime 40—and we can think of this progressive “fitness” curve as being directly analogous to the loss curves in machine learning that we saw before:

If we make different sequences of random mutations, we’ll get different paths of adaptive evolution, and different “solutions” for rules that have lifetime 40:

Two things are immediately notable about these. First, that they essentially all seem to be “using different ideas” to reach their goal (presumably analogous to the phenomenon of different branches in the tree of life). And second, that none of them seem to be using a clear “mechanical procedure” (of the kind we might construct through traditional engineering) to reach their goal. Instead, they seem to be finding “natural” complicated behavior that just “happens” to achieve the goal.

It’s nontrivial, of course, that this behavior can achieve a goal like the one we’ve set here, as well as that simple selection based on random point mutations can successfully reach the necessary behavior. But as I discussed in connection with biological evolution, this is ultimately a story of computational irreducibility—particularly in generating diversity both in behavior, and in the paths necessary to reach it.

But, OK, so how does this model of adaptive evolution relate to systems like neural nets? In the standard language of neural nets, our model is like a discrete analog of a recurrent convolutional network. It’s “convolutional” because at any given step the same rule is applied—locally—throughout an array of elements. It’s “recurrent” because in effect data is repeatedly “passed through” the same rule. The kinds of procedures (like “backpropagation”) typically used to train traditional neural nets wouldn’t be able to train such a system. But it turns out that—essentially as a consequence of computational irreducibility—the very simple method of successive random mutation can be successful.

Machine Learning in Discrete Rule Arrays

Let’s say we want to set up a system like a neural net—or at least a mesh neural net—but we want it to be completely discrete. (And I mean “born discrete”, not just discretized from an existing continuous system.) How can we do this? One approach (that, as it happens, I first considered in the mid-1980s—but never seriously explored) is to make what we can call a “rule array”. Like in a cellular automaton there’s an array of cells. But instead of these cells always being updated according to the same rule, each cell at each place in the cellular automaton analog of “spacetime” can make a different choice of what rule it will use. (And although it’s a fairly extreme idealization, we can potentially imagine that these different rules represent a discrete analog of different local choices of weights in a mesh neural net.)

As a first example, let’s consider a rule array in which there are two possible choices of rules: k = 2, r = 1 cellular automaton rules 4 and 146 (which are respectively class 2 and class 3):

A particular rule array is defined by which of these rules is going to be used at each (“spacetime”) position in the array. Here are a few examples. In all cases we’re starting from the same single-cell initial condition. But in each case the rule array has a different arrangement of rule choices—with cells “running” rule 4 being given a background, and those running rule 146 a one:

We can see that different choices of rule array can yield very different behaviors. But (in the spirit of machine learning) can we in effect “invert this”, and find a rule array that will give some particular behavior we want?

A simple approach is to do the direct analog of what we did in our minimal modeling of biological evolution: progressively make random “single-point mutations”—here “flipping” the identity of just one rule in the rule array—and then keeping only those mutations that don’t make things worse.

As our sample objective, let’s ask to find a rule array that makes the pattern generated from a single cell using that rule array “survive” for exactly 50 steps. At first it might not be obvious that we’d be able to find such a rule array. But in fact our simple adaptive procedure easily manages to do this:

As the dots here indicate, many mutations don’t lead to longer lifetimes. But every so often, the adaptive process has a “breakthrough” that increases the lifetime—eventually reaching 50:

Just as in our model of biological evolution, different random sequences of mutations lead to different “solutions”, here to the problem of “living for exactly 50 steps”:

Some of these are in effect “simple solutions” that require only a few mutations. But most—like most of our examples in biological evolution—seem more as if they just “happen to work”, effectively by tapping into just the right, fairly complex behavior.

Is there a sharp distinction between these cases? Looking at the collection of “fitness” (AKA “learning”) curves for the examples above, it doesn’t seem so:

It’s not too difficult to see how to “construct a simple solution” just by strategically placing a single instance of the second rule in the rule array:

But the point is that adaptive evolution by repeated mutation normally won’t “discover” this simple solution. And what’s significant is that the adaptive evolution can nevertheless still successfully find some solution—even though it’s not one that’s “understandable” like this.

The cellular automaton rules we’ve been using so far take 3 inputs. But it turns out that we can make things even simpler by just putting ordinary 2-input Boolean functions into our rule array. For example, we can make a rule array from And and Xor functions (r = 1/2 rules 8 and 6):

Different And+Xor ( + ) rule arrays show different behavior:

But are there for example And+Xor rule arrays that will compute any of the 16 possible (2-input) functions? We can’t get Not or any of the 8 other functions with —but it turns out we can get all 8 functions with (additional inputs here are assumed to be ):

And in fact we can also set up And+Xor rule arrays for all other “even” Boolean functions. For example, here are rule arrays for the 3-input rule 30 and rule 110 Boolean functions:

It may be worth commenting that the ability to set up such rule arrays is related to functional completeness of the underlying rules we’re using—though it’s not quite the same thing. Functional completeness is about setting up arbitrary formulas, that can in effect allow long-range connections between intermediate results. Here, all information has to explicitly flow through the array. But for example the functional completeness of Nand (r = 1/2 rule 7, ) allows it to generate all Boolean functions when combined for example with First (r = 1/2 rule 12, ), though sometimes the rule arrays required are quite large:

OK, but what happens if we try to use our adaptive evolution process—say to solve the problem of finding a pattern that survives for exactly 30 steps? Here’s a result for And+Xor rule arrays:

And here are examples of other “solutions” (none of which in this case look particularly “mechanistic” or “constructed”):

But what about learning our original f[x] = function? Well, first we have to decide how we’re going to represent the numbers x and f[x] in our discrete rule array system. And one approach is to do this simply in terms of the position of a black cell (“one-hot encoding”). So, for example, in this case there’s an initial black cell at a position corresponding to about x = –1.1. And then the result after passing through the rule array is a black cell at a position corresponding to f[x] = 1.0:

So now the question is whether we can find a rule array that successfully maps initial to final cell positions according to the mapping x f[x] we want. Well, here’s an example that comes at least close to doing this (note that the array is taken to be cyclic):

So how did we find this? Well, we just used a simple adaptive evolution process. In direct analogy to the way it’s usually done in machine learning, we set up “training examples”, here of the form:

Then we repeatedly made single-point mutations in our rule array, keeping those mutations where the total difference from all the training examples didn’t increase. And after 50,000 mutations this gave the final result above.

We can get some sense of “how we got there” by showing the sequence of intermediate results where we got closer to the goal (as opposed to just not getting further from it):

Here are the corresponding rule arrays, in each case highlighting elements that have changed (and showing the computation of f[0] in the arrays):

Different sequences of random mutations will lead to different rule arrays. But with the setup defined here, the resulting rule arrays will almost always succeed in accurately computing f[x]. Here are a few examples—in which we’re specifically showing the computation of f[0]:

And once again an important takeaway is that we don’t see “identifiable mechanism” in what’s going on. Instead, it looks more as if the rule arrays we’ve got just “happen” to do the computations we want. Their behavior is complicated, but somehow we can manage to “tap into it” to compute our f[x].

But how robust is this computation? A key feature of typical machine learning is that it can “generalize” away from the specific examples it’s been given. It’s never been clear just how to characterize that generalization (when does an image of a cat in a dog suit start being identified as an image of a dog?). But—at least when we’re talking about classification tasks—we can think of what’s going on in terms of basins of attraction that lead to attractors corresponding to our classes.

It’s all considerably easier to analyze, though, in the kind of discrete system we’re exploring here. For example, we can readily enumerate all our training inputs (i.e. all initial states containing a single black cell), and then see how frequently these cause any given cell to be black:

By the way, here’s what happens to this plot at successive “breakthroughs” during training:

But what about all possible inputs, including ones that don’t just contain a single black cell? Well, we can enumerate all of them, and compute the overall frequency for each cell in the array to be black:

As we would expect, the result is considerably “fuzzier” than what we got purely with our training inputs. But there’s still a strong trace of the discrete values for f[x] that appeared in the training data. And if we plot the overall probability for a given final cell to be black, we see peaks at positions corresponding to the values 0 and 1 that f[x] takes on:

But because our system is discrete, we can explicitly look at what outcomes occur:

The most common overall is the “meaningless” all-white state—that basically occurs when the computation from the input “never makes it” to the output. But the next most common outcomes correspond exactly to f[x] = 0 and f[x] = 1. After that is the “superposition” outcome where f[x] is in effect “both 0 and 1”.

But, OK, so what initial states are “in the basins of attraction of” (i.e. will evolve to) the various outcomes here? The fairly flat plots in the last column above indicate that the overall density of black cells gives little information about what attractor a particular initial state will evolve to.

So this means we have to look at specific configurations of cells in the initial conditions. As an example, start from the initial condition

which evolves to:

Now we can ask what happens if we look at a sequence of slightly different initial conditions. And here we show in black and white initial conditions that still evolve to the original “attractor” state, and in pink ones that evolve to some different state:

What’s actually going on inside here? Here are a few examples, highlighting cells whose values change as a result of changing the initial condition:

As is typical in machine learning, there doesn’t seem to be any simple characterization of the form of the basin of attraction. But now we have a sense of what the reason for this is: it’s another consequence of computational irreducibility. Computational irreducibility gives us the effective randomness that allows us to find useful results by adaptive evolution, but it also leads to changes having what seem like random and unpredictable effects. (It’s worth noting, by the way, that we could probably dramatically improve the robustness of our attractor basins by specifically including in our training data examples that have “noise” injected.)

Multiway Mutation Graphs

In doing machine learning in practice, the goal is typically to find some collection of weights, etc. that successfully solve a particular problem. But in general there will be many such collections of weights, etc. With typical continuous weights and random training steps it’s very difficult to see what the whole “ensemble” of possibilities is. But in our discrete rule array systems, this becomes more feasible.

Consider a tiny 2×2 rule array with two possible rules. We can make a graph whose edges represent all possible “point mutations” that can occur in this rule array:

In our adaptive evolution process, we’re always moving around a graph like this. But typically most “moves” will end up in states that are rejected because they increase whatever loss we’ve defined.

Consider the problem of generating an And+Xor rule array in which we end with lifetime-4 patterns. Defining the loss as how far we are from this lifetime, we can draw a graph that shows all possible adaptive evolution paths that always progressively decrease the loss:

The result is a multiway graph of the type we’ve now seen in a great many kinds of situations—notably our recent study of biological evolution.

And although this particular example is quite trivial, the idea in general is that different parts of such a graph represent “different strategies” for solving a problem. And—in direct analogy to our Physics Project and our studies of things like game graphs—one can imagine such strategies being laid out in a “branchial space” defined by common ancestry of configurations in the multiway graph.

And one can expect that while in some cases the branchial graph will be fairly uniform, in other cases it will have quite separated pieces—that represent fundamentally different strategies. Of course, the fact that underlying strategies may be different doesn’t mean that the overall behavior or performance of the system will be noticeably different. And indeed one expects that in most cases computational irreducibility will lead to enough effective randomness that there’ll be no discernable difference.

But in any case, here’s an example starting with a rule array that contains both And and Xor—where we observe distinct branches of adaptive evolution that lead to different solutions to the problem of finding a configuration with a lifetime of exactly 4:

Optimizing the Learning Process

How should one actually do the learning in machine learning? In practical work with traditional neural nets, learning is normally done using systematic algorithmic methods like backpropagation. But so far, all we’ve done here is something much simpler: we’ve “learned” by successively making random point mutations, and keeping only ones that don’t lead us further from our goal. And, yes, it’s interesting that such a procedure can work at all—and (as we’ve discussed elsewhere) this is presumably very relevant to understanding phenomena like biological evolution. But, as we’ll see, there are more efficient (and probably much more efficient) methods of doing machine learning, even for the kinds of discrete systems we’re studying.

Let’s start by looking again at our earlier example of finding an And+Xor rule array that gives a “lifetime” of exactly 30. At each step in our adaptive (“learning”) process we make a single-point mutation (changing a single rule in the rule array), keeping the mutation if it doesn’t take us further from our goal. The mutations gradually accumulate—every so often reaching a rule array that gives a lifetime closer to 30. Just as above, here’s a plot of the lifetime achieved by successive mutations—with the “internal” red dots corresponding to rejected mutations:

We see a series of “plateaus” at which mutations are accumulating but not changing the overall lifetime. And between these we see occasional “breakthroughs” where the lifetime jumps. Here are the actual rule array configurations for these breakthroughs, with mutations since the last breakthrough highlighted:

But in the end the process here is quite wasteful; in this example, we make a total of 1705 mutations, but only 780 of them actually contribute to generating the final rule array; all the others are discarded along the way.

So how can we do better? One strategy is to try to figure out at each step which mutation is “most likely to make a difference”. And one way to do this is to try every possible mutation in turn at every step (as in multiway evolution)—and see what effect each of them has on the ultimate lifetime. From this we can construct a “change map” in which we give the change of lifetime associated with a mutation at every particular cell. The results will be different for every configuration of rule array, i.e. at every step in the adaptive evolution. But for example here’s what they are for the particular “breakthrough” configurations shown above (elements in regions that are colored gray won’t affect the result if they are changed; ones colored red will have a positive effect (with more intense red being more positive), and ones colored blue a negative one:

Let’s say we start from a random rule array, then repeatedly construct the change map and apply the mutation that it implies gives the most positive change—in effect at each step following the “path of steepest descent” to get to the lifetime we want (i.e. reduce the loss). Then the sequence of “breakthrough” configurations we get is:

And this in effect corresponds to a slightly more direct “path to a solution” than our sequence of pure single-point mutations.

By the way, the particular problem of reaching a certain lifetime has a simple enough structure that this “steepest descent” method—when started from a simple uniform rule array—finds a very “mechanical” (if slow) path to a solution:

What about the problem of learning f[x] = ? Once again we can make a change map based on the loss we define. Here are the results for a sequence of “breakthrough” configurations. The gray regions are ones where changes will be “neutral”, so that there’s still exploration that can be done without affecting the loss. The red regions are ones that are in effect “locked in” and where any changes would be deleterious in terms of loss:

So what happens in this case if we follow the “path of steepest descent”, always making the change that would be best according to the change map? Well, the results are actually quite unsatisfactory. From almost any initial condition the system quickly gets stuck, and never finds any satisfactory solution. In effect it seems that deterministically following the path of steepest descent leads us to a “local minimum” from which we cannot escape. So what are we missing in just looking at the change map? Well, the change map as we’ve constructed it has the limitation that it’s separately assessing the effect of each possible individual mutation. It doesn’t deal with multiple mutations at a time—which could well be needed in general if one’s going to find the “fastest path to success”, and avoid getting stuck.

But even in constructing the change map there’s already a problem. Because at least the direct way of computing it scales quite poorly. In an n×n rule array we have to check the effect of flipping about n² values, and for each one we have to run the whole system—taking altogether about n⁴ operations. And one has to do this separately for each step in the learning process.

So how do traditional neural nets avoid this kind of inefficiency? The answer in a sense involves a mathematical trick. And at least as it’s usually presented it’s all based on the continuous nature of the weights and values in neural nets—which allow us to use methods from calculus.

Let’s say we have a neural net like this

that computes some particular function f[x]:

We can ask how this function changes as we change each of the weights in the network:

And in effect this gives us something like our “change map” above. But there’s an important difference. Because the weights are continuous, we can think about infinitesimal changes to them. And then we can ask questions like “How does f[x] change when we make an infinitesimal change to a particular weight w_i?”—or equivalently, “What is the partial derivative of f with respect to w_i at the point x?” But now we get to use a key feature of infinitesimal changes: that they can always be thought of as just “adding linearly” (essentially because ε² can always be ignored compared to ε). Or, in other words, we can summarize any infinitesimal change just by giving its “direction” in weight space, i.e. a vector that says how much of each weight should be (infinitesimally) changed. So if we want to change f[x] (infinitesimally) as quickly as possible, we should go in the direction of steepest descent defined by all the derivatives of f with respect to the weights.

In machine learning, we’re typically trying in effect to set the weights so that the form of f[x] we generate successfully minimizes whatever loss we’ve defined. And we do this by incrementally “moving in weight space”—at every step computing the direction of steepest descent to know where to go next. (In practice, there are all sorts of tricks like “ADAM” that try to optimize the way to do this.)

But how do we efficiently compute the partial derivative of f with respect to each of the weights? Yes, we could do the analog of generating pictures like the ones above, separately for each of the weights. But it turns out that a standard result from calculus gives us a vastly more efficient procedure that in effect “maximally reuses” parts of the computation that have already been done.

It all starts with the textbook chain rule for the derivative of nested (i.e. composed) functions:

This basically says that the (infinitesimal) change in the value of the “whole chain” d[c[b[a[x]]]] can be computed as a product of (infinitesimal) changes associated with each of the “links” in the chain. But the key observation is then that when we get to the computation of the change at a certain point in the chain, we’ve already had to do a lot of the computation we need—and so long as we stored those results, we always have only an incremental computation to perform.

So how does this apply to neural nets? Well, each layer in a neural net is in effect doing a function composition. So, for example, our d[c[b[a[x]]]] is like a trivial neural net:

But what about the weights, which, after all, are what we are trying to find the effect of changing? Well, we could include them explicitly in the function we’re computing:

And then we could in principle symbolically compute the derivatives with respect to these weights:

For our network above

the corresponding expression (ignoring biases) is

where ϕ denotes our activation function. Once again we’re dealing with nested functions, and once again—though it’s a bit more intricate in this case—the computation of derivatives can be done by incrementally evaluating terms in the chain rule and in effect using the standard neural net method of “backpropagation”.

So what about the discrete case? Are there similar methods we can use there? We won’t discuss this in detail here, but we’ll give some indications of what’s likely to be involved.

As a potentially simpler case, let’s consider ordinary cellular automata. The analog of our change map asks how the value of a particular “output” cell is affected by changes in other cells—or in effect what the “partial derivative” of the output value is with respect to changes in values of other cells.

For example, consider the highlighted “output” cell in this cellular automaton evolution:

Now we can look at each cell in this array, and make a change map based on seeing whether flipping the value of just that cell (and then running the cellular automaton forwards from that point) would change the value of the output cell:

The form of the change map is different if we look at different “output cells”:

Here, by the way, are some larger change maps for this and a couple of other cellular automaton rules:

But is there a way to construct such change maps incrementally? One might have thought that there would immediately be at least for cellular automata that (unlike the cases here) are fundamentally reversible. But actually such reversibility doesn’t seem to help much—because although it allows us to “backtrack” whole states of the cellular automaton, it doesn’t allow us to trace the separate effects of individual cells.

So how about using discrete analogs of derivatives and the chain rule? Let’s for example call the function computed by one step in rule 30 cellular automaton evolution w[x, y, z]. We can think of the “partial derivative” of this function with respect to x at the point x as representing whether the output of w changes when x is flipped starting from the value given:

(Note that “no change” is indicated as False or , while a change is indicated as True or . And, yes, one can either explicitly compute the rule outcomes here, and then deduce from them the functional form, or one can use symbolic rules to directly deduce the functional form.)

One can compute a discrete analog of a derivative for any Boolean function. For example, we have

and

which we can write as:

We also have:

And here is a table of “Boolean derivatives” for all 2-input Boolean functions:

And indeed there’s a whole “Boolean calculus” one can set up for these kinds of derivatives. And in particular, there’s a direct analog of the chain rule:

where Xnor[x,y] is effectively the equality test x == y:

But, OK, how do we use this to create our change maps? In our simple cellular automaton case, we can think of our change map as representing how a change in an output cell “propagates back” to previous cells. But if we just try to apply our discrete calculus rules we run into a problem: different “chain rule chains” can imply different changes in the value of the same cell. In the continuous case this path dependence doesn’t happen because of the way infinitesimals work. But in the discrete case it does. And ultimately we’re doing a kind of backtracking that can really be represented faithfully only as a multiway system. (Though if we just want probabilities, for example, we can consider averaging over branches of the multiway system—and the change maps we showed above are effectively the result of thresholding over the multiway system.)

But despite the appearance of such difficulties in the “simple” cellular automaton case, such methods typically seem to work better in our original, more complicated rule array case. There’s a bunch of subtlety associated with the fact that we’re finding derivatives not only with respect to the values in the rule array, but also with respect to the choice of rules (which are the analog of weights in the continuous case).

Let’s consider the And+Xor rule array:

Our loss is the number of cells whose values disagree with the row shown at the bottom. Now we can construct a change map for this rule array both in a direct “forward” way, and “backwards” using our discrete derivative methods (where we effectively resolve the small amount of “multiway behavior” by always picking “majority” values):

The results are similar, though in this case not exactly the same. Here are a few other examples:

And, yes, in detail there are essentially always local differences between the results from the forward and backward methods. But the backward method—like in the case of backpropagation in ordinary neural nets—can be implemented much more efficiently. And for purposes of practical machine learning it’s actually likely to be perfectly satisfactory—especially given that the forward method is itself only providing an approximation to the question of which mutations are best to do.

And as an example, here are the results of the forward and backward methods for the problem of learning the function f[x] = , for the “breakthrough” configurations that we showed above:

What Can Be Learned?

We’ve now shown quite a few examples of machine learning in action. But a fundamental question we haven’t yet addressed is what kind of thing can actually be learned by machine learning. And even before we get to this, there’s another question: given a particular underlying type of system, what kinds of functions can it even represent?

As a first example consider a minimal neural net of the form (essentially a single-layer perceptron):

With ReLU (AKA Ramp) as the activation function and the first set of weights all taken to be 1, the function computed by such a neural net has the form:

With enough weights and biases this form can represent any piecewise linear function—essentially just by moving around ramps using biases, and scaling them using weights. So for example consider the function:

This is the function computed by the neural net above—and here’s how it’s built up by adding in successive ramps associated with the individual intermediate nodes (neurons):

(It’s similarly possible to get all smooth functions from activation functions like ELU, etc.)

Things get slightly more complicated if we try to represent functions with more than one argument. With a single intermediate layer we can only get “piecewise (hyper)planar” functions (i.e. functions that change direction only at linear “fault lines”):

But already with a total of two intermediate layers—and sufficiently many nodes in each of these layers—we can generate any piecewise function of any number of arguments.

If we limit the number of nodes, then roughly we limit the number of boundaries between different linear regions in the values of the functions. But as we increase the number of layers with a given number of nodes, we basically increase the number of sides that polygonal regions within the function values can have:

So what happens with the mesh nets that we discussed earlier? Here are a few random examples, showing results very similar to shallow, fully connected networks with a comparable total number of nodes:

OK, so how about our fully discrete rule arrays? What functions can they represent? We already saw part of the answer earlier when we generated rule arrays to represent various Boolean functions. It turns out that there is a fairly efficient procedure based on Boolean satisfiability for explicitly finding rule arrays that can represent a given function—or determine that no rule array (say of a given size) can do this.

Using this procedure, we can find minimal And+Xor rule arrays that represent all (“even”) 3-input Boolean functions (i.e. r = 1 cellular automaton rules):

It’s always possible to specify any n-input Boolean function by an array of 2ⁿ bits, as in:

But we see from the pictures above that when we “compile” Boolean functions into And+Xor rule arrays, they can take different numbers of bits (i.e. different numbers of elements in the rule array). (In effect, the “algorithmic information content” of the function varies with the “language” we’re using to represent them.) And, for example, in the n = 3 case shown here, the distribution of minimal rule array sizes is:

There are some functions that are difficult to represent as And+Xor rule arrays (and seem to require 15 rule elements)—and others that are easier. And this is similar to what happens if we represent Boolean functions as Boolean expressions (say in conjunctive normal form) and count the total number of (unary and binary) operations used:

OK, so we know that there is in principle an And+Xor rule array that will compute any (even) Boolean function. But now we can ask whether an adaptive evolution process can actually find such a rule array—say with a sequence of single-point mutations. Well, if we do such adaptive evolution—with a loss that counts the number of “wrong outputs” for, say, rule 254—then here’s a sequence of successive breakthrough configurations that can be produced:

The results aren’t as compact as the minimal solution above. But it seems to always be possible to find at least some And+Xor rule array that “solves the problem” just by using adaptive evolution with single-point mutations.

Here are results for some other Boolean functions:

And so, yes, not only are all (even) Boolean functions representable in terms of And+Xor rule arrays, they’re also learnable in this form, just by adaptive evolution with single-point mutations.

In what we did above, we were looking at how machine learning works with our rule arrays in specific cases like for the function. But now we’ve got a case where we can explicitly enumerate all possible functions, at least of a given class. And in a sense what we’re seeing is evidence that machine learning tends to be very broad—and capable at least in principle of learning pretty much any function.

Of course, there can be specific restrictions. Like the And+Xor rule arrays we’re using here can’t represent (“odd”) functions where . (The Nand+First rule arrays we discussed above nevertheless can.) But in general it seems to be a reflection of the Principle of Computational Equivalence that pretty much any setup is capable of representing any function—and also adaptively “learning” it.

By the way, it’s a lot easier to discuss questions about representing or learning “any function” when one’s dealing with discrete (countable) functions—because one can expect to either be able to “exactly get” a given function, or not. But for continuous functions, it’s more complicated, because one’s pretty much inevitably dealing with approximations (unless one can use symbolic forms, which are basically discrete). So, for example, while we can say (as we did above) that (ReLU) neural nets can represent any piecewise-linear function, in general we’ll only be able to imagine successively approaching an arbitrary function, much like when you progressively add more terms in a simple Fourier series:

Looking back at our results for discrete rule arrays, one notable observation that is that while we can successfully reproduce all these different Boolean functions, the actual rule array configurations that achieve this tend to look quite messy. And indeed it’s much the same as we’ve seen throughout: machine learning can find solutions, but they’re not “structured solutions”; they’re in effect just solutions that “happen to work”.

Are there more structured ways of representing Boolean functions with rule arrays? Here are the two possible minimum-size And+Xor rule arrays that represent rule 30:

At the next-larger size there are more possibilities for rule 30:

And there are also rule arrays that can represent rule 110:

But in none of these cases is there obvious structure that allows us to immediately see how these computations work, or what function is being computed. But what if we try to explicitly construct—effectively by standard engineering methods—a rule array that computes a particular function? We can start by taking something like the function for rule 30 and writing it in terms of And and Xor (i.e. in ANF, or “algebraic normal form”):

We can imagine implementing this using an “evaluation graph”:

But now it’s easy to turn this into a rule array (and, yes, we haven’t gone all the way and arranged to copy inputs, etc.):

“Evaluating” this rule array for different inputs, we can see that it indeed gives rule 30:

Doing the same thing for rule 110, the And+Xor expression is

the evaluation graph is

and the rule array is:

And at least with the evaluation graph as a guide, we can readily “see what’s happening” here. But the rule array we’re using is considerably larger than our minimal solutions above—or even than the solutions we found by adaptive evolution.

It’s a typical situation that one sees in many other kinds of systems (like for example sorting networks): it’s possible to have a “constructed solution” that has clear structure and regularity and is “understandable”. But minimal solutions—or ones found by adaptive evolution—tend to be much smaller. But they almost always look in many ways random, and aren’t readily understandable or interpretable.

So far, we’ve been looking at rule arrays that compute specific functions. But in getting a sense of what rule arrays can do, we can consider rule arrays that are “programmable”, in that their input specifies what function they should compute. So here, for example, is an And+Xor rule array—found by adaptive evolution—that takes the “bit pattern” of any (even) Boolean function as input on the left, then applies that Boolean function to the inputs on the right:

And with this same rule array we can now compute any possible (even) Boolean function. So here, for example, it’s evaluating Or:

Other Kinds of Models and Setups

Our general goal here has been to set up models that capture the most essential features of neural nets and machine learning—but that are simple enough in their structure that we can readily “look inside” and get a sense of what they are doing. Mostly we’ve concentrated on rule arrays as a way to provide a minimal analog of standard “perceptron-style” feed-forward neural nets. But what about other architectures and setups?

In effect, our rule arrays are “spacetime-inhomogeneous” generalizations of cellular automata—in which adaptive evolution determines which rule (say from a finite set) should be used at every (spatial) position and every (time) step. A different idealization (that in fact we already used in one section above) is to have an ordinary homogeneous cellular automaton—but with a single “global rule” determined by adaptive evolution. Rule arrays are the analog of feed-forward networks in which a given rule in the rule array is in effect used only once as data “flows through” the system. Ordinary homogeneous cellular automata are like recurrent networks in which a single stream of data is in effect subjected over and over again to the same rule.

There are various interpolations between these cases. For example, we can imagine a “layered rule array” in which the rules at different steps can be different, but those on a given step are all the same. Such a system can be viewed as an idealization of a convolutional neural net in which a given layer applies the same kernel to elements at all positions, but different layers can apply different kernels.

A layered rule array can’t encode as much information as a general rule array. But it’s still able to show machine-learning-style phenomena. And here, for example, is adaptive evolution for a layered And+Xor rule array progressively solving the problem of generating a pattern that lives for exactly 30 steps:

One could also imagine “vertically layered” rule arrays, in which different rules are used at different positions, but any given position keeps running the same rule forever. However, at least for the kinds of problems we’ve considered here, it doesn’t seem sufficient to just be able to pick the positions at which different rules are run. One seems to either need to change rules at different (time) steps, or one needs to be able to adaptively evolve the underlying rules themselves.

Rule arrays and ordinary cellular automata share the feature that the value of each cell depends only on the values of neighboring cells on the step before. But in neural nets it’s standard for the value at a given node to depend on the values of lots of nodes on the layer before. And what makes this straightforward in neural nets is that (weighted, and perhaps otherwise transformed) values from previous nodes are taken to be combined just by simple numerical addition—and addition (being n-ary and associative) can take any number of “inputs”. In a cellular automaton (or Boolean function), however, there’s always a definite number of inputs, determined by the structure of the function. In the most straightforward case, the inputs come only from nearest-neighboring cells. But there’s no requirement that this is how things need to work—and for example we can pick any “local template” to bring in the inputs for our function. This template could either be the same at every position and every step, or it could be picked from a certain set differently at different positions—in effect giving us “template arrays” as well as rule arrays.

So what about having a fully connected network, as we did in our very first neural net examples above? To set up a discrete analog of this we first need some kind of discrete n-ary associative “accumulator” function to fill the place of numerical addition. And for this we could pick a function like And, Or, Xor—or Majority. And if we’re not just going to end up with the same value at each node on a given layer, we need to set up some analog of a weight associated with each connection—which we can achieve by applying either Identity or Not (i.e. flip or not) to the value flowing through each connection.

Here’s an example of a network of this type, trained to compute the function we discussed above:

There are just two kinds of connections here: flip and not. And at each node we’re computing the majority function—giving value 1 if the majority of its inputs are 1, and 0 otherwise. With the “one-hot encoding” of input and output that we used before, here are a few examples of how this network evaluates our function:

This was trained just using 1000 steps of single-point mutation applied to the connection types. The loss systematically goes down—but the configuration of the connection types continues to look quite random even as it achieves zero loss (i.e. even after the function has been completely learned):

In what we’ve just done we assume that all connections continue to be present, though their types (or effectively signs) can change. But we can also consider a network where connections can end up being zeroed out during training—so that they are effectively no longer present.

Much of what we’ve done here with machine learning has centered around trying to learn transformations of the form x f[x]. But another typical application of machine learning is autoencoding—or in effect learning how to compress data representing a certain set of examples. And once again it’s possible to do such a task using rule arrays, with learning achieved by a series of single-point mutations.

As a starting point, consider training a rule array (of cellular automaton rules 4 and 146) to reproduce unchanged a block of black cells of any width. One might have thought this would be trivial. But it’s not, because in effect the initial data inevitably gets “ground up” inside the rule array, and has to be reconstituted at the end. But, yes, it’s nevertheless possible to train a rule array to at least roughly do this—even though once again the rule arrays we find that manage to do this look quite random:

But to set up a nontrivial autoencoder let’s imagine that we progressively “squeeze” the array in the middle, creating an increasingly narrow “bottleneck” through which the data has to flow. At the bottleneck we effectively have a compressed version of the original data. And we find that at least down to some width of bottleneck, it’s possible to create rule arrays that—with reasonable probability—can act as successful autoencoders of the original data:

The success of LLMs has highlighted the use of machine learning for sequence continuation—and the effectiveness of transformers for this. But just as with other neural nets, the forms of transformers that are used in practice are typically very complicated. But can one find a minimal model that nevertheless captures the “essence of transformers”?

Let’s say that we have a sequence that we want to continue, like:

We want to encode each possible value by a vector, as in

so that, for example, our original sequence is encoded as:

Then we have a “head” that reads a block of consecutive vectors, picking off certain values and feeding pairs of them into And and Xor functions, to get a vector of Boolean values:

Ultimately this head is going to “slide” along our sequence, “predicting” what the next element in the sequence will be. But somehow we have to go from our vector of Boolean values to (probabilities of) sequence elements. Potentially we might be able to do this just with a rule array. But for our purposes here we’ll use a fully connected single-layer Identity+Not network in which at each output node we just find the sum of the number of values that come to it—and treat this as determining (through a softmax) the probability of the corresponding element:

In this case, the element with the maximum value is 5, so at “zero temperature” this would be our “best prediction” for the next element.

To train this whole system we just make a sequence of random point mutations to everything, keeping mutations that don’t increase the loss (where the loss is basically the difference between predicted next values and actual next values, or, more precisely, the “categorical cross-entropy”). Here’s how this loss progresses in a typical such training:

At the end of this training, here are the components of our minimal transformer:

First come the encodings of the different possible elements in the sequence. Then there’s the head, here shown applied to the encoding of the first elements of the original sequence. Finally there’s a single-layer discrete network that takes the output from the head, and deduces relative probabilities for different elements to come next. In this case the highest-probability prediction for the next element is that it should be element 6.

To do the analog of an LLM we start from some initial “prompt”, i.e. an initial sequence that fits within the width (“context window”) of the head. Then we progressively apply our minimal transformer, for example at each step taking the next element to be the one with the highest predicted probability (i.e. operating “at zero temperature”). With this setup the collection of “prediction strengths” is shown in gray, with the “best prediction” shown in red:

Running this even far beyond our original training data, we see that we get a “prediction” of a continued sine wave:

As we might expect, the fact that our minimal transformer can make such a plausible prediction relies on the simplicity of our sine curve. If we use “more complicated” training data, such as the “mathematically defined” () blue curve in

the result of training and running a minimal transformer is now:

And, not surprisingly, it can’t “figure out the computation” to correctly continue the curve. By the way, different training runs will involve different sequences of mutations, and will yield different predictions (often with periodic “hallucinations”):

In looking at “perceptron-style” neural nets we wound up using rule arrays—or, in effect, spacetime-inhomogeneous cellular automata—as our minimal models. Here we’ve ended up with a slightly more complicated minimal model for transformer neural nets. But if we were to simplify it further, we would end up not with something like a cellular automaton but instead with something like a tag system, in which one has a sequence of elements, and at each step removes a block from the beginning, and—depending on its form—adds a certain block at the end, as in:

And, yes, such systems can generate extremely complex behavior—reinforcing the idea (that we have repeatedly seen here) that machine learning works by selecting complexity that aligns with goals that have been set.

And along these lines, one can consider all sorts of different computational systems as foundations for machine learning. Here we’ve been looking at cellular-automaton-like and tag-system-like examples. But for example our Physics Project has shown us the power and flexibility of systems based on hypergraph rewriting. And from what we’ve seen here, it seems very plausible that something like hypergraph rewriting can serve as a yet more powerful and flexible substrate for machine learning.

So in the End, What’s Really Going On in Machine Learning?

There are, I think, several quite striking conclusions from what we’ve been able to do here. The first is just that models much simpler than traditional neural nets seem capable of capturing the essential features of machine learning—and indeed these models may well be the basis for a new generation of practical machine learning.

But from a scientific point of view, one of the things that’s important about these models is that they are simple enough in structure that it’s immediately possible to produce visualizations of what they’re doing inside. And studying these visualizations, the most immediately striking feature is how complicated they look.

It could have been that machine learning would somehow “crack systems”, and find simple representations for what they do. But that doesn’t seem to be what’s going on at all. Instead what seems to be happening is that machine learning is in a sense just “hitching a ride” on the general richness of the computational universe. It’s not “specifically building up behavior one needs”; rather what it’s doing is to harness behavior that’s “already out there” in the computational universe.

The fact that this could possibly work relies on the crucial—and at first unexpected—fact that in the computational universe even very simple programs can ubiquitously produce all sorts of complex behavior. And the point then is that this behavior has enough richness and diversity that it’s possible to find instances of it that align with machine learning objectives one’s defined. In some sense what machine learning is doing is to “mine” the computational universe for programs that do what one wants.

It’s not that machine learning nails a specific precise program. Rather, it’s that in typical successful applications of machine learning there are lots of programs that “do more or less the right thing”. If what one’s trying to do involves something computationally irreducible, machine learning won’t typically be able to “get well enough aligned” to correctly “get through all the steps” of the irreducible computation. But it seems that many “human-like tasks” that are the particular focus of modern machine learning can successfully be done.

And by the way, one can expect that with the minimal models explored here, it becomes more feasible to get a real characterization of what kinds of objectives can successfully be achieved by machine learning, and what cannot. Critical to the operation of machine learning is not only that there exist programs that can do particular kinds of things, but also that they can realistically be found by adaptive evolution processes.

In what we’ve done here we’ve often used what’s essentially the very simplest possible process for adaptive evolution: a sequence of point mutations. And what we’ve discovered is that even this is usually sufficient to lead us to satisfactory machine learning solutions. It could be that our paths of adaptive evolution would always be getting stuck—and not reaching any solution. But the fact that this doesn’t happen seems crucially connected to the computational irreducibility that’s ubiquitous in the systems we’re studying, and that leads to effective randomness that with overwhelming probability will “give us a way out” of anywhere we got stuck.

In some sense computational irreducibility “levels the playing field” for different processes of adaptive evolution, and lets even simple ones be successful. Something similar seems to happen for the whole framework we’re using. Any of a wide class of systems seem capable of successful machine learning, even if they don’t have the detailed structure of traditional neural nets. We can see this as a typical reflection of the Principle of Computational Equivalence: that even though systems may differ in their details, they are ultimately all equivalent in the computations they can do.

The phenomenon of computational irreducibility leads to a fundamental tradeoff, of particular importance in thinking about things like AI. If we want to be able to know in advance—and broadly guarantee—what a system is going to do or be able to do, we have to set the system up to be computationally reducible. But if we want the system to be able to make the richest use of computation, it’ll inevitably be capable of computationally irreducible behavior. And it’s the same story with machine learning. If we want machine learning to be able to do the best it can, and perhaps give us the impression of “achieving magic”, then we have to allow it to show computational irreducibility. And if we want machine learning to be “understandable” it has to be computationally reducible, and not able to access the full power of computation.

At the outset, though, it’s not obvious whether machine learning actually has to access such power. It could be that there are computationally reducible ways to solve the kinds of problems we want to use machine learning to solve. But what we’ve discovered here is that even in solving very simple problems, the adaptive evolution process that’s at the heart of machine learning will end up sampling—and using—what we can expect to be computationally irreducible processes.

Like biological evolution, machine learning is fundamentally about finding things that work—without the constraint of “understandability” that’s forced on us when we as humans explicitly engineer things step by step. Could one imagine constraining machine learning to make things understandable? To do so would effectively prevent machine learning from having access to the power of computationally irreducible processes, and from the evidence here it seems unlikely that with this constraint the kind of successes we’ve seen in machine learning would be possible.

So what does this mean for the “science of machine learning”? One might have hoped that one would be able to “look inside” machine learning systems and get detailed narrative explanations for what’s going on; that in effect one would be able to “explain the mechanism” for everything. But what we’ve seen here suggests that in general nothing like this will work. All one will be able to say is that somewhere out there in the computational universe there’s some (typically computationally irreducible) process that “happens” to be aligned with what we want.

Yes, we can make general statements—strongly based on computational irreducibility—about things like the findability of such processes, say by adaptive evolution. But if we ask “How in detail does the system work?”, there won’t be much of an answer to that. Of course we can trace all its computational steps and see that it behaves in a certain way. But we can’t expect what amounts to a “global human-level explanation” of what it’s doing. Rather, we’ll basically just be reduced to looking at some computationally irreducible process and observing that it “happens to work”—and we won’t have a high-level explanation of “why”.

But there is one important loophole to all this. Within any computationally irreducible system, there are always inevitably pockets of computational reducibility. And—as I’ve discussed at length particularly in connection with our Physics Project—it’s these pockets of computational reducibility that allow computationally bounded observers like us to identify things like “laws of nature” from which we can build “human-level narratives”.

So what about machine learning? What pockets of computational reducibility show up there, from which we might build “human-level scientific laws”? Much as with the emergence of “simple continuum behavior” from computationally irreducible processes happening at the level of molecules in a gas or ultimate discrete elements of space, we can expect that at least certain computationally reducible features will be more obvious when one’s dealing with larger numbers of components. And indeed in sufficiently large machine learning systems, it’s routine to see smooth curves and apparent regularity when one’s looking at the kind of aggregated behavior that’s probed by things like training curves.

But the question about pockets of reducibility is always whether they end up being aligned with things we consider interesting or useful. Yes, it could be that machine learning systems would exhibit some kind of collective (“EEG-like”) behavior. But what’s not clear is whether this behavior will tell us anything about the actual “information processing” (or whatever) that’s going on in the system. And if there is to be a “science of machine learning” what we have to hope for is that we can find in machine learning systems pockets of computational reducibility that are aligned with things we can measure, and care about.

So given what we’ve been able to explore here about the foundations of machine learning, what can we say about the ultimate power of machine learning systems? A key observation has been that machine learning works by “piggybacking” on computational irreducibility—and in effect by finding “natural pieces of computational irreducibility” that happen to fit with the objectives one has. But what if those objectives involve computational irreducibility—as they often do when one’s dealing with a process that’s been successfully formalized in computational terms (as in math, exact science, computational X, etc.)? Well, it’s not enough that our machine learning system “uses some piece of computational irreducibility inside”. To achieve a particular computationally irreducible objective, the system would have to do something closely aligned with that actual, specific objective.

It has to be said, however, that by laying bare more of the essence of machine learning here, it becomes easier to at least define the issues of merging typical “formal computation” with machine learning. Traditionally there’s been a tradeoff between the computational power of a system and its trainability. And indeed in terms of what we’ve seen here this seems to reflect the sense that “larger chunks of computational irreducibility” are more difficult to fit into something one’s incrementally building up by a process of adaptive evolution.

So how should we ultimately think of machine learning? In effect its power comes from leveraging the “natural resource” of computational irreducibility. But when it uses computational irreducibility it does so by “foraging” pieces that happen to advance its objectives. Imagine one’s building a wall. One possibility is to fashion bricks of a particular shape that one knows will fit together. But another is just to look at stones one sees lying around, then to build the wall by fitting these together as best one can.

And if one then asks “Why does the wall have such-and-such a pattern?” the answer will end up being basically “Because that’s what one gets from the stones that happened to be lying around”. There’s no overarching theory to it in itself; it’s just a reflection of the resources that were out there. Or, in the case of machine learning, one can expect that what one sees will be to a large extent a reflection of the raw characteristics of computational irreducibility. In other words, the foundations of machine learning are as much as anything rooted in the science of ruliology. And it’s in large measure to that science we should look in our efforts to understand more about “what’s really going on” in machine learning, and quite possibly also in neuroscience.

Historical & Personal Notes

In some ways it seems like a quirk of intellectual history that the kinds of foundational questions I’ve been discussing here weren’t already addressed long ago—and in some ways it seems like an inexorable consequence of the only rather recent development of certain intuitions and tools.

The idea that the brain is fundamentally made of connected nerve cells was considered in the latter part of the nineteenth century, and took hold in the first decades of the twentieth century—with the formalized concept of a neural net that operates in a computational way emerging in full form in the work of Warren McCulloch and Walter Pitts in 1943. By the late 1950s there were hardware implementations of neural nets (typically for image processing) in the form of “perceptrons”. But despite early enthusiasm, practical results were mixed, and at the end of the 1960s it was announced that simple cases amenable to mathematical analysis had been “solved”—leading to a general belief that “neural nets couldn’t do anything interesting”.

Ever since the 1940s there had been a trickle of general analyses of neural nets, particularly using methods from physics. But typically these analyses ended up with things like continuum approximations—that could say little about the information-processing aspects of neural nets. Meanwhile, there was an ongoing undercurrent of belief that somehow neural networks would both explain and reproduce how the brain works—but no methods seemed to exist to say quite how. Then at the beginning of the 1980s there was a resurgence of interest in neural networks, coming from several directions. Some of what was done concentrated on very practical efforts to get neural nets to do particular “human-like” tasks. But some was more theoretical, typically using methods from statistical physics or dynamical systems.

Before long, however, the buzz died down, and for several decades only a few groups were left working with neural nets. Then in 2011 came a surprise breakthrough in using neural nets for image analysis. It was an important practical advance. But it was driven by technological ideas and development—not any significant new theoretical analysis or framework.

And this was also the pattern for almost all of what followed. People spent great effort to come up with neural net systems that worked—and all sorts of folklore grew up about how this should best be done. But there wasn’t really even an attempt at an underlying theory; this was a domain of engineering practice, not basic science.

And it was in this tradition that ChatGPT burst onto the scene in late 2022. Almost everything about LLMs seemed to be complicated. Yes, there were empirically some large-scale regularities (like scaling laws). And I quickly suspected that the success of LLMs was a strong hint of general regularities in human language that hadn’t been clearly identified before. But beyond a few outlier examples, almost nothing about “what’s going on inside LLMs” has seemed easy to decode. And efforts to put “strong guardrails” on the operation of the system—in effect so as to make it in some way “predictable” or “understandable”—typically seem to substantially decrease its power (a point that now makes sense in the context of computational irreducibility).

My own interaction with machine learning and neural nets began in 1980 when I was developing my SMP symbolic computation system, and wondering whether it might be possible to generalize the symbolic pattern-matching foundations of the system to some kind of “fuzzy pattern matching” that would be closer to human thinking. I was aware of neural nets but thought of them as semi-realistic models of brains, not for example as potential sources of algorithms of the kind I imagined might “solve” fuzzy matching.

And it was partly as a result of trying to understand the essence of systems like neural nets that in 1981 I came up with what I later learned could be thought of as one-dimensional cellular automata. Soon I was deeply involved in studying cellular automata and developing a new intuition about how complex behavior could arise even from simple rules. But when I learned about recent efforts to make idealized models of neural nets using ideas from statistical mechanics, I was at least curious enough to set up simulations to try to understand more about these models.

But what I did wasn’t a success. I could neither get the models to do anything of significant practical interest—nor did I manage to derive any good theoretical understanding of them. I kept wondering, though, what relationship there might be between cellular automata that “just run”, and systems like neural nets that can also “learn”. And in fact in 1985 I tried to make a minimal cellular-automaton-based model to explore this. It was what I’m now calling a “vertically layered rule array”. And while in many ways I was already asking the right questions, this was an unfortunate specific choice of system—and my experiments on it didn’t reveal the kinds of phenomena we’re now seeing.

Years went by. I wrote a section on “Human Thinking” in A New Kind of Science, that discussed the possibility of simple foundational rules for the essence of thinking, and even included a minimal discrete analog of a neural net. At the time, though, I didn’t develop these ideas. By 2017, though, 15 years after the book was published—and knowing about the breakthroughs in deep learning—I had begun to think more concretely about neural nets as getting their power by sampling programs from across the computational universe. But still I didn’t see quite how this would work.

Meanwhile, there was a new intuition emerging from practical experience with machine learning: that if you “bashed” almost any system “hard enough”, it would learn. Did that mean that perhaps one didn’t need all the details of neural networks to successfully do machine learning? And could one perhaps make a system whose structure was simple enough that its operation would for example be accessible to visualization? I particularly wondered about this when I was writing an exposition of ChatGPT and LLMs in early 2023. And I kept talking about “LLM science”, but didn’t have much of a chance to work on it.

But then, a few months ago, as part of an effort to understand the relation between what science does and what AI does, I tried a kind of “throwaway experiment”—which, to my considerable surprise, seemed to successfully capture some of the essence of what makes biological evolution possible. But what about other adaptive evolution—and in particular, machine learning? The models that seemed to be needed were embarrassingly close to what I’d studied in 1985. But now I had a new intuition—and, thanks to Wolfram Language, vastly better tools. And the result has been my effort here.

Of course this is only a beginning. But I’m excited to be able to see what I consider to be the beginnings of foundational science around machine learning. Already there are clear directions for practical applications (which, needless to say, I plan to explore). And there are signs that perhaps we may finally be able to understand just why—and when—the “magic” of machine learning works.

Thanks

Thanks to Richard Assar of the Wolfram Institute for extensive help. Thanks also to Brad Klee, Tianyi Gu, Nik Murzin and Max Niederman for specific results, to George Morgan and others at Symbolica for their early interest, and to Kovas Boguta for suggesting many years ago to link machine learning to the ideas in A New Kind of Science.

Yet More New Ideas and New Functions: Launching Version 14.1 of Wolfram Language & Mathematica

Stephen Wolfram — Wed, 31 Jul 2024 21:53:02 +0000

For the 36th Time… the Latest from Our R&D Pipeline

There’s Now a Unified Wolfram App

Vector Databases and Semantic Search

RAGs and Dynamic Prompting for LLMs

Connect to Your Favorite LLM

Symbolic Arrays and Their Calculus

Binomials and Pitchforks: Navigating Mathematical Conventions

Fixed Points and Stability for Differential and Difference Equations

The Steady Advance of PDEs

Symbolic Biomolecules and Their Visualization

Optimizing Neural Nets for GPUs and NPUs

The Statistics of Dates

Building Videos with Programs

Optimizing the Speech Recognition Workflow

Historical Geography Becomes Computable

Astronomical Graphics and Their Axes

When Is Earthrise on Mars? New Level of Astronomical Computation

Geometry Goes Color, and Polar

New Computation Flow in Notebooks: Introducing Cell-Linked %

The UX Journey Continues: New Typing Affordances, and More

Syntax for Natural Language Input

Diff[ ] … for Notebooks and More!

Lots of Little Language Tune-Ups

Making the Wolfram Compiler Easier to Use

Even Smoother Integration with External Languages

Standalone Wolfram Language Applications!

And Yet More…

For the 36th Time… the Latest from Our R&D Pipeline

Today we celebrate the arrival of the 36th (x.x) version of the Wolfram Language and Mathematica: Version 14.1. We’ve been doing this since 1986: continually inventing new ideas and implementing them in our larger and larger tower of technology. And it’s always very satisfying to be able to deliver our latest achievements to the world.

We released Version 14.0 just half a year ago. And—following our modern version scheduling—we’re now releasing Version 14.1. For most technology companies a .1 release would contain only minor tweaks. But for us it’s a snapshot of what our whole R&D pipeline has delivered—and it’s full of significant new features and new enhancements.

If you’ve been following our livestreams, you may have already seen many of these features and enhancements being discussed as part of our open software design process. And we’re grateful as always to members of the Wolfram Language community who’ve made suggestions—and requests. And in fact Version 14.1 contains a particularly large number of long-requested features, some of which involved development that has taken many years and required many intermediate achievements.

There’s lots of both extension and polishing in Version 14.1. There are a total of 89 entirely new functions—more than in any other version for the past couple of years. And there are also 137 existing functions that have been substantially updated. Along with more than 1300 distinct bug fixes and specific improvements.

Some of what’s new in Version 14.1 relates to AI and LLMs. And, yes, we’re riding the leading edge of these kinds of capabilities. But the vast majority of what’s new has to do with our continued mission to bring computational language and computational knowledge to everything. And today that mission is even more important than ever, supporting not only human users, but also rapidly proliferating AI “users”—who are beginning to be able to routinely make even broader and deeper use of our technology than humans.

Each new version of Wolfram Language represents a large amount of R&D by our team, and the encapsulation of a surprisingly large number of ideas about what should be implemented, and how it should be implemented. So, today, here it is: the latest stage in our four-decade journey to bring the superpower of the computational paradigm to everything.

There’s Now a Unified Wolfram App

In the beginning we just had “Mathematica”—that we described as “A System for Doing Mathematics by Computer”. But the core of “Mathematica”—based on the very general concept of transformations for symbolic expressions—was always much broader than “mathematics”. And it didn’t take long before “mathematics” was an increasingly small part of the system we had built. We agonized for years about how to rebrand things to better reflect what the system had become. And eventually, just over a decade ago, we did the obvious thing, and named what we had “the Wolfram Language”.

But when it came to actual software products and executables, so many people were familiar with having a “Mathematica” icon on their desktop that we didn’t want to change that. Later we introduced Wolfram|One as a general product supporting Wolfram Language across desktop and cloud—with Wolfram Desktop being its desktop component. But, yes, it’s all been a bit confusing. Ultimately there’s just one “bag of bits” that implements the whole system we’ve built, even though there are different usage patterns, and differently named products that the system supports. Up to now, each of these different products has been a different executable, that’s separately downloaded.

But starting with Version 14.1 we’re unifying all these things—so that now there’s just a single unified Wolfram app, that can be configured and activated in different ways corresponding to different products.

So now you just go to wolfram.com/download-center and download the Wolfram app:

After you’ve installed the app, you activate it as whatever product(s) you’ve got: Wolfram|One, Mathematica, Wolfram|Alpha Notebook Edition, etc. Why have separate products? Each one has a somewhat different usage pattern, and provides a somewhat different interface optimized for that usage pattern. But now the actual downloading of bits has been unified; you just have to download the unified Wolfram app and you’ll get what you need.

Vector Databases and Semantic Search

Let’s say you’ve got a million documents (or webpages, or images, or whatever) and you want to find the ones that are “closest” to something. Version 14.1 now has a function—SemanticSearch—for doing this. How does SemanticSearch work? Basically it uses machine learning methods to find “vectors” (i.e. lists) of numbers that somehow represent the “meaning” of each of your documents. Then when you want to know which documents are “closest” to something, SemanticSearch computes the vector for the something, and then sees which of the document vectors are closest to this vector.

In principle one could use Nearest to find closest vectors. And indeed this works just fine for small examples where one can readily store all the vectors in memory. But SemanticSearch uses a full industrial-strength approach based on the new vector database capabilities of Version 14.1—which can work with huge collections of vectors stored in external files.

There are lots of ways to use both SemanticSearch and vector databases. You can use them to find documents, snippets within documents, images, sounds or anything else whose “meaning” can somehow be captured by a vector of numbers. Sometimes the point is to retrieve content directly for human consumption. But a particularly strong modern use case is to set up “retrieval-augmented generation” (RAG) for LLMs—in which relevant content found with a vector database is used to provide a “dynamic prompt” for the LLM. And indeed in Version 14.1—as we’ll discuss later—we now have LLMPromptGenerator to implement exactly this pipeline.

But let’s come back to SemanticSearch on its own. Its basic design is modeled after TextSearch, which does keyword-based searching of text. (Note, though, that SemanticSearch also works on many things other than text.)

In direct analogy to CreateSearchIndex for TextSearch, there’s now a CreateSemanticSearchIndex for SemanticSearch. Let’s do a tiny example to see how it works. Essentially we’re going to make an (extremely restricted) “inverse dictionary”. We set up a list of definition word elements:

Now create a semantic search index from this:

Behind the scenes this is a vector database. But we can access it with SemanticSearch:

And since “whale” is considered closest, it comes first.

What about a more realistic example? Instead of just using 3 words, let’s set up definitions for all words in the dictionary. It takes a little while (like a few minutes) to do the machine learning feature extraction for all the definitions. But in the end you get a new semantic search index:

This time it has 39,186 entries—but SemanticSearch picks out the (by default) 10 that it considers closest to what you asked for (and, yes, there’s an archaic definition of “seahorse” as “walrus”):

We can see a bit more detail about what’s going on by asking SemanticSearch to explicitly show us distances:

And plotting these we can see that “whale” is the winner by a decent margin:

One subtlety when dealing with semantic search indices is where to store them. When they’re sufficiently small, you can store them directly in memory, or in a notebook. But usually you’ll want to store them in a separate file, and if you want to share an index you’ll want to put this file in the cloud. You can do this either interactively from within a notebook

or programmatically:

And now the SemanticSearchIndex object you have can be used by anyone, with its data being accessed in the cloud.

In most cases SemanticSearch will be what you need. But sometimes it’s worthwhile to “go underneath” and directly work with vector databases. Here’s a collection of small vectors:

We can use Nearest to find the nearest vector to one we give:

But we can also do this with a vector database. First we create the database:

And now we can search for the nearest vector to the one we give:

In this case we get exactly the same answer as from Nearest. But whereas the mission of Nearest is to give us the mathematically precise nearest vector, VectorDatabaseSearch is doing something less precise—but is able to do it for extremely large numbers of vectors that don’t need to be stored directly in memory.

Those vectors can come from anywhere. For example, here they’re coming from extracting features from some images:

Now let’s say we’ve got a specific image. Then we can search our vector database to get the image whose feature vector is closest to the one for the image we provided:

And, yes, this works for other kinds of objects too:

CreateSemanticSearchIndex and CreateVectorDatabase create vector databases from scratch using data you provide. But—just like with text search indices—an important feature of vector databases is that you can incrementally add to them. So, for example, UpdateSemanticSearchIndex and AddToVectorDatabase let you efficiently add individual entries or lists of entries to vector databases.

In addition to providing capabilities for building (and growing) your own vector databases, there are several pre-built vector databases that are now available in the Wolfram Data Repository:

So now we can use a pre-built vector database of Wolfram Language function documentation to do a semantic search for snippets that are “semantically close” to being about iterating functions:

(In the next section, we’ll see how to actually “synthesize a report” based on this.)

The basic function of SemanticSearch is to determine what “chunks of content” are closest to what you are asking about. But given a semantic search index (AKA vector database) there are also other important things you can do. One of them is to use TextSummarize to ask not for specific chunks but rather for some kind of overall summary of what can be said about a given topic from the content in the semantic search index:

RAGs and Dynamic Prompting for LLMs

How does one tell an LLM what one wants it to do? Fundamentally, one provides a prompt, and then the LLM generates output that “continues” that prompt. Typically the last part of the prompt is the specific question (or whatever) that a user is asking. But before that, there’ll be “pre-prompts” that prime the LLM in various ways to determine how it should respond.

In Version 13.3 in mid-2023 (i.e. a long time ago in the world of LLMs!) we introduced LLMPrompt as a symbolic way to specify a prompt, and we launched the Wolfram Prompt Repository as a broad source for pre-built prompts. Here’s an example of using LLMPrompt with a prompt from the Wolfram Prompt Repository:

In its simplest form, LLMPrompt just adds fixed text to “pre-prompt” the LLM. LLMPrompt is also set up to take arguments that modify the text it’s adding:

But what if one wants the LLM to be pre-prompted in a way that depends on information that’s only available once the user actually asks their question (like, for example, the text of the question itself)? In Version 14.1 we’re adding LLMPromptGenerator to dynamically generate pre-prompts. And it turns out that this kind of “dynamic prompting” is remarkably powerful, and—particularly together with tool calling—opens up a whole new level of capabilities for LLMs.

For example, we can set up a prompt generator that produces a pre-prompt that gives the registered name of the user, so the LLM can use this information when generating its answer:

Or for example here the prompt generator is producing a pre-prompt about sunrise, sunset and the current time:

And, yes, if the pre-prompt contains extra information (like about the Moon) the LLM will (probably) ignore it:

As another example, we can take whatever the user asks, and first do a web search on it, then include as a pre-prompt snippets we get from the web. The result is that we can get answers from the LLM that rely on specific “web knowledge” that we can’t expect will be “known in detail” by the raw LLM:

But often one doesn’t want to just “search at random on the web”; instead one wants to systematically retrieve information from some known source to give as “briefing material” to the LLM to help it in generating its answer. And a typical way to implement this kind of “retrieval-augmented generation (RAG)” is to set up an LLMPromptGenerator that uses the SemanticSearch and vector database capabilities that we introduced in Version 14.1.

So, for example, here’s a semantic search index generated from my (rather voluminous) writings:

By setting up a prompt generator based on this, I can now ask the LLM “personal questions”:

How did the LLM “know that”? Internally the prompt generator used SemanticSearch to generate a collection of snippets, which the LLM then “trawled through” to produce a specific answer:

It’s already often very useful just to “retrieve static text” to “brief” the LLM. But even more powerful is to brief the LLM with what it needs to call tools that can do further computation, etc. So, for example, if you want the LLM to write and run Wolfram Language code that uses functions you’ve created, you can do that by having it first “read the documentation” for those functions.

As an example, this uses a prompt generator that uses a semantic search index built from the Wolfram Function Repository:

Connect to Your Favorite LLM

There are now many ways to use LLM functionality from within the Wolfram Language, and Wolfram Notebooks. You can do it programmatically, with LLMFunction, LLMSynthesize, etc. You can do it interactively through Chat Notebooks and related chat capabilities.

But (at least for now) there’s no full-function LLM built directly into the Wolfram Language. So that means that (at least for now) you have to choose your “flavor” of external LLM to power Wolfram Language LLM functionality. And in Version 14.1 we have support for basically all major available foundation-model LLMs.

We’ve made it as straightforward as possible to set up connections to external LLMs. Once you’ve done it, you can select what you want directly in any Chat Notebook

or from your global Preferences:

When you’re using a function you specify the “model” (i.e. service and specific model name) as part of the setting for LLMEvaluator:

In general you can use LLMConfiguration to define the whole configuration of an LLM you want to connect to, and you can make a particular configuration your default either interactively using Preferences, or by explicitly setting the value of $LLMEvaluator.

So how do you initially set up a connection to a new LLM? You can do it interactively by pressing Connect in the AI Settings pane of Preferences. Or you can do it programmatically using ServiceConnect:

At the “ServiceConnect level” you have very direct access to the features of LLM APIs, though unless you’re studying LLM APIs you probably won’t need to use these. But talking of LLM APIs, one of the things that’s now easy to do with Wolfram Language is to compare LLMs, for example programmatically sending the same question to multiple LLMs:

And in fact we’ve recently started posting weekly results that we get from a full range of LLMs on the task of writing Wolfram Language code (conveniently, the exercises in my book An Elementary Introduction to the Wolfram Language have textual “prompts”, and we have a well-developed system that we’ve used for many years in assessing code for the online course based on the book):

Symbolic Arrays and Their Calculus

I want A to be an n×n matrix. I don’t want to say what its elements are, and I don’t even want to say what n is. I just want to have a way to treat the whole thing symbolically. Well, in Version 14.1 we’ve introduced MatrixSymbol to do that.

A MatrixSymbol has a name (just like an ordinary symbol)—and it has a way to specify its dimensions. We can use it, for example, to set up a symbolic representation for our matrix A:

Hovering over this in a notebook, we’ll get a tooltip that explains what it is:

We can ask for its dimensions as a tensor:

Here’s its inverse, again represented symbolically:

That also has dimensions n×n:

In Version 14.1 you can not only have symbolic matrices, you can also have symbolic vectors and, for that matter, symbolic arrays of any rank. Here’s a length-n symbolic vector (and, yes, we can have a symbolic vector named v that we assign to a symbol v):

So now we can construct something like the quadratic form:

A classic thing to compute from this is its gradient with respect to the vector :

And actually this is just the same as the “vector derivative”:

If we do a second derivative we get:

What happens if we differentiate with respect to ? Well, then we get a symbolic identity matrix

which again has dimensions n×n:

is a rank-2 example of a symbolic identity array:

If we give n an explicit value, we can get an explicit componentwise array:

Let’s say we have a function of , like Total. Once again we can find the derivative with respect to :

And now we see another symbolic array construct: SymbolicOnesArray:

This is simply an array whose elements are all 1:

Differentiating a second time gives us a SymbolicZerosArray:

Although we’re not defining explicit elements for , it’s sometimes important to specify, for example, that all the elements are reals:

For a vector whose elements are reals, it’s straightforward to find the derivative of the norm:

The third derivative, though, is a bit more complicated:

The ⊗ here is TensorProduct, and the T:(1,3,2) represents Transpose[..., {1, 3, 2}].

In the Wolfram Language, a symbol, say s, can stand on its own, and represent a “variable”. It can also appear as a head—as in s[x]—and represent a function. And the same is true for vector and matrix symbols:

Importantly, the chain rule also works for matrix and vector functions:

Things get a bit trickier when one’s dealing with functions of matrices:

The here represents ArrayDot[..., ..., 2], which is a generalization of Dot. Given two arrays u and v, Dot will contract the last index of u with the first index of v:

ArrayDot[u, v, n], on the other hand, contracts the last n indices of u with the first n of v. ArrayDot[u, v, 1] is just the same as Dot[u, v]:

But now in this particular example all the indices get “contracted out”:

We’ve talked about symbolic vectors and matrices. But—needless to say—what we have is completely general, and will work for arrays of any rank. Here’s an example of a p×q×r array:

The overscript indicates that this is an array of rank 3.

When one takes derivatives, it’s very easy to end up with high-rank arrays. Here’s the result of differentiating with respect to a matrix:

is a rank-4 n×n×n×n identity array.

When one’s dealing with higher-rank objects, there’s one more construct that appears—that we call SymbolicDeltaProductArray. Let’s set up a rank-3 array with dimensions 3×3×3:

Now let’s compute a derivative:

The result is a rank-5 array that’s effectively a combination of two KroneckerDelta objects for indices 1,4 and 2,5, respectively:

We can visualize this with ArrayPlot3D:

The most common way to deal with arrays in the Wolfram Language has always been in terms of explicit lists of elements. And in this representation it’s extremely convenient that operations are normally done elementwise:

Non-lists are then by default treated as scalars—and for example here added into every element:

But now there’s something new, namely symbolic arrays—which in effect implicitly contain multiple list elements, and thus can’t be “added into every element”:

This is what happens when we have an “ordinary scalar” together with a symbolic vector:

How does this work? “Under the hood” there’s a new attribute NonThreadable which specifies that certain heads (like ArraySymbol) shouldn’t be threaded by Listable functions (like Plus).

By the way, ever since Version 9 a dozen years ago we’ve had a limited mechanism for assuming that symbols represent vectors, matrices or arrays—and now that mechanism interoperates with all our new symbolic array functionality:

When you’re doing explicit computations there’s often no choice but to deal directly with individual array elements. But it turns out that there are all sorts of situations where it’s possible to work instead in terms of “whole” vectors, matrices, etc. And indeed in the literature of fields like machine learning, optimization, statistics and control theory, it’s become quite routine to write down formulas in terms of symbolic vectors, matrices, etc. And what Version 14.1 now adds is a streamlined way to compute in terms of these symbolic array constructs.

The results are often very elegant. So, for example, here’s how one might set up a general linear least-squares problem using our new symbolic array constructs. First we define a symbolic n×m matrix A, and symbolic vectors b and x:

Our goal is to find a vector that minimizes . And with our definitions we can now immediately write down this quantity:

To extremize it, we compute its derivative

and to ensure we get a minimum, we compute the second derivative:

These are standard textbook formulas, but the cool thing is that in Version 14.1 we’re now in a position to generate them completely automatically. By the way, if we take another derivative, the result will be a zero tensor:

We can look at other norms too:

Binomials and Pitchforks: Navigating Mathematical Conventions

Binomial coefficients have been around for at least a thousand years, and one might not have thought there could possibly be anything shocking or controversial about them anymore (notwithstanding the fictional Treatise on the Binomial Theorem by Sherlock Holmes’s nemesis Professor Moriarty). But in fact we have recently been mired in an intense debate about binomial coefficients—which has caused us in Version 14.1 to introduce a new function PascalBinomial alongside our existing Binomial.

When one’s dealing with positive integer arguments, there’s no issue with binomials. And even when one extends to generic complex arguments, there’s again a unique way to do this. But negative integer arguments are a special degenerate case. And that’s where there’s trouble—because there are two different definitions that have historically been used.

In early versions of Mathematica, we picked one of these definitions. But over time we realized that it led to some subtle inconsistencies, and so for Version 7—in 2008—we changed to the other definition. Some of our users were happy with the change, but some were definitely not. A notable (vociferous) example was my friend Don Knuth, who has written several well-known books that make use of binomial coefficients—always choosing what amounts to our pre-2008 definition.

So what could we do about this? For a while we thought about adding an option to Binomial, but to do this would have broken our normal conventions for mathematical functions. And somehow we kept on thinking that there was ultimately a “right answer” to how binomial coefficients should be defined. But after a lot of discussion—and historical research—we finally concluded that since at least before 1950 there have just been two possible definitions, each with their own advantages and disadvantages, with no obvious “winner”. And so in Version 14.1 we decided just to introduce a new function PascalBinomial to cover the “other definition”.

And—though at first it might not seem like much—here’s a big difference between Binomial and PascalBinomial:

Part of why things get complicated is the relation to symbolic computation. Binomial has a symbolic simplification rule, valid for any n:

But there isn’t a corresponding generic simplification rule for PascalBinomial:

FunctionExpand shows us the more nuanced result in this case:

To see a bit more of what’s going on, we can compute arrays of nonzero results for Binomial and PascalBinomial:

Binomial[n, k] has the “nice feature” that it’s symmetric in k even when n < 0. But this has the “bad consequence” that Pascal’s identity (that says a particular binomial coefficient is the sum of two coefficients “above it”) isn’t always true. PascalBinomial, on the other hand, always satisfies the identity, and it’s in recognition of this that we put “Pascal” in its name.

And, yes, this is all quite subtle. And, remember, the differences between Binomial and PascalBinomial only show up at negative integer values. Away from such values, they’re both given by the same expression, involving gamma functions. But at negative integer values, they correspond to different limits, respectively:

The story of Binomial and PascalBinomial is a complicated one that mainly affects only the upper reaches of discrete mathematics. But there’s another, much more elementary convention that we’ve also tackled in Version 14.1: the convention of what the arguments of trigonometric functions mean.

We’ve always taken the “fundamentally mathematical” point of view that the x in Sin[x] is in radians:

You’ve always been able to explicitly give the argument in degrees (using Degree—or after Version 3 in 1996—using °):

But a different convention would just say that the argument to Sin should always be interpreted as being in degrees, even if it’s just a plain number. Calculators would often have a physical switch that globally toggles to this convention. And while that might be OK if you are just doing a small calculation and can physically see the switch, nothing like that would make any sense at all in our system. But still, particularly in elementary mathematics, one might want a “degrees version” of trigonometric functions. And in Version 14.1 we’ve introduced these:

One might think this was somewhat trivial. But what’s nontrivial is that the “degrees trigonometric functions” are consistently integrated throughout the system. Here, for example, is the period in SinDegrees:

You can take the integral as well

and the messiness of this form shows why for more than three decades we’ve just dealt with Sin[x] and radians.

Fixed Points and Stability for Differential and Difference Equations

All sorts of differential equations have the feature that their solutions exhibit fixed points. It’s always in principle been possible to find these by looking for points where derivatives vanish. But in Version 14.1 we now have a general, robust function that takes the same form of input as DSolve and finds all fixed points:

Here’s a stream plot of the solutions to our equations, together with the fixed points we’ve found:

And we can see that there are two different kinds of fixed points here. The ones on the left and right are “stable” in the sense that solutions that start near them always stay near them. But it’s a different story for the fixed points at the top and bottom; for these, solutions that start nearby can diverge. The function DStabilityConditions computes fixed points, and specifies whether they are stable or not:

As another example, here are the Lorenz equations, which have one unstable fixed point, and two stable ones:

If your equations have parameters, their stability fixed points can depend on those parameters:

Extracting the conditions here, we can now plot the region of parameter space where this fixed point is stable:

This kind of stability analysis is important in all sorts of fields, including dynamical systems theory, control theory, celestial mechanics and computational ecology.

And just as one can find fixed points and do stability analysis for differential equations, one can also do it for difference equations—and this is important for discrete dynamical systems, digital control systems, and for iterative numerical algorithms. Here’s a classic example in Version 14.1 for the logistic map:

The Steady Advance of PDEs

Five years ago—in Version 11.3—we introduced our framework for symbolically representing physical systems using PDEs. And in every version since we’ve been steadily adding more and more capabilities. At this point we’ve now covered the basics of heat transfer, mass transport, acoustics, solid mechanics, fluid mechanics, electromagnetics and (one-particle) quantum mechanics. And with our underlying symbolic framework, it’s easy to mix components of all these different kinds.

Our goal now is to progressively cover what’s needed for more and more kinds of applications. So in Version 14.1 we’re adding von Mises stress analysis for solid mechanics, electric current density models for electromagnetics and anisotropic effective masses for quantum mechanics.

So as an example of what’s now possible, here’s a piece of geometry representing a spiral inductor of the kind that might be used in a modern MEMS device:

Let’s define our variables—voltage and position:

And let’s specify parameters—here just that the material we’re going to deal with is copper:

Now we’re in a position to set up the PDE for this system, making use of the new constructs ElectricCurrentPDEComponent and ElectricCurrentDensityValue:

All it takes to solve this PDE for the voltage is then:

From the voltage we can compute the current density

and then plot it (and, yes, the current tends to avoid the corners):

Symbolic Biomolecules and Their Visualization

Ever since Version 12.2 we’ve had the ability to represent and manipulate bio sequences of the kind that appear in DNA, RNA and proteins. We’ve also been able to do things like import PDB (Protein Data Bank) files and generate graphics from them. But now in Version 14.1 we’re adding a symbolic BioMolecule construct, to represent the full structure of biomolecules:

Ultimately this is “just a molecule” (and in this case its data is so big it’s not by default stored locally in your notebook):

But what BioMolecule does is also to capture the “higher-order structure” of the molecule, for example how it’s built up from distinct chains, where structures like α-helices occur in these, and so on. For example, here are the two (bio sequence) chains that appear in this case:

And here are where the α-helices occur:

What about visualization? Well, there’s BioMoleculePlot3D for that:

There are different “themes” you can use for this:

Here’s a raw-atom-level view:

You can combine the views—and for example add coordinate values (specified in angstroms):

You can also specify “color rules” that define how particular parts of the biomolecule should be rendered:

But the structure here isn’t just something you can make graphics out of; it’s also something you can compute with. For example, here’s a geometric region formed from the biomolecule:

And this computes its surface area (in square angstroms):

The Wolfram Language has built-in data on a certain number of proteins. But you can get data on many more proteins from external sources—specifying them with external identifiers:

When you get a protein—say from an external source—it’ll often come with a 3D structure specified, for example as deduced from experimental measurements. But even without that, Version 14.1 will attempt to find at least an approximate structure—by using machine-learning-based protein-folding methods. As an example, here’s a random bio sequence:

If you make a BioMolecule out of this, a “predicted” 3D structure will be generated:

Here’s a visualization of this structure—though more work would be needed to determine how it’s related to what one might actually observe experimentally:

Optimizing Neural Nets for GPUs and NPUs

Many computers now come with GPU and NPU hardware accelerators for machine learning, and in Version 14.1 we’ve added more support for these. Specifically, on macOS (Apple Silicon) and Windows machines, built-in functions like ImageIdentify and SpeechRecognize now automatically use CoreML (Neural Engine) and DirectML capabilities—and the result is typically 2X to 10X faster performance.

We’ve always supported explicit CUDA GPU acceleration, for both training and inference. But in Version 14.1 we now support CoreML and DirectML acceleration for inference tasks with explicitly specified neural nets. But whereas this acceleration is now the default for built-in machine-learning-based functions, for explicitly specified models the default isn’t yet the default.

So, for example, this doesn’t use GPU acceleration:

But you can explicitly request it—and then (assuming all features of the model can be accelerated) things will typically run significantly faster:

We’re continually sprucing up our infrastructure for machine learning. And as part of that, in Version 14.1 we’ve enhanced our diagrams for neural nets to make layers more visually distinct—and to immediately produce diagrams suitable for publication:

The Statistics of Dates

We’ve been releasing versions of what’s now the Wolfram Language for 36 years. And looking at that whole collection of release dates, we can ask statistical questions. Like “What’s the median date for all the releases so far?” Well, in Version 14.1 there’s a direct way to answer that—because statistical functions like Median now just immediately work on dates:

What if we ask about all 7000 or so functions in the Wolfram Language? Here’s a histogram of when they were introduced:

And now we can compute the median, showing quantitatively that, yes, Wolfram Language development has speeded up:

Dates are a bit like numbers, but not quite. For example, their “zero” shifts around depending on the calendar. And their granularity is more complicated than precision for numbers. In addition, a single date can have multiple different representations (say in different calendars or with different granularities). But it nevertheless turns out to be possible to define many kinds of statistics for dates. To understand these statistics—and to compute them—it’s typically convenient to make one’s whole collection of dates have the same form. And in Version 14.1 this can be achieved with the new function ConformDates (which here converts all dates to the format of the first one listed):

By the way, in Version 14.1 the whole pipeline for handling dates (and times) has been dramatically speeded up, most notably conversion from strings, as needed in the import of dates.

The concept of doing statistics on dates introduces another new idea: date (and time) distributions. And in Version 14.1 there are two new functions DateDistribution and TimeDistribution for defining such distributions. Unlike for numerical (or quantity) distributions, date and time distributions require the specification of an origin, like Today, as well as of a scale, like "Days":

But given this symbolic specification, we can now do operations just like for any other distribution, say generating some random variates:

Building Videos with Programs

Introduced in Version 6 back in 2007, Manipulate provides an immediate way to create an interactive “manipulable” interface. And it’s been possible for a long time to export Manipulate objects to video. But just what should happen in the video? What sliders should move in what way? In Version 12.3 we introduced AnimationVideo to let you make a video in which one parameter is changing with time. But now in Version 14.1 we have ManipulateVideo which lets you create a video in which many parameters can be varied simultaneously. One way to specify what you want is to say for each parameter what value it should get at a sequence of times (by default measured in seconds from the beginning of the video). ManipulateVideo then produces a smooth video by interpolating between these values:

(An alternative is to specify complete “keyframes” by giving operations to be done at particular times.)

ManipulateVideo in a sense provides a “holistic” way to create a video by controlling a Manipulate. And in the last several versions we’ve introduced many functions for creating videos from “existing structures” (for example FrameListVideo assembles a video from a list of frames). But sometimes you want to build up videos one frame at a time. And in Version 14.1 we’ve introduced SowVideo and ReapVideo for doing this. They’re basically the analog of Sow and Reap for video frames. SowVideo will “sow” one or more frames, and all frames you sow will then be collected and assembled into a video by ReapVideo:

One common application of SowVideo/ReapVideo is to assemble a video from frames that are programmatically picked out by some criterion from some other video. So, for example, this “sows” frames that contain a bird, then “reaps” them to assemble a new video.

Another way to programmatically create one video from another is to build up a new video by progressively “folding in” frames from an existing video—which is what the new function VideoFrameFold does:

Version 14.1 also has a variety of new “convenience functions” for dealing with videos. One example is VideoSummaryPlot which generates various “at-a-glance” summaries of videos (and their audio):

Another new feature in Version 14.1 is the ability to apply audio processing functions directly to videos:

And, yes, it’s a bird:

Optimizing the Speech Recognition Workflow

We first introduced SpeechRecognize in 2019 in Version 12.0. And now in Version 14.1 SpeechRecognize is getting a makeover.

The most dramatic change is speed. In the past, SpeechRecognize would typically take at least as long to recognize a piece of speech as the duration of the speech itself. But now in Version 14.1, SpeechRecognize runs many tens of times faster, so you can recognize speech much faster than real time.

And what’s more, SpeechRecognize now produces full, written text, complete with capitalization, punctuation, etc. So here, for example, is a transcription of a little video:

There’s also a new function, VideoTranscribe, that will take a video, transcribe its audio, and insert the transcription back into the subtitle track of the video.

And, by the way, SpeechRecognize runs entirely locally on your computer, without having to access a server (except maybe for updates to the neural net it’s using).

In the past SpeechRecognize could only handle English. In Version 14.1 it can handle 100 languages—and can automatically produce translated transcriptions. (By default it produces transcriptions in the language you’ve specified with $Language.) And if you want to identify what language a piece of audio is in, LanguageIdentify now works directly on audio.

SpeechRecognize by default produces a single string of text. But it now also has the option to break up its results into a list, say of sentences:

And in addition to producing a transcription, SpeechRecognize can give time intervals or audio fragments for each element:

Historical Geography Becomes Computable

History is complicated. But that doesn’t mean there isn’t much that can be made computable about it. And in Version 14.1 we’re taking a major step forward in making historical geography computable. We’ve had extensive geographic computation capabilities in the Wolfram Language for well over a decade. And in Version 14.1 we’re extending that to historical geography.

So now you can not only ask for a map of where the current country of Italy is, you can also ask to make a map of the Roman Empire in 100 AD:

And “the Roman Empire in 100 AD” is now a computable entity. So you can ask for example what its approximate area was:

And you can even make a plot of how the area of the Roman Empire changed over the period from 0 AD to 200 AD:

We’ve been building our knowledgebase of historical geography for many years. Of course, country borders may be disputed, and—particularly in the more distant past—may not have been well defined. But by now we’ve accumulated computable data on basically all of the few thousand known historical countries. Still—with history being complicated—it’s not surprising that there are all sorts of often subtle issues.

Let’s start by asking what historical countries the location that’s now Mexico City has been in. GeoIdentify gives the answer:

And already we see subtlety. For example, our historical country entities are labeled by their overall beginning and ending dates. But most of them covered Mexico City only for part of their existence. And here we can see what’s going on:

Often there’s subtlety in identifying what should count as a “different country”. If there was just an “acquisition” or a small “change of management” maybe it’s still the same country. But if there was a “dramatic reorganization”, maybe it’s a different country. Sometimes the names of countries (if they even had official names) give clues. But in general it’s taken lots of case-by-case curation, trying to follow the typical conventions used by historians of particular times and places.

For London we see several “close-but-we-consider-it-a-different-country” issues—along with various confusing repeated conquerings and reconquerings:

Here’s a timeline plot of the countries that have contained London:

And because everything is computable, it’s easy to identify the longest contiguous segment here:

GeoIdentify can tell us what entities something like a city is inside. GeoEntities, on the other hand, can tell us what entities are inside something like a country. So, for example, this tells us what historical countries were inside (or at least overlapped with) the current boundaries of the UK in 800 AD:

This then makes a map (the extra list makes these countries be rendered separately):

In the Wolfram Language we have data on quite a few kinds of historical entities beyond countries. For example, we have extensive data on military conflicts. Here we’re asking what military conflicts occurred within the borders of what’s now France between 200 BC and 200 AD:

Here’s a map of their locations:

And here are conflicts in the Atlantic Ocean in the period 1939–1945:

And—combining several things—here’s a map of conflicts that, at the time when they occurred, were within the region of what was then Carthage:

There are all sorts of things that we can compute from historical geography. For example, this asks for the (minimum) geo distance between the territory of the Roman Empire and the Han Dynasty in 100 AD:

But what about the overall minimum distance across all years when these historical countries existed? This gives the result for that:

Let’s compare this with a plot of these two entities:

But there’s a subtlety here. What version of the Roman Empire is it that we’re showing on the map here? Our convention is by default to show historical countries “at their zenith”, i.e. at the moment when they had their maximum extent.

But what about other choices? Dated gives us a way to specify a particular date. But another possibility is to include in what we consider to be a particular historical country any territory that was ever part of that country, at any time in its history. And you can do this using GeoVariant[…, "UnionArea"]. In the particular case we’re showing here, it doesn’t make much difference, except that there’s more territory in Germany and Scotland included in the Roman Empire:

By the way, you can combine Dated and GeoVariant, to get things like “the zenith within a certain period” or “any territory that was included at any time within a period”. And, yes, it can get quite complicated. In a rather physics-like way you can think of the extent of a historical country as defining a region in spacetime—and indeed GeoVariant[…, "TimeSeries"] in effect represents a whole “stack of spacelike slices” in this spacetime region:

And—though it takes a little while—you can use it to make a video of the rise and fall of the Roman Empire:

Astronomical Graphics and Their Axes

It’s complicated to define where things are in the sky. There are four main coordinate systems that get used in doing this: horizon (relative to local horizon), equatorial (relative to the Earth’s equator), ecliptic (relative to the orbit of the Earth around the Sun) and galactic (relative to the plane of the galaxy). And when we draw a diagram of the sky (here on white for clarity) it’s typical to show the “axes” for all these coordinate systems:

But here’s a tricky thing: how should those axes be labeled? Each one is different: horizon is most naturally labeled by things like cardinal directions (N, E, S, W, etc.), equatorial by hours in the day (in sidereal time), ecliptic by months in the year, and galactic by angle from the center of the galaxy.

In ordinary plots axes are usually straight, and labeled uniformly (or perhaps, say, logarithmically). But in astronomy things are much more complicated: the axes are intrinsically circular, and then get rendered through whatever projection we’re using.

And we might have thought that such axes would require some kind of custom structure. But not in the Wolfram Language. Because in the Wolfram Language we try to make things general. And axes are no exception:

So in AstroGraphics all our various axes are just AxisObject constructs—that can be computed with. And so, for example, here’s a Mollweide projection of the sky:

If we insist on “seeing the whole sky”, the bottom half is just the Earth (and, yes, the Sun isn’t shown because I’m writing this after it’s set for the day…):

Things get a bit wild if we start adding grid lines, here for galactic coordinates:

And, yes, the galactic coordinate axis is indeed aligned with the plane of the Milky Way (i.e. our galaxy):

When Is Earthrise on Mars? New Level of Astronomical Computation

When will the Earth next rise above the horizon from where the Perseverance rover is on Mars? In Version 14.1 we can now compute this (and, yes, this is an “Earth time” converted from Mars time using the standard barycentric celestial reference system (BCRS) solar-system-wide spacetime coordinate system):

This is a fairly complicated computation that takes into account not only the motion and rotation of the bodies involved, but also various other physical effects. A more “down to Earth” example that one might readily check by looking out of one’s window is to compute the rise and set times of the Moon from a particular point on the Earth:

There’s a slight variation in the times between moonrises:

Over the course of a year we see systematic variations associated with the periods of different kinds of lunar months:

There are all sorts of subtleties here. For example, when exactly does one define something (like the Sun) to have “risen”? Is it when the top of the Sun first peeks out? When the center appears? Or when the “whole Sun” is visible? In Version 14.1 you can ask about any of these:

Oh, and you could compute the same thing for the rise of Venus, but now to see the differences, you’ve got to go to millisecond granularity (and, by the way, granularities of milliseconds down to picoseconds are new in Version 14.1):

By the way, particularly for the Sun, the concept of ReferenceAltitude is useful in specifying the various kinds of sunrise and sunset: for example, “civil twilight” corresponds to a reference altitude of –6°.

Geometry Goes Color, and Polar

Last year we introduced the function ARPublish to provide a streamlined way to take 3D geometry and publish it for viewing in augmented reality. In Version 14.1 we’ve now extended this pipeline to deal with color:

(Yes, the color is a little different on the phone because the phone tries to make it look “more natural”.)

And now it’s easy to view this not just on a phone, but also, for example, on the Apple Vision Pro:

Graphics have always had color. But now in Version 14.1 symbolic geometric regions can have color too:

And constructive geometric operations on regions preserve color:

Two other new functions in Version 14.1 are PolarCurve and FilledPolarCurve:

And while at this level this may look simple, what’s going on underneath is actually seriously complicated, with all sorts of symbolic analysis needed in order to determine what the “inside” of the parametric curve should be.

Talking about geometry and color brings up another enhancement in Version 14.1: plot themes for diagrams in synthetic geometry. Back in Version 12.0 we introduced symbolic synthetic geometry—in effect finally providing a streamlined computable way to do the kind of geometry that Euclid did two millennia ago. In the past few versions we’ve been steadily expanding our synthetic geometry capabilities, and now in Version 14.1 one notable thing we’ve added is the ability to use plot themes—and explicit graphics options—to style geometric diagrams. Here’s the default version of a geometric diagram:

Now we can “theme” this for the web:

New Computation Flow in Notebooks: Introducing Cell-Linked %

In building up computations in notebooks, one very often finds oneself wanting to take a result one just got and then do something with it. And ever since Version 1.0 one’s been able to do this by referring to the result one just got as %. It’s very convenient. But there are some subtle and sometimes frustrating issues with it, the most important of which has to do with what happens when one reevaluates an input that contains %.

Let’s say you’ve done this:

But now you decide that actually you wanted Median[ % ^ 2 ] instead. So you edit that input and reevaluate it:

Oops! Even though what’s right above your input in the notebook is a list, the value of % is the latest result that was computed, which you can’t now see, but which was 3.

OK, so what can one do about this? We’ve thought about it for a long time (and by “long” I mean decades). And finally now in Version 14.1 we have a solution—that I think is very nice and very convenient. The core of it is a new notebook-oriented analog of %, that lets one refer not just to things like “the last result that was computed” but instead to things like “the result computed in a particular cell in the notebook”.

So let’s look at our sequence from above again. Let’s start typing another cell—say to “try to get it right”. In Version 14.1 as soon as we type % we see an autosuggest menu:

The menu is giving us a choice of (output) cells that we might want to refer to. Let’s pick the last one listed:

The object is a reference to the output from the cell that’s currently labeled In[1]—and using now gives us what we wanted.

But let’s say we go back and change the first (input) cell in the notebook—and reevaluate it:

The cell now gets labeled In[5]—and the (in In[4]) that refers to that cell will immediately change to :

And if we now evaluate this cell, it’ll pick up the value of the output associated with In[5], and give us a new answer:

So what’s really going on here? The key idea is that signifies a new type of notebook element that’s a kind of cell-linked analog of %. It represents the latest result from evaluating a particular cell, wherever the cell may be, and whatever the cell may be labeled. (The object always shows the current label of the cell it’s linked to.) In effect is “notebook front end oriented”, while ordinary % is kernel oriented. is linked to the contents of a particular cell in a notebook; % refers to the state of the Wolfram Language kernel at a certain time.

gets updated whenever the cell it’s referring to is reevaluated. So its value can change either through the cell being explicitly edited (as in the example above) or because reevaluation gives a different value, say because it involves generating a random number:

OK, so always refers to “a particular cell”. But what makes a cell a particular cell? It’s defined by a unique ID that’s assigned to every cell. When a new cell is created it’s given a universally unique ID, and it carries that same ID wherever it’s placed and whatever its contents may be (and even across different sessions). If the cell is copied, then the copy gets a new ID. And although you won’t explicitly see cell IDs, works by linking to a cell with a particular ID.

One can think of as providing a “more stable” way to refer to outputs in a notebook. And actually, that’s true not just within a single session, but also across sessions. Say one saves the notebook above and opens it in a new session. Here’s what you’ll see:

The is now grayed out. So what happens if we try to reevaluate it? Well, we get this:

If we press Reconstruct from output cell the system will take the contents of the first output cell that was saved in the notebook, and use this to get input for the cell we’re evaluating:

In almost all cases the contents of the output cell will be sufficient to allow the expression “behind it” to be reconstructed. But in some cases—like when the original output was too big, and so was elided—there won’t be enough in the output cell to do the reconstruction. And in such cases it’s time to take the Go to input cell branch, which in this case will just take us back to the first cell in the notebook, and let us reevaluate it to recompute the output expression it gives.

By the way, whenever you see a “positional %” you can hover over it to highlight the cell it’s referring to:

Having talked a bit about “cell-linked %” it’s worth pointing out that there are still cases when you’ll want to use “ordinary %”. A typical example is if you have an input line that you’re using a bit like a function (say for post-processing) and that you want to repeatedly reevaluate to see what it produces when applied to your latest output.

In a sense, ordinary % is the “most volatile” in what it refers to. Cell-linked % is “less volatile”. But sometimes you want no volatility at all in what you’re referring to; you basically just want to burn a particular expression into your notebook. And in fact the % autosuggest menu gives you a way to do just that.

Notice the that appears in whatever row of the menu you’re selecting:

Press this and you’ll insert (in iconized form) the whole expression that’s being referred to:

Now—for better or worse—whatever changes you make in the notebook won’t affect the expression, because it’s right there, in literal form, “inside” the icon. And yes, you can explicitly “uniconize” to get back the original expression:

Once you have a cell-linked % it always has a contextual menu with various actions:

One of those actions is to do what we just mentioned, and replace the positional by an iconized version of the expression it’s currently referring to. You can also highlight the output and input cells that the is “linked to”. (Incidentally, another way to replace a by the expression it’s referring to is simply to “evaluate in place” , which you can do by selecting it and pressing CMDReturn or ShiftControlEnter.)

Another item in the menu is Replace With Rolled-Up Inputs. What this does is—as it says—to “roll up” a sequence of “ references” and create a single expression from them:

What we’ve talked about so far one can think of as being “normal and customary” uses of . But there are all sorts of corner cases that can show up. For example, what happens if you have a that refers to a cell you delete? Well, within a single (kernel) session that’s OK, because the expression “behind” the cell is still available in the kernel (unless you reset your $HistoryLength etc.). Still, the will show up with a “red broken link” to indicate that “there could be trouble”:

And indeed if you go to a different (kernel) session there will be trouble—because the information you need to get the expression to which the refers is simply no longer available, so it has no choice but to show up in a kind of everything-has-fallen-apart “surrender state” as:

is primarily useful when it refers to cells in the notebook you’re currently using (and indeed the autosuggest menu will contain only cells from your current notebook). But what if it ends up referring to a cell in a different notebook, say because you copied the cell from one notebook to another? It’s a precarious situation. But if all relevant notebooks are open, can still work, though it’s displayed in purple with an action-at-a-distance “wi-fi icon” to indicate its precariousness:

And if, for example, you start a new session, and the notebook containing the “source” of the isn’t open, then you’ll get the “surrender state”. (If you open the necessary notebook it’ll “unsurrender” again.)

Yes, there are lots of tricky cases to cover (in fact, many more than we’ve explicitly discussed here). And indeed seeing all these cases makes us not feel bad about how long it’s taken for us to conceptualize and implement .

The most common way to access is to use the % autosuggest menu. But if you know you want a , you can always get it by “pure typing”, using for example ESC%ESC. (And, yes, ESC%%ESC or ESC%5ESC etc. also work, so long as the necessary cells are present in your notebook.)

The UX Journey Continues: New Typing Affordances, and More

We invented Wolfram Notebooks more than 36 years ago, and we’ve been improving and polishing them ever since. And in Version 14.1 we’re implementing several new ideas, particularly around making it even easier to type Wolfram Language code.

It’s worth saying at the outset that good UX ideas quickly become essentially invisible. They just give you hints about how to interpret something or what to do with it. And if they’re doing their job well, you’ll barely notice them, and everything will just seem “obvious”.

So what’s new in UX for Version 14.1? First, there’s a story around brackets. We first introduced syntax coloring for unmatched brackets back in the late 1990s, and gradually polished it over the following two decades. Then in 2021 we started “automatching” brackets (and other delimiters), so that as soon as you type “f[” you immediately get f[ ].

But how do you keep on typing? You could use an to “move through” the ]. But we’ve set it up so you can just “type through” ] by typing ]. In one of those typical pieces of UX subtlety, however, “type through” doesn’t always make sense. For example, let’s say you typed f[x]. Now you click right after [ and you type g[, so you’ve got f[g[x]. You might think there should be an autotyped ] to go along with the [ after g. But where should it go? Maybe you want to get f[g[x]], or maybe you’re really trying to type f[g[],x]. We definitely don’t want to autotype ] in the wrong place. So the best we can do is not autotype anything at all, and just let you type the ] yourself, where you want it. But remember that with f[x] on its own, the ] is autotyped, and so if you type ] yourself in this case, it’ll just type through the autotyped ] and you won’t explicitly see it.

So how can you tell whether a ] you type will explicitly show up, or will just be “absorbed” as type-through? In Version 14.1 there’s now different syntax coloring for these cases: yellow if it’ll be “absorbed”, and pink if it’ll explicitly show up.

This is an example of non-type-through, so Range is colored yellow and the ] you type is “absorbed”:

And this is an example of non-type-through, so Round is colored pink and the ] you type is explicitly inserted:

This may all sound very fiddly and detailed—and for us in developing it, it is. But the point is that you don’t explicitly have to think about it. You quickly learn to just “take the hint” from the syntax coloring about when your closing delimiters will be “absorbed” and when they won’t. And the result is that you’ll have an even smoother and faster typing experience, with even less chance of unmatched (or incorrectly matched) delimiters.

The new syntax coloring we just discussed helps in typing code. In Version 14.1 there’s also something new that helps in reading code. It’s an enhanced version of something that’s actually common in IDEs: when you click (or select) a variable, every instance of that variable immediately gets highlighted:

What’s subtle in our case is that we take account of the scoping of localized variables—putting a more colorful highlight on instances of a variable that are in scope:

One place this tends to be particularly useful is in understanding nested pure functions that use #. By clicking a # you can see which other instances of # are in the same pure function, and which are in different ones (the highlight is bluer inside the same function, and grayer outside):

On the subject of finding variables, another change in Version 14.1 is that fuzzy name autocompletion now also works for contexts. So if you have a symbol whose full name is context1`subx`var2 you can type c1x and you’ll get a completion for the context; then accept this and you get a completion for the symbol.

There are also several other notable UX “tune-ups” in Version 14.1. For many years, there’s been an “information box” that comes up whenever you hover over a symbol. Now that’s been extended to entities—so (alongside their explicit form) you can immediately get to information about them and their properties:

Next there’s something that, yes, I personally have found frustrating in the past. Say you’ve a file, or an image, or something else somewhere on your computer’s desktop. Normally if you want it in a Wolfram Notebook you can just drag it there, and it will very beautifully appear. But what if the thing you’re dragging is very big, or has some other kind of issue? In the past, the drag just failed. Now what happens is that you get the explicit Import that the dragging would have done, so that you can run it yourself (getting progress information, etc.), or you can modify it, say adding relevant options.

Another small piece of polish that’s been added in Version 14.1 has to do with Preferences. There are a lot of things you can set in the notebook front end. And they’re explained, at least briefly, in the many Preferences panels. But in Version 14.1 there are now (i) buttons that give direct links to the relevant workflow documentation:

Syntax for Natural Language Input

Ever since shortly after Wolfram|Alpha was released in 2009, there’ve been ways to access its natural language understanding capabilities in the Wolfram Language. Foremost among these has been CTRL=—which lets you type free-form natural language and immediately get a Wolfram Language version, often in terms of entities, etc.:

Generally this is a very convenient and elegant capability. But sometimes one may want to just use plain text to specify natural language input, for example so that one doesn’t interrupt one’s textual typing of input.

In Version 14.1 there’s a new mechanism for this: syntax for directly entering free-form natural language input. The syntax is a kind of a “textified” version of CTRL=: =[…]. When you type =[...] as input nothing immediately happens. It’s only when you evaluate your input that the natural language gets interpreted—and then whatever it specifies is computed.

Here’s a very simple example, where each =[…] just turns into an entity:

But when the result of interpreting the natural language is an expression that can be further evaluated, what will come out is the result of that evaluation:

One feature of using =[…] instead of CTRL= is that =[…] is something anyone can immediately see how to type:

But what actually is =[…]? Well, it’s just input syntax for the new function FreeformEvaluate:

You can use FreeformEvaluate inside a program—here, rather whimsically, to see what interpretations are chosen by default for “a” followed by each letter of the alphabet:

By default, FreeformEvaluate interprets your input, then evaluates it. But you can also specify that you want to hold the result of the interpretation:

Diff[ ] … for Notebooks and More!

It’s been a very long-requested capability: give me a way to tell what changed, particularly in a notebook. It’s fairly easy to do “diffs” for plain text. But for notebooks—as structured symbolic documents—it’s a much more complicated story. But in Version 14.1 it’s here! We’ve got a function Diff for doing diffs in notebooks, and actually also in many other kinds of things.

Here’s an example, where we’re requesting a “side-by-side view” of the diff between two notebooks:

And here’s an “alignment chart view” of the diff:

Like everything else in the Wolfram Language, a “diff” is a symbolic expression. Here’s an example:

There are lots of different ways to display a diff object; many of them one can select interactively with the menu:

But the most important thing about diff objects is that they can be used programmatically. And in particular DiffApply applies the diffs from a diff object to an existing object, say a notebook.

What’s the point of this? Well, let’s imagine you’ve made a notebook, and given a copy of it to someone else. Then both you and the person to whom you’ve given the copy make changes. You can create a diff object of the diffs between the original version of the notebook, and the version with your changes. And if the changes the other person made don’t overlap with yours, you can just take your diffs and use DiffApply to apply your diffs to their version, thereby getting a “merged notebook” with both sets of changes made.

But what if your changes might conflict? Well, then you need to use the function Diff3. Diff3 takes your original notebook and two modified versions, and does a “three-way diff” to give you a diff object in which any conflicts are explicitly identified. (And, yes, three-way diffs are familiar from source control systems in which they provide the back end for making the merging of files as automated as possible.)

Notebooks are an important use case for Diff and related functions. But they’re not the only one. Diff can perfectly well be applied, for example, just to lists:

There are many ways to display this diff object; here’s a side-by-side view:

And here’s a “unified view” reminiscent of how one might display diffs for lines of text in a file:

And, speaking of files, Diff, etc. can immediately be applied to files:

Diff, etc. can also be applied to cells, where they can analyze changes in both content and styles or metadata. Here we’re creating two cells and then diffing them—showing the result in a side by side:

In “Combined” view the “pure insertions” are highlighted in green, the “pure deletions” in red, and the “edits” are shown as deletion/insertion stacks:

Many uses of diff technology revolve around content development—editing, software engineering, etc. But in the Wolfram Language Diff, etc. are set up also to be convenient for information visualization and for various kinds of algorithmic operations. For example, to see what letters differ between the Spanish and Polish alphabets, we can just use Diff:

Here’s the “pure visualization”:

And here’s an alternate “unified summary” form:

Another use case for Diff is bioinformatics. We retrieve two genome sequences—as strings—then use Diff:

We can take the resulting diff object and show it in a different form—here character alignment:

Under the hood, by the way, Diff is finding the differences using SequenceAlignment. But while Diff is giving a “high-level symbolic diff object”, SequenceAlignment is giving a direct low-level representation of the sequence alignment:

Information visualization isn’t restricted to two-way diffs; here’s an example with a three-way diff:

And here it is as a “unified summary”:

There are all sorts of options for diffs. One that is sometimes important is DiffGranularity. By default the granularity for diffs of strings is "Characters":

But it’s also possible to set it to be "Words":

Coming back to notebooks, the most “interactive” form of diff is a “report”:

In such a report, you can open cells to see the details of a specific change, and you can also click to jump to where the change occurred in the underlying notebooks.

When it comes to analyzing notebooks, there’s another new feature in Version 14.1: NotebookCellData. NotebookCellData gives you direct programmatic access to lots of properties of notebooks. By default it generates a dataset of some of them, here for the notebook in which I’m currently authoring this:

There are properties like the word count in each cell, the style of each cell, the memory footprint of each cell, and a thumbnail image of each cell.

Ever since Version 6 in 2007 we’ve had the CellChangeTimes option which records when cells in notebooks are created or modified. And now in Version 14.1 NotebookCellData provides direct programmatic access to this data. So, for example, here’s a date histogram of when the cells in the current notebook were last changed:

Lots of Little Language Tune-Ups

It’s part of a journey of almost four decades. Steadily discovering—and inventing—new “lumps of computational work” that make sense to implement as functions or features in the Wolfram Language. The Wolfram Language is of course very much strong enough that one can build essentially any functionality from the primitives that already exist in it. But part of the point of the language is to define the best “elements of computational thought”. And particularly as the language progresses, there’s a continual stream of new opportunities for convenient elements that get exposed. And in Version 14.1 we’ve implemented quite a diverse collection of them.

Let’s say you want to nestedly compose a function. Ever since Version 1.0 there’s been Nest for that:

But what if you want the abstract nested function, not yet applied to anything? Well, in Version 14.1 there’s now an operator form of Nest (and NestList) that represents an abstract nested function that can, for example, be composed with other functions, as in

or equivalently:

A decade ago we introduced functions like AllTrue and AnyTrue that effectively “in one gulp” do a whole collection of separate tests. If one wanted to test whether there are any primes in a list, one can always do:

But it’s better to “package” this “lump of computational work” into the single function AnyTrue:

In Version 14.1 we’re extending this idea by introducing AllMatch, AnyMatch and NoneMatch:

Another somewhat related new function is AllSameBy. SameQ tests whether a collection of expressions are immediately the same. AllSameBy tests whether expressions are the same by the criterion that the value of some function applied to them is the same:

Talking of tests, another new feature in Version 14.1 is a second argument to QuantityQ (and KnownUnitQ), which lets you test not only whether something is a quantity, but also whether it’s a specific type of physical quantity:

And now talking about “rounding things out”, Version 14.1 does that in a very literal way by enhancing the RoundingRadius option. For a start, you can now specify a different rounding radius for particular corners:

And, yes, that’s useful if you’re trying to fit button-like constructs together:

By the way, RoundingRadius now also works for rectangles inside Graphics:

Let’s say you have a string, like “hello”. There are many functions that operate directly on strings. But sometimes you really just want to use a function that operates on lists—and apply it to the characters in a string. Now in Version 14.1 you can do this using StringApply:

Another little convenience in Version 14.1 is the function BitFlip, which, yes, flips a bit in the binary representation of a number:

When it comes to Boolean functions, a detail that’s been improved in Version 14.1 is the conversion to NAND representation. By default, functions like BooleanConvert have allowed Nand[p] (which is equivalent to Not[p]). But in Version 14.1 there’s now "BinaryNAND" which yields for example Nand[p, p] instead of just Nand[p] (i.e. Not[p]). So here’s a representation of Or in terms of Nand:

Making the Wolfram Compiler Easier to Use

Let’s say you have a piece of Wolfram Language code that you know you’re going to run a zillion times—so you want it to run absolutely as fast as possible. Well, you’ll want to make sure you’re doing the best algorithmic things you can (and making the best possible use of Wolfram Language superfunctions, etc.). And perhaps you’ll find it helpful to use things like DataStructure constructs. But ultimately if you really want your code to run absolutely as fast as your computer can make it, you’ll probably want to set it up so that it can be compiled using the Wolfram Compiler, directly to LLVM code and then machine code.

We’ve been developing the Wolfram Compiler for many years, and it’s becoming steadily more capable (and efficient). And for example it’s become increasingly important in our own internal development efforts. In the past, when we wrote critical inner-loop internal code for the Wolfram Language, we did it in C. But in the past few years we’ve almost completely transitioned instead to writing pure Wolfram Language code that we then compile with the Wolfram Compiler. And the result of this has been a dramatically faster and more reliable development pipeline for writing inner-loop code.

Ultimately what the Wolfram Compiler needs to do is to take the code you write and align it with the low-level capabilities of your computer, figuring out what low-level data types can be used for what, etc. Some of this can be done automatically (using all sorts of fancy symbolic and theorem-proving-like techniques). But some needs to be based on collaboration between the programmer and the compiler. And in Version 14.1 we’re adding several important ways to enhance that collaboration.

The first thing is that it’s now easy to get access to information the compiler has. For example, here’s the type declaration the compiler has for the built-in function Dimensions:

And here’s the source code of the actual implementation the compiler is using for Dimensions, calling its intrinsic low-level internal functions like CopyTo:

A function like Map has a vastly more complex set of type declarations:

For types themselves, CompilerInformation lets you see their type hierarchy:

And for data structure types, you can do things like see the fields they contain, and the operations they support:

And, by the way, something new in Version 14.1 is the function OperationDeclaration which lets you declare operations to add to a data structure type you’ve defined.

Once you actually start running the compiler, a convenient new feature in Version 14.1 is a detailed progress monitor that lets you see what the compiler is doing at each step:

As we said, the key to compilation is figuring out how to align your code with the low-level capabilities of your computer. The Wolfram Language can do arbitrary symbolic operations. But many of those don’t align with low-level capabilities of your computer, and can’t meaningfully be compiled. Sometimes those failures to align are the result of sophistication that’s possible only with symbolic operations. But sometimes the failures can be avoided if you “unpack” things a bit. And sometimes the failures are just the result of programming mistakes. And now in Version 14.1 the Wolfram Compiler is starting to be able to annotate your code to show where the misalignments are happening, so you can go through and figure out what to do with them. (It’s something that’s uniquely possible because of the symbolic structure of the Wolfram Language and even more so of Wolfram Notebooks.)

Here’s a very simple example:

In compiled code, Sin expects a numerical argument, so a Boolean argument won’t work. Clicking the Source button lets you see where specifically something went wrong:

If you have several levels of definitions, the Source button will show you the whole chain:

Here’s a slightly more complicated piece of code, in which the specific place where there’s a problem is highlighted:

In a typical workflow you might start from pure Wolfram Language code, without Typed and other compilation information. Then you start adding such information, repeatedly trying the compilation, seeing what issues arise, and fixing them. And, by the way, because it’s completely efficient to call small pieces of compiled code within ordinary Wolfram Language code, it’s common to start by annotating and compiling the “innermost inner loops” in your code, and gradually “working outwards”.

But, OK, let’s say you’ve successfully compiled a piece of code. Most of the time it’ll handle certain cases, but not others (for example, it might work fine with machine-precision numbers, but not be capable of handling arbitrary precision). By default, compiled code that’s running is set up to generate a message and revert to ordinary Wolfram Language evaluation if it can’t handle something:

But in Version 14.1 there a new option CompilerRuntimeErrorAction that lets you specify an action to take (or, in general, a function to apply) whenever a runtime error occurs. A setting of None aborts the whole computation if there’s a runtime error:

Even Smoother Integration with External Languages

Let’s say there’s some functionality you want to use, but the only implementation you have is in a package in some external language, like Python. Well, it’s now basically seamless to work with such functionality directly in the Wolfram Language—plugging into the whole symbolic framework and functionality of the Wolfram Language.

As a simple example, here’s a function that uses the Python package faker to produce a random sentence (which of course would also be straightforward to do directly in Wolfram Language):

The first time you run RandomSentence, the progress monitor will show you all sorts of messy things happening under the hood, as Python versions get loaded, dependencies get set up, and so on. But the point is that it’s all automatic, and so you don’t have to worry about it. And in the end, out pops the answer:

And if you run the function again, all the setup will already have been done, and the answer will pop out immediately:

An important piece of automation here is the conversion of data types. One of the great things about the Wolfram Language is that it has fully integrated symbolic representations for a very wide range of things—from videos to molecules to IP addresses. And when there are standard representations for these things in a language like Python, we’ll automatically convert to and from them.

But particularly with more sophisticated packages, there’ll be a need to let the package deal with its own “external objects” that are basically opaque to the Wolfram Language, but can be handled as atomic symbolic constructs there.

For example, let’s say we’ve started a Python external package chess (and, yes, there’s a paclet in the Wolfram Paclet Repository that has considerably more chess functionality):

Now the state of a chessboard can be represented by an external object:

We can define a function to plot the board:

And now in Version 14.1 you can just pass your external object to the external function:

You can also directly extract attributes of the external object:

And you can call methods (here to make a chess move), changing the state of the external object:

Here’s a plot of a new board configuration:

This computes all legal moves from the current position, representing them as external objects:

Here are UCI string representations of these:

In what we’re doing here we’re immediately performing each external operation. But Version 14.1 introduces the construct ExternalOperation which lets you symbolically represent an external operation, and for example build up collections of such operations that can all be performed together in a single external evaluation. ExternalObject supports various built-in operations for each environment. So, for example, in Python we can use Call and GetAttribute to get the symbolic representation:

If we evaluate this, all these operations will get done together in the external environment:

Standalone Wolfram Language Applications!

Let’s say you’re writing an application in pretty much any programming language—and inside it you want to call Wolfram Language functionality. Well, you could always do that by using a web API served from the Wolfram Cloud. And you could also do it locally by running the Wolfram Engine. But in Version 14.1 there’s something new: a way of integrating a standalone Wolfram Language runtime right into your application. The Wolfram Language runtime is a dynamic library that you link into your program, and then call using a C-based API. How big is the runtime? Well, it depends on what you want to use in the Wolfram Language. Because we now have the technology to prune a runtime to include only capabilities needed for particular Wolfram Language code. And the result is that adding the Wolfram Language will often increase the disk requirements of your application only by a remarkably small amount—like just a few hundred megabytes or even less. And, by the way, you can distribute the Wolfram runtime as an integrated part of an application, with its users not needing their own licenses to run it.

OK, so how does creating a standalone Wolfram-enabled application actually work? There’s a lot of software engineering (associated with the Wolfram Language runtime, how it’s called, etc.) under the hood. But at the level of the application programmer you only have to deal with our Standalone Applications SDK—whose interface is rather simple.

As an example, here’s the C code part of a standalone application that uses the Wolfram Language to identify what (human) language a piece of text is in. The program here takes a string of text on its command line, then runs the Wolfram Language LanguageIdentify function on it, and then prints a string giving the result:

If we ignore issues of pruning, etc. we can compile this program just with (and, yes, the file paths are necessarily a bit long):

Now we can run the resulting executable directly from the command line—and it’ll act just like any other executable, even though inside it’s got all the power of a Wolfram Language runtime:

If we look at the C program above, it basically begins just by starting the Wolfram Language runtime (using WLR_SDK_START_RUNTIME()). But then it takes the string (argv[1]) from the command line, embeds it in a Wolfram Language expression LanguageIdentify[string], evaluates this expression, and extracts a raw string from the result.

The functions, etc. that are involved here are part of the new Expression API supported by the Wolfram Language runtime dynamic library. The Expression API provides very clean capabilities for building up and taking apart Wolfram Language expressions from C. There are functions like wlr_Symbol("string") that form symbols, as well as macros like wlr_List(elem₁, elem₂, …) and wlr_E(head, arg₁, arg₂, …) that build up lists and general expressions. Then there’s the function wlr_Eval(expr) that calls the Wolfram Language evaluator. With functions like wlr_StringData(expr, &result, …) you can then extract content from expressions (here the characters in a string) and put it into C data structures.

How does the Expression API relate to WSTP? WSTP (“Wolfram Symbolic Transfer Protocol”) is our protocol for transferring symbolic expressions between processes. The Expression API, on the other hand, operates within a single process, providing the “glue” that connects C code to expressions in the Wolfram Language runtime.

One example of a real-world use of our new Standalone Applications technology is the LSPServer application that will soon be in full distribution. LSPServer started from a pure (though somewhat lengthy) Wolfram Language paclet that provides Language Server Protocol services for annotating Wolfram Language code in programs like Visual Studio Code. To build the LSPServer standalone application we just wrote a tiny C program that calls the paclet, then compiled this and linked it against our Standalone Applications SDK. Along the way (using tools that we’re planning to soon make available)—and based on the fact that only a small part of the full functionality of the Wolfram Language is needed to support LSPServer—we pruned the Wolfram Language runtime, in the end getting a complete LSPServer application that’s only about 170 MB in size, and that shows no outside signs of having Wolfram Language functionality inside.

And Yet More…

Is that all? Well, no. There’s more. Like new formatting of Root objects (yes, I was frustrated with the old one). Or like a new drag-and-drop-to-answer option for QuestionObject quizzes. Or like all the documentation we’ve added for new types of entities and interpreters.

In addition, there’s also the continual stream of new data that we’ve curated, or that’s flowed in real time into the Wolfram Knowledgebase. And beyond the core Wolfram Language itself, there’ve also been lots of functions added to the Wolfram Function Repository, lots of paclets added to the Wolfram Language Paclet Repository, not to mention new entries in the Wolfram Neural Net Repository, Wolfram Data Repository, etc.

Yes, as always it’s been a lot of work. But today it’s here, and we’re proud of it: Version 14.1!

Download your 14.1 now! » (It’s already live in the Wolfram Cloud!)