“What is this a picture of?” Humans can usually answer such questions instantly, but in the past it’s always seemed out of reach for computers to do this. For nearly 40 years I’ve been sure computers would eventually get there—but I’ve wondered when.
I’ve built systems that give computers all sorts of intelligence, much of it far beyond the human level. And for a long time we’ve been integrating all that intelligence into the Wolfram Language.
Now I’m excited to be able to say that we’ve reached a milestone: there’s finally a function called ImageIdentify built into the Wolfram Language that lets you ask, “What is this a picture of?”—and get an answer.
And today we’re launching the Wolfram Language Image Identification Project on the web to let anyone easily take any picture (drag it from a web page, snap it on your phone, or load it from a file) and see what ImageIdentify thinks it is:
It won’t always get it right, but most of the time I think it does remarkably well. And to me what’s particularly fascinating is that when it does get something wrong, the mistakes it makes mostly seem remarkably human.
It’s a nice practical example of artificial intelligence. But to me what’s more important is that we’ve reached the point where we can integrate this kind of “AI operation” right into the Wolfram Language—to use as a new, powerful building block for knowledge-based programming.
Now in the Wolfram Language
In a Wolfram Language session, all you need do to identify an image is feed it to the ImageIdentify function:
What you get back is a symbolic entity, that the Wolfram Language can then do more computation with—like, in this case, figure out if you’ve got an animal, a mammal, etc. Or just ask for a definition:
Or, say, generate a word cloud from its Wikipedia entry:
And if one had lots of photographs, one could immediately write a Wolfram Language program that, for example, gave statistics on the different kinds of animals, or planes, or devices, or whatever, that appear in the photographs.
With ImageIdentify built right into the Wolfram Language, it’s easy to create APIs, or apps, that use it. And with the Wolfram Cloud, it’s also easy to create websites—like the Wolfram Language Image Identification Project.
Personal Backstory
For me personally, I’ve been waiting a long time for ImageIdentify. Nearly 40 years ago I read books with titles like The Computer and the Brain that made it sound inevitable we’d someday achieve artificial intelligence—probably by emulating the electrical connections in a brain. And in 1980, buoyed by the success of my first computer language, I decided I should think about what it would take to achieve full-scale artificial intelligence.
Part of what encouraged me was that—in an early premonition of the Wolfram Language—I’d based my first computer language on powerful symbolic pattern matching that I imagined could somehow capture certain aspects of human thinking. But I knew that while tasks like image identification were also based on pattern matching, they needed something different—a more approximate form of matching.
I tried to invent things like approximate hashing schemes. But I kept on thinking that brains manage to do this; we should get clues from them. And this led me to start studying idealized neural networks and their behavior.
Meanwhile, I was also working on some fundamental questions in natural science—about cosmology and about how structures arise in our universe—and studying the behavior of self-gravitating collections of particles.
And at some point I realized that both neural networks and self-gravitating gases were examples of systems that had simple underlying components, but somehow achieved complex overall behavior. And in getting to the bottom of this, I wound up studying cellular automata and eventually making all the discoveries that became A New Kind of Science.
So what about neural networks? They weren’t my favorite type of system: they seemed a little too arbitrary and complicated in their structure compared to the other systems that I studied in the computational universe. But every so often I would think about them again, running simulations to understand more about the basic science of their behavior, or trying to see how they could be used for practical tasks like approximate pattern matching:
Neural networks in general have had a remarkable roller-coaster history. They first burst onto the scene in the 1940s. But by the 1960s, their popularity had waned, and the word was that it had been “mathematically proven” that they could never do anything very useful.
It turned out, though, that that was only true for one-layer “perceptron” networks. And in the early 1980s, there was a resurgence of interest, based on neural networks that also had a “hidden layer”. But despite knowing many of the leaders of this effort, I have to say I remained something of a skeptic, not least because I had the impression that neural networks were mostly getting used for tasks that seemed like they would be easy to do in lots of other ways.
I also felt that neural networks were overly complex as formal systems—and at one point even tried to develop my own alternative. But still I supported people at my academic research center studying neural networks, and included papers about them in my Complex Systems journal.
I knew that there were practical applications of neural networks out there—like for visual character recognition—but they were few and far between. And as the years went by, little of general applicability seemed to emerge.
Machine Learning
Meanwhile, we’d been busy developing lots of powerful and very practical ways of analyzing data, in Mathematica and in what would become the Wolfram Language. And a few years ago we decided it was time to go further—and to try to integrate highly automated general machine learning. The idea was to make broad, general functions with lots of power; for example, to have a single function Classify that could be trained to classify any kind of thing: say, day vs. night photographs, sounds from different musical instruments, urgency level of email, or whatever.
We put in lots of state-of-the-art methods. But, more importantly, we tried to achieve complete automation, so that users didn’t have to know anything about machine learning: they just had to call Classify.
I wasn’t initially sure it was going to work. But it does, and spectacularly.
People can give training data on pretty much anything, and the Wolfram Language automatically sets up classifiers for them to use. We’re also providing more and more built-in classifiers, like for languages, or country flags:
And a little while ago, we decided it was time to try a classic large-scale classifier problem: image identification. And the result now is ImageIdentify.
It’s All about Attractors
What is image identification really about? There are some number of named kinds of things in the world, and the point is to tell which of them a particular picture is of. Or, more formally, to map all possible images into a certain set of symbolic names of objects.
We don’t have any intrinsic way to describe an object like a chair. All we can do is just give lots of examples of chairs, and effectively say, “Anything that looks like one of these we want to identify as a chair.” So in effect we want images that are “close” to our examples of chairs to map to the name “chair”, and others not to.
Now, there are lots of systems that have this kind of “attractor” behavior. As a physical example, think of a mountainscape. A drop of rain may fall anywhere on the mountains, but (at least in an idealized model) it will flow down to one of a limited number of lowest points. Nearby drops will tend to flow to the same lowest point. Drops far away may be on the other side of a watershed, and so will flow to other lowest points.
The drops of rain are like our images; the lowest points are like the different kinds of objects. With raindrops we’re talking about things physically moving, under gravity. But images are composed of digital pixels. And instead of thinking about physical motion, we have to think about digital values being processed by programs.
And exactly the same “attractor” behavior can happen there. For example, there are lots of cellular automata in which one can change the colors of a few cells in their initial conditions, but still end up in the same fixed “attractor” final state. (Most cellular automata actually show more interesting behavior, that doesn’t go to a fixed state, but it’s less clear how to apply this to recognition tasks.)
So what happens if we take images and apply cellular automaton rules to them? In effect we’re doing image processing, and indeed some common image processing operations (both done on computers and in human visual processing) are just simple 2D cellular automata.
It’s easy to get cellular automata to pick out certain features of an image—like blobs of dark pixels. But for real image identification, there’s more to do. In the mountain analogy, we have to “sculpt” the mountainscape so that the right raindrops flow to the right points.
Programs Automatically Made
So how do we do this? In the case of digital data like images, it isn’t known how to do this in one fell swoop; we only know how to do it iteratively, and incrementally. We have to start from a base “flat” system, and gradually do the “sculpting”.
There’s a lot that isn’t known about this kind of iterative sculpting. I’ve thought about it quite extensively for discrete programs like cellular automata (and Turing machines), and I’m sure something very interesting can be done. But I’ve never figured out just how.
For systems with continuous (real-number) parameters, however, there’s a great method called back propagation—that’s based on calculus. It’s essentially a version of the very common method of gradient descent, in which one computes derivatives, then uses them to work out how to change parameters to get the system one is using to better fit the behavior one wants.
So what kind of system should one use? A surprisingly general choice is neural networks. The name makes one think of brains and biology. But for our purposes, neural networks are just formal, computational, systems, that consist of compositions of multi-input functions with continuous parameters and discrete thresholds.
How easy is it to make one of these neural networks perform interesting tasks? In the abstract, it’s hard to know. And for at least 20 years my impression was that in practice neural networks could mostly do only things that were also pretty easy to do in other ways.
But a few years ago that began to change. And one started hearing about serious successes in applying neural networks to practical problems, like image identification.
What made that happen? Computers (and especially linear algebra in GPUs) got fast enough that—with a variety of algorithmic tricks, some actually involving cellular automata—it became practical to train neural networks with millions of neurons, on millions of examples. (By the way, these were “deep” neural networks, no longer restricted to having very few layers.) And somehow this suddenly brought large-scale practical applications within reach.
Why Now?
I don’t think it’s a coincidence that this happened right when the number of artificial neurons being used came within striking distance of the number of neurons in relevant parts of our brains.
It’s not that this number is significant on its own. Rather, it’s that if we’re trying to do tasks—like image identification—that human brains do, then it’s not surprising if we need a system with a similar scale.
Humans can readily recognize a few thousand kinds of things—roughly the number of picturable nouns in human languages. Lower animals likely distinguish vastly fewer kinds of things. But if we’re trying to achieve “human-like” image identification—and effectively map images to words that exist in human languages—then this defines a certain scale of problem, which, it appears, can be solved with a “human-scale” neural network.
There are certainly differences between computational and biological neural networks—although after a network is trained, the process of, say, getting a result from an image seems rather similar. But the methods used to train computational neural networks are significantly different from what it seems plausible for biology to use.
Still, in the actual development of ImageIdentify, I was quite shocked at how much was reminiscent of the biological case. For a start, the number of training images—a few tens of millions—seemed very comparable to the number of distinct views of objects that humans get in their first couple of years of life.
All It Saw Was the Hat
There were also quirks of training that seemed very close to what’s seen in the biological case. For example, at one point, we’d made the mistake of having no human faces in our training. And when we showed a picture of Indiana Jones, the system was blind to the presence of his face, and just identified the picture as a hat. Not surprising, perhaps, but to me strikingly reminiscent of the classic vision experiment in which kittens reared in an environment of vertical stripes are blind to horizontal stripes.
Probably much like the brain, the ImageIdentify neural network has many layers, containing a variety of different kinds of neurons. (The overall structure, needless to say, is nicely described by a Wolfram Language symbolic expression.)
It’s hard to say meaningful things about much of what’s going on inside the network. But if one looks at the first layer or two, one can recognize some of the features that it’s picking out. And they seem to be remarkably similar to features we know are picked out by real neurons in the primary visual cortex.
I myself have long been interested in things like visual texture recognition (are there “texture primitives”, like there are primary colors?), and I suspect we’re now going to be able to figure out a lot about this. I also think it’s of great interest to look at what happens at later layers in the neural network—because if we can recognize them, what we should see are “emergent concepts” that in effect describe classes of images and objects in the world—including ones for which we don’t yet have words in human languages.
We Lost the Anteaters!
Like many projects we tackle for the Wolfram Language, developing ImageIdentify required bringing many diverse things together. Large-scale curation of training images. Development of a general ontology of picturable objects, with mapping to standard Wolfram Language constructs. Analysis of the dynamics of neural networks using physics-like methods. Detailed optimization of parallel code. Even some searching in the style of A New Kind of Science for programs in the computational universe. And lots of judgement calls about how to create functionality that would actually be useful in practice.
At the outset, it wasn’t clear to me that the whole ImageIdentify project was going to work. And early on, the rate of utterly misidentified images was disturbingly high. But one issue after another got addressed, and gradually it became clear that finally we were at a point in history when it would be possible to create a useful ImageIdentify function.
There were still plenty of problems. The system would do well on certain things, but fail on others. Then we’d adjust something, and there’d be new failures, and a flurry of messages with subject lines like “We lost the anteaters!” (about how pictures that ImageIdentify used to correctly identify as anteaters were suddenly being identified as something completely different).
Debugging ImageIdentify was an interesting process. What counts as reasonable input? What’s reasonable output? How should one make the choice between getting more-specific results, and getting results that one’s more certain aren’t incorrect (just a dog, or a hunting dog, or a beagle)?
Sometimes we saw things that at first seemed completely crazy. A pig misidentified as a “harness”. A piece of stonework misidentified as a “moped”. But the good news was that we always found a cause—like confusion from the same irrelevant objects repeatedly being in training images for a particular type of object (e.g. “the only time ImageIdentify had ever seen that type of Asian stonework was in pictures that also had mopeds”).
To test the system, I often tried slightly unusual or unexpected images:
And what I found was something very striking, and charming. Yes, ImageIdentify could be completely wrong. But somehow the errors seemed very understandable, and in a sense very human. It seemed as if what ImageIdentify was doing was successfully capturing some of the essence of the human process of identifying images.
So what about things like abstract art? It’s a kind of Rorschach-like test for both humans and machines—and an interesting glimpse into the “mind” of ImageIdentify:
Out into the Wild
Something like ImageIdentify will never truly be finished. But a couple of months ago we released a preliminary version in the Wolfram Language. And today we’ve updated that version, and used it to launch the Wolfram Language Image Identification Project.
We’ll continue training and developing ImageIdentify, not least based on feedback and statistics from the site. Like for Wolfram|Alpha in the domain of natural language understanding, without actual usage by humans there’s no real way to realistically assess progress—or even to define just what the goals should be for “natural image understanding”.
I must say that I find it fun to play with the Wolfram Language Image Identification Project. It’s satisfying after all these years to see this kind of artificial intelligence actually working. But more than that, when you see ImageIdentify respond to a weird or challenging image, there’s often a certain “aha” feeling, like one was just shown in a very human-like way some new insight—or joke—about an image.
Underneath, of course, it’s just running code—with very simple inner loops that are pretty much the same as, for example, in my neural network programs from the beginning of the 1980s (except that now they’re Wolfram Language functions, rather than low-level C code).
It’s a fascinating—and extremely unusual—example in the history of ideas: neural networks were studied for 70 years, and repeatedly dismissed. Yet now they are what has brought us success in such a quintessential example of an artificial intelligence task as image identification. I expect the original pioneers of neural networks—like Warren McCulloch and Walter Pitts—would find little surprising about the core of what the Wolfram Language Image Identification Project does, though they might be amazed that it’s taken 70 years to get here.
But to me the greater significance is what can now be done by integrating things like ImageIdentify into the whole symbolic structure of the Wolfram Language. What ImageIdentify does is something humans learn to do in each generation. But symbolic language gives us the opportunity to represent shared intellectual achievements across all of human history. And making all these things computational is, I believe, something of monumental significance, that I am only just beginning to understand.
But for today, I hope you will enjoy the Wolfram Language Image Identification Project. Think of it as a celebration of where artificial intelligence has reached. Think of it as an intellectual recreation that helps build intuition for what artificial intelligence is like. But don’t forget the part that I think is most exciting: it’s also practical technology, that you can use here and now in the Wolfram Language, and deploy wherever you want.