Machine Learning for Middle Schoolers

(An Elementary Introduction to the Wolfram Language is available in print, as an ebook, and free on the web—as well as in Wolfram Programming Lab in the Wolfram Open Cloud. There’s also now a free online hands-on course based on it.)

An Elementary Introduction to the Wolfram Language

A year ago I published a book entitled An Elementary Introduction to the Wolfram Language—as part of my effort to teach computational thinking to the next generation. I just published the second edition of the book—with (among other things) a significantly extended section on modern machine learning.

I originally expected my book’s readers would be high schoolers and up. But it’s actually also found a significant audience among middle schoolers (11- to 14-year-olds). So the question now is: can one teach the core concepts of modern machine learning even to middle schoolers? Well, the interesting thing is that—thanks to the whole technology stack we’ve now got in the Wolfram Language—the answer seems to be “yes”!

Here’s what I did in the book:

After this main text, the book has Exercises, Q&A and Tech Notes.

Exercises, Q&A, Tech Notes

The Backstory

What was my thinking behind this machine learning section? Well, first, it has to fit into the flow of the book—using only concepts that have already been introduced, and, when possible, reinforcing them. So it can talk about images, and real-world data, and graphs, and text—but not functional programming or external data resources.

Chapter list

With modern machine learning, it’s easy to show “wow” examples—like our website from 2015 (based on the Wolfram Language ImageIdentify function). But my goal in the book was also to communicate a bit of the background and intuition of how machine learning works, and where it can be used.

I start off by explaining that machine learning is different from traditional “programming”, because it’s based on learning from examples, rather than on explicitly specifying computational steps. The first thing I discuss is something that doesn’t really need all the fanciness of modern neural-net machine learning: it’s recognizing what languages text fragments are from:

LanguageIdentify[{"thank you", "merci", "dar las gracias", "感謝",    "благодарить"}]

Kids (and other people) can sort of imagine (or discuss in a classroom) how something like this might work—looking words up in dictionaries, etc. And I think it’s useful to give a first example that doesn’t seem like “pure magic”. (In reality, LanguageIdentify uses a combination of traditional lookup, and modern machine learning techniques.)

But then I give a much more “magic” example—of ImageIdentify:


I don’t immediately try to explain how it works, but instead go on to something different: sentiment analysis. Kids have lots of fun trying out sentiment analysis. But the real point here is that it shows the idea of making a “classifier”: there are an infinite number of possible inputs, but only (in this case) 3 possible outputs:

Classify["Sentiment", "I'm so excited to be programming"]

Having seen this, we’re ready to give a little more indication of how something like this works. And what I do is to show the function Classify classifying handwritten digits into 0s and 1s. I’m not saying what’s going on inside, but people can get the idea that Classify is given a bunch of examples, and then it’s using those to classify a particular input as being 0 or 1:


OK, but how does it do this? In reality one’s dealing with ideas about attractors—and inputs that lie in the basins of attraction for particular outputs. But in a first approximation, one can say that inputs that are “nearer to”, say, the 0 examples are taken to be 0s, and inputs that are nearer to the 1 examples are taken to be 1s.

People don’t usually have much difficulty with that explanation—unless they start to think too hard about what “nearest” might really mean in this context. But rather than concentrating on that, what I do in the book is just to talk about the case of numbers, where it’s really easy to see what “nearest” means:

Nearest[{10, 20, 30, 40, 50, 60, 70, 80}, 22]

Nearest isn’t the most exciting function to play with: one potentially puts a lot of things in, and then just one “nearest thing” comes out. Still, Nearest is nice because its functionality is pretty easy to understand (and one can have reasonable guesses about algorithms it could use).

Having seen Nearest for numbers, I show Nearest for colors. In the book, I’ve already talked about how colors are represented by red-green-blue triples of numbers, so this isn’t such a stretch—but seeing Nearest operate on colors begins to make it a little more plausible that it could operate on things like images too.


Next I show the case of words. In the book, I’ve already done quite a bit with strings and words. In the main text I don’t talk about the precise definition of “nearness” for words, but again, kids easily get the basic idea. (In a Tech Note, I do talk about EditDistance, another good algorithmic operation that people can think about and try out.)

Nearest[WordList[], "good", 10]

OK, so how does one get from here to something like ImageIdentify? The approach I used is to talk next about OCR and TextRecognize. This doesn’t seem as “magic” as ImageIdentify (and lots of people know about “OCR’ing documents”), but it’s a good place to get a further idea of what ImageIdentify is doing.

Turning a piece of text into an image, and then back into the same text again, doesn’t seem that impressive or useful. But it gets more interesting if one blurs the text out (and, yes, blurring an image is something I talked about earlier in the book):

Table[Blur[Rasterize[Style["hello", 20]], r], {r, 0, 4}]

Given the blurred image, the question is: can one still recognize the text? At this stage in the book I haven’t talked about /@ (Map) or % (last output) yet, so I have to write the code out a bit more verbosely. But the result is:

TextRecognize /@ %

And, yes, when the image isn’t too blurred, TextRecognize can recognize the text, but when the text gets too blurred, it stops being able to. I like this example, because it shows something impressive—but not “magic”—happening. And I think it’s useful to show both where machine learning-based functions succeed, and where they fail. By the way, the result here is different from the one in the book—because the text font is different, and those details matter when one’s on the edge of what can be recognized. (If one was doing this in a class, for example, one might try some different fonts and sizes, and discuss why some survive more blurring than others.)

TextRecognize shows how one can effectively do something like ImageIdentify, but with just 26 letterforms (well, actually, TextRecognize handles many more glyphs than that). But now in the book I show ImageIdentify again, blurring like we did with letters:

Table[Blur[], r], {r, 0, 22, 2}]

ImageIdentify /@ %

It’s fun to see what it does, but it’s also helpful. Because it gives a sense of the “attractor” around the “cheetah” concept: stay fairly close and the cheetah can still be recognized; go too far away and it can’t. (A slightly tricky issue is that we’re continually producing new, better neural nets for ImageIdentify—so even between when the book was finished and today there’ve been some new nets—and it so happens they give different results for the not-a-cheetah cases. Presumably the new results are “better”, though it’s not clear what that means, given that we don’t have an official right-answer “blurred cheetah” category, and who’s to say whether the blurriest image is more like a whortleberry or a person.)

I won’t go through my whole discussion of machine learning from the book here. Suffice it to say that after discussing explicitly trained functions like TextRecognize and ImageIdentify, I start discussing “unsupervised learning”, and things like clustering in feature space. I think our new FeatureSpacePlot is particularly helpful.

It’s fairly clear what it means to arrange colors:


But then one can “do the same thing” with images of letters. (In the book the code is a little longer, because I haven’t talked about /@ yet.)

FeatureSpacePlot[Rasterize /@ Alphabet[]]

And what’s nice about this is that—as well as being useful in its own right—it also reinforces the idea of how something like TextRecognize might work by finding the “nearest letter” to whatever input it’s given.

My final example in the section uses photographs. FeatureSpacePlot does a nice job of separating images of different kinds of things—again giving an idea of how ImageIdentify might work:


Obviously in just 10 pages in an elementary book I’m not able to give a complete exposition of modern machine learning. But I was pleased to see how many of the core concepts I was able to touch on.

Of course, the fact that this was possible at all depends critically on our whole Wolfram Language technology stack. Whether it’s the very fact that we have machine learning in the language, or the fact that we can seamlessly work with images or text or whatever, or the whole (28-year-old!) Wolfram Notebook system that lets us put all these pieces together—all these pieces are critical to making it possible to bring modern machine learning to people like middle schoolers.

And what I really like is that what one gets to do isn’t toy stuff: one can take what I’m discussing in the book, and immediately apply it in real-world situations. At some level the fact that this works is a reflection of the whole automation emphasis of the Wolfram Language: there’s very sophisticated stuff going on inside, but it’s automated at all levels, so one doesn’t need to be an expert and understand the details to be able to use it—or to get a good intuition about what can work and what can’t.

Going Further

OK, so how would one go further in teaching machine learning?

One early thing might be to start talking about probabilities. ImageIdentify has various possible choices of identifications, but what probabilities does it assign to them?
ImageIdentify[, All, 10, "Probability"]

This can lead to a useful discussion about prior probabilities, and about issues like trading off specificity for certainty.

But the big thing to talk about is training. (After all, “machine learning trainer” will surely be a big future career for some of today’s middle schoolers…) And the good news is that in the Wolfram Language environment, it’s possible to make training work with only a modest amount of data.

Let’s get some examples of images of characters from Guardians of the Galaxy by searching the web (we’re using an external search API, so you unfortunately can’t do exactly this on the Open Cloud):

data = AssociationMap[ WebImageSearch[#, "Thumbnails"] &, {"Star-Lord", "Gamora", "Groot", "Rocket Raccoon"}]

Now we can use these images as training material to create a classifier:


And, sure enough, it can identify Rocket:


And, yes, it thinks a real raccoon is him too:


How does it do it? Well, let’s look at FeatureSpacePlot:


Some of this looks good—but some looks confusing. Because it’s arranging some of the images not according to who they’re of, but just according to their background colors. And here we begin to see some of the subtlety of machine learning. The actual classifier we built works only because in the training examples for each character there were ones with different backgrounds—so it can figure out that background isn’t the only distinguishing feature.

Actually, there’s another critical thing as well: Classify isn’t starting from scratch in classifying the images. Because it’s already been pre-trained to pick out “good features” that help distinguish real-world images. In fact, it’s actually using everything it learned from the creation of ImageIdentify—and the tens of millions of images it saw in connection with that—to know up front what features it should pay attention to.

It’s a bit weird to see, but internally Classify is characterizing each image as a list of numbers, each associated with a different “feature”:


One can do an extreme version of this in which one insists that each image is reduced to just two numbers—and that’s essentially how FeatureSpacePlot determines where to position an image:


Under the Hood

OK, but what’s going on under the hood? Well, it’s complicated. But in the Wolfram Language it’s easy to see—and getting a look at it helps in terms of getting an intuition about how neural nets really work. So, for example, here’s the low-level Wolfram Language symbolic representation of the neural net that powers ImageIdentify:

net = NetModel["Wolfram ImageIdentify Net for WL 11.1"]

And there’s actually even more: just click and keep drilling down:

net = NetModel["Wolfram ImageIdentify Net for WL 11.1"]

And yes, this is hard to understand—certainly for middle schoolers, and even for professionals. But if we take this whole neural net object, and apply it to a picture of a tiger, it’ll do what ImageIdentify does, and tell us it’s a tiger:


But here’s a neat thing, made possible by a whole stack of functionality in the Wolfram Language: we can actually go “inside” the neural net, to get a sense of what’s happening. As an example, let’s just take the first 3 “layers” of the network, apply them to the tiger, and visualize what comes out:

Image /@ Take[net, 3][]

Basically what’s happening is that the network has made lots of copies of the original image, and then processed each of them to pick out a different aspect of the image. (What’s going on actually seems to be remarkably similar to the first few levels of visual processing in the brain.)

What if we go deeper into the network? Here’s what happens at layer 10. The images are more abstracted, and presumably pick out higher-level features:

Image /@ Take[Take[net, 10][],20]

Go to level 20, and the network is “thinking about” lots of little images:

ImageAssemble[Partition[Image /@ Take[net, 20][],30]]

But by level 28, it’s beginning to “come to some conclusions”, with only a few of its possible channels of activity “lighting up”:

ImageAdjust[ImageAssemble[Partition[Image /@ Take[net, 28][],50]]]

Finally, by level 31, all that’s left is an array of numbers, with a few peaks visible:

ListLinePlot[Take[net, 31][]]

And applying the very last layer of the network (a “softmax” layer) only a couple of peaks are left:

ListLinePlot[net[,None], PlotRange -> All]

And the highest one is exactly the one that corresponds to the concept of “tiger”:


I’m not imagining that middle schoolers will follow all these details (and no, nobody should be learning neural net layer types like they learn parts of the water cycle). But I think it’s really useful to see “inside” ImageIdentify, and get even a rough sense of how it works. To someone like me it still seems a little like magic that it all comes together as it does. But what’s great is that now with our latest Wolfram Language tools one can easily look inside, and start getting an intuition about what’s going on.

The Process of Training

The idea of the Wolfram Language Classify function is to do machine learning at the highest possible level—as automatically as possible, and building on as much pre-training as possible. But if one wants to get a more complete feeling for what machine learning is like, it’s useful to see what happens if one instead tries to just train a neural net from scratch.

There is an immediate practical issue though: to get a neural net, starting from scratch, to actually do anything useful, one typically has to give it a very large amount of training data—which is hard to collect and wrangle. But the good news here is that with the recent release of the Wolfram Data Repository we have a growing collection of ready-to-use training sets immediately available for use in the Wolfram Language.

Like here’s the classic MNIST handwritten digit training set, with its 60,000 training examples:


One thing one can do with a training set like this is just feed a random sample of it into Classify. And sure enough this gives one a classifier function that’s essentially a simple version of TextRecognize for handwritten digits:

c = Classify[RandomSample[ResourceData["MNIST"], 1000]]

And even with just 1000 training examples, it does pretty well:


And, yes, we can use FeatureSpacePlot to see how the different digits tend to separate in feature space:

FeatureSpacePlot[First /@ RandomSample[ResourceData["MNIST"], 250]]

But, OK, what if we want to actually train a neural net from scratch, with none of the fancy automation of Classify? Well, first we have to set up a raw neural net. And conveniently, the Wolfram Language has a bunch of classic neural nets built in. Here one’s called LeNet:

lenet = NetModel["LeNet"]

It’s much simpler than the ImageIdentify net, but it’s still pretty complicated. But we don’t have to understand what’s inside it to start training it. Instead, in the Wolfram Language, we can just use NetTrain (which, needless to say, automatically applies all the latest GPU tricks and so on):

net = NetTrain[lenet, RandomSample[ResourceData["MNIST"], 1000]]

It’s pretty neat to watch the training happening, and to see the orange line of the neural net’s error rate for fitting the examples keep going down. After about 20 seconds, NetTrain decides it’s gone far enough, and generates a finally trained net—which works pretty well:


If you stop the training early, it won’t do quite so well:

net = NetTrain[lenet, RandomSample[ResourceData["MNIST"], 1000], MaxTrainingRounds -> 1]


In the professional world of machine learning, there’s a whole art and science of figuring out the best parameters for training. But with what we’ve got now in the Wolfram Language, nothing is stopping a middle schooler from doing their own experiments, visualizing and analyzing the results, and getting as good an intuition as anyone.

What Are Neural Nets Made Of?

OK, so if we want to really get down to the lowest level, we have to talk about what neural nets are made of. I’m not sure how much of this is middle-school stuff—but as soon as one knows about graphs of functions, one can already explain quite a bit. Because, you see, the “layers” in a neural net are actually just functions, that take numbers in, and put numbers out.

Take layer 2 of LeNet. It’s essentially just a simple Ramp function, which we can immediately plot (and, yes, it looks like a ramp):

Plot[Ramp[x], {x, -1, 1}]

Neural nets don’t typically just deal with individual numbers, though. They deal with arrays (or “tensors”) of numbers—represented in the Wolfram Language as nested lists. And each layer takes an array of numbers in, and puts an array of numbers out. Here’s a typical single layer:

layer = NetInitialize[LinearLayer[4, "Input" -> 2]]

This particular layer is set up to take 2 numbers as input, and put 4 numbers out:

layer[{2, 3}]

It might seem to be doing something quite “random”, and actually it is. Because the actual function the layer is implementing is determined by yet another array of numbers, or “weights”—which NetInitialize here just sets randomly. Here’s what it set them to in this particular case:

NetExtract[layer, "Weights"]

Why is any of this useful? Well, the crucial point is that what NetTrain does is to progressively tweak the weights in each layer of a neural network to try to get the overall behavior of the net to match the training examples you gave.

There are two immediate issues, though. First, the structure of the network has to be such that it’s possible to get the behavior you want by using some appropriate set of weights. And second, there has to be some way to progressively tweak weights so as to get to appropriate values.

Well, it turns out a single LinearLayer like the one above can’t do anything interesting. Here’s a contour plot of (the first element of) its output, as a function of its two inputs. And as the name LinearLayer might suggest, we always get something flat and linear out:

ContourPlot[First[layer[{x, y}]], {x, -1, 1}, {y, -1, 1}]

But here’s the big discovery that makes neural nets useful: if we chain together several layers, it’s easy to get something much more complicated. (And, yes, in the Wolfram Language outputs from one layer get knitted into inputs to the next layer in a nice, automatic way.) Here’s an example with 4 layers—two linear layers and two ramps:

net = NetInitialize[   NetChain[{LinearLayer[10], Ramp, LinearLayer[1], Ramp},     "Input" -> 2]]

And now when we plot the function, it’s more complicated:

ContourPlot[net[{x, y}], {x, -1, 1}, {y, -1, 1}]

We can actually also look at an even simpler case—of a neural net with 3 layers, and just one number as final output. (For technical reasons, it’s nice to still have 2 inputs, though we’ll always set one of those inputs to the constant value of 1.)

net = NetInitialize[   NetChain[{LinearLayer[3], Ramp, LinearLayer[1]}, "Input" -> 2]]

Here’s what this particular network does as a function of its input:

Plot[net[{x, 1}], {x, -2, 2}]

Inside the network, there’s an array of 3 numbers being generated—and it turns out that “3” causes there to be at most 3 (+1) distinct linear parts in the function. Increase the 3 to 100, and things can get more complicated:

net = NetInitialize[   NetChain[{LinearLayer[100], Ramp, LinearLayer[1]}, "Input" -> 2]]
Plot[net[{x, 1}], {x, -2, 2}]

Now, the point is that this is in a sense a “random function”, determined by the particular random weights picked by NetInitialize. If we run NetInitialize a bunch of times, we’ll get a bunch of different results:

Table[With[{net =      NetInitialize[      NetChain[{LinearLayer[100], Ramp, LinearLayer[1]},        "Input" -> 2]]}, Plot[net[{x, 1}], {x, -2, 2}]], 8]

But the big question is: can we find an instance of this “random function” that’s useful for whatever we’re trying to do? Or, more particularly, can we find a random function that reproduces particular training examples?

Let’s imagine that our training examples give the values of the function at the dots in this plot (by the way, the setup here is more like machine learning in the style of Predict than Classify):

ListLinePlot[Table[Mod[n^2, 5], {n, 15}], Mesh -> All]

Here’s an instance of our network again:

net = NetInitialize[   NetChain[{LinearLayer[100], Ramp, LinearLayer[1]}, "Input" -> 2]]

And here’s a plot of what it initially does over the range of the training examples (and, yes, it’s obviously completely wrong):

Plot[net[{n, 1}], {n, 1, 15}]

Well, let’s just try training our network on our training data using NetTrain:

net = NetTrain[net, data = Table[{n, 1} -> {Mod[n^2, 5]}, {n, 15}]]

After about 20 seconds of training on my computer, there’s some vague sign that we’re beginning to reproduce at least some aspects of the original training data. But it’s at best slow going—and it’s not clear what’s eventually going to happen.

Plot[net[{n, 1}], {n, 1, 15}]

It’s a frontier question in neural net research just what structure of net will work best in any particular case (yes, we’re working on this question). But here let’s just try a slightly more complicated network:

net = NetInitialize[   NetChain[{LinearLayer[100], Tanh, LinearLayer[10], Ramp,      LinearLayer[1]}, "Input" -> 2]]

Random instances of this network don’t give very different results from our last network (though the presence of that Tanh layer makes the functions a bit smoother):

Tanh layer

But now let’s do some training (data was defined above):

net = NetTrain[net, data]

And here’s the result—which is surprisingly decent:

Plot[net[{n, 1}], {n, 1, 15}]

In fact, if we compare it to our original training data we see that the training values lie right on the function that the neural net produced:

Show[Plot[net[{n, 1}], {n, 1, 15}],   ListPlot[Table[Mod[n^2, 5], {n, 1, 15}], PlotStyle -> Red]]

Here’s what happened during the training process. The neural net effectively “tried out” a bunch of different possibilities, finally settling on the result here:

Machine learning animation

In what sense is the result “correct”? Well, it fits the training examples, and that’s really all we can ask. Because that’s all the input we gave. How it “interpolates” between the training examples is really its own business.  We’d like it to learn to “generalize” from the data it’s given—but it can’t really deduce much about the whole distribution of the data from the few points it’s being given here, so the kind of smooth interpolation it’s doing is as good as anything.

Outside the range of the training values, the neural net does what seem to be fairly random things—but again, there’s no “right answer” so one can’t really fault it:

Plot[net[{n, 1}], {n, -5, 25}]

But the fact that with the arbitrariness and messiness of our original neural net, we were able to successfully train it at all is quite remarkable. Neural nets of pretty much the type we’re talking about here had actually been studied for more than 60 years—but until the modern “deep learning revolution” nobody knew that it was going to be practical to train them for real problems.

But now—particularly with everything we have now in the Wolfram Language—it’s easy for anyone to do this.

So Much to Explore

Modern machine learning is very new—so even many of the obvious experiments haven’t been tried yet. But with our whole Wolfram Language setup there’s a lot that even middle schoolers can do. For example (and I admit I’m curious about this as I write this post): one can ask just how much something like the tiny neural net we were studying can learn.

Here’s a plot of the lengths of the first 60 Roman numerals:

ListLinePlot[Table[StringLength[RomanNumeral[n]], {n, 60}]]

After a small amount of training, here’s what the network managed to reproduce:

NetTrain[net,    Table[{n, 1} -> {StringLength[RomanNumeral[n]]}, {n, 60}]];
Plot[%[{n, 1}], {n, 1, 60}]

And one might think that maybe this is the best it’ll ever do. But I was curious if it could eventually do better—and so I just let it train for 2 minutes on my computer. And here’s the considerably better result that came out:

NetTrain[net,    Table[{n, 1} -> {StringLength[RomanNumeral[n]]}, {n, 60}],    MaxTrainingRounds -> Quantity[2, "Minutes"]];

Plot[%[{n, 1}], {n, 1, 60}]

I think I can see why this particular thing works the way it does.  But seeing it suggests all sorts of new questions to pursue. But to me the most exciting point is the overarching one of just how wide open this territory is—and how easy it is now to explore it.

Yes, there are plenty of technical details—some fundamental, some superficial. But transcending all of these, there’s intuition to be developed. And that’s something that can perfectly well start with the middle schoolers…


  1. I am much older than an average middle-schooler but I have enjoyed this post very much, indeed! It is a great start for me to dig deeper into machine learning problems. Keep on, Stephen!

  2. Thanks for writing this, it was very inspiring!
    I am a Dutch middle school (mathematics) teacher. Some students are thinking about spending approximately 60 hours on the basics of programming (they have just a little bit of experience) and if possible, the combination of programming and Machine Learning. Do you think it’s something that could work in this timeframe with their experience? Or do you see other interesting possibilities? I would love to get your view!