How I Learned to Stop Worrying and Love Optimization

March 20, 2023

Imagine a traveler stranded in the mountains at night. Somewhere far below, there are valleys with food, warmth, safety. It is pitch-black: no stars, no moon. In order to reach the safety of a valley, the traveler must navigate by feel alone. They slide their feet around, pat the ground to feel which way it slopes, then take a step in the direction of steepest descent, always following the gradient of the land. If the traveler is lucky and the landscape accommodating, gradient descent will bring them eventually out of the mountains.

I first heard this parable in a class on training neural networks to recognize objects in images. We would feed in an image, and the network would guess whether the image showed a horse or a bird or a dog or a frog. In our attempts to train the network, we always began with a sort of template, a structure with many empty slots, each slot meant to hold a single number. By filling the slots with different numbers, we could create different versions of our neural network. To make a prediction, the neural network would add and multiply the numbers in its slots together with the image we fed it. After the network made a prediction, we calculated how far it had been from the truth. This number, which we called the loss, told us how good that particular version of the network was. Even the tiny network we worked with had more than a hundred thousand slots. We never knew which way of filling the network’s slots would enable it to guess correctly, so we picked the numbers randomly to begin with. From there, we began to search, looking for a version of the network with a lower loss.

We sought the optimal network using an algorithm called gradient descent. Instead of mountains and valleys, we explored the optimization landscape of the neural network. In this abstract mathematical space, each point represented a different version of the neural network, while the network’s loss determined the elevation at that point. Assigned to implement and optimize neural networks for homework, I would flick back and forth between my code and an elevation profile showing the optimization as it unfolded in real-time. If the optimization went well, this graph would start high on the left side of the screen and descend towards the right as the optimization progressed. Looking at the graph, watching the network explore the unknown landscape of itself, I began to feel as though I, too, was a traveler in the darkness.

Optimization is an essential part of the vocabulary of technological hype. In self-help books it appears as a foolproof method for success. In advertising, the promise that a product has been optimized assures the purchaser that its development and production have been scientific and systematic. In the everyday settings where it is most often encountered, optimization is about efficiency and perfection. But after a few weeks of the neural network class, it meant something different to me. The more I spent time actually optimizing things, the more the word made me think of the vastness of the unknown, the necessity of trying things before judging them, and the connectedness of everything in the world to everything else. Optimization, I found, made efficiency and perfection delightfully irrelevant.

◊

My earliest memory of being optimized is from middle school, when my dad enrolled me and my sister in an after-school math program. There was no teaching or tutoring in this program. Instead, I was trained through brute repetition. Each week, I received a bundle of worksheets, one packet for each day. The packets were full of arithmetic problems, printed in a grid on both sides of each sheet. The hypothesis seemed to be that the sheer quantity of problems would gradually and irresistibly rewire my brain, turn me into a machine which could intuitively, effortlessly, produce the answer to any arithmetic problem. Problems in, answers out — and me in the middle, changing bit by bit, slowly becoming a more perfect calculator. Any conscious arithmetical understanding I might develop was expected to emerge spontaneously from the problems themselves. Once a week, I went into the after-school program’s office, an underground room in the back of a strip mall, where I completed another worksheet, a weekly assessment. My score on this assessment measured my ongoing transformation into a human calculator.

I studied math in college, then took a job as a software engineer at Google. I was grateful to have a job that would help me pay off my hundred thousand dollars of student debt. Still, I wanted more. I had believed the company’s recruiting slogans, had expected to make computers do things that felt like magic, things the world had never seen before. Instead, my work felt routine and inconsequential. One of tens of thousands of engineers, I felt like a cog in a vast machine. I felt interchangeable.

This was because the kind of software I had been hired to write — “traditional” computer programs made of if-statements and for-loops — had become distinctly unfashionable within the company. Instead, everyone was obsessed with neural networks, computer programs loosely inspired by the biological structure of the human brain. Google was already using neural networks to recognize cats in YouTube videos, recognize speech, and suggest email responses. The company’s star researchers were all working on the technology, and every week they would take the stage at all-hands meetings to share progress on their latest projects: beautifying photographs, proving mathematical theorems, improvising Bach harmonies. Across the street from where I worked, there was a room full of camera-equipped robot arms controlled by neural networks learning hand-eye coordination. Elsewhere in the company, another team was working on using neural networks to automate conventional software engineering.

Feeling envious and excited and a little threatened, I enrolled in a course to learn how the new type of computer program worked. It was here that I learned to march the neural network through the landscape of all possible versions of itself. The network’s loss determined the landscape’s elevation, meaning that the least-bad version of the network would be found at the lowest point in this landscape, the bottom of its deepest valley. At each point along this march, the algorithm would calculate how the landscape sloped around the network’s current position, then direct the network to take a step in the direction of steepest descent, always following the gradient of the land. At each step, the algorithm knew only where it was and where it had been. Extrapolating even a few steps ahead was infeasible. Surveying the landscape from some high point in order to plan the descent was out of the question. The problem was that the optimization landscape was unimaginably vast: not vast in extent, like Siberia or the Sahara, but in dimension. The two dimensions of east/west and north/south along the earth’s surface, are nothing compared to the tens of thousands of directions (one for each slot) in the optimization landscape of even a small neural network. Every additional direction multiplied the number of choices. Unable to look ahead, it was as though the algorithm traversed the landscape in pitch-darkness, navigating by feel alone.

The instructors cautioned that optimizing neural networks was uncertain and unpredictable: there was always a possibility of failure. If the algorithm marched the network out onto a plateau, the flatness of its surroundings would provide no signals about which way to turn, and it would come to a standstill; if into a region of steep cliffs, the network would undergo a kind of neural death as too-strong signals filled its slots with infinities. Or it might get stuck in a high valley, the landscape curving upwards all around it, trapping it in place: a dreaded local optimum. Our instructors offered folk wisdom for navigating these situations, strategies that seemed to work, even though scientists couldn’t yet explain why. They explained how to assess the health of an ongoing optimization, how to read its loss curve like an augury and diagnose the likely causes of failure. Above all, they counseled us to develop our own intuitions for the likely paths different algorithms would take or how alterations to the network’s mathematical structure might reshape its optimization landscape. I found the exploratory and improvisational spirit of their guidance invigorating.

Whenever I talked to my sister on the phone, she would inevitably ask me what I planned to do next: how long I planned to stay at my job, whether my boyfriend and I would stay together, when I would move back east. I resented these questions and felt somehow ashamed for not having the answers. For years, I had done everything I was supposed to: achieved high test scores and good grades, gotten into a good college, secured a stable, professional job. But beyond hitting these external targets, I had drifted, without a focus or goal. I hadn’t needed one. Now that I had some autonomy, I was realizing I didn’t know how to articulate what I wanted from life. I was pursuing a feeling rather than a credential.

Studying neural networks, I noticed that a network’s goal was defined obliquely in terms of a large dataset of examples, rather than a tidy formula. Indeed, I learned, neural networks were most useful for pursuing goals which could not be expressed neatly. Training a neural network for homework, I observed the way it moved towards its incorporeal, unconsolidated goal in tiny increments, never certain of its exact destination. This, I thought, was how I had been moving through life. In life, this strategy seemed like aimless drifting. But here, it worked. In just the few minutes since I started training the network, the loss had gone down. The network was learning, and I had numbers to prove it. Seeing its quantifiable success eased the sense of inadequacy I felt at not having a blueprint for my future.

The course wasn’t easy. I struggled on every assignment, filling sheets of paper with calculations before arriving at a solution I could translate into code. I stayed up late many nights, testing my equations and trying to optimize the network. Many times, the optimization failed because of some mistake I had made in one of the many equations. I would track down the error, fix it, and try again. Despite my best efforts, I received poor grades on the assignments. This course was my first serious foray into mathematics after graduating from college, and part of my rationale for enrolling was to prove that my poor grades in my college had been the result of circumstances, rather than a verdict on my abilities. But when I tried my best and still didn’t do well, I was tempted to view my performance as confirmation of my incorrigible intellectual inferiority. My housemate was taking the same course, and he seemed to get top marks easily: proof (so my inner self whispered) that my abilities were second-rate.

Haunted by the suspicion of my own unalterable mediocrity, I found it strangely comforting to watch the loss curve downwards as I optimized a neural network. Here it was: real improvement, before my eyes, rigorous and quantifiable. Even I, skeptical by nature, could hardly disbelieve the evidence for improvability when I had implemented it myself and run it on my own computer. If my neural network could become better, then surely so could I.

After I completed the course, I mentioned to my boss that I wanted to take another one, a mathematics course entirely focused on optimization theory. He laughed: “why would you want to do that?” He told me I’d surely be able to perform any optimization I needed with some off-the-shelf software library, said there was no need to understand how it actually worked. I didn’t know how to tell him that when I was learning about optimization, I felt like I was learning about life.

◊

Optimization is the mobilization of measurement in the service of desire. It thrives on feedback loops, the tighter the better. In the last two decades, the rise of mobile personal computing has produced an unprecedented abundance of feedback. Today, our devices contain sensors, computers, and rich output devices: just what is needed to measure our means and computationally coordinate them in pursuit of quantified ends. When I shift my bedtime so that my smart watch will rate my sleep more highly, I am optimizing myself. When I order groceries online or catch a ride through an app, I know the app is using my fees and my feedback to optimize the driver’s behavior. When I go to work I am optimized in turn by my employer using software for “continuous performance management.”

The idea of optimization was first articulated during World War II by a US Air Force statistician named George Dantzig. Dantzig’s group developed programs for assigning military personnel to tasks and deciding how supplies and weapons should be moved around. Their programs produced instructions for tens of thousands of people; Dantzig later described this effort as the equivalent of running the economy of a small nation.

During the war, Dantzig came to view the many varied planning problems he faced as essentially similar abstract problems of means and ends, and he sought a general method for reconciling the two. His great innovation, first articulated in a 1947 lecture, was to express the attainment of a desired end as a single number calculated in terms of the available means. Dantzig called this calculation an objective function, and he developed an algorithm for discovering the values which would maximize or minimize a particular objective function in order to reveal the best means to achieve a desired end. It was this idea, of expressing a goal as a number and then using that number to guide action towards the goal, which would eventually become known as optimization.

In the decades following Dantzig’s 1947 paper, optimization was embraced enthusiastically by the military, chemicals manufacturers, and airlines. It quietly transformed food, health, and travel: people knowing nothing of optimization nevertheless ate cattle and poultry raised on a mix of different feeds optimized to deliver calories as cheaply as possible. They filled up their cars with gasoline blended from different petroleum products optimized to produce an acceptable freezing point, vapor pressure, and octane rating, all again at the lowest possible cost. But despite its widespread adoption in industry, optimization remained esoteric. When, in the late 1950s, one of Dantzig’s students assembled an early textbook on the subject, publisher after publisher rejected it, doubtful it would find an audience. Dantzig’s collaborators understood the mathematics of optimization better than anyone else in the world. But even they were skeptical when he told them he was toying with using optimization to direct his own life, starting with an algorithmically-generated weight-loss diet. One collaborator, himself a famous mathematician, admonished Dantzig that optimization was meant to be used on other people, not on oneself.

◊

The first artifacts of optimization to capture my attention were a series of images I saw at an art exhibition. The images had been created using a then-nine-month-old technique called DeepDream. Their creators had first trained a neural network to recognize objects in images the same way I would in class several years later. Then, they had fixed the numbers inside the network, and turned the whole thing backwards, treating the network’s input as a grid of empty slots, one for each pixel. They used optimization to fill those slots with numbers which would maximize the image’s resemblance (as assessed by the neural network) to some object.

The DeepDream exhibition was full of strange and captivating images. One picture showed a landscape dotted with palaces, each set in the middle of a circular lawn mowed in swirls, each swirl forming a smaller circular lawn around a smaller palace. I could not tell whether the palaces which proliferated along the edges of the large swirls were tiny, or very far away. Near and far seemed strangely interleaved, as though the picture showed something much larger than an ordinary three-dimensional landscape.

In the years after DeepDream was first developed, scientists used the technique to build an atlas of the patterns and objects recognized by each neuron in one of the most widely-used visual recognition networks. Browsing the atlas, one image catches my eye: a pattern like chicken wire woven from live earthworms, sinuous and slippery. The technique produces not a single archetypal earthworm, but a never-ending sea. Optimization multiplies the subject of the image until everything else is squeezed out. Every pixel is pressed into service. And the patterns never repeat: every pinkish-brown curve is unique. This is not homogeneity, but totality. Not the same earthworm repeated ad infinitum, but all the earthworms which ever existed — all the earthworms which might ever exist — heaped together until they obscure everything else.

I love watching these images form. The process starts with white noise, like static on a television screen. Slowly, patterns or shapes start to emerge, like figures coming out a thick fog. They are gray and indistinct at first, but become increasingly sharp and colorful as the optimization progresses. Optimization coaxes structure and meaning and order out of the chaos.

Some of the DeepDream images in the original exhibition almost resembled Cubist paintings by Picasso or Braque. Another reminded me of van Gogh’s Starry Night. Looking at these pictures, I wondered whether the optimization process which produced them had anything in common with the way humans made art. Perhaps the neuron being maximized by the DeepDream algorithm might be like a feeling the human artist sought to express, and optimization the artistic process of iteratively refining an expression until it felt true.

When I was not attending exhibitions of optimized pictures, I lurked on arXiv, a website where neural network researchers share their work. I filled my browser tabs with scientific papers about neural network architectures and objective functions and optimization landscapes, papers I could only half-understand, but devoured anyway. Many of these papers came with accompanying video demonstrations. I’ve watched one of these many times. It shows a series of cartoon figures running through various simulated obstacle courses. The research paper explains that the figures were set on their obstacle courses without any knowledge of how to operate their limbs and scored on how far they could get through the obstacle course without falling down. The neural networks which controlled the figures’ movements learned to run by optimizing this score. At every step, when the neural network was scored on how far its body had traveled, numbers flowed backwards through the network. The impact of every trip-up, every overbalancing, every overlooked obstacle was propagated from the macro-scale to the micro. Every neuron informed its neighbors how they should respond differently next time to make the body move faster and farther.

In the video, one cartoon figure leaps over gaps in the terrain like a hurdler: push-off perfectly timed, leading foot outstretched to make contact with the ground on the other side. Faced with a floating platform stretching across the course at chest-height, it bends backwards smoothly, doing the limbo without slowing down. Another figure runs up a staircase with long, even strides. At the top, hidden from view until a few moments before, there is a large pit. The cartoon figure makes a heroic leap and almost overbalances when its front foot touches the ground on the other side, but by swinging its arms up is able to recover its balance without breaking stride. Its movements look graceful and easy. Sometimes when I’m watching the video, I scrub backwards after watching the humanoid figure leap over the pit, just for the pleasure of seeing it again.

◊

There are many ways of turning reading and writing into optimization problems. The most common approach is to predict the next word in a text, given all the words that have come before. Other approaches mask out words randomly and try to guess the missing words, or scramble words in a sentence and attempt to unscramble them. All of these are ways of turning reading into a game, an activity whose success or failure can be quantified. Texts are scrambled, erased, or corrupted, and it is always the job of the neural network to recover the human original.

In the early 1950s, George Dantzig’s doctor advised Dantzig that he was overweight and should go on a diet. Naturally, Dantzig’s first instinct was to approach his doctor-ordered diet as an optimization problem. He drew up a table of ingredients and fed the punch cards into RAND’s IBM 701 mainframe (one of only nineteen then in existence). On the first day, the computer dictated an optimal diet of 500 gallons of vinegar per day; on the second, 200 bouillon cubes; on the third, two pounds of bran; and on the fourth, two pounds of blackstrap molasses. On the fifth day, Dantzig abandoned his optimized diet.

Like Dantzig, I have dabbled in self-optimization. I wanted to become a better writer and thought I might cleverly circumvent the labor of self-improvement by fabricating an improved version of my writerly self. Over several weeks one summer, I fine-tuned a neural network on a decade of my own writing: art criticism, diaries, a draft of this essay. With one line of code, I downloaded a pretrained model. Another line of code initialized the optimization. I let it run, then inspected the results.

The network picked up my favorite subjects: art, memories of cooking with friends, confessions about friction with coworkers, long ambivalent interrogations of relationships. It also picked up my stylistic tics: the obsessive introspection of my diaries and art criticism. It didn’t satisfy me, though. The text I generated came without effort, but was uninspiring. It showed me my past self when what I wanted was my future.

The comic failure of Dantzig’s own diet optimization resulted inevitably from the naivety of his optimization objective. His goal was sensible enough: he sought to maximize the feeling of fullness after each meal so he could eat as little as possible and still feel satisfied. In modern terms, he wanted a low-calorie-density diet. But the way Dantzig made this vague notion measurable was absurd. He calculated for each food its percentage of solid (non-water) content, and scored each diet according to the weighted solid content of all its food — as though water were unnecessary and all edible solids equivalent in their nutritional and content. In optimizing, he sought the diet with the highest score. The resulting diets were, as they say in computer science, a case of “garbage in, garbage out.” But although Dantzig quickly abandoned his optimized diet, he never abandoned optimization. Indeed, for the next fifty years, right until his death in 2005, optimization was the primary focus of Dantzig’s life.

Even before I saw the results of my optimized writing, I knew the optimization objective I was using was a caricature no less ridiculous than Dantzig’s. I wanted my writing to respond to the world with sensitivity and to creatively imagine new worlds, but I trained it only to imitate what already exists. After a few weeks, I too abandoned the experiment. Still, I have not been able to bring myself to abandon optimization itself.

Now, instead of actually applying optimization to my life, I think in terms of optimization. I fall back on metaphor. I treat fuzzy, underspecified desires as though they might be optimizable. I imagine applying algorithms to situations which are unmeasured or unmeasurable. I embrace optimization not as a means of action, but as a way of thinking, a way of seeing.

As I write this, I have found myself often thinking about the activity of writing as a kind of optimization. I imagine optimizing without constraining myself to readily available datasets or known techniques. Freed of the constraints of implementation feasibility, I imagine a different optimization process, one which feels more truthful to my own experience. When I write, I am trying to find a match between an arrangement of words on a page and something inside me, an inchoate feeling or an idea. If I am troubled or uncertain, I write to bring about a change within myself, a sense of resolution. Working incrementally, rewriting a sentence here, a paragraph there, I try to move my whole manuscript towards something I can’t yet articulate, but will know when I see it. I think of myself as optimizing the feeling of rightness in the writing. I recall DeepDream images emerging out of multicolored static and hope that what I am doing will create order and meaning out of the jumble of words on the page in front of me. I cannot implement this optimization. But in my head, I am living it. I have become the optimizer, and I have become the network through which optimization flows.

Notes

I wrote this between April 2020 and May 2021 and returned to it on and off through September 2022, though it was only posted in the spring of 2023.

Thanks to Ivy Lau, Abigail Kelly, and Joseph Bolling, who listened to drafts of this essay on two 15-hour drives between San Francisco and Washington State. Thanks to Ione Barrows and Mike Webb for their feedback, and to Julia Aizuss at The Point for holding me to a high standard of emotional accuracy.

The header GIF is an excerpt from the video accompanying the paper “Emergence of Locomotion Behaviours in Rich Environments.”