Specification and AI

I’ve been looking for a grounded way to reason about the limits and potential of the new era of AI technology. Is it mostly a fun toy, or will future advances put most people out of a job (or somewhere in between)?

I take inspiration from a fun computer science activity where I pretend to be a robot trying to cross a crowded classroom, and a group of kids takes turns instructing me to take a step forward, back, left, or right. Inevitably, one of their instructions won’t quite line up and a step will send me crashing into a desk (which is also part of the fun).

The takeaway is that computers do exactly what you tell them to do, not necessarily what you want them to do. In other words, the core problem is specification: how to translate the needs and goals in your head into instructions that a computer can follow.

AI tools clearly raise the level at which you can communicate: it is now plausible to use higher-level concepts like “walk across the classroom while avoiding desks.” But no matter how smart, an AI still can’t read your mind. It might know what you’ve done in the past, and what other people have done. But it doesn’t know what you want to do today unless you can describe it.

In other words, the extent to which AI tools can automate a task depends on how complicated it is to specify what you want.

Since I’m a software developer, let’s imagine a future intelligent assistant that might take my job by being able to fulfill a request like “build a great weather app”. Will such a tool ever come to exist?

What makes a weather app great? There’s no definitive answer — rather it’s a question of what you happen to want, today. How much do you care about the rain vs. wind vs. clouds? How much do you care about today’s conditions vs. tomorrow and next week? How much detail do you want to see? How much time are you willing to wait for data to load? How much will it cost? You’ll have to tell the imagined AI assistant about all the things you care about and don’t care about in order for it to make an app that’s great for you. That might still require a lot of work from you, the human.

Consider all the time people spend in meetings trying to get everyone on the same page about how, exactly, to best move forward. I don’t see how AI technology would remove the need for this. If you want to take everyone’s goals into account, you’ll still need to spend a lot of time talking it all through with the AI. If you skip that step and ask the AI to make decisions, you’ll only be getting a cultural average and/or a roll of the dice. That might be good enough in some cases, but it’s certainly not equivalent.

On the other hand, when requests are relatively simple and your goals are relatively universal, AI is likely to be transformative.

Either way, the limit of automation is the complexity of specifying what you want.

AI Fashion

As a way to experiment with recent generative AI tools, I challenged myself to design a piece of clothing for each color of the rainbow. The results are a sort of fashion line with a theme of bold, angular patterns.

I experimented with a variety of tools and approaches, but all of the above were generated using free tools based on Stable Diffusion XL: either the macOS app Draw Things or the open source project Fooocus. I also used Pixelmator Pro in a few cases to fix small issues with faces, hands, and clothing via more classic photo editing techniques.

Each image was selected from around 5 to 50 alternatives, each of which took between 1 to 6 minutes for the system to generate (depending on hardware and settings). So the gallery above represents at least 10 hours of total compute time.

In some cases, I needed to iterate repeatedly on the prompt text, adding or emphasizing terms to guide the system towards the balance of elements that I wanted. In other cases, I just needed to let the model produce more images (with the same prompt) before I found one that was close enough to my vision. In a few cases, I used a promising output image as the input for a second round of generation in order to more precisely specify the scene, outfit, and pose.

It’s impressive to see how realistic these tools are getting, though they certainly have limits. If you specify too many details, some will be ignored, especially if they do not commonly co-occur. I also started to get a feel for the limits and biases of the training data, as evidenced by how much weight I needed to give different words before the generated images would actually start to reflect their meaning.

It’s also clear that the model does not have a deep understanding of physics or anatomy. AI-generated images famously struggle with hands, sometimes using too many fingers or fusing them together. It also often failed to depict mechanical objects with realistic structure — I more or less gave up trying to generate bicycles, barbells, and drum sticks.

Overall, the experience of generating the fashion gallery felt less like automation and more like a new form of photography. Rather than having to buy gear, hire a model, sew an outfit, and travel to a location, you can describe all those things in words and virtually take the photo. But you still need the artistic vision to come up with a concept, as well as the editorial discretion to discard the vast majority of images — which is also the case in traditional photography.

Last, it was interesting to notice that the process of adjusting prompts and inspecting results was not so different than trying to communicate with another person. You’re never sure exactly how your words will be interpreted, and you sometimes need to iterate for a while to come to a shared understanding of “less like that” and “more like this”.