AI Image Generation Algorithms - Breaking The Rules, Gently

Introduction

A little while ago, I made a couple of videos detailing my informal exploration of a variety of different artificial intelligence image generators. Nothing very technical—I'm more interested in studying them as a phenomenon than as a technology. Since I made those videos, I've had access to a couple of different algorithms that are more advanced and capable. Let's have a look at what I managed to get those to do.

[Music]

In this video, I'll be using and then later abusing DALL-E from OpenAI and Stable Diffusion from Stability AI. I thought I might perhaps give these things the same text prompts that I used in the previous video, but results from doing that were a bit of a mixed bag with some triumphs and some slight disappointments, which I'll explain as we go. But let's just do one or two comparisons.

So last time I asked for a dog made of bricks and got this. This time, DALL-E responded with this and Stable Diffusion with this. Clearly an improvement.

Some things didn't work so well whatever I tried. So the strange animal in a field that was one of the more interesting sets of images previously was one of the least interesting this time. Similarly, the prompt for a very long bird just got me mostly somewhat realistic looking pictures of tallish birds. Previously, when I asked for "boy with apple," I got things like this. When I asked the same this time, I got very literal responses. Whereas a lot of the algorithms I examined before were specifically trying to return something that looks like a work of art, Stable Diffusion and DALL-E are instead trying to return exactly what you asked for. So with these algorithms, a more verbose text prompt is often required to get you closer to the desired kind of output.

So instead, if I ask for an oil painting of a boy with apple in the style of Johannes van Hoytel the Younger, I get these, which I think are outstanding. So that's probably enough of the comparison, but I do want to just spend a moment to talk about what's happening here when these algorithms generate images.

These algorithms know what things look like and can imagine things they've never seen. Now I should make it clear I'm not suggesting that they're sentient or sapient or self-aware or anything like that. It's just necessary to pick some convenient shorthand terms in order to describe the capabilities of the thing. So when I say they know or imagine or see, I'm really saying that they've been sufficiently trained and configured to be able to perform a task that we would describe as knowing or imagining or seeing if we were the ones doing the task.

But what do I really mean by that? Well for one thing, we can ask them to create realistic images. So if I ask for a sunlit glass of flowers on a pine table, I get images that plausibly look like what I asked for. The shadows and caustics—that is, the focused light thrown by the vase in the sunlight—are realistic. It's been able to create these images because it's studied enough examples to be able to understand how glass looks, how shadows work, how sunlight is refracted and focused through glass objects, and so on, even though those things were not specific objectives of the learning process. The understanding, if that's the right word, of refraction is an emergent property of the learning process.

Now the skeptics might still be thinking these could be just stock photos. Okay, let's change the prompt to a sunlit glass sculpture of a lobster on a pine table and we get these, again showing plausible shadows and the play of light. Still think there might happen to be four ready-made glass lobster on pine table photos in a database somewhere? Fine. Let's try a sunlit glass sculpture of a Citroën 2CV on a pine table. Really anything you like it will try to generate. Obviously, the algorithm will have seen some images of Citroën 2CVs and pine tables and glass but not all together. I'm sure it's not cobbling these images together from pieces; it's creating them from trained knowledge about what the world looks like.

Now it doesn't always get it exactly right. Sometimes this appears to be because it couldn't parse the sentence perfectly. For example, if I asked for a squirrel holding a box of multi-colored metal balls on a red table, sometimes it works and you get this. Other times it misunderstands which attribute belongs to which object, e.g., the table isn't red in this picture, the wall is. Or if I ask for an oil painting of a squirrel holding a box of multi-colored metal balls on a red table, sometimes I get exactly that. Other times I get an oil painting of a squirrel on a red table next to a box of multi-colored metal balls. Still pretty impressive, I think. And let's not forget that humans can easily misinterpret the syntax of compound sentences too—crash blossoms.

One thing they tell you not to do with these, not because it's dangerous or anything—in fact what they tell you is not to bother doing—is to ask for text or written output because these algorithms have not been trained to produce written output. They know what the world looks like, what paintings look like, what drawings and sculptures look like, but they don't know how to write. Except it's true that within the training data there will have been pictures that included text within them: street signs, labels on bottles, greetings cards, posters—that sort of thing.

So whilst they don't know how to write, they do know what writing looks like. So what happens when we do the thing we're advised not to bother doing and ask for some text output? Well, I found it both interesting and amusing.

Here we go. I asked for a cartoon drawing of a sign that says "Danger Thin Ice." I got "Danger dinge, dinge, danger, ting this ding the danger ding is the dinge." Next, I asked for an inspirational message. I got that. "Love Tay starts the art feisting." I mean it inspires something! Let's try asking for a proverb: "Pop plays the over post our protest priyaz Nat over is is polits ever." Okay, how about the message from a fortune cookie? "I worry, foreign wisdom there for all of us."

Perhaps in all of these experiments, the output looks like text; it even contains recognizable letters, sometimes whole words. But I think this is because the training data set for the algorithms must have contained some pictures of posters and signs and such that contain text, and maybe those images were tagged with something useful and relevant. But rather than learning to read and write, the algorithms learn to draw pictures of text. Like this: a request for emergency instructions, which came out as "emergingi imaginary endley or wanged and princi emperingly and lnc endley amine glenty ranked." Well, it certainly sounds important.

I used the outpainting feature of DALL-E where you can give it any image, here one of its own, and it will try to extend it into a bigger view by filling in what it considers to be plausible pieces. So we get more of this sign, which was not just wanged, but "urgency wanged." Stable Diffusion gave similar results for emergency instructions, including "entermanstony" and "Emma Earth sentisi."

And just to jump back to the outpainting thing for a moment, I gave DALL-E a prompt consisting of the first verse of Lewis Carroll's poem "The Jabberwocky" and it drew this, which looks like it's trying to be part of the cover of a book. I was curious to see what the rest of the book might look like, so I used the outpainting feature. And here it is: "bingas the boozing walkus."

But I digress. Some of the outputs, when I try to read them out loud, evoke a weird and completely unscientific feeling in my mind that they might represent some sort of archetypal version of English, as though somehow the things learned to make the primitive word shapes of the pieces of the English language, just abstracted from their meaning because they're being drawn as pictures of words.

This is one of the thoughts that goes through my head, and I know it's probably complete and fanciful nonsense. But I dropped a note to discuss it with one of my favorite YouTubers for videos about language, Simon Roper. Simon has a really lovely channel containing many interesting videos about the reconstruction of the accents and pronunciation of ancient forms of English, as well as other related languages—well worth a look if you're interested in that sort of thing, even if you think you're not interested, because his videos are really good and they might actually get you interested.

To Simon's ear and mind, these archetypes I imagined I might be seeing and hearing simply weren't there at all. Honestly, this didn't surprise me. My fanciful amateur imaginings of what I think I can see often seem to play out like that, and this one felt like a huge stretch from the start. However, Simon was a great sport about it and agreed to try to read some of the outputs where possible in an Old English style. And that's what you're about to hear, a couple of short poems about cheese written by Stable Diffusion and DALL-E and performed by Simon Roper.

Ted media bed owns Foreign

So thank you very much, Simon; words can't express my gratitude for your willingness to play along here. Don't forget to check out Simon Roper's YouTube channel (link in the card and description). So that was my little journey off of one of the edges of the map with AI image generation. I hope you found that interesting, and I think if there's anything to take away from this, it's that sometimes deliberately not following the guidelines can be a bit of fun. I'm not suggesting you break the law or anything or that you circumvent safety protocols, but not all instructions are about safety or law. Thanks for watching, and I hope to see you again soon.

[Music]

Keywords

AI Image Generators
DALL-E
OpenAI
Stable Diffusion
Stability AI
Image Generation
Text Prompts
Art
Training Data
Outpainting
Refraction
Shadows
Language
Simon Roper

FAQ

Q: Which algorithms did you use in your experiments? A: I used DALL-E from OpenAI and Stable Diffusion from Stability AI.

Q: How do these algorithms differ from the ones you explored previously? A: These newer algorithms are designed to return exactly what you ask for rather than interpreting the prompt creatively. They often require more detailed text prompts to produce the desired output.

Q: Can these AI algorithms understand complex visual concepts? A: Yes, they can. For example, they can understand how glass looks, how shadows work, and how sunlight is refracted through objects, even though these were not the specific objectives of their training.

Q: Do these algorithms always get the prompts right? A: No, sometimes they misinterpret complex or compound sentences. However, their visual outputs are often impressively close to what was requested.

Q: Can these AI algorithms produce textual outputs? A: They haven't been specifically trained for written output and often produce amusing or nonsensical text. However, they can generate text that looks visually like writing due to exposure to images containing text.

Q: What is outpainting? A: Outpainting is a feature where the algorithm extends an image by filling in the surrounding area with what it considers plausible continuations of the original image.

Q: Is there anything to be cautious about when using these AI algorithms? A: While not generally dangerous, you should have realistic expectations about possible errors or misinterpretations, especially with complex text prompts. Following guidelines can ensure more reliable results, but occasionally breaking the rules (safely) can lead to interesting discoveries!