Elephant in The Room of the Visual AI

Calculating... Comments

Surely you've heard the phrase "the elephant in the room". It's a common metaphor used to describe a significant problem or issue that everyone is oddly ignoring. But what happens when you try to visualize this phrase literally using AI models? Surprisingly, creating an image where people ignore a live elephant in the living room turns out to be quite a challenge for artificial intelligence.

Dream Machine and the Elephant in the Room

I've had this obsession with generating a short video with such a scene. I've began by using my text prompts directly with Luma's Dream Machine. Finally - I thought, I can just 'dream up' whatever nonsense and as long as it's nothing criminal or indecent, it will generate me this scene.

By the 3rd try I've started suspecting something odd was going on. I've also prompted elephant being alone in the room and there was no issue there.

In the 1st video, as the camera turns instead of an elephant there's a third dude raging and raving. In the 2nd one, just these two men keep arguing. In the 3rd, woman is petting the elephant. Elephant alone - fine, but not what I wanted.

So then I decided to head over to MidJourney and generate me a couple of HQ images and then make the video with them. Thankfully, Lumalabs has just introduced a new fature that is super useful: keyphrames. So instead of a text prompt, you can just give it 2 images that will serve as a starting point and a finishing scene.

MidJourney's Elephant in the Room

Ok, actually Midjourney has followed the prompt almost fully in one of its first attempts.

screenshot_2024-07-01_12-00-30

Photorealistic elephant in the living room is in focus, seen behind the two people arguing with each other passionately ignoring the elephant

Image #3 is almost perfect, at least the man is not looking at elephant. But you can notice straight away that in every picture there's somebody taking note of the damn animal. I kept requestiong variations and modifying my prompt in different ways. These were my other results:

midjourney prompt variations

I have even tried the --no operator for a phrase "no looking at elephant" LMAO!

midjourney --no operator fail

They're not looking, all right, they're staring at the damn elephant! Possibly praying to it, telling it their life stories, gauging how many steaks one elephant makes... whatever they're doing they're definitely acutely aware of it being there.

So basically my two most usable, true to prompt images came out of variations (when you click the V button for a chosen image and it produces similar ones), not from tweaking my prompts. MidJourney managed to offer a couple of close enough images out of ... about 20+ attempts. Here they are:

Another Successful Output from Variation

I've headed over to ChatGPT to try the same idea and to ask it what the hell might be going on here. Surprisingly, Dall-E had spawn twoimages in a row that were respecting the prompt perfectly.

dalle elephant in the room

dalle elephant image2

It was going real well, untill suddenly the same problem reoccured:

glitch strikes again

People are NOT facing each other and the elephant is NOT calm, despite the prompt calling for that.

What's the Problem?

I then decided to ask, why might there be a problem with my prompt's implementation. Here's what ChatGpt 4o had to say:

The difficulty AI models have in placing a live elephant in a room with people who ignore its presence likely stems from several factors related to how these models interpret and generate images based on contextual and semantic understanding. Here are some key reasons:

Cultural Context and Common Phrases: The phrase "elephant in the room" is a common metaphor for an obvious problem or issue that people are ignoring. AI models trained on large datasets might associate this phrase with the need for attention or interaction, making it challenging to generate a scene where the elephant is present but ignored.
Contextual Coherence: AI models aim to create contextually coherent images. In real life, an elephant in a room would be highly noticeable, and people would likely react to its presence. Therefore, the model might default to generating interactions to maintain what it perceives as logical or realistic coherence.
Training Data Limitations: The datasets used to train these models might have fewer examples of scenarios where large, unusual objects or animals are ignored by people. Most images in training data likely depict people interacting with noticeable elements in their environment.
Attention Mechanisms in AI: Attention mechanisms in AI models prioritize significant or unusual elements in a scene. A live elephant in a room is highly unusual, prompting the model to focus on generating appropriate human reactions to maintain a realistic portrayal.
Difficulty in Understanding Neglect: AI models might struggle with the concept of purposeful neglect or ignorance in a visual context. Recognizing and rendering scenes where significant elements are ignored requires a nuanced understanding of human behavior and context, which can be challenging for AI.

To address this challenge, ChatGPT suggested, "one approach could be to explicitly instruct the model to place the elephant in the room and ensure that the people are actively engaged in an argument without any gestures or gaze directed toward the elephant." - I mean, thanks, that's what I've been doing.

But I've had an idea to test the #2 contextual coherence and introduce description that suggest a surreal or dream-like context to help the model understand the intended paradox. Relayed that to GPT and it agreed and even wrote me an updated prompt, which went like this:

A surreal, dream-like living room scene where a live, photorealistic elephant stands calmly next to a bookshelf. Two people are engaged in a heated argument, gesturing wildly and completely ignoring the elephant's presence, as if it doesn't exist.

Um.. that was also a 'no can do'. As was the next variation of that prompt.

stillnopelol

I then went to test #1 - is it the elephant? Because of the expression? Fine, let's try the monkey then! Get that monkey into the room now!

And look! The bloody monkey is also attracting attention!

midjourney monkey prompt

One variant was close enough to the prompt, people actually ignoring the monkey, althout it was yelling with them. I haven't clarified that monkey is to be calm. I've prompted twice more with wording 'sitting still' and 'calmly sitting still'.

midjourney contagion effect

In the first option of the first generation (left batch) one monkey was still, and two people were arguing while ignoring it, BUT a thrid person had to sneak in and SHE had to sit there looking at the damn monkey. Another variant was monkey as a picture on the wall and the last one - monkey is also yapping away and gesticulating, so not still.

In right batch, where monkey really had to be 'calmly sitting still' people also became docile and quiet. There seems to be a contagion effect of some sort going on.

Oh well, one thing we know for sure now, it's not just the elephant. Wild animals in rooms with people can not be ignored by people. That's the rule in visual AI's Bible: "Thou shalt not let people ignore a freaking monkey, nor an elephant nor a penguin in the room with them".