SANA: High-Resolution Image Generation with Linear Diffusion Transformers

Calculating... Comments

Nvidia, MIT, and Tsinghua University recently introduced SANA, a next-gen diffusion model built for high-res image synthesis using a linear diffusion Transformer (DiT) setup. SANA can generate images up to 4096×4096 resolution with solid text-image alignment at top speeds—even on a laptop GPU.

Paper: https://arxiv.org/abs/2410.10629

Demo: https://nv-sana.mit.edu/

Github (no code yet): https://github.com/NVlabs/Sana

Key features include:

High Compression: Uses a 32× compression autoencoder, reducing tokens and speeding up image generation.
Efficient Linear Transformer: Replaces quadratic attention with faster linear attention, integrating a 3×3 convolution to enhance details without position-based encoding.
Compact Text Encoder: Utilizes “Gemma,” a small, decoder-only language model, for better text understanding and alignment with minimal memory usage.
Optimized Training & Sampling: Flow-DPM-Solver reduces sampling steps, while caption labeling and CLIPScore-based selection improve text-image consistency with fewer training steps.
Impressive Performance: For 512×512 images, SANA-0.6B is 5× faster than competitors, while SANA-1.6B is 23× faster at 1024×1024 resolution, offering high-quality images with low latency.
Compact and Fast: Available in 0.6B and 1.6B parameter sizes, SANA produces 1024×1024 images in under a second on 16GB VRAM, ideal for fast, detailed images with extended context handling.

AI Podcast about Sana Image Generator

This is a test AI podcast, dedicated to introducing Sana and featuring its images.

This comparison table was published by the developers:

Methods	Throughput (samples/s)	Latency (s)	Params (B)	Speedup	FID ðŸ‘†	CLIP ðŸ‘†	GenEval ðŸ‘†	DPG ðŸ‘†
512 × 512 resolution
PixArt-α	1.5	1.2	0.6	1.0×	6.14	27.55	0.48	71.6
PixArt-Σ	1.5	1.2	0.6	1.0×	6.34	27.62	0.52	79.5
Sana-0.6B	6.7	0.8	0.6	5.0×	5.67	27.92	0.64	84.3
Sana-1.6B	3.8	0.6	1.6	2.5×	5.16	28.19	0.66	85.5
1024 × 1024 resolution
LUMINA-Next	0.12	9.1	2.0	2.8×	7.58	26.84	0.46	74.6
SDXL	0.15	6.5	2.6	3.5×	6.63	29.03	0.55	74.7
PlayGroundv2.5	0.21	5.3	2.6	4.9×	6.09	29.13	0.56	75.5
Hunyuan-DiT	0.05	18.2	1.5	1.2×	6.54	28.19	0.63	78.9
PixArt-Σ	0.4	2.7	0.6	9.3×	6.15	28.26	0.54	80.5
DALLE3	-	-	-	-	-	-	0.67	83.5
SD3-medium	0.28	4.4	2.0	6.5×	11.92	27.83	0.62	84.1
FLUX-dev	0.04	23.0	12.0	1.0×	10.15	27.47	0.67	84.0
FLUX-schnell	0.5	2.1	12.0	11.6×	7.94	28.14	0.71	84.8
Sana-0.6B	1.7	0.9	0.6	39.5×	5.81	28.36	0.64	83.6
Sana-1.6B	1.0	1.2	1.6	23.3×	5.76	28.67	0.66	84.8

Sana's Real-World Capabilities & Tests

Clear Text in Images: Generates crisp text, including styles like neon signs and banners.
Logo Design: Produces logos comparable to specialized AI tools.
Fast, High-Quality Results: Creates high-res images fast, even on standard laptops.
Stability: The online demo hosted by MIT is responsive, generating images within ~20 seconds for 1472x960px resolution.
Versatile Styles: Capable of producing images in styles from realistic to hyperrealistic.

Here are some prompt tests so far:

sana-man-portrait — Man's portrait (realistic prompt test)

Raw, DSLR photo. Foreground: A rugged-looking man with a rebellious style, featuring medium-length, tousled dark brown hair with some strands falling across his face. He has facial hair, including a goatee and a light mustache. His expression is intense and serious, with deep-set eyes that give a penetrating gaze. The man is wearing a yellow bomber jacket with a zipper down the front. Underneath the jacket, a white T-shirt is visible at the neckline. He also wears several accessories, including necklaces and small hoop earrings, adding to the edgy and artistic look. Background: solid teal. Lighting: soft diffused lighting.

I'm not surprised that longer text gets the model to start writing gibberish. Industry standard so far. So no breakthroughs there:

Sana's prompt with long text in Fantasy Art style

A raw photo of a 22-year-old woman with messy hair against a clear blue sky screaming from excitement while holding a neon sign that says "Sana is crazy! (Or maybe I am)" Soft diffused lighting.

The prompt I've used previously with Flux didn't quite work:

Sana astronaut with a rock portal prompt example

pov of an austronaut holding a small red rock that has a portal showing the Statue of Liberty, alien landscape of a red planet in the background, surreal

And this one had a weird understanding of a Jack-O-Lantern, but hey, it's still neat and moody. Maybe pumpkin turned into a spooky dude and walked.

A spooky, mysterious foggy Halloween scene with thick fog. The colors are rich and saturated, and the lighting is soft and diffused, adding to the eerie moody atmosphere. Jack-O-Lantern and haunted house in background.

But some people might appreciate that Sana AI seems to be not only very generous with fingers (which is not uncommon, Flux and even Midjourney still like to give 6th finger as a bonus), but with breast.

Prompt: A medium shot of a sexy 22-year-old woman holding a neon sign that says "Sana" on her chest level. With voluminous blonde hair, against a light purple backdrop, screaming from excitement.

Room for Future Enhancements

While SANA excels in speed and alignment, it could improve in facial detail—potentially through fine-tuning.

Only the paper and demo are available online, and developers are aiming to release the following soon:

Training and inference code
Model zoo
Diffusers
Compatibility with ComfyUI

SANA’s potential for working on average laptops with tools like ComfyUI is a promising development for free local quality image generation. So put this on your radar.

Last modified 09 November 2024 at 21:19

Published: Oct 29, 2024 at 5:15 PM