We explore using versatile format information from rich text like font size, color, style, and footnote for text-to-image generation. Our framework enables intuitive local style control, precise color generation, and supplementary description for long prompts.
"A pizza with mushrooms, pepperonis, and pineapples on the top, 4k, photorealism."
"A young woman sits at a table in a beautiful, lush garden, reading a book on the table."
Styles: Johannes Vermeer, Claude Monet
"A mesmerizing sight that captures the beauty of a rose blooming, close up"
"A close-up of a cat1 riding a scooter. Tropical trees in the background."
1A cat wearing a wearing glasses and has a bandana around its neck.
"A vibrant field of sunflowers under the sunset. A towering mountain1 rises up in the distance."
1A single snow mountain. Styles: Van Gogh, Hokusai.
"A nightstand1 next to a bed with pillows on it. Grey wall2 bedroom."
1Nightstand with some books. 2Accent shelf with plants on the wall.
A small pond (Ukiyo-e) surrounded by skyscraper (Cubism).
A fountain (Claude Monet) in front of an elegant castle (Pixel Art).
A night sky filled with stars (Van Gogh) above a turbulent sea with giant waves (Ukiyo-e).
The awe-inspiring sky and sea (J.M.W. Turner) by a coast with flowers and grasses in spring (Claude Monet).
A coffee table1 sits in front of
a sofa2 on a cozy carpet. A painting3 on the wall. cinematic lighting, trending on artstation, 4k,
hyperrealistic, focused, extreme details.
1A rustic wooden coffee table
adorned with scented candles and many books.
2A plush sofa with
a soft blanket and colorful pillows on it.
3A painting of wheat field
with a cottage in the distance, close up shot, trending on artstation, cgsociety, hd, calm,
complimentary colours, realistic lighting, by Albert Bierstadt, Frederic Edwin Church.
The plain text prompt is first input to the diffusion model to collect the cross-attention maps. Attention
maps are averaged across different heads, layers, and time steps, and then taken maximum across tokens to
create token maps. The rich text prompts obtained from the editor are stored in JSON format,
providing attributes for each token span. According to the attributes of each token, corresponding controls
are applied as denoising prompt or guidance on the regions indicated by the token maps.
[1] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv:2208.01626, 2022.
We thank Mia Tang, Aaron Hertzmann, Nupur Kumari, Gaurav Parmar, Ruihan Gao, and Aniruddha Mahapatra for their helpful discussion, code reviewing, and paper reading. We thank AK, Radamés Ajna, and other HuggingFace team members for their help and support with our online demo. This work is partly supported by NSF award no. 239076, NSF grants no. IIS-1910132 and IIS-2213335.
@article{ge2023expressive,
title={Expressive Text-to-Image Generation with Rich Text},
author={Ge, Songwei and Park, Taesung and Zhu, Jun-Yan and Huang, Jia-Bin},
journal={arXiv preprint arXiv:2304.06720},
year={2023}
}