Expressive Text-to-Image Generation with Rich Text

1University of Maryland, College Park   2Adobe Research   3Carnegie Mellon University

We explore using versatile format information from rich text like font size, color, style, and footnote for text-to-image generation. Our framework enables intuitive local style control, precise color generation, and supplementary description for long prompts.

"A pizza with mushrooms, pepperonis, and pineapples on the top, 4k, photorealism."

"A young woman sits at a table in a beautiful, lush garden, reading a book on the table."

Styles: Johannes Vermeer, Claude Monet

"A mesmerizing sight that captures the beauty of a rose blooming, close up"

"A close-up of a cat1 riding a scooter. Tropical trees in the background."

1A cat wearing a wearing glasses and has a bandana around its neck.

"A vibrant field of sunflowers under the sunset. A towering mountain1 rises up in the distance."

1A single snow mountain. Styles: Van Gogh, Hokusai.

"A nightstand1 next to a bed with pillows on it. Grey wall2 bedroom."

1Nightstand with some books. 2Accent shelf with plants on the wall.

Font color controls the precise color of objects

Font style indicates the styles of local regions

A small pond (Ukiyo-e) surrounded by skyscraper (Cubism).

A fountain (Claude Monet) in front of an elegant castle (Pixel Art).

A night sky filled with stars (Van Gogh) above a turbulent sea with giant waves (Ukiyo-e).

The awe-inspiring sky and sea (J.M.W. Turner) by a coast with flowers and grasses in spring (Claude Monet).

Footnote provides supplementary descriptions, enabling complex text prompts

A coffee table1 sits in front of a sofa2 on a cozy carpet. A painting3 on the wall. cinematic lighting, trending on artstation, 4k, hyperrealistic, focused, extreme details.

1A rustic wooden coffee table adorned with scented candles and many books.
2A plush sofa with a soft blanket and colorful pillows on it.
3A painting of wheat field with a cottage in the distance, close up shot, trending on artstation, cgsociety, hd, calm, complimentary colours, realistic lighting, by Albert Bierstadt, Frederic Edwin Church.

Rich-text-to-image Generation Framework

The plain text prompt is first input to the diffusion model to collect the cross-attention maps. Attention maps are averaged across different heads, layers, and time steps, and then taken maximum across tokens to create token maps. The rich text prompts obtained from the editor are stored in JSON format, providing attributes for each token span. According to the attributes of each token, corresponding controls are applied as denoising prompt or guidance on the regions indicated by the token maps.


[1] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv:2208.01626, 2022.
[2] Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. CVPR, 2023
[3] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.arXiv preprint arXiv:2301.13826, 2023.
[4] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models.CVPR, 2022


We thank Mia Tang, Aaron Hertzmann, Nupur Kumari, Gaurav Parmar, Ruihan Gao, and Aniruddha Mahapatra for their helpful discussion, code reviewing, and paper reading. We thank AK, Radamés Ajna, and other HuggingFace team members for their help and support with our online demo. This work is partly supported by NSF award no. 239076, NSF grants no. IIS-1910132 and IIS-2213335.


      title={Expressive Text-to-Image Generation with Rich Text},
      author={Ge, Songwei and Park, Taesung and Zhu, Jun-Yan and Huang, Jia-Bin},
      journal={arXiv preprint arXiv:2304.06720},