Expressive Text-to-Image Generation with Rich Text

Songwei Ge¹ Taesung Park² Jun-Yan Zhu³ Jia-Bin Huang¹

¹University of Maryland, College Park ²Adobe Research ³Carnegie Mellon University

ICCV 2023 & IJCV 2025

arXiv Conference Version Journal Version Code Demo Benchmark

We explore using versatile format information from rich text, such as font size, color, style, and footnote, for text-to-image generation and editing. Our framework enables various controlibility, including intuitive local style control, precise color generation, and supplementary description for long prompts. Check out our paper for more applications.

A pizza with mushrooms, pepperonis, and pineapples on the top.

A Gothic church in the sunset with a beautiful landscape in the background.

A night sky filled with stars above a turbulent sea with giant waves.

Styles: Van Gogh, Ukiyo-e

A close-up photo of a corgi wearing a hat¹, beach and ocean in the background.

Styles: Impressionism.

¹A lady's hat.

A man in suit¹ with a green apple on his face..

¹A colorful Hawaiian shirt.

A dog playing guitar on a boat, sailing in the ocean.

A girl with long hair sitting in a cafe, by a table with coffee¹ on it, best quality, ultra detailed, dynamic pose.

¹Ceramic coffee cup with intricate design, a dance of earthy browns and delicate gold accents. The dark, velvety latte is in it.

A pixel art of a duck with a gun¹ in hand, wearing a hat², minimalist, flat

¹A bouquet of flowers.
²A black hat decorated with a red flower.

A watercolor painting of the detective duck wearing a sheriff uniform¹ and holding a vintage handgun².

¹A dark green, washed jacket.
²A beautiful flower bouquet made of pink roses.

A kid wearing a backpack riding a bike in a street with fallen leaves.

A panda¹ standing on a cliff by a waterfall.

¹Happy kung fu panda, asian art, ultra detailede.

A portrait of a man with a golden beard wearing a hat.

Style: Cubism

Font color controls the precise color of objects

Comparison with existing methods:

Font style indicates the styles of local regions

Comparison with existing methods:

A small pond (Ukiyo-e) surrounded by skyscraper (Cubism).

A night sky filled with stars (Van Gogh) above a turbulent sea with giant waves (Ukiyo-e).

Footnote provides supplementary descriptions, enabling complex text prompts

Comparison with existing methods:

A car¹ driving on the road. A bicycle² nearby a tree³. A cityscape⁴ in the background.

¹A sleek sports car gleams on the road in the sunlight, with its aerodynamic curves and polished finish catching the light.
²A bicycle with rusted frame and worn tires.
³A dead tree with a few red apples on it.
⁴A bustling Hongkong cityscape with towering skyscrapers.

A coffee table¹ sits in front of a sofa² on a cozy carpet. A painting³ on the wall. cinematic lighting, trending on artstation, 4k, hyperrealistic, focused, extreme details.

¹A rustic wooden coffee table adorned with scented candles and many books.
²A plush sofa with a soft blanket and colorful pillows on it.
³A painting of wheat field with a cottage in the distance, close up shot, trending on artstation, cgsociety, hd, calm, complimentary colours, realistic lighting, by Albert Bierstadt, Frederic Edwin Church.

Rich-text-to-image Generation Framework

The plain text prompt is first input to the diffusion model to collect the self-attention and cross-attention maps. Attention maps are averaged across different heads, layers, and time steps. The self-attention maps are then used to create the segmentation using spectral clustering and the cross-attention label each segment. The rich text prompts obtained from the editor are stored in JSON format, providing attributes for each token span. According to the attributes of each token, corresponding controls are applied as denoising prompt or guidance on the regions indicated by the token maps. We preserve the structure and background from plain-text generation by injecting the features or blending the noised samples.

References

[1] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv:2208.01626, 2022.
[2] Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. CVPR, 2023
[3] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.arXiv preprint arXiv:2301.13826, 2023.
[4] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models.CVPR, 2022

Acknowledgment

We thank Mia Tang, Aaron Hertzmann, Nupur Kumari, Gaurav Parmar, Ruihan Gao, and Aniruddha Mahapatra for their helpful discussion, code reviewing, and paper reading. We thank AK, Radamés Ajna, and other HuggingFace team members for their help and support with our online demo. This work is partly supported by NSF award no. 239076, NSF grants no. IIS-1910132 and IIS-2213335.

BibTeX

@inproceedings{ge2023expressive,
      title={Expressive Text-to-Image Generation with Rich Text},
      author={Ge, Songwei and Park, Taesung and Zhu, Jun-Yan and Huang, Jia-Bin},
      booktitle={IEEE International Conference on Computer Vision (ICCV)},
      year={2023}
}
@article{ge2025expressive,
      title={Expressive Text-to-Image Generation and Editing with Rich Text},
      author={Ge, Songwei and Park, Taesung and Zhu, Jun-Yan and Huang, Jia-Bin},
      journal={International Journal of Computer Vision (IJCV)},
      year={2025}
}