Generating precise poses for Stable Diffusion ControlNet usually requires finding a reference image or manually posing a 3D model. New "Text-to-Pose" tools allow you to generate these skeletal structures directly from a text prompt.

How it works

These models function similarly to text-to-image models but output coordinate data instead of pixels.

  1. Text Encoder: A transformer (like CLIP or T5) converts your text ("a man and a woman hugging") into vector embeddings.
  2. Pose Decoder: A specialized network translates those embeddings into 2D joint coordinates (keypoints).
  3. Rendering: The coordinates are plotted onto a black canvas using the OpenPose color scheme (e.g., distinct colors for limbs and joints), which ControlNet can read.

Option 1: The Web Service (Easiest)

If you just need the pose data quickly without setting up a pipeline, use the web interface.

Option 2: ComfyUI (Integrated)

For automation or complex workflows, you can run the model locally. The weights are open and available via a ComfyUI custom node.

Research Paper

The underlying technology often stems from research into motion generation and human-centric diffusion adapters.