Generating precise poses for Stable Diffusion ControlNet usually requires finding a reference image or manually posing a 3D model. New "Text-to-Pose" tools allow you to generate these skeletal structures directly from a text prompt.
How it works
These models function similarly to text-to-image models but output coordinate data instead of pixels.
- Text Encoder: A transformer (like CLIP or T5) converts your text ("a man and a woman hugging") into vector embeddings.
- Pose Decoder: A specialized network translates those embeddings into 2D joint coordinates (keypoints).
- Rendering: The coordinates are plotted onto a black canvas using the OpenPose color scheme (e.g., distinct colors for limbs and joints), which ControlNet can read.
Option 1: The Web Service (Easiest)
If you just need the pose data quickly without setting up a pipeline, use the web interface.
- Tool: Text-to-Pose.com
- Workflow:
- Type your prompt.
- Download the generated stick-figure image.
- Upload it to the ControlNet unit in your Stable Diffusion interface (A1111, Forge, etc.).
Option 2: ComfyUI (Integrated)
For automation or complex workflows, you can run the model locally. The weights are open and available via a ComfyUI custom node.
- Extension: ComfyUI Text-to-Pose (by logicalor)
- Model Weights: Hosted on Hugging Face (e.g.,
clement-bonnet/t2p-transformer-v0).
- Workflow:
- Install via ComfyUI Manager (search "Text to Pose").
- Use the
Text To Pose node to generate the pose image.
- Feed the output directly into a
ControlNet Apply node.
Research Paper
The underlying technology often stems from research into motion generation and human-centric diffusion adapters.