Generating precise poses for Stable Diffusion ControlNet usually requires finding a reference image or manually posing a 3D model. New "Text-to-Pose" tools allow you to generate these skeletal structures directly from a text prompt.

How it works

These models function similarly to text-to-image models but output coordinate data instead of pixels.

Text Encoder: A transformer (like CLIP or T5) converts your text ("a man and a woman hugging") into vector embeddings.
Pose Decoder: A specialized network translates those embeddings into 2D joint coordinates (keypoints).
Rendering: The coordinates are plotted onto a black canvas using the OpenPose color scheme (e.g., distinct colors for limbs and joints), which ControlNet can read.

Option 1: The Web Service (Easiest)

If you just need the pose data quickly without setting up a pipeline, use the web interface.

Tool: Text-to-Pose.com
Workflow:
1. Type your prompt.
2. Download the generated stick-figure image.
3. Upload it to the ControlNet unit in your Stable Diffusion interface (A1111, Forge, etc.).

Option 2: ComfyUI (Integrated)

For automation or complex workflows, you can run the model locally. The weights are open and available via a ComfyUI custom node.

Extension: ComfyUI Text-to-Pose (by logicalor)
Model Weights: Hosted on Hugging Face (e.g., clement-bonnet/t2p-transformer-v0).
Workflow:
1. Install via ComfyUI Manager (search "Text to Pose").
2. Use the Text To Pose node to generate the pose image.
3. Feed the output directly into a ControlNet Apply node.

Research Paper

The underlying technology often stems from research into motion generation and human-centric diffusion adapters.

Paper: From Text to Pose to Image