Wan 2.5 is best prompted like a short film shot, not a still-image caption. Because many implementations support audio, prompts should often describe not just visuals and motion, but also timing, performance, camera behavior, and sound.
Subject + Environment + Action + Camera + Lighting + Style + Audio + Constraints
Wan 2.5 often works best when prompts read like film direction notes because the model may support lip sync, native audio, uploaded audio, and stronger prompt awareness.
Keep one shot idea per clip.
Who + where + what happens + camera + light + style + sound + negatives
Describe the subject in a way that affects generation: wardrobe, age range, posture, hair, expression, emotional state.
Be specific enough to anchor the shot: indoors or outdoors, time of day, weather, props, cleanliness, crowd level.
Use visible motion. Prefer concrete actions over abstract feelings.
Specify framing, angle, and movement instead of letting the model improvise.
Because many Wan 2.5 implementations support audio, sound should be prompted intentionally when relevant.
Useful audio instruction types:
For T2V, your prompt must do more worldbuilding.
For I2V, focus more on what moves, what stays fixed, camera behavior, atmosphere, audio, and preservation.
One camera move is usually enough.
Clear subject + clear location + one visible action + one camera move + one lighting concept + one style lane + intentional audio + explicit constraints.