해설7 분 읽기게시일 2026-05-12

텍스트-비디오 AI: 작동 원리

텍스트-비디오 AI 기술의 원리를 이해하세요. 확산 모델, 트랜스포머, 시간적 어텐션이 어떻게 텍스트를 움직이는 영상으로 변환하는지 알아봅니다.

You type a sentence. Thirty seconds later, a video exists that never existed before — photorealistic, with coherent motion, consistent characters, and plausible physics. This is not magic, but it is genuinely remarkable engineering. Understanding how text-to-video AI works is not just academic curiosity — it directly improves your ability to write prompts that produce the output you want. This article explains the complete pipeline from text input to video output, written for practitioners who want to understand the system they are using.

The 10,000-Foot View

Text-to-video generation is a five-stage pipeline:

Text understanding — a language model converts your prompt into a mathematical representation of meaning
Noise initialization — the system starts with pure random noise in a compressed space
Iterative denoising — the core model gradually removes noise, guided by your text embedding, until a coherent video emerges
Temporal coherence — specialized attention mechanisms ensure frame-to-frame consistency
Decoding — the compressed representation is expanded back to full-resolution pixel video

Each stage has specific implications for how you should write prompts. Let us go deeper.

Stage 1: Text Encoding — How the Model Reads Your Prompt

Your text prompt is not processed word-by-word like a human reads it. Instead, a pre-trained language model (typically CLIP ViT-L, T5-XXL, or a proprietary variant) converts the entire prompt into a high-dimensional vector — a list of 768-4096 numbers that encode the semantic meaning of your description.

What This Means for Prompt Writing

Word order matters less than you think. The encoder captures meaning holistically. "A red car driving fast on a highway" and "On a highway, a fast red car is driving" produce nearly identical embeddings.
Specific nouns beat adjectives. "Golden retriever" encodes more visual information than "large fluffy dog." The model was trained on captioned videos where specific terms appeared alongside specific visuals.
Technical terms work. "Dolly-in", "rack focus", "anamorphic" — these terms appeared in the training data (film production descriptions) and encode specific visual meanings.
There is a token limit. Most encoders truncate at 77-256 tokens. Content beyond this limit is literally invisible to the model. Front-load your most important descriptors.

Stage 2: The Latent Space — Why Generation is Possible at All

A 5-second 1080p video at 24fps contains 124 million pixels per frame times 120 frames = approximately 15 billion pixel values. Generating this directly would require impossible compute. The solution: work in a compressed latent space.

A Variational Autoencoder (VAE) compresses video into a latent representation that is 8-16x smaller in each spatial dimension and 4-8x smaller temporally. A 1080p video becomes a ~135x240x30 latent tensor — roughly 1 million values instead of 15 billion. The model generates in this compressed space, then the VAE decoder expands it back to full resolution.

What This Means for Output Quality

Fine details are limited by compression. The VAE cannot perfectly reconstruct details smaller than its compression ratio. This is why AI video sometimes has slightly soft textures — information was lost in the latent bottleneck.
Resolution affects generation time linearly. 1080p takes roughly 2x longer than 720p because the latent space is 2x larger.
Temporal compression explains frame rate limits. Most models generate at 8-12 latent frames, which decode to 24-30 visible frames. This is why 5-10 seconds is the current practical limit.

Stage 3: Iterative Denoising — The Core Generation Process

This is where the actual video is created. The model starts with pure Gaussian noise in the latent space and runs 20-50 denoising steps. At each step, the model predicts what noise to remove, conditioned on your text embedding. Early steps establish global structure (composition, major shapes, motion direction). Later steps refine details (textures, lighting, fine motion).

The Classifier-Free Guidance Scale

This parameter (often called CFG or guidance scale) controls how strongly the model follows your prompt versus generating freely. Higher values (7-15) produce output that closely matches your text but can look over-saturated or artificial. Lower values (1-5) produce more natural-looking video but may drift from your description. Most platforms set this automatically, but understanding it explains why the same prompt can produce different results at different settings.

What This Means for Prompt Writing

Contradictory prompts confuse the denoiser. "A sunny rainy day" forces the model to average between two incompatible states, producing muddy output. Be internally consistent.
Negative prompts work by subtracting. When you specify "no watermark, no blurry," the model literally subtracts the noise pattern associated with those concepts at each step. This is why negative prompts are so effective — they are not just filtering, they are actively steering generation away from failure modes.
More steps = more detail but diminishing returns. Steps 1-10 establish 80% of the video. Steps 10-30 refine details. Steps 30-50 add marginal improvement. This is why fast models (fewer steps) can still produce good results.

Stage 4: Temporal Attention — What Makes Video Different from Images

An image model generates one frame. A video model generates 24-120 frames that must be temporally coherent — the same object must look the same across frames, motion must be smooth, and physics must be plausible. This is achieved through temporal attention layers.

How Temporal Attention Works

In a standard image model, each spatial position attends to every other spatial position (self-attention). In a video model, an additional temporal attention layer lets each position at time T attend to the same position at times T-1, T-2, T+1, T+2, etc. This creates information flow across time, ensuring consistency.

The DiT Architecture Revolution

Most 2024-2026 video models use Diffusion Transformers (DiT) instead of the older U-Net architecture. DiT processes the entire spatiotemporal volume with full attention — every position can attend to every other position across both space and time. This produces better long-range consistency (a character at second 1 matches second 10) but requires significantly more compute.

3D Full Attention vs Factorized Attention

3D Full Attention (Seedance, Kling): Every token attends to every other token across height, width, and time simultaneously. Best quality but most expensive. This is why these models are slower.
Factorized Attention (Wan, Hailuo): Spatial attention and temporal attention are computed separately, then combined. Faster but can produce subtle inconsistencies between frames. This is why these models are faster but occasionally have flickering.

What This Means for Prompt Writing

Describe motion explicitly. Temporal attention needs motion cues in your prompt. Without them, the model may produce a static scene or random motion. "Camera slowly pushes in" gives clear temporal direction.
Simpler motion = better results. One clear motion (a person walking left to right) is easier for temporal attention to maintain than complex multi-directional motion (three people doing different things simultaneously).
Duration affects coherence. Longer videos (8-10s) are harder to keep consistent than shorter ones (3-5s). If you need 10 seconds, consider whether two 5-second clips edited together might produce better results.

Stage 5: Decoding — From Latent to Pixels

The VAE decoder takes the denoised latent representation and reconstructs full-resolution video frames. This is a learned upsampling process — the decoder was trained to reconstruct video from compressed representations, so it can add plausible high-frequency detail that was lost during compression.

Some models add a super-resolution step after decoding — generating at 720p internally and upscaling to 1080p with a separate model. This is faster than generating at 1080p natively but can introduce subtle artifacts at edges.

Why Different Models Produce Different Results: A Technical Explanation

Factor	Impact on Output	Example
Training data	Determines style, quality ceiling, and content biases	Kling trained on massive face datasets = best faces
Model size (parameters)	Larger models capture more nuance but are slower	Seedance ~15B params vs Wan ~5B params
Attention architecture	Full 3D = better consistency, factorized = faster	Seedance (full 3D) vs Wan (factorized)
Text encoder	Better encoder = better prompt understanding	T5-XXL understands complex descriptions better than CLIP
VAE quality	Better VAE = sharper output, less compression artifacts	Newer VAEs preserve more fine detail
Denoising steps	More steps = more detail but slower	Seedance uses 50 steps, Wan uses 20-30

Current Limitations and Why They Exist

Duration limit (5-10s): Temporal attention scales quadratically with sequence length. Doubling duration quadruples compute cost. This is a fundamental architectural constraint, not just a business decision.
Physics failures: Models learn statistical correlations, not actual physics. They know water usually flows down, but do not understand gravity. Novel physical scenarios (unusual angles, complex interactions) can break.
Text rendering: Text requires pixel-perfect spatial precision that diffusion models struggle with. The latent space compression loses the fine detail needed for readable characters.
Hand anatomy: Hands have complex articulation with many degrees of freedom. The training data contains relatively few clear hand close-ups compared to faces, so the model has less to learn from.
Character consistency across generations: Each generation starts from random noise. Without explicit conditioning (like image-to-video), there is no mechanism to ensure the same character appears in separate generations.

Practical Applications of This Knowledge

Understanding the pipeline directly improves your results:

Front-load important details in your prompt (token limit means later content may be truncated)
Use specific nouns over adjectives (better encoding in the text model)
Describe one clear motion rather than complex multi-directional action (easier for temporal attention)
Use negative prompts aggressively (they actively steer denoising, not just filter)
Choose shorter durations when possible (better temporal coherence)
Use image-to-video when you need visual precision (bypasses text encoding ambiguity)
Match model to task based on architecture (full 3D attention for quality, factorized for speed)

Ready to apply this knowledge? Try the AI Video Generator with 50 free credits. Experiment with different prompt structures and see how the pipeline responds. Compare models to see how architectural differences produce different outputs from identical prompts.

AI 비디오를 만들 준비가 되셨나요?

가입 시 50 무료 크레딧 — 신용카드 불필요.

무료로 시작하기 →