The future of video understanding is here. While traditional video AI models struggle with latency and batch processing limitations, LiveCC shatters these barriers with streaming speech transcription at scale. Developed by ShowLab and presented at CVPR 2025, this breakthrough technology delivers real-time commentary that feels natural, immediate, and remarkably human.
Imagine watching a live soccer match where AI commentary flows seamlessly with the action. Picture educational videos that generate instant narration as concepts appear on screen. Envision accessibility tools that describe visual content to visually impaired users without delay. LiveCC makes all this possible through its novel video-ASR streaming architecture that processes visual and audio information simultaneously.
This comprehensive guide dives deep into LiveCC's capabilities. You'll discover how it works, explore real-world applications, master the installation process, and learn to implement it with actual code from the repository. Whether you're a researcher pushing AI boundaries or a developer building next-generation video applications, LiveCC deserves your attention.
What is LiveCC?
LiveCC (Live Commentary with Context) represents a paradigm shift in video language models. Created by the research team at ShowLab, this groundbreaking system is the first video LLM specifically engineered for real-time commentary generation. Unlike conventional models that process videos in offline batches, LiveCC introduces a novel streaming mechanism that handles video frames and speech transcription concurrently.
At its core, LiveCC leverages a sophisticated video-ASR streaming method that trains the model to anticipate and generate commentary while simultaneously processing incoming visual and audio streams. This approach, detailed in their CVPR 2025 paper, enables the model to maintain temporal coherence and contextual awareness without the latency penalties that plague traditional architectures.
The project has gained massive traction in the AI community for several reasons. First, it achieves state-of-the-art performance on both streaming and offline benchmarks, proving that real-time processing doesn't require sacrificing quality. Second, the team has released comprehensive training datasets including Live-CC-5M for pre-training and Live-WhisperX-526K for supervised fine-tuning, democratizing research in this space. Third, the open-source release includes not just model weights but complete training scripts, evaluation protocols, and interactive demos.
Built upon the robust Qwen2-VL-7B foundation, LiveCC extends its capabilities with specialized vision encoders and a streaming-aware attention mechanism. The model processes variable-length video clips, dynamically adjusting frame sampling rates based on content complexity. This adaptive approach ensures optimal token usage—allocating up to 24,000 visual tokens while reserving 8,000 for language generation, creating a balanced representation that captures both fine-grained details and high-level semantics.
Key Features That Set LiveCC Apart
Real-Time Streaming Architecture LiveCC's defining innovation is its ability to generate commentary while simultaneously ingesting video and audio streams. Traditional models wait for complete video segments before processing; LiveCC operates continuously. This streaming design reduces latency to near-zero levels, making it ideal for live broadcasts, interactive applications, and time-sensitive scenarios.
Novel Video-ASR Joint Training The model doesn't just see videos—it learns from synchronized speech transcriptions using a custom streaming alignment strategy. The Live-WhisperX-526K dataset provides precisely timestamped ASR outputs that teach LiveCC to correlate spoken words with visual events. This joint training creates richer contextual understanding than vision-only approaches.
Scalable Training Pipeline With support for DeepSpeed ZeRO-2 optimization and gradient accumulation across multiple nodes, LiveCC trains efficiently at scale. The repository includes meticulously tuned configurations for both pre-training and supervised fine-tuning. Researchers can replicate results using the exact hyperparameters that produced the SOTA model: learning rates of 2e-5 for pre-training and 1e-5 for SFT, cosine scheduling, and BF16 mixed precision.
Flexible Interface Options LiveCC meets developers where they are. The Gradio demo provides an intuitive web interface with JavaScript timestamp monitoring for precise latency measurement. The CLI tool enables batch processing and integration into existing pipelines. For production deployments, the inference module supports custom video loaders and streaming protocols.
State-of-the-Art Performance Benchmarks on LiveSports-3K and other evaluation sets demonstrate LiveCC's superiority. It outperforms previous video LLMs by significant margins in both streaming accuracy and offline comprehension. The model excels at temporal grounding, action recognition, and generating coherent, contextually appropriate commentary.
Comprehensive Data Ecosystem The project provides unprecedented access to training data. The Live-CC-5M dataset contains 5 million video-commentary pairs for pre-training. The Live-WhisperX-526K dataset offers high-quality ASR-aligned annotations. These resources enable the community to build upon LiveCC's foundations and explore new applications.
Real-World Use Cases Where LiveCC Shines
Live Sports Broadcasting Sports networks face immense pressure to provide commentary for thousands of events simultaneously. LiveCC can generate real-time play-by-play for lower-tier matches, regional games, or emerging sports that lack dedicated broadcast teams. The model understands game dynamics, player movements, and scoring events, producing commentary that captures the excitement and nuance of live competition. During pre-training on sports footage, it learns sport-specific terminology and typical play patterns.
Educational Content Creation Online learning platforms can leverage LiveCC to automatically narrate instructional videos. As a teacher demonstrates a physics experiment or walks through code, LiveCC generates step-by-step explanations synchronized with visual actions. This capability scales content production dramatically, allowing educators to focus on teaching while AI handles narration. The streaming architecture ensures descriptions appear at the exact moment concepts are introduced.
Accessibility for Visually Impaired Users Perhaps the most impactful application is real-time video description for accessibility. Visually impaired users can receive instant audio descriptions of visual content—whether it's a friend's social media video, a live stream, or a movie. LiveCC's low latency means users experience content simultaneously with sighted viewers, promoting inclusion and equal access to visual media.
Gaming and Esports Game streaming platforms can integrate LiveCC to provide commentary for amateur streamers or automated highlight reels. The model recognizes game states, player achievements, and dramatic moments, generating engaging narration that enhances viewer experience. For esports tournaments with multiple concurrent matches, LiveCC can provide secondary commentary tracks or multilingual broadcasts.
Security and Surveillance Analysis Security operations centers can deploy LiveCC to monitor camera feeds and generate textual logs of suspicious activities. Instead of relying solely on human operators, the system provides continuous natural language descriptions: "Person in red jacket loitering near entrance for 3 minutes" or "Vehicle parked in restricted zone." This creates searchable, actionable intelligence from raw video streams.
Step-by-Step Installation & Setup Guide
Environment Prerequisites LiveCC requires a modern Python environment. Ensure you have Python 3.11 or newer installed. The project recommends a CUDA-enabled GPU with at least 24GB VRAM for inference and 80GB for training. For optimal performance, use NVIDIA Ampere architecture or newer to leverage BF16 and TF32 precision.
Core Installation Start by installing PyTorch and the essential dependencies. The repository provides a streamlined pip installation process:
# Install PyTorch ecosystem (adjust for your CUDA version)
pip install torch torchvision torchaudio
# Install core dependencies with specific versions
git clone https://github.com/showlab/livecc.git
cd livecc
pip install "transformers>=4.52.4" accelerate deepspeed peft opencv-python decord datasets tensorboard gradio pillow-heif gpustat timm sentencepiece openai av==12.0.0 qwen_vl_utils liger_kernel numpy==1.24.4
# Install Flash Attention for memory-efficient processing
pip install flash-attn --no-build-isolation
# Install LiveCC utility package
pip install livecc-utils==0.0.2
Version Compatibility
The models were trained under specific versions: torch==2.6.0, transformers==4.50.0, and liger-kernel==0.5.5. While other versions may work, using these exact versions ensures reproducibility. Create a dedicated conda environment to avoid conflicts:
conda create -n livecc python=3.11
conda activate livecc
# Then run the pip installations above
Advanced Setup for Data Production If you plan to replicate the data pipeline or train on custom datasets, install additional tools:
pip install insightface onnxruntime-gpu python_speech_features wavfile
These packages enable face detection, GPU-accelerated audio processing, and feature extraction from speech signals. They're optional for inference but essential for data preprocessing.
Hardware Verification After installation, verify your GPU setup:
import torch
import gpustat
print(f"CUDA Available: {torch.cuda.is_available()}")
print(f"GPU Count: {torch.cuda.device_count()}")
print(f"Current GPU: {torch.cuda.current_device()}")
# Check GPU memory
stats = gpustat.new_query()
print(stats)
This confirms your environment can handle LiveCC's computational demands. For multi-GPU training, ensure all devices are visible and properly configured with NCCL backend.
REAL Code Examples from the Repository
Example 1: Launching the Interactive Gradio Demo
The Gradio interface provides the easiest way to experience LiveCC's capabilities. This command launches a web server with JavaScript timestamp monitoring for precise latency measurement:
# Launch Gradio demo with timestamp monitoring
python demo/app.py --js_monitor
# For high-latency environments, disable monitoring
python demo/app.py # Removes --js_monitor flag
Code Breakdown:
demo/app.pycontains the Gradio interface definition with video upload, streaming playback, and real-time commentary display--js_monitorenables client-side JavaScript that tracks video playback position and measures the delay between visual events and generated commentary- The interface automatically loads the
chenjoya/LiveCC-7B-Instructmodel from HuggingFace - Video frames are extracted using OpenCV and Decord libraries, then processed through the vision encoder
- Audio is transcribed using WhisperX integration, providing synchronized text input
- Generated commentary streams back to the UI with timestamps for alignment verification
Practical Implementation:
For production deployments, modify app.py to accept RTMP streams or WebRTC connections. The js_monitor feature is invaluable for debugging but adds overhead—disable it when latency is critical.
Example 2: Command-Line Interface for Batch Processing
The CLI tool enables programmatic access for processing video files or directories:
# Interactive CLI mode
python demo/cli.py
Code Breakdown:
demo/cli.pyimplements a simple command-line interface that prompts for video file paths- It uses the same inference engine as the Gradio demo but outputs commentary to stdout
- The script handles video loading, frame sampling, and tokenization automatically
- Commentary generation uses a greedy decoding strategy for speed
- Results can be piped to files or other processing tools
Advanced Usage: Modify the CLI to accept batch directories:
# Add to cli.py for batch processing
import glob
video_files = glob.glob("/path/to/videos/*.mp4")
for video_path in video_files:
commentary = generate_commentary(video_path)
with open(f"{video_path}.txt", "w") as f:
f.write(commentary)
Example 3: Pre-Training Configuration Script
The pre-training script demonstrates sophisticated distributed training setup. Here's the annotated configuration:
#!/bin/bash
# scripts/pt_local.sh - Pre-training configuration
# Token allocation settings
export VIDEO_MIN_PIXELS=78400 # 100*28*28: Minimum 100 visual tokens per frame
export FPS_MAX_FRAMES=480 # Maximum frames per video: 480 frames at 60fps = 8 seconds
export VIDEO_MAX_PIXELS=19267584 # 24576*28*28: Cap total visual tokens at 24k
# Training hyperparameters
learning_rate=2e-5 # Lower LR for pre-training to prevent catastrophic forgetting
run_name="livecc_pretrain_24kx480x100_bs512lr$learning_rate"
# Launch distributed training
WANDB_PROJECT='joya.chen' \
TOKENIZERS_PARALLELISM=false \
torchrun --standalone --nproc_per_node=8 train.py \
--deepspeed ./scripts/deepspeed_zero2.json \ # Memory optimization across 8 GPUs
--output_dir checkpoints/$run_name \ # Checkpoint storage
--overwrite_output_dir True \ # Fresh start (set False to resume)
--run_name $run_name \ # WandB experiment tracking
--save_on_each_node True \ # Save checkpoints per node
--do_train True \ # Enable training mode
--eval_strategy no \ # Skip mid-training eval for speed
--per_device_train_batch_size 1 \ # Micro-batch per GPU
--gradient_accumulation_steps 64 \ # Effective batch = 64*8 = 512
--learning_rate $learning_rate \ # 2e-5 for stable pre-training
--warmup_ratio 0.03 \ # 3% warmup steps
--optim adamw_torch \ # AdamW optimizer
--lr_scheduler_type cosine \ # Cosine decay schedule
--num_train_epochs 1 \ # Single epoch on large dataset
--logging_steps 10 \ # Log every 10 steps
--save_steps 1000 \ # Checkpoint every 1000 steps
--bf16 True \ # BF16 mixed precision (Ampere+)
--tf32 True \ # TF32 tensor cores
--gradient_checkpointing True \ # Trade compute for memory
--pretrained_model_name_or_path Qwen/Qwen2-VL-7B \ # Base model
--annotation_paths datasets/live_cc_5m_with_seeks.jsonl \ # Training data
--dataloader_num_workers 16 \ # Parallel data loading
--freeze_modules visual \ # Freeze vision encoder
--use_liger_kernel True \ # Optimized attention kernels
--report_to wandb # Weights & Biases logging
Key Insights:
- Token budgeting is crucial: 24k visual tokens + 8k language tokens = 32k context length
- DeepSpeed ZeRO-2 shards optimizer states across GPUs, enabling training on consumer hardware
- Gradient accumulation achieves large effective batch sizes without requiring massive GPU memory
- Freezing the visual encoder during pre-training prevents degradation of learned representations
- Liger kernel provides 2-3x speedup in attention computation compared to native PyTorch
Example 4: Supervised Fine-Tuning Script
The SFT script refines the pre-trained model on high-quality commentary data:
#!/bin/bash
# scripts/sft_local.sh - Supervised Fine-Tuning
# Same token budget as pre-training
export VIDEO_MIN_PIXELS=78400
export FPS_MAX_FRAMES=480
export VIDEO_MAX_PIXELS=19267584
# SFT uses slightly higher LR for adaptation
learning_rate=1e-5
run_name="livecc_sft_24k480x100_live526k+llava178k+hound+onevision_lr$learning_rate"
# Multi-dataset training
WANDB_PROJECT='joya.chen' \
TOKENIZERS_PARALLELISM=false \
torchrun --standalone --nproc_per_node=8 train.py \
--deepspeed ./scripts/deepspeed_zero2.json \
--output_dir checkpoints/$run_name \
--overwrite_output_dir True \
--run_name $run_name \
--save_on_each_node True \
--do_train True \
--eval_strategy steps \ # Enable evaluation during SFT
--eval_steps 500 \ # Evaluate every 500 steps
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 64 \
--learning_rate $learning_rate \ # 1e-5 for fine-tuning
--warmup_ratio 0.03 \
--optim adamw_torch \
--lr_scheduler_type cosine \
--num_train_epochs 3 \ # More epochs for SFT
--logging_steps 10 \
--save_steps 1000 \
--bf16 True \
--tf32 True \
--gradient_checkpointing True \
--pretrained_model_name_or_path checkpoints/livecc_pretrain_24kx480x100_bs512lr2e-5 \ # Start from pre-trained
--annotation_paths datasets/live_whisperx_526k.jsonl datasets/llava_video_178k.jsonl \ # Multiple datasets
--dataloader_num_workers 16 \
--freeze_modules None \ # Unfreeze all modules for SFT
--use_liger_kernel True \
--report_to wandb
Training Strategy:
- Learning rate reduction from 2e-5 to 1e-5 prevents overfitting on smaller SFT datasets
- Multiple annotation paths combine Live-WhisperX-526K with LLaVA-Video-178K for diversity
- Unfreezing all modules allows the vision encoder to adapt to commentary-specific features
- Evaluation during training monitors for overfitting and selects best checkpoint
Advanced Usage & Best Practices
Memory Optimization Techniques
LiveCC's 7B parameter model demands careful memory management. Enable gradient checkpointing to trade computation for memory—this reduces VRAM usage by 60% at the cost of 20% slower training. Use Flash Attention with --use_liger_kernel True for optimal attention computation. For inference on limited GPUs, quantize the model to 4-bit using bitsandbytes:
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModel.from_pretrained(
"chenjoya/LiveCC-7B-Instruct",
quantization_config=bnb_config
)
Multi-Node Training Scaling
The provided scripts use --nproc_per_node=8 for single-node training. For multi-node clusters, modify the torchrun command:
torchrun \
--nnodes=4 \ # Total number of nodes
--node_rank=0 \ # Current node rank (0 to 3)
--master_addr="192.168.1.100" \ # Master node IP
--master_port=29500 \ # Communication port
--nproc_per_node=8 train.py ...
Adjust --gradient_accumulation_steps to maintain the effective batch size of 512 across nodes.
Inference Optimization
For production deployments, implement continuous batching to handle multiple video streams efficiently. Cache the vision encoder outputs for repeated frames and use speculative decoding with a smaller draft model to accelerate text generation. The livecc-utils package includes optimized video loaders that pre-extract frames and audio asynchronously.
Custom Dataset Preparation When creating your own training data, follow the JSONL format used in Live-CC-5M:
{"video": "path/to/video.mp4", "commentary": "The player dribbles past defender...", "seek": [10.5, 15.2]}
The seek field marks the temporal segment relevant to the commentary. Use WhisperX for accurate speech transcription timestamps.
Comparison: LiveCC vs. Alternative Approaches
| Feature | LiveCC | Traditional Pipeline | Other Video LLMs |
|---|---|---|---|
| Latency | <500ms streaming | 5-30 seconds batch | 2-10 seconds |
| Training Method | Joint video-ASR streaming | Separate ASR + LLM | Vision-only |
| Model Size | 7B parameters (efficient) | 70B+ (cascaded) | 7-13B |
| Data Efficiency | 5M video-commentary pairs | Requires labeled transcripts | 1-2M videos |
| Real-Time Capability | Native streaming | No | Limited |
| Benchmark Performance | SOTA on LiveSports-3K | Moderate | Good (offline) |
| Hardware Requirements | 1x A100 for inference | Multiple GPUs | 1-2x A100 |
| Open Source | Full pipeline | Partial | Models only |
Why LiveCC Wins: Traditional pipelines separate ASR and video understanding, creating error propagation and latency. Other video LLMs process videos offline, making them unsuitable for live applications. LiveCC's unified streaming architecture eliminates these bottlenecks, delivering commentary that's both immediate and accurate. The open-source release of training scripts and datasets provides unmatched reproducibility.
Trade-offs: LiveCC's 7B model may generate less detailed commentary than larger alternatives in offline scenarios. The streaming constraint requires careful token budgeting. However, for real-time applications, these are acceptable compromises.
Frequently Asked Questions
What makes LiveCC different from GPT-4V or other multimodal models? LiveCC is specifically optimized for streaming video commentary. While GPT-4V excels at image understanding, it processes videos as sequences of frames with significant latency. LiveCC's architecture maintains temporal state across frames and generates commentary incrementally, achieving true real-time performance.
What hardware do I need to run LiveCC? For inference: NVIDIA GPU with 24GB+ VRAM (RTX 4090, A5000, or better). For training: 8x A100 80GB recommended. The model runs on consumer GPUs with quantization but expects 24GB for full precision. CPU inference is possible but impractical due to speed.
How real-time is "real-time" commentary? LiveCC achieves 400-600ms end-to-end latency on A100 GPUs. This includes video frame encoding, audio transcription, and text generation. The JavaScript monitor in the Gradio demo measures this precisely. For comparison, human commentary has 200-300ms reaction time.
Can I fine-tune LiveCC on my own videos?
Absolutely! The repository includes complete SFT scripts. Prepare your data in the Live-WhisperX format (video paths + commentary + timestamps) and run scripts/sft_local.sh with your annotation paths. Freeze the visual encoder initially to speed up training.
Is LiveCC suitable for non-English commentary? The current release focuses on English, but the Qwen2-VL foundation supports multiple languages. Fine-tuning on non-English datasets would require preparing corresponding ASR transcriptions. The architecture is language-agnostic.
What about licensing and commercial use? LiveCC inherits the Qwen2-VL license. Check the repository's LICENSE file for details. The training datasets have their own terms. Commercial use is generally permitted but requires compliance with the model card and attribution requirements.
How do I handle videos longer than 4 minutes?
The FPS_MAX_FRAMES=480 setting processes 4 minutes at 60fps. For longer videos, implement sliding window processing with overlap. The model's context window can maintain coherence across segments. Alternatively, sample frames at lower rates to cover longer durations.
Conclusion: Why LiveCC Deserves Your Attention
LiveCC isn't just another video LLM—it's a fundamental reimagining of how AI understands and describes visual content in real time. By solving the streaming challenge, ShowLab has opened doors to applications that were previously impossible: truly accessible video platforms, scalable live sports coverage, and interactive educational tools that respond instantly to visual input.
The technical achievements are impressive. State-of-the-art performance on both streaming and offline benchmarks proves that speed doesn't sacrifice quality. The open-source release of training pipelines, datasets, and evaluation frameworks sets a new standard for reproducibility in video AI research. The meticulous engineering—from token budgeting to Liger kernel integration—demonstrates production-ready thinking.
For developers, now is the time to experiment. The Gradio demo provides immediate gratification. The CLI enables pipeline integration. The training scripts invite customization. Whether you're building accessibility tools, content platforms, or research prototypes, LiveCC offers a solid foundation.
The AI community thrives on innovation that democratizes technology. LiveCC does exactly that, bringing real-time video understanding from research labs to your GPU. Clone the repository, launch the demo, and experience the future of video commentary today. The code is waiting at https://github.com/showlab/livecc—don't just read about it, build with it.
Your next breakthrough application starts here.