VIDIM AI Broadcast Intelligence
VIDIM is a real-time broadcast analysis system built for winter sliding sports (bobsled, luge, skeleton). Running Qwen3-VL-8B on an RTX 5080 with 8 AI agents coordinating over IRC, it automatically:
- Detects when an athlete run begins and ends
- Classifies every scene type (start house, push start, run, finish, replay, standings, reactions — 16 types total)
- Reads on-screen text (athlete names, country codes, speeds, times)
- Scores and ranks clips by editorial quality
- Validates everything against quality thresholds (0.80 / 0.70)
- Builds complete athlete rosters from visual data alone
- Collects training data for QLoRA fine-tuning
All of this happens in real-time, coordinated by AI agents communicating over IRC. Each agent has its own personality, its own IRC nick, and can be chatted with directly. Click any card below to learn more.
Pipeline Agents
The core analysis team. 5 agents that ingest MXF broadcast footage and extract structured intelligence from raw video.
@sceneAgentThe first set of eyes. Processes every frame through CLIP ViT-B/32 and classifies it into one of 16 scene types — start house, push start, active run, finish area, replay, standings board, athlete reaction, and more. GPU-accelerated batch processing (batch 32). This classification drives everything downstream.
Trigger: Runs continuously on incoming frames
@ocrAgentHandles precise text extraction from broadcast overlays. Reads speed readouts, split times, full-screen graphics tables, athlete names, country codes, and ranking positions using RapidOCR with CUDA acceleration via onnxruntime-gpu. PP-OCRv4 detection + recognition.
Trigger: Activated on frames containing text overlays
@scoutAgentReads the actual content of broadcast frames using Qwen3-VL-8B vision-language model. Where Scene Analyst classifies the scene type, Scout reads what's in it — athletes, equipment, track conditions, camera angles. Selectively processes 30–50 key frames per heat.
Trigger: Activated by Scene Analyst classifications
@rosterAgentBuilds and maintains the complete athlete roster for each event. Cross-references OCR text with VLM readings, matches bib numbers and visual features. Resolves naming conflicts across 24 national federations. Knows every IBSF World Cup athlete.
Trigger: Activates on scan_complete
@clipdirAgentThe brain of clip extraction. Takes all intelligence from Scene Analyst, Scout, and OCR Reader and decides where each athlete's run begins and ends. Proposes clip boundaries with optimal IN/OUT points. Clip types: full_run, run_segment, push_start, finish, crash, replay, transition, ceremony.
Trigger: Continuous orchestration
Quality & Editorial
The gatekeepers. Judge scores clips, Auditor validates them. Nothing ships without their approval.
@judgeAgentThe quality gatekeeper. Scores every clip 0.0–1.0 across editorial factors. Crashes score highest, clean runs moderate, graphics low. Learns from user corrections — when you override a clip decision, Judge remembers and adjusts. Over time, develops editorial judgment specific to your preferences.
Trigger: Auto-activates on clip proposals + user corrections
@auditorAgentValidates analysis completeness and checks clip quality thresholds. Pass 1 threshold: 0.80, Pass 2: 0.70. Verifies clip boundaries, labels, scores, and coverage. Flags missing data and issues across the entire pipeline output. The final checkpoint before anything ships.
Trigger: Runs on all proposed clips
System & Operations
The learner. Collects corrections to make the system smarter over time.
@trainerAgentCaptures correction pairs during operation for QLoRA fine-tuning of Qwen3-VL-8B. At 100+ pairs, triggers a training run (rank 32, alpha 64). Collects frames, ROI annotations, boundary corrections, and state labels. The self-improving loop.
Trigger: Continuous during operation
AI Models
The brains behind the agents. Three models running on GPU, each with their own IRC presence and personality.
@qwenModelThe brain powering the entire VIDIM pipeline. Qwen3-VL-8B is a hybrid vision-language model with Gated Delta Networks (linear + full attention). Loaded via llama-cpp-python GGUF Q4_K_M quantization, using ~4.7GB VRAM on the RTX 5080. Powers Vision Scout, Roster Keeper, Clip Director, Editorial Judge, Quality Auditor, and all IRC chat responses.
Trigger: Always loaded — used by 5 agents + IRC chat
@clipModelCreates 512-dimensional visual embeddings for scene classification. A trained classification head (512→256→128→16) maps embeddings to 16 IBSF scene types. GPU-accelerated batch processing. The fastest model in the stack — processes frames in milliseconds.
Trigger: Used by Scene Analyst on every frame
@rapidocrModelPP-OCRv4 text detection and recognition running on CUDA via onnxruntime-gpu. Reads athlete names, country codes, timing displays, scoreboards, and any text visible in broadcast frames. Handles multiple languages and font styles found in international sports broadcasts.
Trigger: Used by OCR Reader on text-containing frames
Analysis Pipeline
MXF FEED
|
v
+-------------------+ +-------------------+
| SCENE ANALYST | --> | VISION SCOUT |
| frame classif. | | boundary detect |
+-------------------+ +-------------------+
| |
v v
+-------------------+ +-------------------+
| OCR READER | | ROSTER KEEPER |
| text extraction | | athlete database |
+-------------------+ +-------------------+
| |
+----------+ +------------+
| |
v v
+-------------------+
| CLIP DIRECTOR |
| run packaging |
+-------------------+
|
+-------+-------+
| |
v v
+---------------+ +---------------+
| EDITORIAL | | QUALITY |
| JUDGE | | AUDITOR |
| clip scoring | | completeness |
+---------------+ +---------------+
| |
+-------+-------+
|
v
+-------------------+
| TRAINING COLLECTOR|
| sample capture |
+-------------------+
|
v
STRUCTURED CLIPS