Case 04

Multi-model AI video localization pipeline

Four AI services in one coherent pipeline. Source video in, localized derivative video out, for under $1.

Role: RTP Agency·Timeline: 3 months in production·Status: Active deployment
01
TranscriptionWhisper Large (self-hosted)
02
RewriteGemini — preserve meaning
03
Voiceovermultilingual TTS
04
Clip matchingVertex AI + Qdrant
05
AssemblyFFmpeg (GPU acceleration)

The business problem

A media localization agency wanted to adapt source videos into new variations for specific markets — different languages, audiences, and niches. The requirement: create genuinely derivative videos (newly assembled footage, rewritten scripts, new voiceover), not 1:1 translations, so that each result would be a distinct asset for its market.

This was a custom commercial request, not an off-the-shelf product — nothing like it existed on the market. The client wanted to test a content-scaling hypothesis: can AI adapt source material into new variations with enough quality and volume to make the operation economically justified?

What was hard

The naive approach — translate, re-voice, and republish the same footage — doesn't work for two reasons:

  • Originality — reusing the source footage and structure yields a near-duplicate; the result must be a genuinely new asset
  • Distinctiveness at scale — each result must differ visually and structurally, not copy the original

The system had to output videos with:

  • New footage (assembled from a library, not from the source)
  • Rewritten scripts (preserve the meaning, change the wording)
  • New voiceover in the target language
  • And still be semantically coherent — the new footage must genuinely match what the new audio says

Architecture: a two-stage pipeline

Stage 1: populating the library

First, the system builds a searchable visual library:

  • The user bulk-submits video URLs via a Telegram bot (processed through a queue)
  • Videos are downloaded on the server, then segmented by scene detection (cut detection, not fixed intervals)
  • Each segment receives a semantic embedding via Vertex AI
  • Embeddings and segments are stored locally in a Qdrant vector database
  • Each segment also receives a JSON description of its content (which later improves matching accuracy)
An architectural decision worth noting: the previous implementation stored embeddings in Firebase, which was expensive and excessive. We moved everything to a local Qdrant instance, completely removing recurring database costs. Now only embedding creation costs money — storage and retrieval are free.

Stage 2: generating a new video

When the client wants to make a new video:

  • They send a link to the source + configuration (language, voice, music, emoji, subtitles — all selected via the bot)
  • FFmpeg extracts the audio from the source video
  • The audio is transcribed via self-hosted Whisper Large (locally, to avoid API costs at scale)
  • The transcript is rewritten by Gemini — preserving meaning while changing the wording
  • The rewritten script is translated into the target language
  • Multilingual TTS generates voiceover in the selected voice/language
  • Vertex AI matches clips from the library to the new audio segments by embedding proximity
  • FFmpeg assembles the final video: matched clips + new audio + selected enhancements (background music, sounds, memes, subtitles)
  • The finished video is sent to the client in Telegram

Key technical decisions

Why Vertex AI for embeddings

OpenAI did not offer API access to the needed video embedding model at the time. Local alternatives were expensive to operate. Vertex AI gave the best price/quality balance for production.

Why self-hosted Whisper

At scale, API costs for transcription become significant. Self-hosting on a local GPU completely removed recurring transcription costs.

Why multilingual TTS through a reseller

Instead of a direct subscription with hard plan limits, we used a reseller on a pay-as-you-go model. Same quality, no subscription lock-in, easier to scale costs.

Why Qdrant locally

A vector database on a local server removed recurring cloud database costs. The entire library lived on a single home server (10th-gen i5 + GTX 1070).

Cost engineering

A per-video cost breakdown for a 20-minute result:

  • Embedding creation (one-time per source video): negligible
  • Whisper transcription: free (self-hosted)
  • Gemini rewrite + translation: ~cents
  • Multilingual voiceover (via reseller): the main cost item
  • Storage: free (local)
  • Processing: electricity only
Total per video: under $1, even for long 20-minute content.

It's precisely this cost structure that makes localization at scale economically justified — manual localization of a 20-minute video would take a designer/editor 8–15 hours of work.

Production environment

  • Deployed on a home server (i5-10K + GTX 1070, 16 GB RAM)
  • A single Telegram bot interface — the client sends URLs, gets finished videos
  • FFmpeg with GPU acceleration for video assembly
  • Throughput: ~2 videos per hour for 20-minute clips (the bottleneck is assembly)
  • Scalable design: the architecture supports parallel deployment across several GPU nodes (top-tier cards aren't needed — a 2060/3060 is enough for this load)
  • Embedding namespaces by category (for example, separate libraries for cooking and gaming content) — keeps semantic matching relevant within domains

Challenges solved

1. Embedding quality for visual matching

The first implementation produced poor semantic matches — unrelated footage was matched to new audio about topic X. The solution: we augmented each segment's embedding with a JSON description of its content, sharply increasing matching relevance.

2. Pacing and rhythm in assembled videos

Auto-assembled videos initially looked unnatural — segments too short (under 1.5s) or too long (over 15s), cuts at awkward moments. We built constraints into the assembly logic: minimum/maximum segment duration, avoiding adjacent repeated segments, audio level normalization.

3. Migration from Firebase to local Qdrant

The legacy architecture stored embeddings in Firebase with recurring costs. We moved the entire pipeline to local Qdrant, completely removing ongoing database costs.

4. Whisper translation quality

Standard Whisper translations sometimes produced clumsy results. We added Gemini as a rewrite layer, which improved both meaning preservation and language naturalness in the target language.

The result

< $1
Per 20-minute video
4+
AI services orchestrated
~2/hr
Throughput for 20-min clips
3 mo
Active operation in production

End-to-end automation — the client sends a URL in Telegram and receives a publish-ready video with no intermediate manual steps. Cost under $1 per 20-minute clip. The scalable architecture is built to expand across multiple GPU instances and category libraries.

Technology stack

LanguagePython
Video processingFFmpeg (GPU acceleration)
EmbeddingsVertex AI
TranscriptionWhisper Large (self-hosted)
LLM rewriteGemini
Speech synthesisMultilingual TTS
Vector databaseQdrant (self-hosted)
InterfaceTelegram Bot

What this demonstrates

  • Multi-model AI orchestration — we combined 4+ AI services into one coherent pipeline
  • Semantic understanding of video content — embedding-based matching for video-audio coherence
  • End-to-end product engineering — from a raw URL to a finished result, fully automated
  • Cost optimization through architecture — strategic decisions on what to self-host versus run via API, keeping per-video cost under $1 even on a premium AI stack
  • Custom solution development — we built something that didn't exist as a product, for a specific commercial task

Similar challenge?

Tell us what you're building — we'd be glad to talk it through.

Let's talk