Case 04

Multi-model AI video localization pipeline

Four AI services in one coherent pipeline. Source video in, localized derivative video out, for under $1.

Role: RTP Agency·Timeline: 3 months in production·Status: Active deployment

TranscriptionWhisper Large (self-hosted)

RewriteGemini — preserve meaning

Voiceovermultilingual TTS

Clip matchingVertex AI + Qdrant

AssemblyFFmpeg (GPU acceleration)

The business problem

A media localization agency wanted to adapt source videos into new variations for specific markets — different languages, audiences, and niches. The requirement: create genuinely derivative videos (newly assembled footage, rewritten scripts, new voiceover), not 1:1 translations, so that each result would be a distinct asset for its market.

This was a custom commercial request, not an off-the-shelf product — nothing like it existed on the market. The client wanted to test a content-scaling hypothesis: can AI adapt source material into new variations with enough quality and volume to make the operation economically justified?

What was hard

The naive approach — translate, re-voice, and republish the same footage — doesn't work for two reasons:

Originality — reusing the source footage and structure yields a near-duplicate; the result must be a genuinely new asset
Distinctiveness at scale — each result must differ visually and structurally, not copy the original

The system had to output videos with:

New footage (assembled from a library, not from the source)
Rewritten scripts (preserve the meaning, change the wording)
New voiceover in the target language
And still be semantically coherent — the new footage must genuinely match what the new audio says

Architecture: a two-stage pipeline

Stage 1: populating the library

First, the system builds a searchable visual library:

The user bulk-submits video URLs via a Telegram bot (processed through a queue)
Videos are downloaded on the server, then segmented by scene detection (cut detection, not fixed intervals)
Each segment receives a semantic embedding via Vertex AI
Embeddings and segments are stored locally in a Qdrant vector database
Each segment also receives a JSON description of its content (which later improves matching accuracy)

An architectural decision worth noting: the previous implementation stored embeddings in Firebase, which was expensive and excessive. We moved everything to a local Qdrant instance, completely removing recurring database costs. Now only embedding creation costs money — storage and retrieval are free.

Stage 2: generating a new video

When the client wants to make a new video:

They send a link to the source + configuration (language, voice, music, emoji, subtitles — all selected via the bot)
FFmpeg extracts the audio from the source video
The audio is transcribed via self-hosted Whisper Large (locally, to avoid API costs at scale)
The transcript is rewritten by Gemini — preserving meaning while changing the wording
The rewritten script is translated into the target language
Multilingual TTS generates voiceover in the selected voice/language
Vertex AI matches clips from the library to the new audio segments by embedding proximity
FFmpeg assembles the final video: matched clips + new audio + selected enhancements (background music, sounds, memes, subtitles)
The finished video is sent to the client in Telegram

Key technical decisions

Why Vertex AI for embeddings

OpenAI did not offer API access to the needed video embedding model at the time. Local alternatives were expensive to operate. Vertex AI gave the best price/quality balance for production.

Why self-hosted Whisper

At scale, API costs for transcription become significant. Self-hosting on a local GPU completely removed recurring transcription costs.

Why multilingual TTS through a reseller

Instead of a direct subscription with hard plan limits, we used a reseller on a pay-as-you-go model. Same quality, no subscription lock-in, easier to scale costs.

Why Qdrant locally

A vector database on a local server removed recurring cloud database costs. The entire library lived on a single home server (10th-gen i5 + GTX 1070).

Cost engineering

A per-video cost breakdown for a 20-minute result:

Embedding creation (one-time per source video): negligible
Whisper transcription: free (self-hosted)
Gemini rewrite + translation: ~cents
Multilingual voiceover (via reseller): the main cost item
Storage: free (local)
Processing: electricity only

Total per video: under $1, even for long 20-minute content.

It's precisely this cost structure that makes localization at scale economically justified — manual localization of a 20-minute video would take a designer/editor 8–15 hours of work.

Production environment

Deployed on a home server (i5-10K + GTX 1070, 16 GB RAM)
A single Telegram bot interface — the client sends URLs, gets finished videos
FFmpeg with GPU acceleration for video assembly
Throughput: ~2 videos per hour for 20-minute clips (the bottleneck is assembly)
Scalable design: the architecture supports parallel deployment across several GPU nodes (top-tier cards aren't needed — a 2060/3060 is enough for this load)
Embedding namespaces by category (for example, separate libraries for cooking and gaming content) — keeps semantic matching relevant within domains

Challenges solved

1. Embedding quality for visual matching

The first implementation produced poor semantic matches — unrelated footage was matched to new audio about topic X. The solution: we augmented each segment's embedding with a JSON description of its content, sharply increasing matching relevance.

2. Pacing and rhythm in assembled videos

Auto-assembled videos initially looked unnatural — segments too short (under 1.5s) or too long (over 15s), cuts at awkward moments. We built constraints into the assembly logic: minimum/maximum segment duration, avoiding adjacent repeated segments, audio level normalization.

3. Migration from Firebase to local Qdrant

The legacy architecture stored embeddings in Firebase with recurring costs. We moved the entire pipeline to local Qdrant, completely removing ongoing database costs.

4. Whisper translation quality

Standard Whisper translations sometimes produced clumsy results. We added Gemini as a rewrite layer, which improved both meaning preservation and language naturalness in the target language.

The result

< $1

Per 20-minute video

AI services orchestrated

~2/hr

Throughput for 20-min clips

3 mo

Active operation in production

End-to-end automation — the client sends a URL in Telegram and receives a publish-ready video with no intermediate manual steps. Cost under $1 per 20-minute clip. The scalable architecture is built to expand across multiple GPU instances and category libraries.

Technology stack

Language	Python
Video processing	FFmpeg (GPU acceleration)
Embeddings	Vertex AI
Transcription	Whisper Large (self-hosted)
LLM rewrite	Gemini
Speech synthesis	Multilingual TTS
Vector database	Qdrant (self-hosted)
Interface	Telegram Bot

What this demonstrates

Multi-model AI orchestration — we combined 4+ AI services into one coherent pipeline
Semantic understanding of video content — embedding-based matching for video-audio coherence
End-to-end product engineering — from a raw URL to a finished result, fully automated
Cost optimization through architecture — strategic decisions on what to self-host versus run via API, keeping per-video cost under $1 even on a premium AI stack
Custom solution development — we built something that didn't exist as a product, for a specific commercial task

Similar challenge?

Tell us what you're building — we'd be glad to talk it through.

Let's talk →

← Back: Motion Control All case studies →