Case Study AI • Media & Entertainment

From Manual Metadata to Intelligent Video Search

Turning 2.5 million videos into searchable, time-addressable assets with a cost-efficient hybrid AI pipeline.

Tag Tag Tag

client: Kurator • Nimia Jan 2026

2.5M

Videos processed & indexed

~100×

Cost reduction vs. cloud-only

95%+

Transcription word accuracy

30K+

Hours of long-form content

At a Glance

Manual tagging remaining

Near-complete elimination. Teams do a quick spot-check per batch.

hrs → min

Time saved per upload batch

From hours of manual tagging to minutes of quality review.

$2.5K

Cost per additional GPU unit

Scalable throughput by adding on-prem GPUs vs. six-figure contracts.

the client

Kurator

Media & Entertainment Broadcast Footage Video Licensing

Kurator • Nimia

Video licensing & discovery

A video discovery platform with millions of high-value assets

Kurator is a video licensing and discovery platform within Nimia, serving major media and entertainment buyers with high-value archival and broadcast footage — including news, interviews, and historical content.

The platform's core value is helping customers find the right moment inside long video assets, then enabling easy purchase with confidence in rights management. But at scale, that promise depended entirely on metadata quality.

2.5M+

Videos in catalog

30K+

Hours of long-form content

the challenge

challenge 01

Manual tagging didn't scale

Teams spent hours per batch entering transcripts, metadata, keywords, and compliance flags — often with inconsistent results across the catalog.

Operations

challenge 02

Manual tagging didn't scale

Teams spent hours per batch entering transcripts, metadata, keywords, and compliance flags — often with inconsistent results across the catalog.

Operations

challenge 03

Manual tagging didn't scale

Teams spent hours per batch entering transcripts, metadata, keywords, and compliance flags — often with inconsistent results across the catalog.

Operations

challenge 04

Manual tagging didn't scale

Teams spent hours per batch entering transcripts, metadata, keywords, and compliance flags — often with inconsistent results across the catalog.

Operations

the solution

Intelligent video processing platform

A hybrid architecture balancing performance, security, and cost.

~1000× cost reduction

three orders of magnitude

↓

Video library

Millions of videos · Legacy to 8K

↓

On-premises · GPU machines

Heavy Inference

Whisper (large)

Gemma 3

OpenCV

Transcription · Video analysis Metadata Generation · Frame parsing

~$2.5K–$3K per GPU unit

SELECTIVE

Cloud · AWS

Orchestration & delivery

Recognition

Staging

Delivery

Selective API usage only Celebrity / brand recognition

Rekognition only when faces detected

▼ OUTPUT

Transcript

VTT · Time-coded

Tags & metadata

Auto-generated · Structured

Celebrity detection

High-confidence · Gated

Searchable index

Frame-accurate navigation

Key Results

Millions of videos processed · Scalable · Secure · Cost-effective at massive scale

1000×

Cost reduction

95%+

Transcription accuracy

Millions

Videos processed

How it works

Step 01

Hybrid execution model

Heavy video inference runs on on-prem GPU machines, while AWS handles orchestration, staging, and delivery. Avoids runaway cloud costs at scale.

Architecture

Step 02

Cost-aware frame sampling

Instead of analyzing every frame, the pipeline samples one frame every two seconds — selected through testing to balance coverage, accuracy, and cost.

1 frame / 2 sec

Step 03

On-prem vision analysis

Sampled frames are analyzed using Gemma 3 running locally on GPUs. The model generates concise on-screen descriptions that feed metadata and summaries.

Gemma 3

Step 04

Gated celebrity detection

Face detection runs first using OpenCV. Only when faces are present does the system invoke AWS Rekognition for celebrity.

OpenCV

Step 05

Inference optimization

Frames are resized and combined into mosaic batches before being sent to Rekognition, cutting external API calls by up to 50× while preserving detection accuracy.

50× fewer API calls

Step 06

Quality & reliability controls

The pipeline filters blank frames, removes blurred images, normalizes transcription artifacts, and includes modular retries and validation for production reliability.

Production-grade

Results & impact

~100×

Overall cost reduction

Compared to cloud-only or vendor pipelines. Throughput scales by adding low-cost GPU machines at $2.5K–$3K each — shifting video intelligence from a capital project into a repeatable operational capability.

95%+

Transcription word accuracy

Consistently exceeds 95% in spot-checked samples, approaching human-level performance under good audio conditions. Powers keyword search, time-based navigation, and downstream metadata extraction.

hrs→min

Reduction in tagging time

Manual metadata entry reduced from hours to minutes per batch. Teams perform a quick spot-check and add only information requiring human judgment, freeing them to focus on quality.

~1000×

Reduction in high-volume processing paths

In specific high-volume processing paths, including the 50× reduction in external API calls to AWS Rekognition via mosaic batching optimization.

Cost comparison by architecture approach

Relative cost normalized to cloud-only baseline (100%)

Cloud-only pipeline

100%

Vendor contract

~88%

hybrid

~1%

Under the hood

Whisper

Speech-to-text transcription. Time-coded VTT output for keyword search and frame-accurate navigation.

Open Source

Gemma 3

On-prem vision-language model for frame analysis. Generates on-screen descriptions that feed metadata and summaries.

Open Source

OpenCV

Lightweight real-time face detection. Acts as a gate to prevent unnecessary stream processing and API costs.

Open Source

Recognition

Celebrity identification. Used selectively — only invoked after face detection confirms a face is present in frame.

Cloud · Selective

This hybrid approach delivers production-grade accuracy while avoiding the cost, lock-in, and unpredictability of cloud-only architectures — making large-scale video intelligence economically sustainable at 2.5M+ video scale.

client quote

“Lorem ipsum dolor sit amet, consectetur adipiscing
elit, sed do eiusmod tempor incididunt ut labore
et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip ex
ea commodo consequat.

CEO of Kurator

the outcome

before

Hours per batch of manual tagging with inconsistent results across the catalog

Incomplete metadata limited search quality and buyer confidence at point of purchase

Manual scrubbing required to find short clips inside hour-long broadcast footage

Six-figure vendor contracts for cloud-only AI at Kurator's scale — cost-prohibitive

after

Minutes of spot-check per batch. Tagging is automatic, consistent, and complete.

95%+ transcription accuracy powering search, navigation, and metadata across the full catalog

Frame-accurate navigation — jump directly to any moment in millions of hours of content

~100× cost reduction. GPU on-prem at $2.5K–$3K per unit for scalable, repeatable throughput

The catalogue is now discoverable by:

What's said

Keyword search via time-coded VTT transcription

What's shown

Frame-level descriptions from Gemma 3 vision analysis

Who appears

Celebrity identification via gated AWS Recognition

When it happens

Timecode-level navigation inside any long-form asset

FAQ

Project Based FAQ

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Book a Call

Question goes here?

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Question goes here?

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Question goes here?

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

From Manual Metadata to Intelligent Video Search

A video discovery platform with millions of high-value assets

Search and tagging had become a serious bottleneck

Manual tagging didn't scale

Manual tagging didn't scale

Manual tagging didn't scale

Manual tagging didn't scale

A hybrid AI video intelligence pipeline built for scale

Intelligent video processing platform

7-step AI pipeline

Hybrid execution model

Cost-aware frame sampling

On-prem vision analysis

Gated celebrity detection

Inference optimization

Quality & reliability controls

Real outcomes, measurable impact

Cost comparison by architecture approach

Open-source intelligence, cloud applied selectively

Whisper

Gemma 3

OpenCV

Recognition

A searchable, time-addressable catalogue spanning millions of hours

What's said

What's shown

Who appears

When it happens

Project Based FAQ

Real outcomes, measurable impact

Case title

Case title

Case title

Explore more perspectives from AccelOne