Why I chose a multi-model architecture instead of a single AI API

AccelOne developer Sebastian Sznur shares how he designed a multi-model AI architecture to make 40,000+ hours of video searchable while reducing costs by 95% and protecting data privacy.

Sebastian Sznur Jun 11, 2026

SOFTWARE DEVELOPMENT • AI • DIGITAL TRANSFORMATION

The challenge

One of the most interesting challenges I've worked on at AccelOne involved making a massive video library searchable with AI.

The platform contained more than 40,000 hours of video content, and users needed to find very specific moments within that library. Sometimes they were looking for a particular spoken phrase. Other times they needed to find a brand appearing on screen, a public figure, a specific scene, or a visual event.

The challenge wasn't just storing the content. It was making it discoverable.

At the time, all metadata was created manually. Titles, descriptions, tags, people, camera information, and other details were entered by humans. That process was expensive, time-consuming, and impossible to scale as the library continued to grow.

The goal was clear:

Use AI to automatically generate metadata, detect visual and audio events, identify sensitive content, and dramatically improve search capabilities.

The challenge was doing all of that without creating an unsustainable operating cost or compromising data privacy.

Why It was more complex than It looked

My first instinct was simple.

Why not use one of the large AI video analysis platforms already available in the market?

Several vendors offered transcription, object detection, moderation, face recognition, and search capabilities through a single API. On paper, it seemed like the fastest path to a solution.

Then I started evaluating the constraints.

→ The first issue was scale.

Processing a few hours of video through a cloud API is affordable. Processing 40,000 hours is a completely different conversation. Once I projected the cost across the entire library and future uploads, the numbers became difficult to justify.

→ The second issue was privacy.

The video content itself was the product. Uploading every video to a third-party service introduced a level of exposure that I wasn't comfortable with. Even if the providers offered strong security guarantees, I wanted to minimize the amount of sensitive content leaving our environment.

At that point, the challenge stopped being an AI problem and became an architecture problem.

How could I achieve cloud-level content understanding while maintaining control over both cost and privacy?

Challenge Snapshot

GOAL

Make 40,000+ hours of video searchable with AI.

Constraints:

Massive processing volume
Cost efficiency
Data privacy
Search accuracy
Continuous scalability

How I approached the decision

I evaluated two primary approaches.

Option 1: A single cloud-based AI platform

This option was appealing for obvious reasons.

Most of the infrastructure already existed. Many of the required features were available out of the box. Development would be faster and operational complexity would be lower.

But the tradeoffs were significant:

High long-term operating costs
Dependence on a single vendor
Raw video content leaving our environment
Limited control over model behavior and optimization

For a smaller content library, I might have chosen this approach.

For this scale, I couldn't justify it.

Option 2: A multi-model local architecture

The alternative was building a system composed of specialized models, most of them running locally.

This approach introduced more engineering complexity. There were more components to manage, more integrations to maintain, and more opportunities for things to go wrong.

But it offered two critical advantages:

Control over costs
Control over data

The more I analyzed the problem, the more it became clear that privacy wasn't simply a feature requirement. It was an architectural constraint.

Once I accepted that, the decision became much easier.

Decision framework

Criteria	Single Cloud API	Multi-Model Local Architecture
Cost Control	Low	High
Data Privacy	Medium	High
Vendor Dependency	High	Low
Scalability	Medium	High
Engineering Complexity	Low	High

I intentionally chose additional engineering complexity because it solved the constraints that mattered most.

The solution

I built a multi-model pipeline where each component was responsible for a specific task.

Instead of relying on a single general-purpose system, I optimized each stage independently.

Audio Processing

I started by extracting audio tracks from videos and processing them locally using Whisper.

This approach gave me accurate transcriptions while avoiding the need to transfer large video files through external services.

It also reduced processing overhead and helped minimize hallucinations.

Visual Analysis

Processing every frame of every video would have been prohibitively expensive.

Instead, I sampled frames strategically.

Each selected frame was analyzed using local multimodal models running through Ollama and Gemma 12B.

The models generated:

Visual Descriptions

Summaries

Scene context

Content warnings

One of the most important lessons came from improving the inputs rather than changing the models.

I discovered that removing blurry frames, overlays, and low-information images dramatically improved output quality.

The better the input, the better the results.

Sensitive content detection

For NSFW detection, I iteratively refined prompts until false positives reached an acceptable level.

For face detection → I used OpenCV.

Celebrity recognition presented a different challenge.

I decided to use an external service for that specific task, but I wanted to minimize both cost and exposure.

Instead of sending large numbers of images individually → I created composite mosaics containing multiple frames in a single request.

This reduced the number of requests by approximately 90%.

More importantly, I never needed to send raw video files outside the system.

Brand detection

Brand recognition required more experimentation than any other component.

I tested multiple providers, compared results, and eventually combined outputs with custom filtering logic.

No single service consistently outperformed the others, so a hybrid approach delivered the best results.

Search infrastructure

The final step was bringing everything together.

Audio transcriptions, visual descriptions, metadata, content warnings, and recognition results were consolidated into a single searchable index using Elasticsearch.

That transformed thousands of hours of video into something users could actually explore.

Architecture overview

Video Content

Frame Sampling + Audio Extraction

Local AI Models

Metadata Generation

Search Indexing

Intelligent Search Experience

The outcome

The final system achieved the objectives that initially seemed difficult to reconcile.

Cost Efficiency

Compared to the fully cloud-based approach, processing costs were reduced by approximately 95%.

That difference transformed the solution from a theoretical possibility into something operationally sustainable.

Privacy

Raw video content remained within our environment.

The only external processing involved carefully selected still-image composites for celebrity recognition.

The architecture respected privacy without sacrificing functionality.

Search experience

Users could now search for:

Spoken phrases
Visual scenes
Brands
Faces
Events

Search quality improved dramatically compared to manually entered metadata.

Operational scalability

Frame sampling and optimized external requests kept processing costs low enough to continuously analyze new content as it was uploaded.

What surprised me most

Going into the project, I assumed model selection would be the hardest part.

It wasn't.

The biggest improvements came from cleaning inputs and refining prompts.

Removing poor-quality frames often delivered better results than switching to a more sophisticated model.

I also learned that specialized models working together can outperform a single expensive general-purpose solution, especially when constraints like cost and privacy matter.

Looking back, I would make the same architectural decision again.

Newer multimodal models could simplify some parts of the implementation today, but the core principle remains valid.

The architecture mattered more than the individual models.

Engineering principle

One lesson from this project continues to influence how I approach AI systems today:

The most important AI decision is often not which model you choose. It's how you design the system around the constraints.

In this case, privacy and cost shaped the architecture.

Once those constraints became clear, the right solution became clear as well.

Sometimes the best engineering decision is not finding a bigger model.

It's designing a smarter system.

Latest from our Blog

Stay up to date with the latest insights, best practices, and industry trends from our team.

The Next Engineering Advantage Isn't a Better Model. It's What Your Organization Remembers.

Written by Riccardo Tagliavia, AI Business Principal at AccelOne, on why AI development is shifting from model capability to organizational memory.

Scaling security through architecture, not infrastructure

Written by Esteban Baker, Technical Lead at AccelOne, this article explores why a distributed security architecture can improve scalability, reduce...