The challenge
One of the most interesting challenges I've worked on at AccelOne involved making a massive video library searchable with AI.
The platform contained more than 40,000 hours of video content, and users needed to find very specific moments within that library. Sometimes they were looking for a particular spoken phrase. Other times they needed to find a brand appearing on screen, a public figure, a specific scene, or a visual event.
The challenge wasn't just storing the content. It was making it discoverable.
At the time, all metadata was created manually. Titles, descriptions, tags, people, camera information, and other details were entered by humans. That process was expensive, time-consuming, and impossible to scale as the library continued to grow.
The goal was clear:
Use AI to automatically generate metadata, detect visual and audio events, identify sensitive content, and dramatically improve search capabilities.
The challenge was doing all of that without creating an unsustainable operating cost or compromising data privacy.
Why It was more complex than It looked
My first instinct was simple.
Why not use one of the large AI video analysis platforms already available in the market?
Several vendors offered transcription, object detection, moderation, face recognition, and search capabilities through a single API. On paper, it seemed like the fastest path to a solution.
Then I started evaluating the constraints.
→ The first issue was scale.
Processing a few hours of video through a cloud API is affordable. Processing 40,000 hours is a completely different conversation. Once I projected the cost across the entire library and future uploads, the numbers became difficult to justify.
→ The second issue was privacy.
The video content itself was the product. Uploading every video to a third-party service introduced a level of exposure that I wasn't comfortable with. Even if the providers offered strong security guarantees, I wanted to minimize the amount of sensitive content leaving our environment.
At that point, the challenge stopped being an AI problem and became an architecture problem.
How could I achieve cloud-level content understanding while maintaining control over both cost and privacy?
Challenge Snapshot
Make 40,000+ hours of video searchable with AI.
Constraints:
-
Massive processing volume
-
Cost efficiency
-
Data privacy
-
Search accuracy
-
Continuous scalability
How I approached the decision
I evaluated two primary approaches.
Option 1: A single cloud-based AI platform
This option was appealing for obvious reasons.
Most of the infrastructure already existed. Many of the required features were available out of the box. Development would be faster and operational complexity would be lower.
But the tradeoffs were significant:
-
High long-term operating costs
-
Dependence on a single vendor
-
Raw video content leaving our environment
-
Limited control over model behavior and optimization
For a smaller content library, I might have chosen this approach.
For this scale, I couldn't justify it.
Option 2: A multi-model local architecture
The alternative was building a system composed of specialized models, most of them running locally.
This approach introduced more engineering complexity. There were more components to manage, more integrations to maintain, and more opportunities for things to go wrong.
But it offered two critical advantages:
-
Control over costs
-
Control over data
The more I analyzed the problem, the more it became clear that privacy wasn't simply a feature requirement. It was an architectural constraint.
Once I accepted that, the decision became much easier.
Decision framework
Criteria |
Single Cloud API |
Multi-Model Local Architecture |
|
Cost Control |
Low |
High |
|
Data Privacy |
Medium |
High |
|
Vendor Dependency |
High |
Low |
|
Scalability |
Medium |
High |
|
Engineering Complexity |
Low |
High |
I intentionally chose additional engineering complexity because it solved the constraints that mattered most.
The solution
I built a multi-model pipeline where each component was responsible for a specific task.
Instead of relying on a single general-purpose system, I optimized each stage independently.
Audio Processing
I started by extracting audio tracks from videos and processing them locally using Whisper.
This approach gave me accurate transcriptions while avoiding the need to transfer large video files through external services.
It also reduced processing overhead and helped minimize hallucinations.
Visual Analysis
Processing every frame of every video would have been prohibitively expensive.
Instead, I sampled frames strategically.
Each selected frame was analyzed using local multimodal models running through Ollama and Gemma 12B.
The models generated:
Visual Descriptions
Summaries
Scene context
Content warnings
One of the most important lessons came from improving the inputs rather than changing the models.
I discovered that removing blurry frames, overlays, and low-information images dramatically improved output quality.
The better the input, the better the results.
Sensitive content detection
For NSFW detection, I iteratively refined prompts until false positives reached an acceptable level.
For face detection → I used OpenCV.
Celebrity recognition presented a different challenge.
I decided to use an external service for that specific task, but I wanted to minimize both cost and exposure.
Instead of sending large numbers of images individually → I created composite mosaics containing multiple frames in a single request.
This reduced the number of requests by approximately 90%.
More importantly, I never needed to send raw video files outside the system.
Brand detection
Brand recognition required more experimentation than any other component.
I tested multiple providers, compared results, and eventually combined outputs with custom filtering logic.
No single service consistently outperformed the others, so a hybrid approach delivered the best results.
Search infrastructure
The final step was bringing everything together.
Audio transcriptions, visual descriptions, metadata, content warnings, and recognition results were consolidated into a single searchable index using Elasticsearch.
That transformed thousands of hours of video into something users could actually explore.
Architecture overview
Video Content
Frame Sampling + Audio Extraction
Local AI Models
Metadata Generation
Search Indexing
Intelligent Search Experience
The outcome
The final system achieved the objectives that initially seemed difficult to reconcile.
Cost Efficiency
Compared to the fully cloud-based approach, processing costs were reduced by approximately 95%.
That difference transformed the solution from a theoretical possibility into something operationally sustainable.
Privacy
Raw video content remained within our environment.
The only external processing involved carefully selected still-image composites for celebrity recognition.
The architecture respected privacy without sacrificing functionality.
Search experience
Users could now search for:
-
Spoken phrases
-
Visual scenes
-
Brands
-
Faces
-
Events
Search quality improved dramatically compared to manually entered metadata.
Operational scalability
Frame sampling and optimized external requests kept processing costs low enough to continuously analyze new content as it was uploaded.
What surprised me most
Going into the project, I assumed model selection would be the hardest part.
It wasn't.
The biggest improvements came from cleaning inputs and refining prompts.
Removing poor-quality frames often delivered better results than switching to a more sophisticated model.
I also learned that specialized models working together can outperform a single expensive general-purpose solution, especially when constraints like cost and privacy matter.
Looking back, I would make the same architectural decision again.
Newer multimodal models could simplify some parts of the implementation today, but the core principle remains valid.
The architecture mattered more than the individual models.
Engineering principle
One lesson from this project continues to influence how I approach AI systems today:
The most important AI decision is often not which model you choose. It's how you design the system around the constraints.
In this case, privacy and cost shaped the architecture.
Once those constraints became clear, the right solution became clear as well.
Sometimes the best engineering decision is not finding a bigger model.
It's designing a smarter system.