Cost-Effective Video Search with Frame-Based Multimodal Embeddings

Pushkar Yadav — Thu, 14 Aug 2025 09:08:56 GMT

TL;DR: Traditional video models are costly and often impractical for large-scale search. By splitting videos into ~800 frames per hour, embedding both visuals and transcribed audio into a vector database, we can build a precise and low-cost system to query exact video moments. This approach makes multimodal search affordable without sacrificing accuracy.

Why Video Search is Expensive Today

Most AI models don't process videos directly. To search within a video, you typically need to process it frame by frame, create embeddings for each frame, and store these embeddings in a database. For a 1-hour video, this can cost nearly a dollar with models like Gemini-2.0-flash. Costs rise quickly when scaling to hundreds of hours of content, and using more advanced models increases the price even more. This makes precise, multimodal search (visual + audio) expensive and often impractical for everyday use.

By dividing videos into frames and embedding both visuals and transcribed audio into a vector database, we can create a precise and low-cost system to find exact video moments, making multimodal search affordable without losing accuracy

Here I will show you a approach which would reduce prices and increase accuracy 🌸
Splitting Videos Into Frames

As of today, there are models like Gemini that can directly process videos. However, I won't be using them because I want to create something that remembers videos by frames and costs less than traditional video models.

FFMPEG can help you split a video into any number of frames.

I wrote a script that splits a video into 800 frames per hour. Typically, videos are 60 FPS, and this script uses FFMPEG to convert the video into .jpeg files at 800 FPH.
From Frames to Embeddings

This part gets interesting because there are models that can help with video-based embeddings. Here, I used multimodal embedding from Vertex AI, and there's a good reason for that. This model supports both text and images, making our text-to-image searches feel precise.

Looping over folder:extracted_frames and store them in a vector database using google/multimodalembedding. Convert them into 1408-dimension vectors and save them to a vector database with the configuration (cosine, 1048).

With a bit more looping, we're ready to use a script that processes all videos in the video-to-train folder. It creates their extracted_frames and saves their vectors to Upstash. The id of each vector is structured to point to any video at a specific timestamp.

Example: videos-to-train/12115024_3840_2160_30fps.mp4-11/16 points to a video stored in the videos-to-train folder with the name 12115024_3840_2160_30fps.mp4. The current frame timestamp is 11/16 (68% complete).
Querying the Vector Database

For querying text-to-embeddings, the same model can be used. It will return vectors of the same dimensions, which can be matched with the vectors stored in Upstash.

Here’s a demo of a few queries

Prompt: Shot of a river from a cliff where clouds seem coming towards me

Prompt: squirrel jumping off a toast in the forest

Watch how it accurately identified the moment when the squirrel was about to jump off the toast.
What’s Next: Adding Audio Context

With this approach, a single moment from thousands of videos can directly point to that specific video, and it doesn't stop there.

Sound can also be extracted using ffmpeg at the same rate as the frames. This audio can be converted to text and embedded with the frame data, then saved to the same vector. This way, not only the visuals but also a small piece of dialogue from the video can be precisely identified.

This method shows that video search doesn’t have to be expensive. By combining frame extraction, embeddings, and audio transcripts, you can build a multimodal system that pinpoints exact moments in hours of footage at a fraction of the cost of traditional video models.

Realtime GitHub Readme Tweets

Pushkar Yadav — Thu, 16 Mar 2023 02:26:36 GMT

Hey now you can integrate your tweets into github readme in realtime. I have created a api which will fetch your tweets and give you a response in picture format. A tweet looks like this:

Let's see how to integrate this into your readme.

visit tweeco.pushkaryadav.in
Enter your twitter username

here you have 2 choices. You can either enter your Twitter username or you can enter a specific tweet URL.
- Tweet username will always return the latest tweet in svg rendered form
- Tweet URL will return a specific tweet in svg rendered form

Copy the markdown code and paste it in your readme file
Costumization

You can customize tweets with URL parameters. Here are some:

Modify the URL with these parameters and you will get a customized tweet.

Property Code	Work
?text=fff	text color
?width=700	width of image rendered
?border=000	border color
?bg=333	background color
?title=F5D76E	title color
?icon=F5D76E	twitter icon color

[![](https://tweeco.pushkaryadav.in/api/handle/pushkaryadavin?text=fff&border=000&width=700&bg=333&title=F5D76E&icon=F5D76E)](https://tweeco.pushkaryadav.in)

Some Tips:
- Do not use # in color code. Use F5D76E instead of #F5D76E
- For colors you can refer to COLPIC or any color code website.
- PRO TIP 😎: You can also use this api in your website. Just use the api url in your img tag.

Commit and Done.

This is how my GitHub looks after adding this tweet integration:

github.com/pushkarydv

Need Help?

See our GitHub repo for more information and time to time upgrades.

Untitled Publication

Cost-Effective Video Search with Frame-Based Multimodal Embeddings

Why Video Search is Expensive Today

Splitting Videos Into Frames

From Frames to Embeddings

Querying the Vector Database

What’s Next: Adding Audio Context

Realtime GitHub Readme Tweets

Let's see how to integrate this into your readme.

visit tweeco.pushkaryadav.in

Enter your twitter username

Copy the markdown code and paste it in your readme file

Costumization

Need Help?