\=
Our Models

Two specialized models,not one general one.

Different encoders, different objectives, different embedding spaces. Deployed together, designed separately.

The Power of Alignment

Mikshi Analyze: turning what happened into language an operator can act on.

A video-language model with its own visual encoder. It turns a clip into grounded, temporally-precise language an operator can act on.

Search returns the evidence. Analyze returns the explanation. They talk through clips and timestamps, not a shared embedding space.

  • Scene understanding
    Critical activity vs. background motion in dense feeds.
  • Temporal grounding
    Sub-second timestamps emitted as tokens, not aligned after the fact.
  • Visual QA
    Natural-language interface over archived and live video.
  • Anomaly detection
    A judgment and an explanation, not an opaque score.
Mikshi Analyze 1.0

clip in · grounded language out

Mikshi Search 1.0

multi-vector · per segment

The Art of Detail

Mikshi Search: a video-native encoder for long-duration CCTV.

A video-native encoder. It reads frames and the time between them, then emits multi-vector embeddings, one set of tokens per segment, not a single pooled vector, so temporal structure survives indexing.

Late interaction between query and segment tokens returns the right seconds, not the right hour. A day of footage becomes searchable in milliseconds.

Search recovers what was seen. Analyze explains what it meant. Together they make long-form CCTV legible.

Why CCTV is the hard case

CCTV isn't just more video.It's a qualitatively different regime and it shapes every design decision.

We don't treat CCTV as an application of a general video model. We treat general video understanding as a byproduct of solving CCTV.

Hours, not seconds.

One camera produces 24 hours a day. A deployment produces thousands. Web-clip models don't survive this scale.

Mostly nothing, occasionally everything.

Returning the right hour is useless. Operators need the right seconds.

Fixed viewpoint, drifting conditions.

Same scene for months, but lighting, weather, and occlusion never stop changing.

Vision-only signal.

No speech, no narration. The visual and temporal channel has to carry it alone.

The cost is missed events, not slow ones.

Value is lost when a moment is never flagged, not when a model is a second slow.

Research Focus

Three themes runthrough our work.

01

Two specialized models, not one general one.

Retrieval and reasoning have different objectives, data, and latency profiles. We don't force them through a shared representation, they compose at the clip level.

02

Recovering missed events.

The value isn't running faster than humans. It's seeing what humans stop seeing several hours into a shift.

03

Multi-vector embeddings.

A segment is a sequence of embeddings, not a point. Pooling destroys exactly the temporal detail retrieval needs.

Target Applications

Built for settings where videois produced faster than anyone can watch it.

Retrieval and reasoning are both needed, neither alone is enough, and the latency budget is set by the operator. Mikshi is shaped by that constraint.

Traffic Analytics

One feed, live and post-hoc.

Live incidents, investigations, and reporting, no re-indexing in between.

Crisis Response

From many feeds to the right moment.

Retrieval across feeds decides how fast the moment reaches the person.

Security Monitoring

Flag it. Then justify it.

Real-time anomalies returned as language an operator can audit.

Why this matters

Most footage is watched by no one.Mikshi changes that.

VLMs were built around static images. Industrial deployments need something else, an understanding of how a scene evolves over time.

Search surfaces the moment. Analyze describes what happened. One video intelligence surface for the cameras that are already running.