WorkflowVision AI

An applied AI system that automatically verifies industrial processes against Standard Operating Procedures (SOPs) using video analysis. Given a recorded video and a SOP, the system identifies each step, matches it to the procedure, and produces a timestamped compliance report.

The pipeline combines three AI models: NVIDIA Cosmos Reason 2 and X-CLIP (Microsoft Research) run in parallel as complementary vision-language models (VLMs) - one for language-grounded visual reasoning, one for zero-shot video-text similarity ranking - and cross-check each other to improve accuracy. Gemma 2 then handles structured SOP matching from the vision models' outputs.

Built with FastAPI, served on RunPod GPU instances, and integrated via OpenAI-compatible NVIDIA NIM endpoints.

Technologies: Python, FastAPI, OpenCV, X-CLIP, NVIDIA Cosmos Reason 2, Gemma 2, NVIDIA NIM, RunPod, httpx

Cosmos Reason 2 X-CLIP Gemma 2

Want to kick off a new project or
get your derailed project back on track & schedule?
Tell us more