🎞️

GridVAD: Open-Set Video Anomaly Detection via
Spatial Reasoning over Stratified Frame Grids

Mohamed Eltahir^1,†, Ahmed O. Ibrahim^2,*, Obada Siralkhatim^2,*, Tabarak Abdallah², Sondos Mohamed^3,‡

¹King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia · ² Independent Researcher · ³ National Center for Research (NCR), Khartoum, Sudan

^* Equal Contribution · ^† Corresponding Author · ^‡ Supervising Author

Paper Code Demo

Figure 1. Overview of the GridVAD pipeline. We convert a video into M stratified frame grids and process them by a VLM to generate anomaly proposals. A semantic consolidation step filters inconsistent proposals. The anomaly description is input to a prompt-based object detection model (Grounding DINO) to localize the anomaly, and SAM2 propagates it across frames to produce temporally consistent pixel-level anomaly masks.

Abstract

Vision-Language Models (VLMs) are powerful open-set reasoners, yet their direct use as anomaly detectors in video surveillance is fragile: without calibrated anomaly priors, they alternate between missed detections and hallucinated false alarms. We argue the problem is not the VLM itself but how it is used. VLMs should function as anomaly proposers, generating open-set candidate descriptions that are then grounded and tracked by purpose-built spatial and temporal modules. We instantiate this propose-ground-propagate principle in GridVAD, a training-free pipeline that produces pixel-level anomaly masks without any domain-specific training. A VLM reasons over stratified grid representations of video clips to generate natural-language anomaly proposals. Self-Consistency Consolidation (SCC) filters hallucinations by retaining only proposals that recur across multiple independent samplings. Grounding DINO anchors each surviving proposal to a bounding box, and SAM2 propagates it as a dense mask through the anomaly interval. The per-clip VLM budget is fixed at M+1 calls regardless of video length, where M can be set according to the proposals needed. On UCSD Ped2, GridVAD achieves the highest Pixel-AUROC (77.59) among all compared methods, surpassing even the partially fine-tuned TAO (75.11), and outperforms other zero-shot approaches on object-level RBDC by over 5×. Ablations reveal that SCC provides a controllable precision-recall tradeoff: filtering improves all pixel-level metrics at a modest cost in object-level recall. Efficiency experiments show GridVAD is 2.7× more call-efficient than uniform per-frame VLM querying while additionally producing dense segmentation masks.

The Method

Methodology

GridVAD follows a propose–ground–propagate decomposition. A VLM proposes free-form anomalies over stratified frame grids; a Self-Consistency Consolidation step suppresses hallucinations; Grounding DINO and SAM2 then localise and propagate each detection. No stage requires anomaly supervision.

Experimental Results

Qualitative Results

Each clip shows the input video alongside GridVAD's predicted pixel-level anomaly masks across both benchmarks.

▶️

Select a clip below to play

⏳

Loading video…

⚠️

Video not found

Check the file path in your GitHub Pages repo.

CLIP

Qualitative Analysis & Benchmarking

GridVAD detects non-pedestrian objects and unusual behaviours via zero-shot reasoning. Visualizing Transparency: These results showcase both successful detections alongside False Negatives (missed anomalies) and False Positives (normal events incorrectly flagged as anomalous). By comparing predicted masks against the Ground Truth (GT), we highlight the model's precision as well as its limitations.

Quantitative Evaluation

Results

Table 1. Quantitative comparison on UCSD Ped2. Baseline numbers are transcribed from the TAO paper; our results are in the last row. Bold = best, underline = second best.

Method	Pixel-AUROC ↑	Pixel-AP ↑	Pixel-AUPRO ↑	Pixel-F1 ↑	RBDC ↑	TBDC ↑
AdaCLIP (Fully fine-tuned)	53.06	4.97	50.66	11.19	12.3	15.5
AnomalyCLIP (Fully fine-tuned)	54.25	23.73	38.59	7.48	13.1	21.0
DDAD (Fully trained)	55.87	5.61	15.12	2.67	18.01	13.29
SimpleNet (Fully trained)	52.49	20.51	44.05	10.71	51.18	27.75
DRAEM (Fully trained)	69.58	30.63	35.78	10.89	44.26	70.64
TAO (Partially fine-tuned)	75.11	50.78	72.97	64.12	83.6	93.2
AdaCLIP (Zero-shot)	51.02	1.32	33.98	2.61	5.8	10.6
AnomalyCLIP (Zero-shot)	51.63	21.20	36.34	5.92	7.5	11.2
GridVAD (Ours, Zero-shot)	77.59	38.53	66.82	42.09	38.96	37.70

Citation

BibTeX