🎞️

GridVAD: Open-Set Video Anomaly Detection via
Spatial Reasoning over Stratified Frame Grids

Mohamed Eltahir1,†, Ahmed O. Ibrahim2,*, Obada Siralkhatim2,*, Tabarak Abdallah2, Sondos Mohamed3,‡
1King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia  ·  2 Independent Researcher  ·  3 National Center for Research (NCR), Khartoum, Sudan
* Equal Contribution  ·  Corresponding Author  ·  Supervising Author
Paper Code Demo
GridVAD Pipeline
Figure 1. Overview of the GridVAD pipeline. We convert a video into M stratified frame grids and process them by a VLM to generate anomaly proposals. A semantic consolidation step filters inconsistent proposals. The anomaly description is input to a prompt-based object detection model (Grounding DINO) to localize the anomaly, and SAM2 propagates it across frames to produce temporally consistent pixel-level anomaly masks.

Abstract

Vision-Language Models (VLMs) are powerful open-set reasoners, yet their direct use as anomaly detectors in video surveillance is fragile: without calibrated anomaly priors, they alternate between missed detections and hallucinated false alarms. We argue the problem is not the VLM itself but how it is used. VLMs should function as anomaly proposers, generating open-set candidate descriptions that are then grounded and tracked by purpose-built spatial and temporal modules. We instantiate this propose-ground-propagate principle in GridVAD, a training-free pipeline that produces pixel-level anomaly masks without any domain-specific training. A VLM reasons over stratified grid representations of video clips to generate natural-language anomaly proposals. Self-Consistency Consolidation (SCC) filters hallucinations by retaining only proposals that recur across multiple independent samplings. Grounding DINO anchors each surviving proposal to a bounding box, and SAM2 propagates it as a dense mask through the anomaly interval. The per-clip VLM budget is fixed at M+1 calls regardless of video length, where M can be set according to the proposals needed. On UCSD Ped2, GridVAD achieves the highest Pixel-AUROC (77.59) among all compared methods, surpassing even the partially fine-tuned TAO (75.11), and outperforms other zero-shot approaches on object-level RBDC by over 5×. Ablations reveal that SCC provides a controllable precision-recall tradeoff: filtering improves all pixel-level metrics at a modest cost in object-level recall. Efficiency experiments show GridVAD is 2.7× more call-efficient than uniform per-frame VLM querying while additionally producing dense segmentation masks.


The Method

Methodology

GridVAD follows a propose–ground–propagate decomposition. A VLM proposes free-form anomalies over stratified frame grids; a Self-Consistency Consolidation step suppresses hallucinations; Grounding DINO and SAM2 then localise and propagate each detection. No stage requires anomaly supervision.

01 REPRESENTATION Stratified Grid Sampling A clip is divided into K = 9 temporal bins. One frame per bin is sampled and tiled into a single grid image G(m), repeated M times with independent draws for complementary temporal coverage. M grid images → 02 DETECTION Open-Set Anomaly Proposal via VLM Each grid is passed to Qwen3-VL 30B with a structured prompt for free-form detection, description, and temporal localisation. No predefined category list — the VLM reasons from first principles. M × proposals → 03 FILTERING Self-Consistency Consolidation (SCC) Proposals from all M grids are consolidated by a text-only LLM. Only detections with support σ ≥ τ across samplings are retained — genuine anomalies recur; hallucinations do not. surviving proposals → 04 GROUNDING Grounding DINO Anchors each surviving proposal to a bounding box in the anomaly anchor frame. Provides precise spatial coordinates for mask propagation. bounding boxes → 05 PROPAGATION SAM2 Propagates each bounding box as a dense pixel-level mask through the SCC-refined temporal window, producing temporally consistent segmentation across all anomaly frames. Output: pixel-level anomaly masks Temporally consistent dense segmentation — no domain training required Fixed VLM budget: M + 1 calls per clip · 2.7× more efficient than per-frame querying

Experimental Results

Qualitative Results

Each clip shows the input video alongside GridVAD's predicted pixel-level anomaly masks across both benchmarks.

▶️
Select a clip below to play
Loading video…
⚠️
Video not found
Check the file path in your GitHub Pages repo.
CLIP
Qualitative Analysis & Benchmarking

GridVAD detects non-pedestrian objects and unusual behaviours via zero-shot reasoning. Visualizing Transparency: These results showcase both successful detections alongside False Negatives (missed anomalies) and False Positives (normal events incorrectly flagged as anomalous). By comparing predicted masks against the Ground Truth (GT), we highlight the model's precision as well as its limitations.


Quantitative Evaluation

Results

Table 1. Quantitative comparison on UCSD Ped2. Baseline numbers are transcribed from the TAO paper; our results are in the last row. Bold = best, underline = second best.

Method Pixel-AUROC ↑ Pixel-AP ↑ Pixel-AUPRO ↑ Pixel-F1 ↑ RBDC ↑ TBDC ↑
AdaCLIP (Fully fine-tuned)53.064.9750.6611.1912.315.5
AnomalyCLIP (Fully fine-tuned)54.2523.7338.597.4813.121.0
DDAD (Fully trained)55.875.6115.122.6718.0113.29
SimpleNet (Fully trained)52.4920.5144.0510.7151.1827.75
DRAEM (Fully trained)69.5830.6335.7810.8944.2670.64
TAO (Partially fine-tuned)75.1150.7872.9764.1283.693.2
AdaCLIP (Zero-shot)51.021.3233.982.615.810.6
AnomalyCLIP (Zero-shot)51.6321.2036.345.927.511.2
GridVAD (Ours, Zero-shot) 77.59 38.53 66.82 42.09 38.96 37.70

Citation

BibTeX

@inproceedings{paper,
  title     = {GridVAD: Open-Set Video Anomaly Detection via Spatial Reasoning over Stratified Frame Grids},
  author    = {Mohamed Eltahir, Ahmed O. Ibrahim, Obada Siralkhatim, Tabarak Abdallah, Sondos Mohamed},
  booktitle = {arXiv},
  year      = {2026},
  url       = {https://arxiv.org/abs/XXXX.XXXXX}
}