A Distractor-Aware Memory (DAM) for
Visual Object Tracking with SAM2

Faculty of Computer and Information Science, University of Ljubljana
*Equal Contribution

SAM2.1++ tracking examples

Abstract

Memory-based trackers such as SAM2 demonstrate remarkable performance, however still struggle with distractors. We propose a new plug-in distractor-aware memory (DAM) and management strategy that substantially improves tracking robustness. The new model is demonstrated on SAM2.1, leading to SAM2.1++, which sets a new state-of-the-art on six benchmarks, including the most challenging VOT/S benchmarks without additional training. We also propose a new distractor-distilled (DiDi) dataset to better study the distractor problem.

DiDi: A distractor-distilled dataset

DiDi is a distractor-distilled tracking dataset created to address the limitation of low distractor presence in current visual object tracking benchmarks. To enhance the evaluation and analysis of tracking performance amidst distractors, we have semi-automatically distilled several existing benchmarks into the DiDi dataset. The dataset is available for download at this link.

Example annotations from DiDi dataset
Example frames from the DiDi dataset showing distractors. Targets are denoted by green bounding boxes.

SOTA comparison on DiDi

Model Quality Accuracy Robustness
TransT 0.465 0.669 0.678
KeepTrack 0.502 0.646 0.748
SeqTrack 0.529 0.714 0.718
AQATrack 0.535 0.693 0.753
AOT 0.541 0.622 0.852
Cutie 0.575 0.704 0.776
ODTrack 0.608 0.740 🥇 0.809
SAM2.1Long 0.646 0.719 0.883
SAM2.1 (baseline) 0.649 🥉 0.720 0.887 🥉
SAMURAI 0.680 🥈 0.722 🥉 0.930 🥈
SAM2.1++ (ours) 0.694 🥇 0.727 🥈 0.944 🥇

Qualitative comparison

Tracking fast-moving objects amidst distractors
Long-term tracking amidst distractors, with frequent target occlusions
Target redetection after occlusion amidst distractors
Tracking over very long duration and appearance changes