Pervformer May 2026
For years, the computer vision community has debated a fundamental trade-off:
| Model | Something-Something V2 (Accuracy) | Kinetics-700 (FLOPS) | GPU Memory (128 frames) | | :--- | :--- | :--- | :--- | | TimeSformer | 62.5% | 1.9k G | 42 GB | | VideoMAE | 70.8% | 2.1k G | OOM (>80GB) | | | 74.2% | 980 G | 23 GB | pervformer
import torch import torch.nn as nn class PervasiveAttention(nn.Module): def (self, dim, num_probes=64): super(). init () self.num_probes = num_probes # Learnable latent probes (global memory) self.probes = nn.Parameter(torch.randn(1, num_probes, dim)) For years, the computer vision community has debated
A robot navigating a warehouse doesn't need to remember every pixel from 10 seconds ago. It needs to remember that a forklift moved a pallet (semantic) and that the path is now clear (spatial). PervFormer's memory probes act as a working memory, drastically reducing drift in SLAM-based systems. PervFormer's memory probes act as a working memory,