Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets

Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets

The paper “Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets” by Yeshwanth Kumar Adimoolam, Charalambos Poullis, and Melinos Averkiou will appear in IEEE/CVF Computer Vision and Pattern Recognition (CVPR), 2026.

TL;DR: The AICrowd Mapping Challenge dataset is riddled with problems - ~90% duplicate training images and 93% data leakage between training and validation splits. The paper flags these issues and proposes a perceptual-hashing pipeline to detect duplicates and leakage in building footprint datasets before training.

The paper analyzes three popular geospatial datasets used for building footprint extraction - INRIA, SpaceNet 2, and AICrowd Mapping Challenge - and finds that while INRIA and SpaceNet 2 are essentially clean, AICrowd is severely compromised. Roughly 89% of its ~280k training images are duplicates (exact or augmented), and about 93% of validation images also appear in the training split, constituting massive data leakage. After deduplication and leakage removal, the training set shrinks from 280,741 to just 15,392 truly unique, non-leaked images. This contamination has real consequences: methods like HiSup and PolyWorld, benchmarked on AICrowd, show inflated metrics and even replicate incorrect ground truth annotations - a clear sign of memorization rather than generalization. The detection pipeline itself is lightweight, using perceptual hashing (pHash) with a 64-bit hash and Hamming distance of 0, running at ~4ms per image on a single CPU. Compared to average hashing, pHash proves more robust with near-perfect precision and recall on a controlled benchmark. The takeaway is that AICrowd, as commonly used, is unsuitable for fair benchmarking without cleaning, and the pipeline offers a simple, efficient way to audit any large-scale image dataset for these issues.

Research paper: Coming soon
Project page: https://datainspector.app/
Source code: Coming soon