Paper
22 December 1999 Comparison of text-based methods for detecting duplication in document image databases
Author Affiliations +
Proceedings Volume 3967, Document Recognition and Retrieval VII; (1999) https://doi.org/10.1117/12.373496
Event: Electronic Imaging, 2000, San Jose, CA, United States
Abstract
This paper presents an experimental evaluation of several text-based methods for detecting duplication in document image databases using uncorrected OCR output. This task is challenging because of both the wide range of degradations printed documents can suffer, and conflicting interpretations of what it means to be a 'duplicate.' We report results for five sets of experiments exploring various aspects of the problem space. While the techniques studied are generally robust in the face of most types of OCR errors, there are nonetheless important differences which we identify and discuss in detail.
© (1999) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Daniel P. Lopresti "Comparison of text-based methods for detecting duplication in document image databases", Proc. SPIE 3967, Document Recognition and Retrieval VII, (22 December 1999); https://doi.org/10.1117/12.373496
Lens.org Logo
CITATIONS
Cited by 7 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Databases

Optical character recognition

Vector spaces

Chlorine

Algorithm development

Detection and tracking algorithms

Data modeling

Back to Top