Comparison of text-based methods for detecting duplication in document image databases

Daniel P. Lopresti

doi:10.1117/12.373496

22 December 1999 Comparison of text-based methods for detecting duplication in document image databases

Daniel P. Lopresti

Proceedings Volume 3967, Document Recognition and Retrieval VII; (1999) https://doi.org/10.1117/12.373496
Event: Electronic Imaging, 2000, San Jose, CA, United States

Abstract

This paper presents an experimental evaluation of several text-based methods for detecting duplication in document image databases using uncorrected OCR output. This task is challenging because of both the wide range of degradations printed documents can suffer, and conflicting interpretations of what it means to be a 'duplicate.' We report results for five sets of experiments exploring various aspects of the problem space. While the techniques studied are generally robust in the face of most types of OCR errors, there are nonetheless important differences which we identify and discuss in detail.

Citation Download Citation

Daniel P. Lopresti "Comparison of text-based methods for detecting duplication in document image databases", Proc. SPIE 3967, Document Recognition and Retrieval VII, (22 December 1999); https://doi.org/10.1117/12.373496

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available