This work provides an open-source method for extracting rel- evant information from scanned documents, such as bills, bank accounts, and invoices. The solution supports documents in 10 different languages and can extract data from these documents irrespective of their template or structure. We have pre-existing solutions based on OpenCV and deep learning technologies, but none provide a generic solution with high accu- racy and support for multiple languages. The proposed method identifies the language of the input document using a pre-trained fast-text model. The document is segmented into different text regions using Run Length Smoothing Algorithm (RLSA). The output of RLSA is passed through a custom pattern recognition algorithm to filter out the regions having the possibility of relevant data based on invoices or account statements. The filtered segments are passed through the Tesseract OCR module for raw text extraction. Based on the identified language of the document, extracted raw text is mapped against the language-specific entity libraries, and final key-value pairs are stored in JSON or CSV files. After being tested on more than 1000 documents, our proposed solution had an average accuracy of 90.27% for all language documents.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.