Indexing and Retrieval of Low Quality Handwritten Documents

Indexing and Retrieval of Low Quality Handwritten Documents
Title Indexing and Retrieval of Low Quality Handwritten Documents PDF eBook
Author Huaigu Cao
Publisher
Pages 101
Release 2008
Genre
ISBN

Download Indexing and Retrieval of Low Quality Handwritten Documents Book in PDF, Epub and Kindle

Decades of the development in document analysis and recognition techniques has made it possible to convert large amount of documents into electronic formats and store them into computers. In recent years, the achievement in information retrieval has provided a powerful tool for prompt access to the information that lies in the documents. Inspired by the success of applications in the above two areas, in this thesis, we investigate methods that aim at improving the performance of retrieving handwritten document images. Unlike the retrieval of machine-printed documents from which we will anticipate very high OCR accuracy, the retrieval of handwritten document images is more challenging due to document analysis and recognition errors. In existing methods to retrieve handwritten document images, usually the index is built on the text collected from top- n (n> 1) candidates returned by a word recognizer. Different weights may apply to the candidates according to their ranks. Effective as these primitive methods are, with the assumptions of flawless word segmentation and isolated word recognition, these methods are vulnerable by word segmentation errors and cannot take advantage of the language model which has become a standard component in the state-of-the-art handwriting recognition systems. However, incorporation of the word segmentation scores (probabilities) and language model into any existing indexing techniques in general increases the complexity of the problem. In our indexing method, we solved this challenging problem by separating the term counts from standard IR models, estimating them on the word sequence level, and plugging them back in the IR models. A fast algorithm using dynamic programming was proposed to reduce the time complexity. In addition to the application in document retrieval, we also used the word segmentation information in keyword retrieval. In another major contribution of this paper, we applied the Markov random field (MRF) modeling to the binarization problem. The MRF can precisely describe the constraint of local smoothness in the image. We can also use the constraint of smoothness to remove the grid from the form image, which is a very useful application in form image preprocessing. This research work virtually addresses a general topic in the preprocessing of degraded handwritten document images. Applications in both handwriting recognition and handwritten document image retrieval can benefit from our approach.

Content-based Handwritten Document Indexing and Retrieval

Content-based Handwritten Document Indexing and Retrieval
Title Content-based Handwritten Document Indexing and Retrieval PDF eBook
Author
Publisher
Pages 121
Release 2008
Genre
ISBN

Download Content-based Handwritten Document Indexing and Retrieval Book in PDF, Epub and Kindle

Information retrieval on textual data has been well studied and its applications (such as web searching) have become ubiquitous in our daily lives. However content-based image retrieval on handwritten document collections still remains a challenging problem. Here "content-based" means that the search will analyze the actual content of the images, instead of merely the metadata. In the context of handwritten documents, the word "content" might refer different things, such as writing style, shape of words and characters, or the truth of the writing. Accordingly, two different types of retrieval can be performed: "query by example" and semantic (or "query by text") retrieval. While both of them have their own applications in the real world, the second one is more intuitive and user-friendly, since it uses not only the low level underlying computational features, but also the understanding of documents. This work explores several automatic techniques to do both types of retrieval upon handwritten document collections. These techniques are three-fold: (i) indexing, (ii) "query by example" retrieval and (iii) "query by text" retrieval. For indexing, we focus on the problem of word segmentation and transcript mapping. Word segmentation is the task of segmenting text line images into word image, which is one of the most important preprocessing steps in order to perform any word level analysis or recognition. We propose the use of neural network with a new set of global and local features to make the classification between inter-word and intra-word gaps. The transcript mapping problem is an alignment problem between the handwritten document image and its transcript. It is not a trivial task simply because the word segmentation algorithm is error prone. A recognition based dynamic programming algorithm is proposed to solve this problem. It is also shown to improve the accuracy of automatic word segmentation. In "query by example" retrieval, the query can be either a full page document or a single word image. For the document level retrieval, a statistical model is learned to determine whether the writing styles of two documents are similar or not. Gamma and Gaussian distributions are used for the modeling. Word level retrieval is performed by a feature based similarity search algorithm. For each word image, a 1024-bit binary feature vector is extracted for this purpose. "Query by text" retrieval is a more challenging task because word level segmentation is error prone and word recognition with large lexicon size is still an unsolved problem. The current solution for this problem is to manually annotate the collection, which is costly. By taking the idea from machine translation in textual information retrieval, we propose a statistical approach for word recognition and use the probabilistic annotation results to do language model retrieval on handwritten documents. For all these approaches, their performances are empirically compared on several test collections. The main contributions of this work are a detailed examination of different levels of content-based image retrieval for handwritten documents, and the development of a retrieval system that allows either image or text queries. The new word segmentation method shows an improved performance over a previous method and is useful in forensic document analysis. In addition, a large handwriting database of 3824 pages (about 573,600 labeled words) was created using the proposed transcript-mapping algorithm. This database was used predominantly in this dissertation and it serves as a useful resource for future handwriting analysis and recognition research.

Probabilistic Indexing for Information Search and Retrieval in Large Collections of Handwritten Text Images

Probabilistic Indexing for Information Search and Retrieval in Large Collections of Handwritten Text Images
Title Probabilistic Indexing for Information Search and Retrieval in Large Collections of Handwritten Text Images PDF eBook
Author
Publisher Springer Nature
Pages 372
Release 2024
Genre Automatic indexing
ISBN 3031553896

Download Probabilistic Indexing for Information Search and Retrieval in Large Collections of Handwritten Text Images Book in PDF, Epub and Kindle

This book provides a comprehensive presentation of a recently introduced framework, named "probabilistic indexing" (PrIx), for searching text in large collections of document images and other related applications. It fosters the development of new search engines for effective information retrieval from manuscripts which, however, lack the electronic text (transcripts) that would typically be required for such search and retrieval tasks. The book is structured into 11 chapters and three appendices. The first two chapters briefly outline the necessary fundamentals and state of the art in pattern recognition, statistical decision theory, and handwritten text recognition. Chapter 3 presents approaches for indexing (as opposed to spotting) each region of a handwritten text image which is likely to contain a word. Next, Chapter 4 describes models adopted for handwritten text in images, namely hidden Markov models, convolutional and recurrent neural networks and language models, and provides full details of weighted finite-state transducer (WFST) concepts and methods, needed in further chapters of the book. Chapter 5 explains the set of techniques and algorithms developed to generate image probabilistic indexes which allow for fast search and retrieval of textual information in the indexed images. Chapter 6 then presents experimental evaluations of the proposed framework and algorithms on different traditional benchmark datasets and compares them with other approaches, while Chapter 7 reviews the most popular keyword-spotting approaches. Chapter 8 explains how PrIx can support classical free-text search tools, while Chapter 9 presents new methods that use PrIx not only for searching, but also to deal with text analytics and other related natural language processing and information extraction tasks. Chapter 10 shows how the proposed solutions can be used to effectively index very large collections of handwritten document images, before Chapter 11 eventually summarizes the book and suggests promising lines of future research. The appendices detail the necessary mathematical foundations for the work and presents details of the text image collections and datasets used in the experiments throughout the book. This book is written for researchers and (post-)graduate students in pattern recognition and information retrieval. It will also be of interest to people in areas like history, criminology, or psychology who need technical support to evaluate, understand or decode historical or contemporary handwritten text.

Advances In Digital Document Processing And Retrieval

Advances In Digital Document Processing And Retrieval
Title Advances In Digital Document Processing And Retrieval PDF eBook
Author Bidyut Baran Chaudhuri
Publisher World Scientific
Pages 334
Release 2013-11-20
Genre Computers
ISBN 9814583898

Download Advances In Digital Document Processing And Retrieval Book in PDF, Epub and Kindle

From the participation of researchers in most important international conferences in the field, it is noted that activities in automatic document processing have been continuously growing. This book is an edited volume in Digital Document Processing where the chapters are written by several internationally renowned researchers in the domain. It will be useful for both students and researchers working on various aspects of document image analysis and recognition problems. It contains chapters on topics that are not covered by any textbook, but are more futuristic like “Going beyond the Myth of Paperlessness”, or interesting application areas like “The Role of Document Image Analysis in Trustworthy Elections” as well as “Word Recognition for Museum Index Cards with SNT-Grid”. Persons developing document analysis software for industry may also find the chapters useful and attractive. The language of the chapters is simple and clear, along with drawings/diagrams wherever necessary. An adequate number of references are given at the end of each chapter. Overall, the book is highly readable and will be an asset to the community. Renowned contributors include George Nagy, Hiromichi Fujisawa, F Kimura, D Lopresti, Chew Lim Tan, S Uchida, Thierry Paquet, Laurent Heutte, V Govindaraju, R Manmatha.

Statistical Techniques For Efficient Indexing And Retrieval Of Document Images

Statistical Techniques For Efficient Indexing And Retrieval Of Document Images
Title Statistical Techniques For Efficient Indexing And Retrieval Of Document Images PDF eBook
Author Anurag Bhardwaj
Publisher
Pages 144
Release 2010
Genre
ISBN

Download Statistical Techniques For Efficient Indexing And Retrieval Of Document Images Book in PDF, Epub and Kindle

We have developed statistical techniques to improve the performance of document image search systems where the intermediate step of OCR based transcription is not used. Previous research in this area has largely focused on challenges pertaining togeneration of small lexicons for processing handwritten documents and enhancement of poor quality document images. However, in practice one must deal with several additional challenges such as processing multilingual documents which are predominantlyin non-Latin scripts. In this dissertation we have developed script-independent and content-based retrieval techniques to access document images from multilingual digital libraries containing both printed and unconstrained handwritten documents. Our work advances the state-of-art in retrieval of Indic documents.^Our two-fold solution involves keyword spotting for scripts with existing OCR solutions and a semi-supervised recognition-free approach when an OCR option is unavailable. We have also designed a novel framework for content based retrieval of handwritten documents that captures the stylistic properties of handwriting. This framework is adapted from the Latent Dirichlet Allocation (LDA) model for handwriting to learn the latent handwriting styles (i.e. cursive, loopy) present in a given corpora without any manual annotations or grammar. We have successfully applied this (style) modeling technique to forensic document analysis tasks of writer identification. Finally, we have extended the idea of content based retrieval of historical documents by formulating for the first time, the problem of temporal indexing and retrieval of such manuscripts.^We use a novel subspace learning technique for estimating the age of a scanned document image and apply it to retrieve other documents in thecollection of similar age. The proposed subspace learning technique (hGLRAM) is based on a globally as well as locally optimized hierarchical generalized low rank approximation of matrices (GLRAM) that learns a tree based low-dimensional representation of documents images for robust modeling of aging patterns. The methods developed in this dissertation have been validated on publicly available datasets: handwritten documents from the IAM database, George Washington's letters dataset, and printed documents datasets available from the Google Book Project and the Million Book Project. The accuracy of our methods is significantly superior to results reported in the literature.

Indexing and Retrieval of Non-Text Information

Indexing and Retrieval of Non-Text Information
Title Indexing and Retrieval of Non-Text Information PDF eBook
Author Diane Rasmussen Neal
Publisher Walter de Gruyter
Pages 440
Release 2012-10-30
Genre Language Arts & Disciplines
ISBN 3110260581

Download Indexing and Retrieval of Non-Text Information Book in PDF, Epub and Kindle

The scope of this volume will encompass a collection of research papers related to indexing and retrieval of online non-text information. In recent years, the Internet has seen an exponential increase in the number of documents placed online that are not in textual format. These documents appear in a variety of contexts, such as user-generated content sharing websites, social networking websites etc. and formats, including photographs, videos, recorded music, data visualizations etc. The prevalence of these contexts and data formats presents a particularly challenging task to information indexing and retrieval research due to many difficulties, such as assigning suitable semantic metadata, processing and extracting non-textual content automatically, and designing retrieval systems that "speak in the native language" of non-text documents.

Dissertation Abstracts International

Dissertation Abstracts International
Title Dissertation Abstracts International PDF eBook
Author
Publisher
Pages 840
Release 2009
Genre Dissertations, Academic
ISBN

Download Dissertation Abstracts International Book in PDF, Epub and Kindle