Content-based Handwritten Document Indexing and Retrieval

Content-based Handwritten Document Indexing and Retrieval
Title Content-based Handwritten Document Indexing and Retrieval PDF eBook
Author
Publisher
Pages 121
Release 2008
Genre
ISBN

Download Content-based Handwritten Document Indexing and Retrieval Book in PDF, Epub and Kindle

Information retrieval on textual data has been well studied and its applications (such as web searching) have become ubiquitous in our daily lives. However content-based image retrieval on handwritten document collections still remains a challenging problem. Here "content-based" means that the search will analyze the actual content of the images, instead of merely the metadata. In the context of handwritten documents, the word "content" might refer different things, such as writing style, shape of words and characters, or the truth of the writing. Accordingly, two different types of retrieval can be performed: "query by example" and semantic (or "query by text") retrieval. While both of them have their own applications in the real world, the second one is more intuitive and user-friendly, since it uses not only the low level underlying computational features, but also the understanding of documents. This work explores several automatic techniques to do both types of retrieval upon handwritten document collections. These techniques are three-fold: (i) indexing, (ii) "query by example" retrieval and (iii) "query by text" retrieval. For indexing, we focus on the problem of word segmentation and transcript mapping. Word segmentation is the task of segmenting text line images into word image, which is one of the most important preprocessing steps in order to perform any word level analysis or recognition. We propose the use of neural network with a new set of global and local features to make the classification between inter-word and intra-word gaps. The transcript mapping problem is an alignment problem between the handwritten document image and its transcript. It is not a trivial task simply because the word segmentation algorithm is error prone. A recognition based dynamic programming algorithm is proposed to solve this problem. It is also shown to improve the accuracy of automatic word segmentation. In "query by example" retrieval, the query can be either a full page document or a single word image. For the document level retrieval, a statistical model is learned to determine whether the writing styles of two documents are similar or not. Gamma and Gaussian distributions are used for the modeling. Word level retrieval is performed by a feature based similarity search algorithm. For each word image, a 1024-bit binary feature vector is extracted for this purpose. "Query by text" retrieval is a more challenging task because word level segmentation is error prone and word recognition with large lexicon size is still an unsolved problem. The current solution for this problem is to manually annotate the collection, which is costly. By taking the idea from machine translation in textual information retrieval, we propose a statistical approach for word recognition and use the probabilistic annotation results to do language model retrieval on handwritten documents. For all these approaches, their performances are empirically compared on several test collections. The main contributions of this work are a detailed examination of different levels of content-based image retrieval for handwritten documents, and the development of a retrieval system that allows either image or text queries. The new word segmentation method shows an improved performance over a previous method and is useful in forensic document analysis. In addition, a large handwriting database of 3824 pages (about 573,600 labeled words) was created using the proposed transcript-mapping algorithm. This database was used predominantly in this dissertation and it serves as a useful resource for future handwriting analysis and recognition research.

Handwritten Historical Document Analysis, Recognition, And Retrieval - State Of The Art And Future Trends

Handwritten Historical Document Analysis, Recognition, And Retrieval - State Of The Art And Future Trends
Title Handwritten Historical Document Analysis, Recognition, And Retrieval - State Of The Art And Future Trends PDF eBook
Author Andreas Fischer
Publisher World Scientific
Pages 269
Release 2020-11-11
Genre Computers
ISBN 9811203253

Download Handwritten Historical Document Analysis, Recognition, And Retrieval - State Of The Art And Future Trends Book in PDF, Epub and Kindle

In recent years, libraries and archives all around the world have increased their efforts to digitize historical manuscripts. To integrate the manuscripts into digital libraries, pattern recognition and machine learning methods are needed to extract and index the contents of the scanned images.The unique compendium describes the outcome of the HisDoc research project, a pioneering attempt to study the whole processing chain of layout analysis, handwriting recognition, and retrieval of historical manuscripts. This description is complemented with an overview of other related research projects, in order to convey the current state of the art in the field and outline future trends.This must-have volume is a relevant reference work for librarians, archivists and computer scientists.

Indexing and Retrieval of Low Quality Handwritten Documents

Indexing and Retrieval of Low Quality Handwritten Documents
Title Indexing and Retrieval of Low Quality Handwritten Documents PDF eBook
Author Huaigu Cao
Publisher
Pages 101
Release 2008
Genre
ISBN

Download Indexing and Retrieval of Low Quality Handwritten Documents Book in PDF, Epub and Kindle

Decades of the development in document analysis and recognition techniques has made it possible to convert large amount of documents into electronic formats and store them into computers. In recent years, the achievement in information retrieval has provided a powerful tool for prompt access to the information that lies in the documents. Inspired by the success of applications in the above two areas, in this thesis, we investigate methods that aim at improving the performance of retrieving handwritten document images. Unlike the retrieval of machine-printed documents from which we will anticipate very high OCR accuracy, the retrieval of handwritten document images is more challenging due to document analysis and recognition errors. In existing methods to retrieve handwritten document images, usually the index is built on the text collected from top- n (n> 1) candidates returned by a word recognizer. Different weights may apply to the candidates according to their ranks. Effective as these primitive methods are, with the assumptions of flawless word segmentation and isolated word recognition, these methods are vulnerable by word segmentation errors and cannot take advantage of the language model which has become a standard component in the state-of-the-art handwriting recognition systems. However, incorporation of the word segmentation scores (probabilities) and language model into any existing indexing techniques in general increases the complexity of the problem. In our indexing method, we solved this challenging problem by separating the term counts from standard IR models, estimating them on the word sequence level, and plugging them back in the IR models. A fast algorithm using dynamic programming was proposed to reduce the time complexity. In addition to the application in document retrieval, we also used the word segmentation information in keyword retrieval. In another major contribution of this paper, we applied the Markov random field (MRF) modeling to the binarization problem. The MRF can precisely describe the constraint of local smoothness in the image. We can also use the constraint of smoothness to remove the grid from the form image, which is a very useful application in form image preprocessing. This research work virtually addresses a general topic in the preprocessing of degraded handwritten document images. Applications in both handwriting recognition and handwritten document image retrieval can benefit from our approach.

Artificial Intelligence for Maximizing Content Based Image Retrieval

Artificial Intelligence for Maximizing Content Based Image Retrieval
Title Artificial Intelligence for Maximizing Content Based Image Retrieval PDF eBook
Author Ma, Zongmin
Publisher IGI Global
Pages 450
Release 2009-01-31
Genre Computers
ISBN 1605661759

Download Artificial Intelligence for Maximizing Content Based Image Retrieval Book in PDF, Epub and Kindle

Discusses major aspects of content-based image retrieval (CBIR) using current technologies and applications within the artificial intelligence (AI) field.

Document Analysis Systems VI

Document Analysis Systems VI
Title Document Analysis Systems VI PDF eBook
Author Simone Marinai
Publisher Springer Science & Business Media
Pages 575
Release 2004-08-26
Genre Computers
ISBN 3540230602

Download Document Analysis Systems VI Book in PDF, Epub and Kindle

Thisvolumecontainspapersselectedforpresentationatthe6thIAPRWorkshop on Document Analysis Systems (DAS 2004) held during September 8–10, 2004 at the University of Florence, Italy. Several papers represent the state of the art in a broad range of “traditional” topics such as layout analysis, applications to graphics recognition, and handwritten documents. Other contributions address the description of complete working systems, which is one of the strengths of this workshop. Some papers extend the application domains to other media, like the processing of Internet documents. The peculiarity of this 6th workshop was the large number of papers related to digital libraries and to the processing of historical documents, a taste which frequently requires the analysis of color documents. A total of 17 papers are associated with these topics, whereas two yearsago (in DAS 2002) only a couple of papers dealt with these problems. In our view there are three main reasons for this new wave in the DAS community. From the scienti?c point of view, several research ?elds reached a thorough knowledge of techniques and problems that can be e?ectively solved, and this expertise can now be applied to new domains. Another incentive has been provided by several research projects funded by the EC and the NSF on topics related to digital libraries.

On-line Handwritten Document Understanding

On-line Handwritten Document Understanding
Title On-line Handwritten Document Understanding PDF eBook
Author Anoop M. Namboodiri
Publisher
Pages 376
Release 2004
Genre Graphology
ISBN

Download On-line Handwritten Document Understanding Book in PDF, Epub and Kindle

Document Recognition and Retrieval

Document Recognition and Retrieval
Title Document Recognition and Retrieval PDF eBook
Author
Publisher
Pages 236
Release 2004
Genre Image processing
ISBN

Download Document Recognition and Retrieval Book in PDF, Epub and Kindle