Saturday, April 24, 2010

optical character recognition for tamil characters

Optical Character Recognition For Tamil Characters

Abstract
In this modern evolving technology, language is one of the barriers, which stops the effectiveness of the various software technologies from reaching the commonly literate people who know only their regional languages. To show the effectiveness, evolving technologies should also be language compatible.

Document analysis and optical character recognition (OCR) systems have been under research for few decades. But still it remains a highly challenging task to implement an OCR that works under all possible conditions and gives highly accurate results. So on considering these effects in this project, we have developed an effective optical character recognition application for Tamil characters.

In this project, we have

a) Preprocessing stage wherein the various processes such as noise removal, binarization by Otsu’s threshold method, skeletonization by Two pass parallel algorithm, feature extraction by Line and Word segmentation technique are carried out, and

b) Recognition stage consists of finding centroid and geometric moments of the images.

Introduction

Optical character recognition for Tamil characters is a software project, which converts the printed form of Tamil characters and the image form of Tamil characters into a system compatible word processing document in Tamil.

Scope of the Project

The motivation behind our work was to develop a character recognition system, which can further be developed to recognize printed and handwritten Tamil characters. Such a system will be extremely useful to digitize innumerable ancient documents available both on paper and palm leaves and are dated few hundred years back. These invaluable ancient works are in the process of being lost to termites in libraries in India, due to the lack of proper OCR systems. Therefore, our attempt here is a humble effort towards this noble goal of developing an efficient OCR system, in public domain, for Tamil characters.

Description of the Project Work

Tamil is a classical South Indian language spoken in India, Sri Lanka, Singapore, and Malaysia. Its character set comprises of 247 alphabets.

The 126 unique Tamil characters, which form 247 alphabets, are shown in Fig.1



Figure 1 shows the unique characters that can be printed using the Inaimathi TSC font family, at 12 points size. This font follows the Tamil Standard Code for Information Interchange (TSCII) [4] encoding.

Our Approach

Any OCR implementation consists of a number of preprocessing steps followed by the actual recognition. The number and types of preprocessing algorithms employed on the scanned image depend on many factors such as age of the document, paper quality, resolution of the scanned image, the amount of skew in the image, the format and layout of the images and text, the kind of script used and also on the type of characters - printed or handwritten. The recognition stage usually involves calculating a number of statistical parameters and hence recognizing the character. Typical preprocessing stages include noise cleaning, binarization, skeletonization, skew detection and correction and feature extraction - like line and word segmentation.

In our approach, we assumed a fixed size (12 pt.) Tamil printed character from the Inaimathi TSC family with no skew. So, the preprocessing stages included noise cleaning. binarization, skeletonization and feature extraction.

Figure 2 shows the flow of our approach.

Noise Removal

Scanned documents often contain noise that arises due to printer, scanner, print quality, age of the document, etc. Therefore, it is necessary to filter this noise before we process the image. The commonly used approach is to low-pass filter the image and use it for later processing.

In our case, we assumed that the document does not have high noise content, in which case the image can be binarized directly.


Image Binarization

Binarization is a technique by which the gray scale images are converted to binary images. The most common method is to select a proper threshold for the image and then convert all the intensity values above the threshold intensity to one intensity value representing either “black” or “white” value. All intensity values below a threshold are converted to one intensity level and intensities higher than this threshold are converted to the other chosen intensity.

We used Otsu’s threshold algorithm to binarize our gray scale image. In our convention, black represented a character (or noise) and white represented the foreground. Otsu’s threshold makes an assumption that the histogram of the gray scale image has a bimodal distribution. Examples of thresholding applied on the gray scale images are shown in later figures.

Skew Detection And Estimation

During the scanning operation, a certain amount of image skew is unavoidable especially when more number of documents are to be scanned in a limited amount of time. Very rarely do we see documents, which does not have a skew. So, skew detection and correction is an important pre-processing operation. Skew angle is the angle that the text lines of the scanned image make with the horizontal direction. This angle has to be detected and once this is detected, it can be corrected using an affine transformation. Many research papers are dedicated just for this problem, which shows the amount of complexities involved in this calculation. Despite our initial efforts, due to lack of time and complexity, we did not achieve highly successful results. So, we did not explore this problem in great detail and our documents are scanned without introducing skew.

Image Thinning (Skeletonization)

Skeletonization or thinning is a process by which a one-pixel width representation (or the skeleton) of an object is obtained, by preserving the connectedness of the object and its end points. Hence, this process can be seen as a conditional deletion of boundary pixels. The process of thinning is of practical importance in applications that further involves a structural analysis because thinning can greatly simplify an analysis, which could become complex later. A number of thinning algorithms have been proposed and are being used. The most common algorithm used is the classical Hilditch algorithm [2] and its variants.

We used a simple algorithm for thinning our images. The algorithm makes two passes: one pass on the actual image and another pass over an intermediate image (constructed from the original image during the first pass). The result after the two passes decides if a pixel is to be removed to get the skeleton of the object of interest.

Bounding Box

Before analyzing any character, it is important to identify the (pixel) boundaries of that character. Thus, a bounding box has to be identified for each character. For this, we first calculate the row-wise horizontal projection of the whole document and identify the start and end positions of each line, from the valleys in the projection. Having found the line boundaries, we can then find the vertical projection of each line, the valleys of which show the boundaries of each character in the line [1]. Since, all the characters in a line may not be of equal height, what we have now is only an approximate bounding rectangle for the characters. Now, the bounding rectangle is used to identify the exact bounding box. This is a crucial step in moment based recognition methods and can be calculated from a vertical projection of the bounding rectangle.

Feature Space

Each character to be recognized is expressed as a point in a N-dimensional vector space, where N is the number of features considered. Here, we used a 6-dimensional (so, N = 6) feature space for recognition where the features are,

 centroid (x, y) of the character,

x = M10/M00; y = M01/M00

 four non-linear combination of central moments [3],

f1 = M20 + M02 + M00
f2 = M20 - M02 + M11
f3 = M10 - M01
f4 = M30 + M03

where Mpq is the central moment [1]. In [3], geometric moments are used while here we use central moments, due to its rotation invariant nature.

By including more features for each character in our analysis, we can increase the dimensions of the feature space, which will help improve the accuracy of the OCR. We can call Qi, all the character we can recognize. Here, we used 31 characters for recognition (so, max(i) = 31). Table 1 shows the values of these features for each character we used and its corresponding character (TSCII) code.
Character Recognition

Once we identify the bounding box for a character, we can calculate its parameters in our feature space. Let us represent the character to be recognized as a point P in N-dimensional feature space.

If we represent the properties of each character in our database (which we can recognize) with suffix i, then

ei = sqrt [ (xi - x)^2 + (yi - y)^2 + (fi1 - f1)^2
+ (fi2 - f2)^2 + (fi3 - f3)^2 + (fi4 - f4)^2 ]

Now, this square root of sum of the squared errors for all Qi, gives a measure of resemblance to the characters we can recognize. Therefore, the value of least ei (least error distance) is the recognized character. This method is very similar to the nearest-neighbor classifier approach.

Once we recognize a character, the result can be written to a text file with its corresponding character code. Here, we used the standards proposed by the TSCII group [4] and hence the output was easily readable in Tamil using a text-editor.

Complexities


An astute reader will have noticed from our character set, that some characters closely resemble another character in our chosen set. Fig. 9 shows some of these similar characters. Therefore, the centroid and moments of both characters will be similar and hence there arises a difficulty in recognition. We can identify some cases by their corresponding high error rates. This problem can only be solved by increasing the number of features, which is presently not addressed and is left as a future work.

Future Work



Our efforts to correct skew demanded more time and effort, than we anticipated and hence we decided to incorporate skew correction in future. The possible improvements suggested for this work to develop a more robust OCR are higher dimensional feature space, character slant correction, size and font invariance. An extensive statistical analysis of alphabet distribution in documents may also improve the accuracy rate. To recognize characters on a page from a magazine or newspaper, a thorough study of document type recognition has to be made. Ultimately, the goal of any OCR is to recognize hand-written characters, which is highly complex and demanding task.

Proposals

We propose to deliver demonstration software and a working model.

Applications

 Mobile OCR for blind and travelers.

 Screen Capture with OCR.

 Aid for blind and illiterate.

 Preservation of old manuscripts.


References

[1] H. Bunke, P.S.P. Wang, "HANDBOOK OF CHARACTER RECOGNITION AND DOCUMENT IMAGE ANALYSIS",
World Scientific, 1997.

[2] S. Mori, H. Nishida, H. Yamada, "OPTICAL CHARACTER RECOGNITION", John Wiley & Sons, 1999.

[3] Tamil Standard Code for Information Interchange, http://www.tamil.net/tscii





2 comments: