A rule-based post-processing approach to improve Persian OCR performance

Document Type : Article

Authors

1 Faculty of New Sciences and Technologies, University of Tehran, Tehran, Iran

2 Department of Electrical and Computer Engineering, University of Memphis, Memphis, USA

Abstract

Optical Character Recognition (OCR) is a system to convert images including text into an editable text. Nowadays, the accuracy of these systems is acceptable in images with simple-structure and high quality. However, the performance degrades for images with complex-structure, low quality, and in the presence of noise, scratches, pictures, stamps, or other non-textual symbols. This paper proposes a Persian OCR post-processing technique to increase the accuracy of the OCR systems dealing with real-world challenging samples. The proposed method extracts five features in each line of the text and uses seven proposed rules to investigate whether that line should be ignored or not. To evaluate the proposed method, Khana (structural based) and Bina (deep learning-based) Persian OCR systems have utilized, and a dataset containing 200 complex-structure images and 100 simple-structure images have been collected. The accuracy of the Khana and Bina in images with complex-structure is 39% and 58%, respectively, while after applying the proposed post-processing method the accuracy increases to 93% and 91%, respectively.

Keywords


References
1. Sajedi, H. Handwriting recognition of digits, signs,
and numerical strings in Persian", Computers & Electrical
Engineering, 49, pp. 52{65 (2016).
2. Azadnia, M. Presenting an expert system for automatic
correcting Persian texts", International Journal
of Computer Science and Network Security, 8(3), pp.
27{31 (2008).
3. Eikvil, L. Optical Character Recognition, Norsk Regnesentral,
P.B. 114 Blindern, N-0314 Oslo, (Dec. 1993).
4. Singh, A., Bacchuwar, K., and Bhasin, A. A survey of
OCR applications", International Journal of Machine
Learning and Computing, 2(3), pp. 314{318 (2012).
5. Menhaj, M.B. and Adab, M. Simultaneous segmentation
and recognition of Farsi/Latin printed texts with
MLP", Neural Networks, 2002. IJCNN'02. Proceedings
of the 2002 International Joint Conference, 2 (2002).
6. Raymond, S. Hybrid page layout analysis via tabstop
detection", Document Analysis and Recognition,
ICDAR'09. 10th International Conference (2009).
7. Simon, A., Pret, J.-C., and Johnson, A.P. A fast
algorithm for bottom-up document layout analysis",
IEEE Transactions on Pattern Analysis and Machine
Intelligence, 19(3), pp. 273{277 (1997).
8. O'Gorman, L. The document spectrum for page
layout analysis", IEEE Transactions on Pattern Anal3032
Z. Khosrobeigi et al./Scientia Iranica, Transactions D: Computer Science & ... 27 (2020) 3019{3033
ysis and Machine Intelligence, 15(11), pp. 1162{1173
(1993).
9. Pritpal, S. and Budhiraja, S. Feature extraction and
classi cation techniques in OCR systems for handwritten
Gurmukhi script-a survey", International Journal
of Engineering Research and Applications (IJERA),
1(4), pp. 1736{1739 (2011).
10. Lehal, G.S. and Singh, C. A Gurmukhi script recognition
system", Pattern Recognition, Proceedings. 15th
International Conference, 2 (2000).
11. Zand, M., Naghsh Nilchi, A., and Monadjemi, S.A.
Recognition-based segmentation in Persian character
recognition", Proceedings of World Academy of Science,
Engineering and Technology International Journal
of Computer, Electrical, Automation, Control and
Information Engineering, 28 (2008).
12. Khosravi, H. and Kabir, E. A blackboard approach
towards integrated Farsi OCR system", International
Journal of Document Analysis and Recognition (IJDAR),
12(1), pp. 21{32 (2009).
13. Malik, S.A., Maqsood, M., Aadil, F., et al. An
ecient segmentation technique for Urdu optical character
recognizer (OCR)", Advances in Information and
Communication, 70, pp. 131{141 (2019).
14. Mirzaee, M. Text detection in images for Persian
optical character recognition", MSc Thesis, University
Of Tehran, Iran (2012).
15. Ghanbari, N. A review of research studies on the
recognition of Farsi alphabetic and numeric characters
in the last decade", Fundamental Research in Electrical
Engineering, Springer, Singapore, pp. 173{184 (2019).
16. Kameswara Rao, T., Yashwanth Chowdary, K.,
Koushik Chowdary, I., et al. Optical character recognition
from printed text images", International Journal
of Scienti c Research in Computer Science, Engineering
and Information Technology, 5, pp. 597{604
(2019).
17. Bina Persian OCR system", ASR-Gooyesh Co.,
http://www.binaocr.com.
18. Niwa, H., Kayashima, K., and Shimeki, Y. Postprocessing
for character recognition using keyword
information", IAPR Workshop on Machine Vision
Applcatron, Tokyo (1992).
19. Hong, T. Degraded text recognition using visual and
linguistic context", Doctoral Dissertation, University
of New York, Bu alo (1996).
20. Kukich, K. Techniques for automatically correcting
words in text", Acm. Computing Surveys (CSUR),
24(4), pp. 377{439 (1992).
21. Jurafsky, D. and Martin, H.J., Speech and Language
Processing: An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition
(2008).
22. Mays, E., Damerau, F.J., and Mercer, L.R. Context
based spelling correction context-sensitive spell
checking based on eld association terms dictionaries",
Information Processing & Management, 27(5), pp.
517{522 (1991).
23. Beaufort, R. and Thillou, C. A weighted nite-state
framework for correcting errors in natural scene OCR",
Document Analysis and Recognition, ICDAR 2007.
Ninth International Conference, 2, pp. 889{893 (2007).
24. Bassil, Y. and Alwani, M. OCR post-processing
error correction algorithm using google online spelling
suggestion", Computer Science ArXiv, 3(1), pp. 1{9
(2012).
25. Ranka, V., Patil, S., Patni, S., et al. Automatic table
detection and retention from scanned document images
via analysis of structural information", 2017 Fourth
International Conference on Image Information Processing
(ICIIP), India (2017).
26. Jahan MAC, A. and Ragel, R. Locating tables in
scanned documents for reconstructing and republishing",
7th International Conference on Information and
Automation for Sustainability, Sri Lanka (2014).
27. Nagata, M. Japanese OCR error correction using
character shape similarity and statistical language
model", Proceedings of the 36th Annual Meeting of the
Association for Computational Linguistics and 17th
International Conference on Computational Linguistics,
2, pp. 922{928 (1998).
28. A
i, H., Barrault, L., and Schwenk, H. OCR error
correction using statistical machine translation", International
Journal of Computational Linguistics and
Applications, 7(1), pp. 175{191 (2016).
29. Kesorn, K. and Phawapoothayanchai, P. Optical
character recognition (OCR) enhancement using an
approximate string matching technique", Engineering
and Applied Science Research, 45(4), pp. 282{289
(2018).
30. Doush, A.I., Alkhateeb, F. and Gharaibeh, H.A. A
novel Arabic OCR post-processing using rule-based
and word context techniques", International Journal
on Document Analysis and Recognition (IJDAR),
21(1-2), pp. 77{89 (2018).
31. Magdy, W. and Darwish, K. Arabic OCR error
correction using character segment correction, language
modeling, and shallow morphology", EMNLP
'06: Proceedings of the 2006 Conference on Empirical
Methods in Natural Language Processing, pp. 408{414
(2006).
32. Ramanan, M., Ramanan, A. and Charles, E.Y.A.
A performance comparison and post-processing error
correction technique to OCRs for printed Tamil texts",
Industrial and Information Systems (ICIIS), 2014 9th
International Conference, India (2014).
33. Kolak, O. and Resnik, P. OCR post-processing for low
density languages", Proceedings of the Conference on
Human Language Technology and Empirical Methods
in Natural Language Processing, pp. 867{874 (2005).
34. Zaiz, F., Babahenini, C.M., and Dje al, A. Puzzle
based system for improving Arabic handwriting
recognition", Engineering Applications of Arti cial
Intelligence, 56, pp. 222{229 (2016).
Z. Khosrobeigi et al./Scientia Iranica, Transactions D: Computer Science & ... 27 (2020) 3019{3033 3033
35. Al-Youse , H. and Upda, S.S. Recognition of Arabic
characters", IEEE Transactions on Pattern Analysis
& Machine Intelligence, 8, pp. 853{857 (1992).
36. Khorsheed, S.M. and Clocksin, F.C., Structural Features
of Cursive Arabic Script, BMVC (1999).
37. Mahootian, S., Persian, Routledge (2002).
38. Awde, N. and Samano, P., The Arabic Alphabet: How
to Read and Write It, Lyle Stuart (1986).
39. Parhami, B. and Taraghi, M. Automatic recognition
of printed Farsi texts", Pattern Recognition, 14(1-6),
pp. 395{403 (1981).
40. Azmi, R. and Kabir, E. A new segmentation technique
for omnifont Farsi text", Pattern Recognition
Letters, 22(2), pp. 97{104 (2001).
41. Ebrahimi, A. and Kabir, E. A pictorial dictionary for
printed Farsi subwords", Pattern Recognition Letters,
29(5), pp. 656{663 (2008).
42. Azadnia, M. Presenting an expert system for automatic
correcting Persian texts", International Journal
of Computer Science and Network Security, 8(3), pp.
27{31 (2008).
43. Tesseract Open Source OCR engine (main repository),
https://github.com/tesseract-OCR/tesseract.
44. Smith, R. An overview of the Tesseract OCR engine",
9th IEEE Intl. Conf. on Document Analysis and Recognition
(ICDAR) (2007).
45. Persian processing, Tmu-printed-farsi-text-1-100-pp,
http://farsiocr.ir/.