A rule-based post-processing approach to improve Persian OCR performance

Document Type : Article

Authors

1 Faculty of New Sciences and Technologies, University of Tehran, Tehran, Iran

2 Department of Electrical and Computer Engineering, University of Memphis, Memphis, USA

Abstract

Optical Character Recognition (OCR) is a system to convert images including text into an editable text. Nowadays, the accuracy of these systems is acceptable in images with simple-structure and high quality. However, the performance degrades for images with complex-structure, low quality, and in the presence of noise, scratches, pictures, stamps, or other non-textual symbols. This paper proposes a Persian OCR post-processing technique to increase the accuracy of the OCR systems dealing with real-world challenging samples. The proposed method extracts five features in each line of the text and uses seven proposed rules to investigate whether that line should be ignored or not. To evaluate the proposed method, Khana (structural based) and Bina (deep learning-based) Persian OCR systems have utilized, and a dataset containing 200 complex-structure images and 100 simple-structure images have been collected. The accuracy of the Khana and Bina in images with complex-structure is 39% and 58%, respectively, while after applying the proposed post-processing method the accuracy increases to 93% and 91%, respectively.

Keywords


  1. Sajedi, H. Handwriting recognition of digits, signs, and numerical strings in Persian", Computers & Electrical Engineering, 49, pp. 52{65 (2016). 2. Azadnia, M. Presenting an expert system for automatic correcting Persian texts", International Journal of Computer Science and Network Security, 8(3), pp. 27{31 (2008). 3. Eikvil, L. Optical Character Recognition, Norsk Regnesentral, P.B. 114 Blindern, N-0314 Oslo, (Dec. 1993). 4. Singh, A., Bacchuwar, K., and Bhasin, A. A survey of OCR applications", International Journal of Machine Learning and Computing, 2(3), pp. 314{318 (2012). 5. Menhaj, M.B. and Adab, M. Simultaneous segmentation and recognition of Farsi/Latin printed texts with MLP", Neural Networks, 2002. IJCNN'02. Proceedings of the 2002 International Joint Conference, 2 (2002). 6. Raymond, S. Hybrid page layout analysis via tabstop detection", Document Analysis and Recognition, ICDAR'09. 10th International Conference (2009). 7. Simon, A., Pret, J.-C., and Johnson, A.P. A fast algorithm for bottom-up document layout analysis", IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(3), pp. 273{277 (1997). 8. O'Gorman, L. The document spectrum for page layout analysis", IEEE Transactions on Pattern Anal3032 Z. Khosrobeigi et al./Scientia Iranica, Transactions D: Computer Science & ... 27 (2020) 3019{3033 ysis and Machine Intelligence, 15(11), pp. 1162{1173 (1993). 9. Pritpal, S. and Budhiraja, S. Feature extraction and classi_cation techniques in OCR systems for handwritten Gurmukhi script-a survey", International Journal of Engineering Research and Applications (IJERA), 1(4), pp. 1736{1739 (2011). 10. Lehal, G.S. and Singh, C. A Gurmukhi script recognition system", Pattern Recognition, Proceedings. 15th International Conference, 2 (2000). 11. Zand, M., Naghsh Nilchi, A., and Monadjemi, S.A. Recognition-based segmentation in Persian character recognition", Proceedings of World Academy of Science, Engineering and Technology International Journal of Computer, Electrical, Automation, Control and Information Engineering, 28 (2008). 12. Khosravi, H. and Kabir, E. A blackboard approach towards integrated Farsi OCR system", International Journal of Document Analysis and Recognition (IJDAR), 12(1), pp. 21{32 (2009). 13. Malik, S.A., Maqsood, M., Aadil, F., et al. An e_cient segmentation technique for Urdu optical character recognizer (OCR)", Advances in Information and Communication, 70, pp. 131{141 (2019). 14. Mirzaee, M. Text detection in images for Persian optical character recognition", MSc Thesis, University Of Tehran, Iran (2012). 15. Ghanbari, N. A review of research studies on the recognition of Farsi alphabetic and numeric characters in the last decade", Fundamental Research in Electrical Engineering, Springer, Singapore, pp. 173{184 (2019). 16. Kameswara Rao, T., Yashwanth Chowdary, K., Koushik Chowdary, I., et al. Optical character recognition from printed text images", International Journal of Scienti_c Research in Computer Science, Engineering and Information Technology, 5, pp. 597{604 (2019). 17. Bina Persian OCR system", ASR-Gooyesh Co., http://www.binaocr.com. 18. Niwa, H., Kayashima, K., and Shimeki, Y. Postprocessing for character recognition using keyword information", IAPR Workshop on Machine Vision Applcatron, Tokyo (1992). 19. Hong, T. Degraded text recognition using visual and linguistic context", Doctoral Dissertation, University of New York, Bu_alo (1996). 20. Kukich, K. Techniques for automatically correcting words in text", Acm. Computing Surveys (CSUR), 24(4), pp. 377{439 (1992). 21. Jurafsky, D. and Martin, H.J., Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2008). 22. Mays, E., Damerau, F.J., and Mercer, L.R. Context based spelling correction context-sensitive spell checking based on _eld association terms dictionaries", Information Processing & Management, 27(5), pp. 517{522 (1991). 23. Beaufort, R. and Thillou, C. A weighted _nite-state framework for correcting errors in natural scene OCR", Document Analysis and Recognition, ICDAR 2007. Ninth International Conference, 2, pp. 889{893 (2007). 24. Bassil, Y. and Alwani, M. OCR post-processing error correction algorithm using google online spelling suggestion", Computer Science ArXiv, 3(1), pp. 1{9 (2012). 25. Ranka, V., Patil, S., Patni, S., et al. Automatic table detection and retention from scanned document images via analysis of structural information", 2017 Fourth International Conference on Image Information Processing (ICIIP), India (2017). 26. Jahan MAC, A. and Ragel, R. Locating tables in scanned documents for reconstructing and republishing", 7th International Conference on Information and Automation for Sustainability, Sri Lanka (2014). 27. Nagata, M. Japanese OCR error correction using character shape similarity and statistical language model", Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, 2, pp. 922{928 (1998). 28. Ai, H., Barrault, L., and Schwenk, H. OCR error correction using statistical machine translation", International Journal of Computational Linguistics and Applications, 7(1), pp. 175{191 (2016). 29. Kesorn, K. and Phawapoothayanchai, P. Optical character recognition (OCR) enhancement using an approximate string matching technique", Engineering and Applied Science Research, 45(4), pp. 282{289 (2018). 30. Doush, A.I., Alkhateeb, F. and Gharaibeh, H.A. A novel Arabic OCR post-processing using rule-based and word context techniques", International Journal on Document Analysis and Recognition (IJDAR), 21(1-2), pp. 77{89 (2018). 31. Magdy, W. and Darwish, K. Arabic OCR error correction using character segment correction, language modeling, and shallow morphology", EMNLP '06: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 408{414 (2006). 32. Ramanan, M., Ramanan, A. and Charles, E.Y.A. A performance comparison and post-processing error correction technique to OCRs for printed Tamil texts", Industrial and Information Systems (ICIIS), 2014 9th International Conference, India (2014). 33. Kolak, O. and Resnik, P. OCR post-processing for low density languages", Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 867{874 (2005). 34. Zaiz, F., Babahenini, C.M., and Dje_al, A. Puzzle based system for improving Arabic handwriting recognition", Engineering Applications of Arti_cial Intelligence, 56, pp. 222{229 (2016). Z. Khosrobeigi et al./Scientia Iranica, Transactions D: Computer Science & ... 27 (2020) 3019{3033 3033 35. Al-Youse_, H. and Upda, S.S. Recognition of Arabic characters", IEEE Transactions on Pattern Analysis & Machine Intelligence, 8, pp. 853{857 (1992). 36. Khorsheed, S.M. and Clocksin, F.C., Structural Features of Cursive Arabic Script, BMVC (1999). 37. Mahootian, S., Persian, Routledge (2002). 38. Awde, N. and Samano, P., The Arabic Alphabet: How to Read and Write It, Lyle Stuart (1986). 39. Parhami, B. and Taraghi, M. Automatic recognition of printed Farsi texts", Pattern Recognition, 14(1-6), pp. 395{403 (1981). 40. Azmi, R. and Kabir, E. A new segmentation technique for omnifont Farsi text", Pattern Recognition Letters, 22(2), pp. 97{104 (2001). 41. Ebrahimi, A. and Kabir, E. A pictorial dictionary for printed Farsi subwords", Pattern Recognition Letters, 29(5), pp. 656{663 (2008). 42. Azadnia, M. Presenting an expert system for automatic correcting Persian texts", International Journal of Computer Science and Network Security, 8(3), pp. 27{31 (2008). 43. Tesseract Open Source OCR engine (main repository), https://github.com/tesseract-OCR/tesseract. 44. Smith, R. An overview of the Tesseract OCR engine", 9th IEEE Intl. Conf. on Document Analysis and Recognition (ICDAR) (2007). 45. Persian processing, Tmu-printed-farsi-text-1-100-pp, http://farsiocr.ir/.