Skip to main content
Log in

Segmentation techniques for recognition of Arabic-like scripts: A comprehensive survey

  • Published:
Education and Information Technologies Aims and scope Submit manuscript

Abstract

Arabic script based text recognition system has been a popular field of research for many years that can be used in the learning and teaching process to the students and educators how to read and understand educational contents of Arabic script. The challenging nature of Arabic script recognition has attracted the attention of researchers from both industry and academic circles but these efforts have not achieved good results until now. Segmentation of Urdu script when written in Nasta’liq writing style is very difficult task due to the complexity of writing style as compare to Naskh writing style. Good segmentation is one of the reasons for high accuracy. Character segmentation has been a critical phase of the OCR process. The higher recognition rates for isolated characters as compare to results of words or connected character well illustrate the importance of segmentation. Current study investigates the recent work for character segmentation and challenges for segmentation for Arabic script based languages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  • Abidi, A., Siddiqi, I., & Khurshid, K. (2011). Towards searchable digital Urdu libraries—a word spotting based retrieval approach. In Proc. International Conference on Document Analysis and Recognition (ICDAR’11) (pp. 1344–1348). IEEE.

  • Ahmad, Z., Orakzai, J. K., Shamsher, I., & Adnan, A. (2007). Urdu Nastaleeq optical character recognition. In Proc. World Academy of Science, Engineering and Technology, volume 26.

  • Ahmad, Z., Orakzai, J. K., & Shamsher, I. (2009). Urdu compound character recognition using feed forward neural networks. In Proc. 2nd International Conference on Computer Science and Information Technology (ICCSIT’09) (pp. 457–462).

  • Ahmad, R., Amin, S. H., & Khan, M. A. U. (2010). Scale and rotation invariant recognition of cursive Pashto scriptusing SIFT features. In Proc. 6th International Conference on Emerging Technologies (ICET’10) (pp. 299–303). Islamabad, Pakistan.

  • Akram, M., & Hussain, S. (2010). Word segmentation for Urdu OCR system. In Proc. 8th Workshop on Asian Language Resources (pp. 87–93). Beijing, China. Asian Federation for Natural Language Processing.

  • Akram, Q. U., Hussain, S., & Habib, Z. (2010). Font size independent OCR for Noori Nastaleeq. In Proc. Graduate Colloquium on Computer Sciences (GCCS), Department of Computer Science, FAST-NU Lahore, Pakistan, volume 1.

  • Akram, Q. u. A., Hussain, S., Niazi, A., Anjum, U., & Irfan, F. (2014). Adapting Tesseract for complex scripts: An example for Urdu Nastalique. In 11th IAPR International Workshop on Document Analysis Systems.

  • Al-Badr, B., & Mahmoud, S. A. (1995). Survey and bibliography of Arabic optical text recognition. Signal Processing, 41(1), 49–77.

    Article  MATH  Google Scholar 

  • Alginahi, Y. M. (2012). A survey on Arabic character segmentation. International Journal on Document Analysis and Recognition, 1–22.

  • Ali, A., Ahmad, M., Rafiq, N., Akber, J., Ahmad, U., & Akmal. (2004). Language independent optical character recognition for hand written text. In Proc. 8th International Multitopic IEEE Conference (INMIC’04) (pp. 79–84).

  • Asad, M., Butt, A. S., Chaudhry, S., & Hussain, S. (2004). Rule-based expert system for Urdu Nastaleeq justification. In Proc. 8th International Multitopic IEEE Conference (INMIC’04) (pp. 591–596).

  • Bukhari, S. S., Shafait, F., & Breuel, T. M. (2012). Layout analysis of Arabic script documents. In Computer Analysis of Images and Patterns, volume 5702 of Lecture Notes in Computer Science (pp. 35–53). Springer.

  • Cheung, A., Bennamoun, M., & Bergmann, N. W. (2001). An Arabic optical character recognition system using recognition-based segmentation. Pattern Recognition, 34(2), 215–233.

    Article  MATH  Google Scholar 

  • CLE. “CLE Urdu HFL 14 point size,” CLE. [Online]. Available: http://www.cle.org.pk/clestore/cleurduhfl14pt.htm on Nov, 2014.

  • CLE. “CLE Urdu HFL 16 Point Size,” CLE. (2014). Available: http://www.cle.org.pk/clestore/cleurduhfl16pt.htm on Nov, 2014.

  • CLE. “Image Corpora” CLE. [Online]. Available: http://www.cle.org.pk/clestore/imagecorpora.htm.

  • Fareen, N., Khan, M. A., & Durrani, A. (2012). Urdu Naskh typography pattern recognition of type foundries. International Journal of Computational Linguistics (IJCL), 3(2).

  • Graves, A., Liwicki, M., Fern, S., Bertolami, R., Bunke, H., & Schmidhuber, J. (2009). A novel connectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, no. 5:85517868.

  • Hasan, A. U., Ahmed, S. B., Rashid, F., Shafait, F., & Breuel, T. M. (2013). Offline printed Urdu Nastaleeq script recognition with Bidirectional LSTM networks. In 12th International Conference on Document Analysis and Recognition (ICDAR’13) (pp. 1061–1065).

  • Husain, S. A. (2002). A multi-tier holistic approach for Urdu Nastaliq recognition. In Multi topic conference, 2002. Abstracts. INMIC 2002. International (pp. 84–84). IEEE.

  • Hussain, S. A., Zaman, S., & Ayub, M. (2009). A self organizing map based Urdu Nasakh character recognition. In Proc. International Conference on Emerging Technologies (ICET’09) (pp. 267–273). Islamabad, Pakistan.

  • Iftikhar, U. (2011). Recognition of Urdu ligatures. Master’s thesis, VIBOT Consortium and German Research Center for Artificial Intelligence (DFKI).

  • Ijaz, M., & Hussain, S. (2007). Corpus based Urdu Lexicon Development. In Proc. Conference on Language Technology.

  • Jamil, A. M. (2008). Noori Nastaliq revolution in Urdu composing. Elite publishers for Noori Nastaliq Foundation.

  • Jamil. (2010). Valid ligatures of Urdu. In Center for Language Engineering.

  • Javed, S. T. (2007). Investigation into a segmentation based OCR for the Nastaleeq writing system. Master’s thesis, National University of Computer & Emerging Sciences, Lahore, Pakistan.

  • Javed, S. T., & Hussain, S. (2009). Improving Nastalique-specific Pre-recognition process for Urdu OCR. In Proc. 13th International Multitopic IEEE Conference (INMIC’09) (pp. 1–6). IEEE.

  • Javed, S. T., Hussain, S., Maqbool, A., Asloob, S., Jamil, S., & Moin, H. (2010). Segmentation free Nastalique Urdu OCR. World Academy of Science, Engineering and Technology, 46, 456–461.

    Google Scholar 

  • Khan, K., Ullah, R., Khan, N. A., & Naveed, K. (2012). Urdu character recognition using principal component analysis. International Journal of Computer Applications, 60(11).

  • Korashy, A. E. & Shafait, F. (2013). Search space reduction for holistic ligature recognition in Urdu Nastalique script. In 2013 12th International Conference on Document Analysis and Recognition (ICDAR’13) (pp. 1125–1129). IEEE.

  • Lehal, G. S. (2012). Choice of recognizable units for Urdu OCR. In Proc. Workshop on Document Analysis and Recognition (DAR’12) (pp. 79–85). New York, NY, USA: ACM.

  • Lehal, G. S. (2013). Ligature segmentation for Urdu OCR. In 12th International Conference on Document Analysis and Recognition (ICDAR’ 2013) (pp. 1130–1134). IEEE.

  • Lorigo, L. M., & Govindaraju, V. (2006). Offline Arabic handwriting recognition: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(5), 712–724.

    Article  Google Scholar 

  • Mahar, J. A., Memon, G. Q., & Danwar, S. H. (2011). Algorithms for Sindhi word segmentation using Lexicon-Driven approach. International Journal of Academic Research, 3(3), 28.

    Google Scholar 

  • Malik, H., & Fahiem, M. A. (2009). Segmentation of printed Urdu scripts using structural features. In Proc. 2nd International Conference in Visualisation (VIZ’09) (pp. 191–195). IEEE.

  • Märgner, V., & El-Abed, H., (2008). Databases and competitions: strategies to improve Arabic recognition systems. In: Proceedings of the Conference on Arabic and Chinese Handwriting Recognition. Springer, Berlin, Heidelberg: 82. [18] Marti U-Vand Horst Bunke.

  • Megherbi, D. B., Lodhi, S. M., & Boulenouar, A. J. (2000). Fuzzy logic model based technique with application to Urdu character recognition. Proceedings of SPIE Applications of Artificial Neural Networks in Image Processing, 3962, 13–24.

    Google Scholar 

  • Naz, S., Hayat, K., Razzak, M. I., Anwar, M. W., & Akbar, H. (2013a). Arabic script based language character recognition: Nasta’liq vs Naskh analysis. In World Congress on Computer and Information Technology (WCCIT’13) (pp. 1–7). IEEE.

  • Naz, S., Hayat, K., Razzak, M. I., Anwar, M.W., & Akbar, H. (2013b). Challenges in baseline detection of cursive script languages. In Science and Information Conference (SAI’13) (pp. 551–556). IEEE.

  • Naz, S., Hayat, K., Razzak, M. I., Anwar, M. W., Madani, S. A., & Khan, S. U. (2013c). The optical character recognition of Urdu-like cursive scripts. Pattern Recognition, 47(3), 1229–1248.

    Article  Google Scholar 

  • Naz, M., Akram, Q. u., & Hussain, S. (2014). Inarization and its evaluation for Urdu Nastalique document images. In CLT.

  • Pal, U., & Chaudhuri, B. B. (2004). Indian script character recognition: a survey. Pattern Recognition, 37(9), 1887–1899.

    Article  Google Scholar 

  • Pal, U., & Sarkar, A. (2003). Recognition of printed Urdu script. In Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR’03).

  • Parvez, M. T., & Mahmoud, S. A. (2012). Arabic handwriting recognition using structural and syntactic pattern attributes. Pattern Recognition.

  • Rafiq, M. I., Wali, A., Ghazali, M. A., Bashir, S., Hussain, S., & Zia, A., et al. (2002). Contextual shape analysis of Nastaliq. Akhbar-e-Urdu. A Journal of National Language Authority, Islamabad, Pakistan, 288–302.

  • Ramteke, R. J., & Pathan, I. K. (2013). Noise reduction in Urdu document ImageSpatial and frequency domain approaches. In Proceedings of the Fourth International Conference on Signal and Image Processing 2012 (ICSIP’12) (443–452).

  • Ramteke, R. J., Imran, K. P., & Mehrotra, S. C. (2011). Skew angle estimation of Urdu document images: a moments based approach. International Journal of Machine Learning and Computing, 1(1), 7–12.

    Article  Google Scholar 

  • Raza, A., Siddiqi, I., Abidi, A., & Arif, F. (2012). An unconstrained benchmark Urdu handwritten Sentence database with automatic line segmentation. Proc. International Conference on Frontiers in Handwriting Recognition (ICFHR’12) (pp. 491–496).

  • Rehman, M. A. U. (2010). A new scale invariant optimized chain code for Nastaliq character representation. In Proc. 2nd International Conference on Computer Modeling and Simulation (ICCMS’10), volume 4 (pp. 400–403). IEEE.

  • Sabbour, N., & Shafait, F. (2013). A segmentation-free approach to Arabic and Urdu OCR. In Proc. SPIE International Society for Optics and Photonics, volume 86580. International Society for Optics and Photonics.

  • Sagheer, M. W., He, C. L., Nobile, N., & Suen, C. Y. (2009). A new large Urdu database for off-line handwriting recognition. In Proc. International Conference on Image Analysis and Processing (ICIAP’09), volume 5716 of Lecture Notes in Computer Science (pp. 538–546). Springer.

  • Sardar, S., & Wahab, A. (2010). Optical character recognition system for Urdu. In 2010 International Conference on Information and Emerging Technologies (ICIET’10) (pp. 1–5). IEEE.

  • Sattar, S. A. (2009a). A technique for the design and implementation of an OCR for Printed Nastaliue Text. PhD thesis, NED University of Engineering & Technology, Karachi.

  • Sattar, S. A. (2009b). A finite state model for Urdu Nastalique optical character recognition. IJCSNS International Journal of Computer Science and Network Security, No.9.

  • Sattar, S. A., Haque, S., Pathan, M. K., & Gee, Q. (2008). Implementation challenges for Nastaliq character recognition. Communications in Computer and Information Science, ISSN 1865–0929, Springer-Verlag, Berlin, Germany, 20:279–285.

  • Satti, D. A. (2013). Offline Urdu Nastaliq OCR for printed text using analytical approach. Master’s thesis, Quaid-i-Azam University Islamabad, Pakistan.

  • Satti, D. A., & Saleem, K. (2012). Complexities and implementation challenges in offline Urdu Nastaliq OCR. In Proc. Conference on Language & Technology 2012 (CLT12) (pp. 85–91). Lahore, Pakistan: University of Engineering & Technology (UET).

  • Shah, Z. A. (2002). Ligature based optical character recognition of Urdu-Nastaleeq Font. In Proc. 6th International Multitopic IEEE Conference (INMIC’02) (pp. 25–25). IEEE.

  • Shaikh, N. A., Mallah, G. A., & Shaikh, Z. A. (2009). Character segmentation of Sindhi, an Arabic style scripting language, using height profile vector. Australian Journal of Basic and Applied Sciences, 3(4), 4160–4169.

    Google Scholar 

  • Shamsher, I., Ahmad, Z., Orakzai, J. K., & Adnan, A. (2007). OCR for printed Urdu script using feed forward neural network. Proceedings of World Academy of Science, Engineering and Technology, 23, 172–175.

    Google Scholar 

  • Smith, R. (2007). An overview of the Tesseract OCR engine. In ICDAR.

  • Smith, R., Antonova, D., & Lee, D.-S. (2009). Adapting the Tesseract open source OCR engine for multilingual OCR. In International Workshop on Multilingual OCR, Barcelona, Spain.

  • Wahab, M., Amin, H., & Ahmed, F. (2009). Shape analysis of Pashto script and creation of image database for OCR. In Proc. International Conference on Emerging Technologies (ICET’09) (287–290). IEEE. 10

  • Wali, A., & Hussain, S. (2007). Innovations and advanced techniques in computer and information sciences and engineering, chapter context sensitive shape-substitution in Nastaliq writing system: analysis and formulation (pp. 53–58). Springer.

  • Wali, A., Gulzar, A., Zia, A., Ghazali, M. A., Rafiq, M. I., & Niaz, M. S., et al. (2002). Features for Noori Nastaleeq. Akhbar-e-Urdu. A Journal of National Language Authority, Islamabad, Pakistan, 303–308.

  • Zahedi, M., & Eslami, S. (2011). Farsi/Arabic optical font recognition using SIFT features. Procedia Computer Science, World Conference on Information Technology (WCIT’11), 3, 1055–1059.

    Google Scholar 

  • Zeki, A. M. (2005). The segmentation problem in arabic character recognition the state of the art. In Proc. 1st International Conference on Information and Communication Technologies (ICICT’05) (pp. 11–26). IEEE.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Muhammad I. Razzak.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Naz, S., Umar, A.I., Shirazi, S.H. et al. Segmentation techniques for recognition of Arabic-like scripts: A comprehensive survey. Educ Inf Technol 21, 1225–1241 (2016). https://doi.org/10.1007/s10639-015-9377-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10639-015-9377-5

Keywords

Navigation