Variance-based features for keyword extraction in Persian and English text documents

Document Type : Article

Authors

1 Faculty of New Sciences and Technologies (FNST), University of Tehran, Tehran, Iran

2 Kish International Campus, University of Tehran, Kish, Iran

Abstract

This paper address automatic keyword extraction in Persian and English text documents. Generally, for keyword extraction in a text, a weight is assigned to each token and words having higher weights are selected as the keywords. We have proposed four methods for weighting the words and have compared these methods with five previous weighting techniques. The previous methods used in this paper are term frequency (TF), term frequency inverse document frequency (TF-IDF), variance, discriminative feature selection (DFS), and document length normalization based on unit words (LNU). The proposed weighting methods are based on using variance features and include variance to TF-IDF ratio, variance to TF ratio, the intersection of TF and variance, and the intersection of variance and IDF.
For evaluation, the documents are clustered using the extracted keywords as feature vectors, and K-means, expectation maximization (EM), and Ward hierarchical clustering methods. The entropy of the clusters and pre-defined classes of the documents are used as the evaluation metric. For the evaluations, we have collected and labelled Persian documents. Results show that our proposed weighting method, variance to TF ratio, has the best performance for Persian. Also, the best entropy is resulted by variance to TD-IDF ratio for English.

Keywords


References:
1. Liu, J. and Wang, J. "Keyword extraction using language network", In Proceedings of the International Conference on Natural Language Processing and Knowledge Engineering, pp. 129-134 (2007).
2. Rossi, R.G., Maracini, R.M., and Rezende, S.O. "Analysis of domain independent statistical keyword extraction methods for incremental clustering", Learning and Nonlinear Models, 12(1), pp. 17-37 (2014).
3. Siddiqi, S. and Sharan, A. "Keyword and keyphrase extraction techniques: a literature review", International Journal of Computer Applications, 109(2), pp. 18-23 (2015).
4. Beliga, S., Mestrovic, A., and Martincic-Ipsic, S. "An overview of graph-based keyword extraction methods and approaches", Journal of Information and Organizational Sciences, 39(1), pp. 1-20 (2015).
5. Taeho, C.J. "Text categorization with the concept of fuzzy set of informative keywords", Fuzzy Systems Conference Proceedings, FUZZ-IEEE'99, 1999 IEEE International, 2, IEEE (1999).
6. Mohammadi, M. and Analouyi, M. "Keyword extraction in Persian documents", 13th Conference of Iran Computer Association, Kish, Iran (2007).
7. Biswas, S.K., Bordoloi, M., and Shreya, J. "A graph based keyword extraction model using collective node weight", Expert Systems with Applications, 97, pp. 51- 59 (2018).
8. Noh, H., Joe, Y., and Lee, S. "Keyword selection and processing strategy for applying text mining to patent analysis", Expert Systems with Applications, Elsevier (2015).
9. Haulth, A. "Improved Automatic Keyword Extraction Given More Linguistic Knowledge", In Proceedings of the 2003 Conference on Emprical Methods in Natural Language Processing, Sapporo, Japan: pp. 216-223 (2003).
10. Ercan, G. and Cicekli, I. "Using lexical chains for keyword extraction", Information Processing and Management, pp. 1705-1714 (2007).
11. Zhang, K., Xu, H., Tang, J., et al. "Keyword extraction using support vector Machine", In Proceeding of the 7th International Conference on Web- AgemInformation Management, Hong Kong, China, pp. 85-96 (2006).
12. Onan, A., Korukoglu, and Bulut, S. "Ensemble of keyword extraction methods and classifiers in text classification", Expert Systems with Applications, 57(C), pp. 232-247 (2016).
13. Wartena, C., Brussee, R., and Slakhorst, W. "Keyword extraction using word co-occurrence", IEEE 2010 Workshops on Database and Expert Systems Applications, pp. 54-58 (2010).
14. Arabi, S., Vahidi, M., and Minaei, B. "Keyword extraction for Persian text categorization", First Iranian Data Mining Conference, Amirkabir University of Technology (2007).
15. Frank, E. and Paynter, I.H. "Domain-specific keyphrase extraction", In Proceedings of the 16th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, Morgan Kaufman, pp. 668-673 (1999).
16. Najafi, E. and Darooneh, A.H. "The fractal patterns of words in a text: a method for automatic keyword extraction", Plos One, 10(6), e0130617 (2015).
17. Haggag, M. "Keyword extraction using semantic analysis", International Journal of Computer Applications, 61(1), pp. 1128-1132 (2013).
18. Shamsfard, M., Hesabi, A., Fadaei, H., et al. "Semiautomatic development of farsNet; the Persian word- Net", Proceedings of 5th Global WordNet Conference, Mumbai, India, 29 (2010).
19. Keith, J.B. "Phraserate: an Html keyphrase extractor", Technical Report, University of California, Riverside, pp. 1-16 (2002).
20. Sharma, N., Bajpai, A., and Litoriya, R. "Comparison the various clustering algorithms of Weka tool", International Journal of Emerging Technology and Advanced Engineering, 4(7), pp. 73-80 (2012).
21. Aleahmad, A., Hakimian, P., Mahdikhani, F., et al. "N-gram and local context analysis for Persian text retrieval", International Symposium on Signal Processing and Its Applications (ISSPA2007), Sharjah, United Arab Emirates (UAE), pp. 12-15 (2007).
22. Zong, W.,Wu, F., Chu, L., et al. "A discriminative and semantic feature selection method for text categorization", Int. J. Production Economics Elsevier, 165(1), pp. 215-222 (2015).
23. Kaur, M. and Kaur, N. "Web document clustering approaches using K-means algorithm", International Journal of Advanced Research in Computer Science and Software Engineering, 3(5), pp. 861-864 (2013).
24. Zarandi, M.F., Faraji, M.R., and Karbasian, M. "An exponential cluster validity index for fuzzy clustering with crisp and fuzzy data", Scientia Iranica, Transactions E, Industrial Engineering, 17(2), p. 95 (2010).
25. McLaclan, G. and Krishnan, T., The EM Algorithm and Extensions, Wiley 2nd Edition, New Jersey (2008).
26. Zhao, Y. and Karypis, G. "Hierarchical clustering algorithms for document datasets", Journal of Data Mining and Knowledge Discovery Elsevier, 10(2), pp. 141-168 (2005).
27. Montazeri-Gh, M. and Fotouhi, A. "Traffic condition recognition using the k-means clustering method", Scientia Iranica, Transactions B: Mechanical Engineering, 18, pp. 930-937 (2011). 28. Hartigan, J.A. and Wong, M.A. "Algorithm as 136: A k-means clustering algorithm", Journal of the Royal Statistical Society, Series C (Applied Statistics), 28(1), pp. 100-108 (1979).
29. Jain, A.K. "Data clustering: 50 years beyond Kmeans", International Conference on Pattern Recognition (ICPR) Elsevier, pp. 651-666 (2010).
30. Jeff, Wu. C.J. "On the convergence properties of the EM algorithm", The Annals of Statistics, 11(1), pp. 95-103 (1983).
31. Bock, R.D. and Aitkin, M. "Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm", Psychometrika, 46(4), pp. 443-459 (1981).
32. Bottou, L. and Bengio, Y. "Convergence properties of the k-means algorithms", Advances in Neural Information Processing Systems, pp. 585-592 (1995).
33. Johnson, S.C. "Hierarchical clustering schemes", Psychometrika, 32(3), pp. 241-254 (1967).
34. Web, A.R., Statistical Pattern Recognition, John Wiley & Sons. 2nd Edition (2002).
35. Olson, C.F. "Parallel algorithms for hierarchical lustering", International Journal of Parallel Computing, Elsevier, 21(8), pp. 1313-1325 (1995).
36. Mangiameli, P., Chen, S.K., and West, D. "Comparison of SOM neural network and hierarchical clustering method", European Journal of Operational Research, 93(2), pp. 402-417 (1976).
37. He, Q., A Review of Clustering Algorithms as Applied in IR, Graduate School of Library and Information Science University of Illinois at Urbana-Champaign (1999).
38. Batagelj, V., Generalized Ward and Related Clustering Problem. Classification and Related Methods of Data Analysis, H.H. Bock (editor), North Holland, Amsterdam, pp. 67-74 (1986).