A statistical approach to knowledge discovery: Bootstrap analysis of language models for knowledge base population from unstructured text

Document Type : Article


1 Department of Computer Engineering and Information Technology, Amirkabir University of Technology, Tehran, Iran

2 Department of Computational Linguistics and Phonetics, Saarland University, Saarbruucken, Germany


In this paper, we propose a novel approach for knowledge discovery from textual data. The generated knowledge base can be used as one of the main components in the cognitive process of question answering systems. The proposed model automatically extract relations between named enti- ties in Persian. Our proposed model is a bootstrapping approach based on n-gram model to nd the representative textual patterns of relations as n-grams in order to extract new knowledge about given named entities. The main motivation for this work is the characteristic of the sentence structure in Persian which, in contrary to English sentences, is in subject- object-verb format. The proposed approach is a purely statistical one and no  background knowledge of the target language is required. This makes our method applicable to any open domain relation extraction task. How- ever, as for our test-bed, we focus on the domain of biographical data of international poets and scientists to build a knowledge base about them. Qualitative evaluations based on human assessment is an evidence for the ecacy of our method.


Main Subjects

1. Chen, Y., Argentinis, J.E., and Weber, G. "IBM Watson: How cognitive computing can be applied to big data challenges in life sciences research", Clinical Therapeutics, 38(4), pp. 688-701 (2016).
2. Gowda, N. and Rekha, K. "Implementation of cognitive approaches in question-answering system", International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), 5(10), pp. 2548-2551 (2016).
3. Bhati, R. and Prasad, S.S. "Open domain questionanswering system using cognitive computing", 6th International Conference-Cloud System and Big Data Engineering (Confluence), pp. 34-39 (2016).
4. Kaur, S. and Singh, I. "Cognitive computing: Building a smarter planet", International Journal of Computer Science Trends and Technology (IJCST), 4(2), pp. 325- 329 (2016).
5. Aghaebrahimian, A. and Jurcicek, F. "Open-domain factoid question-answering via knowledge graph search", In Proc. of the NAACL Workshop on Human- Computer Question Answering, pp. 22-28 (2016).
6. Yahya, M., Berberich, K., Ramanath, M., and Weikum, G. "Exploratory querying of extended knowledge graphs", Very Large Data Bases (VLDB) Endowment, 9(13) pp. 1521-1524 (2016).
7. Furbach, U., Schon, C., and Stolzenburg, F. "Cognitive systems and question-answering", Industrie Manageme, 31, pp. 29-32 (2015).
8. High, R., The Era of Cognitive Systems: An Inside Look at IBM Watson and How it Works, In IBM Redbooks: Watson (2012).
9. Yih, W. and Ma, H. "Question answering with knowledge base, web and beyond", In Proc. of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1219-1221. ACM (2016).
10. Karimi, S. and Shakery, A. "A language model-based approach for subjectivity detection", Journal of Information Science, 43(3), pp. 356-377 (2017).
11. La erty, J. and Zhai, C. "Document language models, query models, and risk minimization for information Retrieval", SIGIR Forum, 51(2), pp. 251-259 (2017).
12. Momtazi, S. and Klakow, D. "A word clustering approach for language model-based sentence retrieval in question-answering systems", In Proc. of the 18th ACM Conference on Information and Knowledge Management, pp. 1911-1914 (2009).
13. Ghayoomi, M. and Momtazi, S. "An overview on the existing language models for prediction systems as writing assistant tools", Proc. of IEEE International Conference on Systems, Man and Cybernetics, pp. 5083-5087 (2009).
14. Kushmerick, N., Weld, D., and Doorenbos, R. "Wrapper induction for information extraction", In Proc. of International Joint Conference on Artificial Intelligence (IJCAI) (1997).
15. Hsu, C. and Dung, M. "Generating finite-state transducers for semistructured data extraction from the web", Information Systems (Special Issue on Semistructured Data), 23(9), pp. 521-538 (1998).
16. Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., and Crespo, A. "Extracting semistructured information from the web", In Proc. of the Workshop on Management of Semistructured Data (1997).
17. Mecca, G., Merialdo, P., and Atzeni, P. "Araneus in the era of xml", In Proc. of the IEEE Data Engineering Bullettin, Special Issue on XML (1999).
18. Hindle, D. "Noun classification from predicateargument structures", In Proc. of the Annual Meeting on Association for Computational Linguistics (ACL), pp. 268-275 (1990).
19. Hearst, M.A. "Automatic acquisition of hyponyms from large tex tcorpora", In Proc. of the International Conference on Computational Linguistics (CoLing) (1992).
20. Cali , M.E. and Mooney, R.J. "Relational learning of pattern-match rules for information extraction", In Proc. of the International Conference of the Association for the Advancement of Artificial Intelligence (AAAI), pp. 328-334 (1999).
21. Jurafsky, D. and Martin, J.H., Speech and Language Processing (2nd Edition), Prentice Hall (2008).
22. Manning, C. and Schutze, H., Foundations of Statistical Natural Language Processing, MIT Press (1999).
23. Sarawagi, S., Information Extraction, Now Publisher (2008).
24. Sutton, C. and McCallum, A., Introduction to Conditional Random Fields for Relational Learning, MIT Press (2006).
25. Freitag, D. and McCallum, A. "Information extraction using HMMs and shrinkage", In Proc. of Workshop on Machine Learning for Information Extraction, pp. 31- 36 (1999).
26. Freitag, D. and McCallum, A. "Information extraction with HMM structures learned by stochastic optimization", In Proc. of the International Conference of the Association for the Advancement of Artificial Intelligence (AAAI) (2000).
27. Seymore, K., McCallum, A., and Rosenfeld, R. "Learning hidden Markov model structure for information extraction", In Proc. of the AAAI Workshop on Machine Learning for Information Extraction (1999).
28. McCallum, A., Freitag, D., and Pereira, F. "Maximum entropy Markov models for information extraction and segmentation", In Proc. of the International Conference on Machine Learning (ICML), pp. 591-598 (2000).
29. Brin, S. "Extracting patterns and relations from the World Wide Web", In WebDB '98: Selected Papers from the International Workshop on The World Wide Web and Databases, pp. 172-183 (1999).
30. Agichtein, E., Gravano, L., Pavel, J., Sokolova, V., and Voskoboynik, A. "Snowball: a prototype system for extracting relations from large text collections", In Proc. of the International Conference on Management of Data (SIGMOD) (2001).
31. Aleman-Meza, B., Halaschek, C., Sheth, A., Arpinar, I.B., and Sannapareddy, G. "SWETO: Large-scale semantic web test-bed", In SEKE: Workshop on Ontology in Action (2004).
32. Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D. S., and Yates, A. "Web-scale information extraction in KnowItAll", In Proc. of the International Conference on World Wide Web (WWW), pp. 100-110 (2004).
33. Cimiano, P. and Volker, J. "Text2Onto - a framework for ontology learning and data-driven change discovery", In Proc. of the International Conference on Natural Language and Information Systems, pp. 227- 238 (2005).
34. Suchanek, F.M., Ifrim, G., and Weikum, G. "Combining linguistic and statistical analysis to extract relations from web documents", In Proc. of the International Conference on Knowledge Discovery and Data Mining (KDD) (2006).
35. Yates, A., Banko, M., Broadhead, M., Cafarella, M. J., Etzioni, O., and Soderland, S. "TextRunner: Open information extraction on the web", In The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACLHLT ) (2007).
36. Wang, R.C. and Cohen, W.W. "Language-independent set expansion of named entities using the web", In Proc. of the IEEE International Conference on Data Mining (ICDM) (2007).
37. Moradi, M., Vazirnezhad, B., and Bahrani, M. "Commonsense knowledge extraction for Persian language: A combinatory approach", Iranian Journal of Information Processing and Management, 31(1), pp. 109- 124 (2015).
38. Pantel, P. and Pennacchiotti, M. "Espresso: Leveraging generic patterns for automatically harvesting semantic relations", In Proc. of the International Conference on Computational Linguistics and the annual meeting of the Association for Computational Linguistics (CoLing-ACL), pp. 113-120 (2006).
39. Shamsfard, M. and Barforoush, A.A. "learning ontologies from natural language texts", International Journal of Human-Computer Studies, 60(1) pp. 17-63 (2004).
40. Shamsfard, M., Hesabi, A., Fadaei, H., Mansoory, N., Famian, A., Bagherbeigi, S., Fekri, E., Monshizadeh, M., and Assi, S.M. "Semi automatic development of farsnet; the Persian wordnet", In Proc. of the Global WordNet Conference, 29 (2010).
41. Hashemi, H.B. and Shakery, A. "Mining a Persian- English comparable corpus for cross-language information retrieval", Information Processing & Management, 50(2), pp. 384-398 (2014).
42. Shamsfard, M. "Towards semi automatic construction of a lexical ontology for persian" In Proc. of the Language Resources and Evaluation Conference (LREC) (2008).
43. Ravichandran, D. and Hovy, E. "Learning surface text patterns for a question-answering system", In Proc. of the Annual Meeting on Association for Computational Linguistics (ACL), pp. 41-47 (2002).