Skip to main content
Log in

Insights into relevant knowledge extraction techniques: a comprehensive review

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

More than 50 million journal papers will have been published by the end of 2019 with 2 million more journal papers published every year. The number of conference papers is even higher, and millions of other types of scientific research are added to the knowledge base every year. Scientific databases such as Web of Science, Scopus, and PubMed index millions of scientific papers and Google Scholar indexes a huge amount of scientific knowledge across diverse domains. However, current systems provide long lists of results when users attempt to find relevant papers, leaving them with little choice other than manually skimming through the lists. This article surveys different techniques used to identify relevant research papers by knowledge-based organizations. We categorized current literature content as content, metadata, collaborative filtering, and citation based techniques and identified the strengths and limitation for each approach. Further, we evaluated the published techniques and research-based products used to identify relevant documents and identified the strengths and limitations of each approach. This research will greatly help to understand current state-of-the-art techniques internal workings for finding relevant papers, understand the relevant strengths and limitations, and explore previously proposed techniques targeting this area.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. https://clarivate.libguides.com/webofscienceplatform/coverage.

  2. http://www.citeulike.org.

  3. http://www.bibsonomy.org/.

  4. https://delicious.com/.

  5. https://www.yahoo.com/.

  6. http://developer.yahoo.com/contentanalysis/.

  7. https://lucene.apache.org/.

References

  1. Afzal MT, Kulathuramaiyer N, Maurer HA (2007) Creating links into the future. J UCS 13(9):1234–1245

    Google Scholar 

  2. Larsen PO, Von Ins M (2010) The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index. Scientometrics 84(3):575–603

    Article  Google Scholar 

  3. Van Dalen HP, Klamer A (2005) Is science a case of wasteful competition? Kyklos 58(3):395–414

    Article  Google Scholar 

  4. Gemert AV PLoS ONE Publishes 10,000th Manuscript. http://blogs.plos.org/everyone/2010/04/02/plos-one-publishes-10000th-article/, 2010. Accessed 31 Dec 2018

  5. Jinha AE (2010) Article 50 million: an estimate of the number of scholarly articles in existence. Learn Publ 23(3):258–263

    Article  Google Scholar 

  6. Bollacker KD, Lawrence S, Giles CL (2000) Discovering relevant scientific literature on the web. IEEE Intell Syst Their Appl 15(2):42–47

    Article  Google Scholar 

  7. Garfield E (1965) Can citation indexing be automated. In: Statistical Association Methods for Mechanized Documentation, Symposium Proceedings, National Bureau of Standards, Miscellaneous Publication 269., Washington DC

  8. Teufel S, Siddharthan A, Tidhar D (2006) Automatic classification of citation function. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics

  9. Sajid NA et al (2011) Exploiting reference section to classify paper’s topics. In: Proceedings of the International Conference on Management of Emergent Digital EcoSystems. ACM

  10. Sajid N, Afzal M, Qadir M (2016) Multi-label classification of computer science documents using fuzzy logic. J Natil Sci Found Sri Lanka 44(2):155–165

    Article  Google Scholar 

  11. Bogers T, Van den Bosch A (2008) Recommending scientific articles using citeulike. In: Proceedings of the 2008 ACM Conference on Recommender Systems. ACM

  12. Klink S, Kieninger T (2001) Rule-based document structure understanding with a fuzzy combination of layout and textual features. Int J Doc Anal Recogn 4(1):18–26

    Article  Google Scholar 

  13. McCallum A, Freitag D, Pereira FC (2000) Maximum entropy markov models for information extraction and segmentation. In: ICML '00 proceedings of the seventeenth international conference on machine learning, pp 591–598

  14. Linden G, Smith B, York J (2003) Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet Comput 7(1):76–80

    Article  Google Scholar 

  15. Cohen WW, Fan W (2000) Web-collaborative filtering: Recommending music by crawling the web. Comput Netw 33(1):685–698

    Article  Google Scholar 

  16. Koren Y, Bell R, Volinsky C (2009) Matrix factorization techniques for recommender systems. Computer 42(8):30–37

    Article  Google Scholar 

  17. Pohl S, Radlinski F, Joachims T (2007) Recommending related papers based on digital library access records. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital libraries. ACM

  18. Su X, Khoshgoftaar TM (2009) A survey of collaborative filtering techniques. Adv Artif Intell 2009:4

    Article  Google Scholar 

  19. Chen C-H et al (2011) Novelty paper recommendation using citation authority diffusion. In: Technologies and Applications of Artificial Intelligence (TAAI), 2011 International Conference on. IEEE

  20. Garfield E (2006) The history and meaning of the journal impact factor. JAMA 295(1):90–93

    Article  Google Scholar 

  21. Hirsch JE (2005) An index to quantify an individual’s scientific research output. In: Proceedings of the National academy of Sciences of the United States of America, pp 16569–16572

  22. Kessler MM (1963) Bibliographic coupling between scientific papers. J Assoc Inf Sci Technol 14(1):10–25

    Google Scholar 

  23. Small H (1973) Co-citation in the scientific literature: a new measure of the relationship between two documents. J Assoc Inf Sci Technol 24(4):265–269

    MathSciNet  Google Scholar 

  24. Lo RT-W, He B, Ounis I (2005) Automatically building a stopword list for an information retrieval system. In: Journal on digital information management: special issue on the 5th Dutch–Belgian information retrieval workshop (DIR)

  25. Makrehchi M, Kamel MS (2008) Automatic extraction of domain-specific stopwords from labeled documents. In: European Conference on Information Retrieval. Springer

  26. Afzal MT (2009) Applying ontological framework for finding links into the future from web. In: I-SEMANTICS

  27. Kim D et al (2019) Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec. Inf Sci 477 : 15–29

    Article  Google Scholar 

  28. Khan Rahim, Qian Yurong, Naeem Sajid (2019) Extractive based Text Summarization Using K-Means and TF-IDFInternational. J Inf Eng Electron Bus 2019(3):33–44

    Google Scholar 

  29. Sparck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21

    Article  Google Scholar 

  30. Witten IH et al (1999) KEA: Practical automatic keyphrase extraction. In: Proceedings of the Fourth ACM Conference on Digital libraries. ACM

  31. Jones S, Paynter GW (2002) Automatic extraction of document keyphrases for use in digital libraries: evaluation and applications. J Am Soc Inf Sci Technol 53(8):653–677

    Article  Google Scholar 

  32. Balakrishnan V, Humaidi N, Lloyd-Yemoh E (2016) Improving document relevancy using integrated language modeling techniques. Malays J Comput Sci 29(1):45–55

    Article  Google Scholar 

  33. Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137

    Article  Google Scholar 

  34. Chris DP (1990) Another stemmer. In: ACM SIGIR Forum

  35. Lovins JB (1968) Development of a stemming algorithm. Mech Transl Comput Linguist 11(1–2):22–31

    Google Scholar 

  36. Krovetz R (1993) Viewing morphology as an inference process. In: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM

  37. Giles CL, Bollacker KD, Lawrence S (1998) CiteSeer: an automatic citation indexing system. In: Proceedings of the Third ACM Conference on Digital Libraries. ACM

  38. Haddadene HA, Harik H, Salhi S (2012) On the PageRank algorithm for the articles ranking. In: Proceedings of the World Congress on Engineering

  39. Sandhya N, Govardhan A (2012) Analysis of similarity measures with wordnet based text document clustering. In: Proceedings of the International Conference on Information Systems Design and Intelligent Applications 2012 (INDIA 2012) held in Visakhapatnam, India, January Springer

  40. Li M et al (2006) Exploring distributional similarity based models for query spelling correction. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. 2006. Association for Computational Linguistics

  41. Sökner J, Filipcic L, Hampshire N (1998) Genetic variability of populations and similarity of subpopulations in Austrian cattle breeds determined by analysis of pedigrees. Anim Sci 67(02):249–256

    Article  Google Scholar 

  42. Zechner M et al (2009) External and intrinsic plagiarism detection using vector space models. In: Proc. SEPLN

  43. Miller GA et al (1990) Introduction to WordNet: an on-line lexical database. Int J Lexicogr 3(4):235–244

    Article  Google Scholar 

  44. Zhiqiang L, Werimin S, Zhenhua Y (2009) Measuring semantic similarity between words using Wikipedia. In: International Conference on Web Information Systems and Mining, 2009. WISM 2009. 2009. IEEE

  45. Kyeong KH, Kyeong KJ, Ryu Young U (2009) Personalized recommendation over a customer network for ubiquitous shopping. IEEE Trans Serv Comput 2(2):140–151

    Article  Google Scholar 

  46. Diego S-M et al (2016) A collaborative filtering method for music recommendation using playing coefficients for artists and users. Expert Syst Appl 66:234–244

    Article  Google Scholar 

  47. Chang A et al (2014) Application of artificial immune systems combines collaborative filtering in movie recommendation system. In: Proceedings of the 2014 IEEE 18th International Conference on Computer Supported Cooperative Work in Design (CSCWD). IEEE

  48. Cortez E et al (2007) FLUX-CIM: flexible unsupervised extraction of citation metadata. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM

  49. Councill IG, Giles CL, Kan M-Y (2008) ParsCit: an Open-source CRF Reference String Parsing Package. In: LREC

  50. Afzal MT et al (2010) Rule based autonomous citation mining with tierl. J Digit Inf Manag 8(3):196–204

    MathSciNet  Google Scholar 

  51. Joeran Beel et al (2016) paper recommender systems: a literature survey. Int J Digit Libr 17(4):305–338

    Article  Google Scholar 

  52. Bai X, Wang M, Lee I, Yang Z, Kong X, Xia F (2019) Scientific paper recommendation: a survey. IEEE Access 7:9324–9339

    Article  Google Scholar 

  53. Waheed W, Imran M, Raza B, Malik A K, Khattak H A (2019) A hybrid approach towards research paper recommendation using centrality measures and author ranking. IEEE Access 7:33145–33158

    Article  Google Scholar 

  54. Wan H, Zhang Y, Zhang J, Tang J (2019) AMiner: search and mining of academic social networks. Data Intell 1(1):58–76

    Article  Google Scholar 

  55. Habib R, Tanvir AM (2017) Paper recommendation using citation proximity in bibliographic coupling. Turk J Electr Eng Comput Sci 25(4):2708–2718

    Article  Google Scholar 

  56. Habib R, Habib MT (2019) Sections-based bibliographic coupling for research paper recommendation. Scientometrics 119:1–14

    Article  Google Scholar 

  57. Khan AY, Shahid A, Afzal MT (2018) Extending co-citation using sections of research articles. Turk J Electr Eng Comput Sci 26(6):3345–3355

    Google Scholar 

  58. Shahid A, Afzal MT (2017) Section-wise indexing and retrieval of research articles. Clust Comput 21:1–12

    Google Scholar 

  59. Pruitikanee S et al (2012) Paper recommendation system: a global and soft approach. In: FUTURE COMPUTING’2012: Fourth International Conference on Future Computational Technologies and Applications

  60. Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30(1):107–117

    Article  Google Scholar 

  61. Yang W-S, Lin Y-R (2013) A task-focused literature recommender system for digital libraries. Online Inf Rev 37(4):581–601

    Article  MathSciNet  Google Scholar 

  62. Huynh T et al (2012) Scientific publication recommendations based on collaborative citation networks. In: 2012 International Conference on Collaboration Technologies and Systems (CTS). IEEE

  63. Hou WR, Li M, Niu DK (2011) Counting citations in texts rather than reference lists to improve the accuracy of assessing scientific contribution. BioEssays 33(10):724–727

    Article  Google Scholar 

  64. Habib R, Afzal MT (2017) Paper recommendation using citation proximity in bibliographic coupling. Turk J Electr Eng Comput Sci 25(4): 2708–2718

    Article  Google Scholar 

  65. Taheriyan M (2011) Subject classification of research papers based on interrelationships analysis. In: Proceedings of the 2011 Workshop on Knowledge Discovery, Modeling and Simulation. ACM

  66. He Q et al (2010) Context-aware citation recommendation. In: Proceedings of the 19th International Conference on World Wide Web. ACM

  67. Zhang Z, Li L (2010) A research paper recommender system based on spreading activation model. In: 2010 2nd International Conference on Information Science and Engineering (ICISE). IEEE

  68. Gipp B, Beel J, Hentschel C (2009) Scienstein: A research paper recommender system. In: Proceedings of the International Conference on Emerging Trends in Computing (icetic’09)

  69. Gipp B, Beel J (2009) Citation Proximity Analysis (CPA)-A new approach for identifying related work based on co-citation analysis. In: Proceedings of the 12th International Conference on Scientometrics and Informetrics (ISSI’09). 2009. Rio de Janeiro (Brazil): International Society for Scientometrics and Informetrics

  70. Gipp B, Beel J (2009) Identifying related documents for research paper recommender by CPA and COA. In: International Conference on Education and Information Technology (ICEIT’09), Lecture Notes in Engineering and Computer Science

  71. Liu S, Chen C (2011) The effects of co-citation proximity on co-citation analysis. In: Proceedings of ISSI

  72. Vellino A (2009) Recommending journal articles with pagerank ratings. In: Proceedings of ISSI. Recommender Systems 2009

  73. Naak A, Hage H, Aïmeur E (2008) Papyres: a research paper management system. In: Proceedings of ISSI.E-Commerce Technology and the Fifth IEEE Conference on Enterprise Computing, E-Commerce and E-Services, 2008 10th IEEE Conference on. IEEE

  74. Avancini H, Candela L, Straccia U (2007) Papyres: a research paper management system in Recommenders in a personalized, collaborative digital library environment. J Intell Inf Syst 28(3):253–283

    Article  Google Scholar 

  75. Strohman T, Croft WB, Jensen D (2007) Recommending citations for academic papers. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. ACM

  76. Gori M, Pucci A (2006) Research paper recommender systems: a random-walk based approach. In: IEEE/WIC/ACM International Conference on Web Intelligence, 2006. WI 2006. IEEE

  77. Sugiyama K, Kan M-Y (2010) Scholarly paper recommendation via user’s recent research interests. In: Proceedings of the 10th Annual Joint Conference on Digital Libraries. ACM

  78. Sugiyama K, Kan M-Y (2013) Exploiting potential citation papers in scholarly paper recommendation. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital libraries. ACM

  79. Maurer H Beyond Digital Libraries, Global Digtial Library Development in the New Millenium. In: NIT Conference

  80. Spiegel-Rosing I (1977) Science studies: bibliometric and content analysis. Soc Stud Sci 7(1):97–113

    Article  Google Scholar 

  81. O’Connor J (1982) Citing statements: computer recognition and use to improve retrieval. Inf Process Manag 18(3):125–131

    Article  Google Scholar 

  82. Swales J (1986) Citation analysis and discourse analysis. Appl Linguist 7:39

    Article  Google Scholar 

  83. Kaplan D, Iida R, Tokunaga T (2009) Automatic extraction of citation contexts for research paper summarization: a coreference-chain based approach. In: Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries. 2009. Association for Computational Linguistics

  84. Watanabe S et al (2005) A paper recommendation mechanism for the research support system papits. In: Data Engineering Issues in E-Commerce, 2005. Proceedings. International Workshop on. 2005. IEEE

  85. McNee SM et al (2002) On the recommending of citations for research papers. In: Proceedings of the 2002 ACM Conference on Computer Supported Cooperative Work. 2002. ACM

  86. McNee SM, Kapoor N, Konstan JA (2006) Don’t look stupid: avoiding pitfalls when recommending research papers. In: Proceedings of the 2006 20th Anniversary Conference on Computer Supported Cooperative Work. ACM

  87. Joaquin D, Naohiro I, Tomoki U (1998) Content-based collaborative information filtering: actively learning to classify and recommend documents. In: International Workshop on Cooperative Information Agents. Springer

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Moloud Abdar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shahid, A., Afzal, M.T., Abdar, M. et al. Insights into relevant knowledge extraction techniques: a comprehensive review. J Supercomput 76, 1695–1733 (2020). https://doi.org/10.1007/s11227-019-03009-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-019-03009-y

Keywords

Navigation