Abstract
More than 50 million journal papers will have been published by the end of 2019 with 2 million more journal papers published every year. The number of conference papers is even higher, and millions of other types of scientific research are added to the knowledge base every year. Scientific databases such as Web of Science, Scopus, and PubMed index millions of scientific papers and Google Scholar indexes a huge amount of scientific knowledge across diverse domains. However, current systems provide long lists of results when users attempt to find relevant papers, leaving them with little choice other than manually skimming through the lists. This article surveys different techniques used to identify relevant research papers by knowledge-based organizations. We categorized current literature content as content, metadata, collaborative filtering, and citation based techniques and identified the strengths and limitation for each approach. Further, we evaluated the published techniques and research-based products used to identify relevant documents and identified the strengths and limitations of each approach. This research will greatly help to understand current state-of-the-art techniques internal workings for finding relevant papers, understand the relevant strengths and limitations, and explore previously proposed techniques targeting this area.
Similar content being viewed by others
References
Afzal MT, Kulathuramaiyer N, Maurer HA (2007) Creating links into the future. J UCS 13(9):1234–1245
Larsen PO, Von Ins M (2010) The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index. Scientometrics 84(3):575–603
Van Dalen HP, Klamer A (2005) Is science a case of wasteful competition? Kyklos 58(3):395–414
Gemert AV PLoS ONE Publishes 10,000th Manuscript. http://blogs.plos.org/everyone/2010/04/02/plos-one-publishes-10000th-article/, 2010. Accessed 31 Dec 2018
Jinha AE (2010) Article 50 million: an estimate of the number of scholarly articles in existence. Learn Publ 23(3):258–263
Bollacker KD, Lawrence S, Giles CL (2000) Discovering relevant scientific literature on the web. IEEE Intell Syst Their Appl 15(2):42–47
Garfield E (1965) Can citation indexing be automated. In: Statistical Association Methods for Mechanized Documentation, Symposium Proceedings, National Bureau of Standards, Miscellaneous Publication 269., Washington DC
Teufel S, Siddharthan A, Tidhar D (2006) Automatic classification of citation function. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics
Sajid NA et al (2011) Exploiting reference section to classify paper’s topics. In: Proceedings of the International Conference on Management of Emergent Digital EcoSystems. ACM
Sajid N, Afzal M, Qadir M (2016) Multi-label classification of computer science documents using fuzzy logic. J Natil Sci Found Sri Lanka 44(2):155–165
Bogers T, Van den Bosch A (2008) Recommending scientific articles using citeulike. In: Proceedings of the 2008 ACM Conference on Recommender Systems. ACM
Klink S, Kieninger T (2001) Rule-based document structure understanding with a fuzzy combination of layout and textual features. Int J Doc Anal Recogn 4(1):18–26
McCallum A, Freitag D, Pereira FC (2000) Maximum entropy markov models for information extraction and segmentation. In: ICML '00 proceedings of the seventeenth international conference on machine learning, pp 591–598
Linden G, Smith B, York J (2003) Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet Comput 7(1):76–80
Cohen WW, Fan W (2000) Web-collaborative filtering: Recommending music by crawling the web. Comput Netw 33(1):685–698
Koren Y, Bell R, Volinsky C (2009) Matrix factorization techniques for recommender systems. Computer 42(8):30–37
Pohl S, Radlinski F, Joachims T (2007) Recommending related papers based on digital library access records. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital libraries. ACM
Su X, Khoshgoftaar TM (2009) A survey of collaborative filtering techniques. Adv Artif Intell 2009:4
Chen C-H et al (2011) Novelty paper recommendation using citation authority diffusion. In: Technologies and Applications of Artificial Intelligence (TAAI), 2011 International Conference on. IEEE
Garfield E (2006) The history and meaning of the journal impact factor. JAMA 295(1):90–93
Hirsch JE (2005) An index to quantify an individual’s scientific research output. In: Proceedings of the National academy of Sciences of the United States of America, pp 16569–16572
Kessler MM (1963) Bibliographic coupling between scientific papers. J Assoc Inf Sci Technol 14(1):10–25
Small H (1973) Co-citation in the scientific literature: a new measure of the relationship between two documents. J Assoc Inf Sci Technol 24(4):265–269
Lo RT-W, He B, Ounis I (2005) Automatically building a stopword list for an information retrieval system. In: Journal on digital information management: special issue on the 5th Dutch–Belgian information retrieval workshop (DIR)
Makrehchi M, Kamel MS (2008) Automatic extraction of domain-specific stopwords from labeled documents. In: European Conference on Information Retrieval. Springer
Afzal MT (2009) Applying ontological framework for finding links into the future from web. In: I-SEMANTICS
Kim D et al (2019) Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec. Inf Sci 477 : 15–29
Khan Rahim, Qian Yurong, Naeem Sajid (2019) Extractive based Text Summarization Using K-Means and TF-IDFInternational. J Inf Eng Electron Bus 2019(3):33–44
Sparck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21
Witten IH et al (1999) KEA: Practical automatic keyphrase extraction. In: Proceedings of the Fourth ACM Conference on Digital libraries. ACM
Jones S, Paynter GW (2002) Automatic extraction of document keyphrases for use in digital libraries: evaluation and applications. J Am Soc Inf Sci Technol 53(8):653–677
Balakrishnan V, Humaidi N, Lloyd-Yemoh E (2016) Improving document relevancy using integrated language modeling techniques. Malays J Comput Sci 29(1):45–55
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
Chris DP (1990) Another stemmer. In: ACM SIGIR Forum
Lovins JB (1968) Development of a stemming algorithm. Mech Transl Comput Linguist 11(1–2):22–31
Krovetz R (1993) Viewing morphology as an inference process. In: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM
Giles CL, Bollacker KD, Lawrence S (1998) CiteSeer: an automatic citation indexing system. In: Proceedings of the Third ACM Conference on Digital Libraries. ACM
Haddadene HA, Harik H, Salhi S (2012) On the PageRank algorithm for the articles ranking. In: Proceedings of the World Congress on Engineering
Sandhya N, Govardhan A (2012) Analysis of similarity measures with wordnet based text document clustering. In: Proceedings of the International Conference on Information Systems Design and Intelligent Applications 2012 (INDIA 2012) held in Visakhapatnam, India, January Springer
Li M et al (2006) Exploring distributional similarity based models for query spelling correction. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. 2006. Association for Computational Linguistics
Sökner J, Filipcic L, Hampshire N (1998) Genetic variability of populations and similarity of subpopulations in Austrian cattle breeds determined by analysis of pedigrees. Anim Sci 67(02):249–256
Zechner M et al (2009) External and intrinsic plagiarism detection using vector space models. In: Proc. SEPLN
Miller GA et al (1990) Introduction to WordNet: an on-line lexical database. Int J Lexicogr 3(4):235–244
Zhiqiang L, Werimin S, Zhenhua Y (2009) Measuring semantic similarity between words using Wikipedia. In: International Conference on Web Information Systems and Mining, 2009. WISM 2009. 2009. IEEE
Kyeong KH, Kyeong KJ, Ryu Young U (2009) Personalized recommendation over a customer network for ubiquitous shopping. IEEE Trans Serv Comput 2(2):140–151
Diego S-M et al (2016) A collaborative filtering method for music recommendation using playing coefficients for artists and users. Expert Syst Appl 66:234–244
Chang A et al (2014) Application of artificial immune systems combines collaborative filtering in movie recommendation system. In: Proceedings of the 2014 IEEE 18th International Conference on Computer Supported Cooperative Work in Design (CSCWD). IEEE
Cortez E et al (2007) FLUX-CIM: flexible unsupervised extraction of citation metadata. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM
Councill IG, Giles CL, Kan M-Y (2008) ParsCit: an Open-source CRF Reference String Parsing Package. In: LREC
Afzal MT et al (2010) Rule based autonomous citation mining with tierl. J Digit Inf Manag 8(3):196–204
Joeran Beel et al (2016) paper recommender systems: a literature survey. Int J Digit Libr 17(4):305–338
Bai X, Wang M, Lee I, Yang Z, Kong X, Xia F (2019) Scientific paper recommendation: a survey. IEEE Access 7:9324–9339
Waheed W, Imran M, Raza B, Malik A K, Khattak H A (2019) A hybrid approach towards research paper recommendation using centrality measures and author ranking. IEEE Access 7:33145–33158
Wan H, Zhang Y, Zhang J, Tang J (2019) AMiner: search and mining of academic social networks. Data Intell 1(1):58–76
Habib R, Tanvir AM (2017) Paper recommendation using citation proximity in bibliographic coupling. Turk J Electr Eng Comput Sci 25(4):2708–2718
Habib R, Habib MT (2019) Sections-based bibliographic coupling for research paper recommendation. Scientometrics 119:1–14
Khan AY, Shahid A, Afzal MT (2018) Extending co-citation using sections of research articles. Turk J Electr Eng Comput Sci 26(6):3345–3355
Shahid A, Afzal MT (2017) Section-wise indexing and retrieval of research articles. Clust Comput 21:1–12
Pruitikanee S et al (2012) Paper recommendation system: a global and soft approach. In: FUTURE COMPUTING’2012: Fourth International Conference on Future Computational Technologies and Applications
Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30(1):107–117
Yang W-S, Lin Y-R (2013) A task-focused literature recommender system for digital libraries. Online Inf Rev 37(4):581–601
Huynh T et al (2012) Scientific publication recommendations based on collaborative citation networks. In: 2012 International Conference on Collaboration Technologies and Systems (CTS). IEEE
Hou WR, Li M, Niu DK (2011) Counting citations in texts rather than reference lists to improve the accuracy of assessing scientific contribution. BioEssays 33(10):724–727
Habib R, Afzal MT (2017) Paper recommendation using citation proximity in bibliographic coupling. Turk J Electr Eng Comput Sci 25(4): 2708–2718
Taheriyan M (2011) Subject classification of research papers based on interrelationships analysis. In: Proceedings of the 2011 Workshop on Knowledge Discovery, Modeling and Simulation. ACM
He Q et al (2010) Context-aware citation recommendation. In: Proceedings of the 19th International Conference on World Wide Web. ACM
Zhang Z, Li L (2010) A research paper recommender system based on spreading activation model. In: 2010 2nd International Conference on Information Science and Engineering (ICISE). IEEE
Gipp B, Beel J, Hentschel C (2009) Scienstein: A research paper recommender system. In: Proceedings of the International Conference on Emerging Trends in Computing (icetic’09)
Gipp B, Beel J (2009) Citation Proximity Analysis (CPA)-A new approach for identifying related work based on co-citation analysis. In: Proceedings of the 12th International Conference on Scientometrics and Informetrics (ISSI’09). 2009. Rio de Janeiro (Brazil): International Society for Scientometrics and Informetrics
Gipp B, Beel J (2009) Identifying related documents for research paper recommender by CPA and COA. In: International Conference on Education and Information Technology (ICEIT’09), Lecture Notes in Engineering and Computer Science
Liu S, Chen C (2011) The effects of co-citation proximity on co-citation analysis. In: Proceedings of ISSI
Vellino A (2009) Recommending journal articles with pagerank ratings. In: Proceedings of ISSI. Recommender Systems 2009
Naak A, Hage H, Aïmeur E (2008) Papyres: a research paper management system. In: Proceedings of ISSI.E-Commerce Technology and the Fifth IEEE Conference on Enterprise Computing, E-Commerce and E-Services, 2008 10th IEEE Conference on. IEEE
Avancini H, Candela L, Straccia U (2007) Papyres: a research paper management system in Recommenders in a personalized, collaborative digital library environment. J Intell Inf Syst 28(3):253–283
Strohman T, Croft WB, Jensen D (2007) Recommending citations for academic papers. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. ACM
Gori M, Pucci A (2006) Research paper recommender systems: a random-walk based approach. In: IEEE/WIC/ACM International Conference on Web Intelligence, 2006. WI 2006. IEEE
Sugiyama K, Kan M-Y (2010) Scholarly paper recommendation via user’s recent research interests. In: Proceedings of the 10th Annual Joint Conference on Digital Libraries. ACM
Sugiyama K, Kan M-Y (2013) Exploiting potential citation papers in scholarly paper recommendation. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital libraries. ACM
Maurer H Beyond Digital Libraries, Global Digtial Library Development in the New Millenium. In: NIT Conference
Spiegel-Rosing I (1977) Science studies: bibliometric and content analysis. Soc Stud Sci 7(1):97–113
O’Connor J (1982) Citing statements: computer recognition and use to improve retrieval. Inf Process Manag 18(3):125–131
Swales J (1986) Citation analysis and discourse analysis. Appl Linguist 7:39
Kaplan D, Iida R, Tokunaga T (2009) Automatic extraction of citation contexts for research paper summarization: a coreference-chain based approach. In: Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries. 2009. Association for Computational Linguistics
Watanabe S et al (2005) A paper recommendation mechanism for the research support system papits. In: Data Engineering Issues in E-Commerce, 2005. Proceedings. International Workshop on. 2005. IEEE
McNee SM et al (2002) On the recommending of citations for research papers. In: Proceedings of the 2002 ACM Conference on Computer Supported Cooperative Work. 2002. ACM
McNee SM, Kapoor N, Konstan JA (2006) Don’t look stupid: avoiding pitfalls when recommending research papers. In: Proceedings of the 2006 20th Anniversary Conference on Computer Supported Cooperative Work. ACM
Joaquin D, Naohiro I, Tomoki U (1998) Content-based collaborative information filtering: actively learning to classify and recommend documents. In: International Workshop on Cooperative Information Agents. Springer
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Shahid, A., Afzal, M.T., Abdar, M. et al. Insights into relevant knowledge extraction techniques: a comprehensive review. J Supercomput 76, 1695–1733 (2020). https://doi.org/10.1007/s11227-019-03009-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-019-03009-y