Academia versus social media: A psycho-linguistic analysis
Introduction
Recently it is found that positive words have been in the rise in the content of scientific articles [2], [50]. The underlying reason is thought to be the publication pressure [2], leading authors to exaggerate [50] or misreport (or spin – distortion of study findings) [31] to hopefully get papers accepted. These findings illustrate the potential of lexicon analysis for large scale data to discover patterns hidden in scientific archives.
Our previous work has found differences between academia and social media in the degree of formality, including expressing emotional information, using first person pronouns to refer to the authors, and mixing English varieties [36]. The current study uses the same data-driven approach in [50] to capture language styles conveyed in PubMed articles. However, in stead of two features – positives and negatives, almost 80 psycho-linguistic categories in the content of PubMed abstracts will be extracted. The language features will be compared with that of other media, consisting of online encyclopedia (Wikipedia), online diaries (web-logs, e.g., Live Journal), online forums (Reddit), and micro-blogs (Twitter). Language features among PubMed articles with different cohorts of author affiliations or journals will also be examined. Additionally, trends of language styles in the scientific publications for the last four decades will be investigated.
Advanced framework in cluster computing will be employed to process approximately six terabytes (TB) of data containing billions data records from all the media. Machine learning techniques will be applied to discover linguistic predictors of scientific articles in distinguishing with other media. The techniques are also used to detect language discriminators of scientific articles in high impact factor journals or to compare language features by author affiliations either in or outside an English speaking country. Correlation with time will be used to capture possible trends of language styles expressed in PubMed articles over the last 40 years.
A key contribution of this work is to provide a comprehensive set of linguistic predictors that can be used to differentiate scientific articles with other media, as well as to distinguish the subgroups of academic publications. Understanding the differences may help to develop ‘gold standards’ of language styles for academic writing. Additionally the trends may help to better understand the evolution of scientific writing, as well as anomalies along the development.
The current paper is organized as follows. Section 2 presents the background literature. Section 3 outlines the proposed methods, data, and experimental set-up. Section 4 presents the results. Section 5 concludes the paper.
Section snippets
Language styles as features
To capture language styles represented in the text, categorizing words into linguistic groups is a popular practice. For instance, Linguistic Inquiry and Word Count (LIWC) package [41], [40], [39], originated from psychology, returns almost 80 psycho-linguistic categories for an input text. LIWC has been used across different research areas in sociology and psychology, such as examining status, dominance and social hierarchy, honesty and deception, thinking styles and individual differences [48]
Methodology
In this work, to build datasets for experiments, scientific articles will be downloaded from PubMed and social media posts will be crawled from a variety of user-generated sources: online encyclopaedias, diaries, forums, and micro-blogs. Then linguistic categories for scientific articles and social media posts will be extracted. These categories will be used as features in the classifications of PubMed papers versus other media as well as of PubMed papers in different cohorts. The trend of
PubMed versus other media
To create a balanced dataset, one million documents from each class were randomly taken into the experiments. This makes accuracy a suitable measure of performance for the classifications.
Though word count was not taken into the classification models, the accuracy gained by the other LIWC features is still high. The accuracy in all the balanced binary classifications is more than 94%, ranging from 94.5% (PubMed versus Reddit post prediction) to 98.3% (PubMed versus Live Journal prediction).
Conclusion
This study investigated the linguistic features of bio-medical articles, in comparison with that of a variety of media, consisting of online encyclopedia, online diaries, online forums, and micro blogs. Results indicated that distinct linguistic styles differentiated PubMed articles, a representative of scientific style media, from the other sources. Overall, linguistic features were found to be powerful predictors of scientific articles, and so effectively separating academia from other media.
Acknowledgment
This work is partially supported by the Telstra-Deakin Centre of Excellence in Big Data and Machine Learning.
Thin Nguyen is a research scientist in the Centre for Pattern Recognition and Data Analytics at Deakin University, Australia. He received his PhD from Curtin University, Australia in 2012 in the area of social media analysis and machine learning. His broad research interests lie in data analytics, pattern recognition, affective understanding and web-scale analysis. In particular, his research work has focused on sentiment analysis, personalisation, crowdsourcing in social media and medical
References (53)
- et al.
Building a Twitter opinion lexicon from automatically-annotated tweets
Knowl.-Based Syst.
(2016) - et al.
New avenues in knowledge bases for natural language processing
Knowl.-Based Syst.
(2016) - et al.
Big social data analysis
Knowl.-Based Syst.
(2014) - et al.
Wiki-surgery? Internal validity of Wikipedia as a medical and surgical reference
J. Am. Coll. Surg.
(2007) Wikipedia as an evidence source for nursing and healthcare students
Nurse Educ. Today
(2011)Teaching with wikis: toward a networked pedagogy
Comput. Compos.
(2008)- et al.
Aspect extraction for opinion mining with a deep convolutional neural network
Knowl.-Based Syst.
(2016) - et al.
Figurative messages and affect in Twitter: differences between #irony, #sarcasm and #not
Knowl.-Based Syst.
(2016) Staying afloat in the rising tide of science
Cell
(2016)- et al.
The use of superlatives in cancer research
JAMA Oncol.
(2016)
‘Novel, amazing, innovative’: positive words on the rise in science papers
Nature
Size matters: word count as a measure of quality on Wikipedia
Proceedings of the International Conference on World Wide Web
Evaluation of pharmacist use and perception of Wikipedia as a drug information resource
Ann. Pharmacother.
Wikipedia as a data source for political scientists: accuracy and completeness of coverage
PS: Polit. Sci. Polit.
Affective computing and sentiment analysis
IEEE Intell. Syst.
Computational intelligence for big social data analysis
IEEE Comput. Intell. Mag.
New avenues in opinion mining and sentiment analysis
IEEE Intell. Syst.
Confronting stem cell hype
Science
Informal elements in English academic writing: threats or opportunities for advanced non-native speakers
Writing: Texts, Processes and Practices
What online communities can tell us about electronic cigarettes and hookah use: a study using text mining and visualization techniques
J. Med. Internet Res.
Using blogs and wikis in a graduate public health course
Med. Ref. Serv. Q.
Measuring post traumatic stress disorder in Twitter
Proceedings of the International Conference on Weblogs and Social Media
Survey of review spam detection using machine learning techniques
J. Big Data
Mental health discourse on Reddit: self-disclosure, social support, and anonymity
Proceedings of the International Conference on Weblogs and Social Media
MapReduce: simplified data processing on large clusters
Commun. ACM
Social media update 2014. Technical report
Cited by (0)
Thin Nguyen is a research scientist in the Centre for Pattern Recognition and Data Analytics at Deakin University, Australia. He received his PhD from Curtin University, Australia in 2012 in the area of social media analysis and machine learning. His broad research interests lie in data analytics, pattern recognition, affective understanding and web-scale analysis. In particular, his research work has focused on sentiment analysis, personalisation, crowdsourcing in social media and medical Internet research. His current research topic is viewing the Web as a sensing platform and as a surrogate to develop innovative and novel ways to monitor and predict disease outcomes. One example is to exploit population level web search activity behaviour to construct new computational methods to serve as a proxy for chronic disease risk.
Svetha Venkatesh is Alfred Deakin Professor of Computer Science and Director of Centre for Pattern Recognition and Data Analytics (PRaDA) at Deakin University. She is a Fellow of the Australian Academy of Technological Sciences and Engineering and the International Association of Pattern Recognition. She is on the editorial board of IEEE Transactions on Multimedia and was on the board of ACM Transactions on Multimedia (2008-2011). She is a program member of several international conferences such as ACM Multimedia. Venkatesh has developed frontier technologies in large scale pattern recognition exemplified through more than 300 publications and 9 patents. One start-up company, spun out of these patents is Virtual Observer and based on the paradigm shifting methods that leverages mobile cameras to deliver wide area surveillance solutions. The technology won the Runner up in both the WA Inventor of the year (Early stage) and Global Security Challenge (Asia-Pacific) in 2007. A recent spin-out company is iCetana and is based on novel methods to find anomalies in video data. iCetana won the Broadband Innovation Award at the prestigious Tech23 in 2010. Venkatesh has recently made important contributions to the field of health analytics and won the prestigious Barwon Health Researcher of the Year Award in 2013.
Dinh Phung is currently a professor in the School of Information Technology, Deakin University, Australia. He obtained a bachelor of computer science with a first class honours and a PhD from Curtin University in 2001 and 2005 respectively. His primary research interest is statistical machine learning, graphical models, Bayesian statistics and their applications in pervasive health, multimedia, social computing and healthcare analytics with an established publication record in these areas. Before joining Deakin, he was the recipient of the Curtin Research Fellowship from Curtin University from 2006 to 2011 and was the recipient of an International Research Fellowship from SRI International in 2005. In 2008, he was invited to Dagstuhl school series on the topic of context modelling and understanding in Germany. He further received the Early Career Research and Development Award and the Curtin Innovation Award from Curtin University in 2010 and 2011 respectively.