Academia versus social media: A psycho-linguistic analysis

https://doi.org/10.1016/j.jocs.2017.08.011Get rights and content

Highlights

  • Extract almost 80 linguistic features for PubMed papers published for last 40 years.

  • Extract the features for online encyclopaedias, diaries, forums, and micro-blogs.

  • Classify PubMed papers versus other media & classify different cohorts of PubMed papers.

  • Detect trends of linguistic features in PubMed papers for last 40 years.

  • Employ advanced cluster computing to process 5.8 terabytes with 3.6 billion records.

Abstract

Publication pressure has influenced the way scientists report their experimental results. Recently it has been found that scientific outcomes have been exaggerated or distorted (spin) to hopefully be published. Apart from investigating the content to look for spins, language styles has been proven to be the good traces. For example, the use of words in emotion lexicons has been used to interpret exaggeration and overstatement in academia. This work adapts a data-driven approach to explore a comprehensive set of psycho-linguistic features for a large corpus of PubMed papers published for the last four decades. The language features for other media – online encyclopedia (Wikipedia), online diaries (web-logs), online forums (Reddit), and micro-blogs (Twitter) – are also extracted. Several binary classifications are employed to discover linguistic predictors of scientific abstracts versus other media as well as strong predictors of scientific articles in different cohorts of impact factors and author affiliations. Trends of language styles expressed in scientific articles over the course of 40 years has also been discovered, providing the evolution of academic writing for the period of time. The study demonstrates advances in lightning-fast cluster computing on dealing with large scale data, consisting of 5.8 terabytes of data containing 3.6 billion records from all the media. The good performance of the advanced cluster computing framework suggests the potential of pattern recognition in data at scale.

Introduction

Recently it is found that positive words have been in the rise in the content of scientific articles [2], [50]. The underlying reason is thought to be the publication pressure [2], leading authors to exaggerate [50] or misreport (or spin – distortion of study findings) [31] to hopefully get papers accepted. These findings illustrate the potential of lexicon analysis for large scale data to discover patterns hidden in scientific archives.

Our previous work has found differences between academia and social media in the degree of formality, including expressing emotional information, using first person pronouns to refer to the authors, and mixing English varieties [36]. The current study uses the same data-driven approach in [50] to capture language styles conveyed in PubMed articles. However, in stead of two features – positives and negatives, almost 80 psycho-linguistic categories in the content of PubMed abstracts will be extracted. The language features will be compared with that of other media, consisting of online encyclopedia (Wikipedia), online diaries (web-logs, e.g., Live Journal), online forums (Reddit), and micro-blogs (Twitter). Language features among PubMed articles with different cohorts of author affiliations or journals will also be examined. Additionally, trends of language styles in the scientific publications for the last four decades will be investigated.

Advanced framework in cluster computing will be employed to process approximately six terabytes (TB) of data containing billions data records from all the media. Machine learning techniques will be applied to discover linguistic predictors of scientific articles in distinguishing with other media. The techniques are also used to detect language discriminators of scientific articles in high impact factor journals or to compare language features by author affiliations either in or outside an English speaking country. Correlation with time will be used to capture possible trends of language styles expressed in PubMed articles over the last 40 years.

A key contribution of this work is to provide a comprehensive set of linguistic predictors that can be used to differentiate scientific articles with other media, as well as to distinguish the subgroups of academic publications. Understanding the differences may help to develop ‘gold standards’ of language styles for academic writing. Additionally the trends may help to better understand the evolution of scientific writing, as well as anomalies along the development.

The current paper is organized as follows. Section 2 presents the background literature. Section 3 outlines the proposed methods, data, and experimental set-up. Section 4 presents the results. Section 5 concludes the paper.

Section snippets

Language styles as features

To capture language styles represented in the text, categorizing words into linguistic groups is a popular practice. For instance, Linguistic Inquiry and Word Count (LIWC) package [41], [40], [39], originated from psychology, returns almost 80 psycho-linguistic categories for an input text. LIWC has been used across different research areas in sociology and psychology, such as examining status, dominance and social hierarchy, honesty and deception, thinking styles and individual differences [48]

Methodology

In this work, to build datasets for experiments, scientific articles will be downloaded from PubMed and social media posts will be crawled from a variety of user-generated sources: online encyclopaedias, diaries, forums, and micro-blogs. Then linguistic categories for scientific articles and social media posts will be extracted. These categories will be used as features in the classifications of PubMed papers versus other media as well as of PubMed papers in different cohorts. The trend of

PubMed versus other media

To create a balanced dataset, one million documents from each class were randomly taken into the experiments. This makes accuracy a suitable measure of performance for the classifications.

Though word count was not taken into the classification models, the accuracy gained by the other LIWC features is still high. The accuracy in all the balanced binary classifications is more than 94%, ranging from 94.5% (PubMed versus Reddit post prediction) to 98.3% (PubMed versus Live Journal prediction).

Conclusion

This study investigated the linguistic features of bio-medical articles, in comparison with that of a variety of media, consisting of online encyclopedia, online diaries, online forums, and micro blogs. Results indicated that distinct linguistic styles differentiated PubMed articles, a representative of scientific style media, from the other sources. Overall, linguistic features were found to be powerful predictors of scientific articles, and so effectively separating academia from other media.

Acknowledgment

This work is partially supported by the Telstra-Deakin Centre of Excellence in Big Data and Machine Learning.

Thin Nguyen is a research scientist in the Centre for Pattern Recognition and Data Analytics at Deakin University, Australia. He received his PhD from Curtin University, Australia in 2012 in the area of social media analysis and machine learning. His broad research interests lie in data analytics, pattern recognition, affective understanding and web-scale analysis. In particular, his research work has focused on sentiment analysis, personalisation, crowdsourcing in social media and medical

References (53)

  • P. Ball

    ‘Novel, amazing, innovative’: positive words on the rise in science papers

    Nature

    (2015)
  • J.E. Blumenstock

    Size matters: word count as a measure of quality on Wikipedia

    Proceedings of the International Conference on World Wide Web

    (2008)
  • L. Brokowski et al.

    Evaluation of pharmacist use and perception of Wikipedia as a drug information resource

    Ann. Pharmacother.

    (2009)
  • A.R. Brown

    Wikipedia as a data source for political scientists: accuracy and completeness of coverage

    PS: Polit. Sci. Polit.

    (2011)
  • E. Cambria

    Affective computing and sentiment analysis

    IEEE Intell. Syst.

    (2016)
  • E. Cambria et al.

    Computational intelligence for big social data analysis

    IEEE Comput. Intell. Mag.

    (2016)
  • E. Cambria et al.

    New avenues in opinion mining and sentiment analysis

    IEEE Intell. Syst.

    (2013)
  • T. Caulfield et al.

    Confronting stem cell hype

    Science

    (2016)
  • Y.-Y. Chang et al.

    Informal elements in English academic writing: threats or opportunities for advanced non-native speakers

    Writing: Texts, Processes and Practices

    (1999)
  • A.T. Chen et al.

    What online communities can tell us about electronic cigarettes and hookah use: a study using text mining and visualization techniques

    J. Med. Internet Res.

    (2015)
  • L. Cobus

    Using blogs and wikis in a graduate public health course

    Med. Ref. Serv. Q.

    (2009)
  • G. Coppersmith et al.

    Measuring post traumatic stress disorder in Twitter

    Proceedings of the International Conference on Weblogs and Social Media

    (2014)
  • M. Crawford et al.

    Survey of review spam detection using machine learning techniques

    J. Big Data

    (2015)
  • M. De Choudhury et al.

    Mental health discourse on Reddit: self-disclosure, social support, and anonymity

    Proceedings of the International Conference on Weblogs and Social Media

    (2014)
  • J. Dean et al.

    MapReduce: simplified data processing on large clusters

    Commun. ACM

    (2008)
  • M. Duggan et al.

    Social media update 2014. Technical report

    (2015)
  • Cited by (0)

    Thin Nguyen is a research scientist in the Centre for Pattern Recognition and Data Analytics at Deakin University, Australia. He received his PhD from Curtin University, Australia in 2012 in the area of social media analysis and machine learning. His broad research interests lie in data analytics, pattern recognition, affective understanding and web-scale analysis. In particular, his research work has focused on sentiment analysis, personalisation, crowdsourcing in social media and medical Internet research. His current research topic is viewing the Web as a sensing platform and as a surrogate to develop innovative and novel ways to monitor and predict disease outcomes. One example is to exploit population level web search activity behaviour to construct new computational methods to serve as a proxy for chronic disease risk.

    Svetha Venkatesh is Alfred Deakin Professor of Computer Science and Director of Centre for Pattern Recognition and Data Analytics (PRaDA) at Deakin University. She is a Fellow of the Australian Academy of Technological Sciences and Engineering and the International Association of Pattern Recognition. She is on the editorial board of IEEE Transactions on Multimedia and was on the board of ACM Transactions on Multimedia (2008-2011). She is a program member of several international conferences such as ACM Multimedia. Venkatesh has developed frontier technologies in large scale pattern recognition exemplified through more than 300 publications and 9 patents. One start-up company, spun out of these patents is Virtual Observer and based on the paradigm shifting methods that leverages mobile cameras to deliver wide area surveillance solutions. The technology won the Runner up in both the WA Inventor of the year (Early stage) and Global Security Challenge (Asia-Pacific) in 2007. A recent spin-out company is iCetana and is based on novel methods to find anomalies in video data. iCetana won the Broadband Innovation Award at the prestigious Tech23 in 2010. Venkatesh has recently made important contributions to the field of health analytics and won the prestigious Barwon Health Researcher of the Year Award in 2013.

    Dinh Phung is currently a professor in the School of Information Technology, Deakin University, Australia. He obtained a bachelor of computer science with a first class honours and a PhD from Curtin University in 2001 and 2005 respectively. His primary research interest is statistical machine learning, graphical models, Bayesian statistics and their applications in pervasive health, multimedia, social computing and healthcare analytics with an established publication record in these areas. Before joining Deakin, he was the recipient of the Curtin Research Fellowship from Curtin University from 2006 to 2011 and was the recipient of an International Research Fellowship from SRI International in 2005. In 2008, he was invited to Dagstuhl school series on the topic of context modelling and understanding in Germany. He further received the Early Career Research and Development Award and the Curtin Innovation Award from Curtin University in 2010 and 2011 respectively.

    View full text