Skip to main content

SPDF: Set Probabilistic Distance Features for Prediction of Population Health Outcomes via Social Media

  • Conference paper
  • First Online:
Book cover Data Mining (AusDM 2019)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1127))

Included in the following conference series:

  • 612 Accesses

Abstract

Measurement of population health outcomes is critical to understanding the health status of communities and thus enabling the development of appropriate health-care programmes for the communities. This task acquires the prediction of population health status to be fast and accurate yet scalable to different population sizes. To satisfy these requirements, this paper proposes a method for automatic prediction of population health outcomes from social media using Set Probabilistic Distance Features (SPDF). The proposed SPDF are mid-level features built upon the similarity in posting patterns between populations. Our proposed SPDF hold several advantages. Firstly, they can be applied to various low-level features. Secondly, our SPDF fit well problems with weakly labelled data, i.e., only the labels of sets are available while the labels of sets’ elements are not explicitly provided. We thoroughly evaluate our approach in the task of prediction of health indices of counties in the US via a large-scale dataset collected from Twitter. We also apply our proposed SPDF to two different textual features including latent topics and linguistic styles. We conduct two case studies: across-year vs across-county prediction. The performance of the approach is validated against the Behavioral Risk Factor Surveillance System surveys. Experimental results show that the proposed approach achieves state-of-the-art performance on linguistic style features in prediction of all health indices and in both case studies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.cdc.gov/brfss/.

  2. 2.

    https://www.cdc.gov/brfss/index.html.

  3. 3.

    https://www.usnews.com/news/healthiest-communities/rankings.

References

  1. Andreu-Perez, J., Poon, C.C.Y., Merrifield, R.D., Wong, S.T.C., Yang, G.-Z.: Big data for health. IEEE J. Biomed. Health Inform. 19(4), 1193–1208 (2015)

    Article  Google Scholar 

  2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  3. Culotta, A.: Estimating county health statistics with Twitter. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1335–1344 (2014)

    Google Scholar 

  4. De Choudhury, M., Gamon, M., Counts, S., Horvitz, E.: Predicting depression via social media. In: Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM), pp. 128–137 (2013)

    Google Scholar 

  5. Dittrich, J., Quiané-Ruiz, J.-A.: Efficient big data processing in Hadoop MapReduce. Proc. VLDB Endow. 5(12), 2014–2015 (2012)

    Article  Google Scholar 

  6. Dredze, M.: How social media will change public health. IEEE Intell. Syst. 27(4), 81–84 (2012)

    Article  Google Scholar 

  7. Dredze, M., Paul, M.J.: Natural language processing for health and social media. IEEE Intell. Syst. 29(2), 64–67 (2014)

    Google Scholar 

  8. França, U., Sayama, H., McSwiggen, C., Daneshvar, R., Bar-Yam, Y.: Visualizing the “Heartbeat” of a city with Tweets. Complexity 21(6), 280–287 (2016)

    Article  MathSciNet  Google Scholar 

  9. Ginsberg, J., Mohebbi, M.H., Patel, R.S., Brammer, L., Smolinski, M.S., Brilliant, L.: Detecting influenza epidemics using search engine query data. Nature 457(7232), 1012–1014 (2009)

    Article  Google Scholar 

  10. Lan, R., Lieberman, M.D., Samet, H.: The picture of health: Map-based, collaborative spatio-temporal disease tracking. In: Proceedings of the SIGSPATIAL International Workshop on Use of GIS in Public Health, pp. 27–35 (2012)

    Google Scholar 

  11. Leetaru, K., Wang, S., Cao, G., Padmanabhan, A., Shook, E.: Mapping the global Twitter heartbeat: the geography of Twitter. First Monday 18(5) (2013)

    Google Scholar 

  12. Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012)

    MATH  Google Scholar 

  13. Nguyen, T., et al.: Kernel-based features for predicting population health indices from geocoded social media data. Decis. Support Syst. 102, 22–31 (2017)

    Article  Google Scholar 

  14. Nguyen, T., et al.: Prediction of population health indices from social media using kernel-based textual and temporal features. In: Proceedings of the International Conference on World Wide Web Companion, pp. 99–107 (2017)

    Google Scholar 

  15. Parrish, R.G.: Peer reviewed: measuring population health outcomes. Prev. Chronic Dis. 7(4) (2010)

    Google Scholar 

  16. Pennebaker, J.W., Booth, R.J., Boyd, R.L., Francis, M.E.: Linguistic Inquiry and Word Count: LIWC 2015 [Computer software]. Pennebaker Conglomerates Inc. (2015)

    Google Scholar 

  17. Quercia, D., Capra, L., Crowcroft, J.: The social world of Twitter: topics, geography, and emotions. In: Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM), vol. 12, pp. 298–305 (2012)

    Google Scholar 

  18. Schwartz, H.A., et al.: Characterizing geographic variation in well-being using tweets. In: Proceedings of the International AAAI Conference on Web and Social Media (ICWSM), pp. 583–591 (2013)

    Google Scholar 

  19. Shekhar, S., et al.: Spatiotemporal data mining: a computational perspective. ISPRS Int. J. Geo-Inf. 4(4), 2306–2338 (2015)

    Article  Google Scholar 

  20. Thacker, S.B., Stroup, D.F., Carande-Kulis, V., Marks, J.S., Roy, K., Gerberding, J.L.: Measuring the public’s health. Public Health Rep. 121(1), 14–22 (2006)

    Article  Google Scholar 

  21. Venerandi, A., Quattrone, G., Capra, L.: City form and well-being: what makes London neighborhoods good places to live? In: Proceedings of the SIGSPATIAL International Conference on Advances in Geographic Information Systems (2016)

    Google Scholar 

  22. Ye, M., Yin, P., Lee, W.-C.: Location recommendation for location-based social networks. In: Proceedings of the SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 458–461 (2010)

    Google Scholar 

  23. Zaharia, M., et al.: Fast and interactive analytics over Hadoop data with Spark. Usenix Login 37(4), 45–51 (2012)

    MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hung Nguyen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nguyen, H., Nguyen, D.T., Nguyen, T. (2019). SPDF: Set Probabilistic Distance Features for Prediction of Population Health Outcomes via Social Media. In: Le, T., et al. Data Mining. AusDM 2019. Communications in Computer and Information Science, vol 1127. Springer, Singapore. https://doi.org/10.1007/978-981-15-1699-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-1699-3_5

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-1698-6

  • Online ISBN: 978-981-15-1699-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics