Recentred local profiles for authorship attribution

ROBERT LAYTON; PAUL WATTERS; RICHARD DAZELEY

doi:10.1017/S1351324911000180

Recentred local profiles for authorship attribution

Published online by Cambridge University Press: 09 June 2011

ROBERT LAYTON ,

PAUL WATTERS and

RICHARD DAZELEY

Show author details

ROBERT LAYTON: Affiliation:
Internet Commerce Security Laboratory, University of Ballarat, Australia e-mail: r.layton@icsl.com.au, p.watters@ballarat.edu.au
PAUL WATTERS: Affiliation:
Internet Commerce Security Laboratory, University of Ballarat, Australia e-mail: r.layton@icsl.com.au, p.watters@ballarat.edu.au
RICHARD DAZELEY: Affiliation:
Data Mining and Informatics Research Group, University of Ballarat, Australia e-mail: r.dazeley@ballarat.edu.au

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Authorship attribution methods aim to determine the author of a document, by using information gathered from a set of documents with known authors. One method of performing this task is to create profiles containing distinctive features known to be used by each author. In this paper, a new method of creating an author or document profile is presented that detects features considered distinctive, compared to normal language usage. This recentreing approach creates more accurate profiles than previous methods, as demonstrated empirically using a known corpus of authorship problems. This method, named recentred local profiles, determines authorship accurately using a simple ‘best matching author’ approach to classification, compared to other methods in the literature. The proposed method is shown to be more stable than related methods as parameter values change. Using a weighted voting scheme, recentred local profiles is shown to outperform other methods in authorship attribution, with an overall accuracy of 69.9% on the ad-hoc authorship attribution competition corpus, representing a significant improvement over related methods.

Type: Articles
Information: Natural Language Engineering , Volume 18 , Issue 3 , July 2012 , pp. 293 - 312

DOI: https://doi.org/10.1017/S1351324911000180 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bennett, W. R. 1976. Scientific and Engineering Problem-Solving with the Computer. Upper Saddle River, NJ: Prentice Hall PTR.Google Scholar

Brunet, E. 1978. Le Vocabulaire de Jean Giraudoux : structure et evolution : statistique et informatique appliquees a l'etude des textes a partir des donnees du Tresor de la langue francaise/Etienne Brunet. Geneve: Slatkine.Google Scholar

Cavnar, W. B. 1975. Using an n-gram-based document representation with a vector processing retrieval model. In Overview of the Third Text REtrieval Conference (TREC-3). PA, USA: DIANE Publishing.Google Scholar

Chaski, C. E. 2005. Who's at the keyboard? Authorship attribution in digital evidence investigations. International Journal of Digital Evidence, 4 (1): 1–13.Google Scholar

Frantzeskou, G., Stamatatos, E., Gritzalis, S. and Katsikas, S. 2006. Source code author identification based on n-gram author profiles. In Proceedings of the Artificial Intelligence Applications and Innovations, pp. 508–515. Thessaloniki, Greece: Springer.CrossRef Google Scholar

Frantzeskou, G., Stamatatos, E., Gritzalis, S. and Chaski, C. E. 2007. Identifying authorship by byte-level n-grams: the source code author profile (SCAP) method. International Journal of Digital Evidence, 6 (1).Google Scholar

Honoré, A. 1979. Some simple measures of richness of vocabulary. Association for Literary and Linguistic Computing Bulletin 7 (2): 172–177.Google Scholar

Juola, P. 2004. Ad-hoc authorship attribution competition. In Proceedings of 2004 Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities (ALLC/ACH 2004). Göteborg, SwedenGoogle Scholar

Juola, P. 2008. Authorship Attribution. Now Pub.CrossRef Google Scholar

Kešelj, V., and Cercone, N. 2004. CNG method with weighted voting. In Joula, P. (ed.), Ad-hoc Authorship Attribution Competition. Proceedings 2004 Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities (ALLC/ACH 2004), Göteborg, Sweden.Google Scholar

Kešelj, V., Peng, F., Cercone, N., and Thomas, C. 2003. N-gram-based author profiles for authorship attribution. In Proceedings of the Pacific Association for Computational Linguistics, pp. 255–264. Halifax, Canada.Google Scholar

Koppel, M., Akiva, N. and Dagan, I. 2006. Feature instability as a criterion for selecting potential style markers. Journal of the American Society for Information Science and Technology, 57 (11): 1519–1525.CrossRef Google Scholar

Koppel, M., Schler, J. and Argamon, S. 2009. Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology, 60 (1): 9–26.CrossRef Google Scholar

Kuncheva, L. I. and Whitaker, C. J. 2003. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 51 (2): 181–207.CrossRef Google Scholar

Layton, R. and Watters, P. 2009. Determining provenance in phishing websites using automated conceptual analysis. In eCrime Researchers Summit (eCRS), pp. 1–7. WA, USA: IEEE.Google Scholar

Layton, R., Watters, P. and Dazeley, R. 2010. Authorship attribution for twitter in 140 characters or less. Cybercrime and Trustworthy Computing (CTC) Workshop 1: 1–8.Google Scholar

Li, Z. and Sun, M. 2009. Punctuation as implicit annotations for chinese word segmentation. Computational Linguistics, 35 (4): 505–512.CrossRef Google Scholar

Mosteller, F. and Wallace, D. 1963. Inference in an authorship problem. Journal of the American Statistical Association, 58 (302): 275–309.Google Scholar

Raghavan, S., Kovashka, A. and Mooney, R. 2010. Authorship attribution using probabilistic context-free grammars. In Proceedings of the ACL 2010 Conference Short Papers. Association for Computational Linguistics, pp. 38–42.Google Scholar

Rudman, J. 1997. The state of authorship attribution studies: some problems and solutions. Computers and the Humanities 31 (4): 351–365.CrossRef Google Scholar

Sichel, H. S. 1975. On a distribution law for word frequencies. Journal of the American Statistical Association 70 (351): 542–547.Google Scholar

Simpson, E. H. 1949. Measurement of diversity. Nature, 163 (4148): 688.CrossRef Google Scholar

Stamatatos, E. 2009. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60 (3). Maryland, USA.CrossRef Google Scholar

van Halteren, H., Baayen, R., Tweedie, F., Haverkort, M., and Neijt, A. 2005. New machine learning methods demonstrate the existence of a human stylome. Journal of Quantitative Linguistics 12 (1): 65–77.CrossRef Google Scholar

Yule, G. 1939. On sentence-length as a statistical characteristic of style in prose: with application to two cases of disputed authorship. Biometrika, 30 (3–4): 363–390.Google Scholar

Zheng, R., Li, J., Chen, H. and Huang, Z. 2006. A framework for authorship identification of online messages: writing-style features and classification techniques. Journal of the American Society for Information Science and Technology 57 (3): 378–393.CrossRef Google Scholar

Article contents

Recentred local profiles for authorship attribution

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests