Hostname: page-component-76fb5796d-vfjqv Total loading time: 0 Render date: 2024-04-25T15:01:59.279Z Has data issue: false hasContentIssue false

Recentred local profiles for authorship attribution

Published online by Cambridge University Press:  09 June 2011

ROBERT LAYTON
Affiliation:
Internet Commerce Security Laboratory, University of Ballarat, Australia e-mail: r.layton@icsl.com.au, p.watters@ballarat.edu.au
PAUL WATTERS
Affiliation:
Internet Commerce Security Laboratory, University of Ballarat, Australia e-mail: r.layton@icsl.com.au, p.watters@ballarat.edu.au
RICHARD DAZELEY
Affiliation:
Data Mining and Informatics Research Group, University of Ballarat, Australia e-mail: r.dazeley@ballarat.edu.au

Abstract

Authorship attribution methods aim to determine the author of a document, by using information gathered from a set of documents with known authors. One method of performing this task is to create profiles containing distinctive features known to be used by each author. In this paper, a new method of creating an author or document profile is presented that detects features considered distinctive, compared to normal language usage. This recentreing approach creates more accurate profiles than previous methods, as demonstrated empirically using a known corpus of authorship problems. This method, named recentred local profiles, determines authorship accurately using a simple ‘best matching author’ approach to classification, compared to other methods in the literature. The proposed method is shown to be more stable than related methods as parameter values change. Using a weighted voting scheme, recentred local profiles is shown to outperform other methods in authorship attribution, with an overall accuracy of 69.9% on the ad-hoc authorship attribution competition corpus, representing a significant improvement over related methods.

Type
Articles
Copyright
Copyright © Cambridge University Press 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bennett, W. R. 1976. Scientific and Engineering Problem-Solving with the Computer. Upper Saddle River, NJ: Prentice Hall PTR.Google Scholar
Brunet, E. 1978. Le Vocabulaire de Jean Giraudoux : structure et evolution : statistique et informatique appliquees a l'etude des textes a partir des donnees du Tresor de la langue francaise/Etienne Brunet. Geneve: Slatkine.Google Scholar
Cavnar, W. B. 1975. Using an n-gram-based document representation with a vector processing retrieval model. In Overview of the Third Text REtrieval Conference (TREC-3). PA, USA: DIANE Publishing.Google Scholar
Chaski, C. E. 2005. Who's at the keyboard? Authorship attribution in digital evidence investigations. International Journal of Digital Evidence, 4 (1): 113.Google Scholar
Frantzeskou, G., Stamatatos, E., Gritzalis, S. and Katsikas, S. 2006. Source code author identification based on n-gram author profiles. In Proceedings of the Artificial Intelligence Applications and Innovations, pp. 508515. Thessaloniki, Greece: Springer.CrossRefGoogle Scholar
Frantzeskou, G., Stamatatos, E., Gritzalis, S. and Chaski, C. E. 2007. Identifying authorship by byte-level n-grams: the source code author profile (SCAP) method. International Journal of Digital Evidence, 6 (1).Google Scholar
Honoré, A. 1979. Some simple measures of richness of vocabulary. Association for Literary and Linguistic Computing Bulletin 7 (2): 172177.Google Scholar
Juola, P. 2004. Ad-hoc authorship attribution competition. In Proceedings of 2004 Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities (ALLC/ACH 2004). Göteborg, SwedenGoogle Scholar
Juola, P. 2008. Authorship Attribution. Now Pub.CrossRefGoogle Scholar
Kešelj, V., and Cercone, N. 2004. CNG method with weighted voting. In Joula, P. (ed.), Ad-hoc Authorship Attribution Competition. Proceedings 2004 Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities (ALLC/ACH 2004), Göteborg, Sweden.Google Scholar
Kešelj, V., Peng, F., Cercone, N., and Thomas, C. 2003. N-gram-based author profiles for authorship attribution. In Proceedings of the Pacific Association for Computational Linguistics, pp. 255264. Halifax, Canada.Google Scholar
Koppel, M., Akiva, N. and Dagan, I. 2006. Feature instability as a criterion for selecting potential style markers. Journal of the American Society for Information Science and Technology, 57 (11): 15191525.CrossRefGoogle Scholar
Koppel, M., Schler, J. and Argamon, S. 2009. Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology, 60 (1): 926.CrossRefGoogle Scholar
Kuncheva, L. I. and Whitaker, C. J. 2003. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 51 (2): 181207.CrossRefGoogle Scholar
Layton, R. and Watters, P. 2009. Determining provenance in phishing websites using automated conceptual analysis. In eCrime Researchers Summit (eCRS), pp. 17. WA, USA: IEEE.Google Scholar
Layton, R., Watters, P. and Dazeley, R. 2010. Authorship attribution for twitter in 140 characters or less. Cybercrime and Trustworthy Computing (CTC) Workshop 1: 18.Google Scholar
Li, Z. and Sun, M. 2009. Punctuation as implicit annotations for chinese word segmentation. Computational Linguistics, 35 (4): 505512.CrossRefGoogle Scholar
Mosteller, F. and Wallace, D. 1963. Inference in an authorship problem. Journal of the American Statistical Association, 58 (302): 275309.Google Scholar
Raghavan, S., Kovashka, A. and Mooney, R. 2010. Authorship attribution using probabilistic context-free grammars. In Proceedings of the ACL 2010 Conference Short Papers. Association for Computational Linguistics, pp. 38–42.Google Scholar
Rudman, J. 1997. The state of authorship attribution studies: some problems and solutions. Computers and the Humanities 31 (4): 351365.CrossRefGoogle Scholar
Sichel, H. S. 1975. On a distribution law for word frequencies. Journal of the American Statistical Association 70 (351): 542547.Google Scholar
Simpson, E. H. 1949. Measurement of diversity. Nature, 163 (4148): 688.CrossRefGoogle Scholar
Stamatatos, E. 2009. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60 (3). Maryland, USA.CrossRefGoogle Scholar
van Halteren, H., Baayen, R., Tweedie, F., Haverkort, M., and Neijt, A. 2005. New machine learning methods demonstrate the existence of a human stylome. Journal of Quantitative Linguistics 12 (1): 6577.CrossRefGoogle Scholar
Yule, G. 1939. On sentence-length as a statistical characteristic of style in prose: with application to two cases of disputed authorship. Biometrika, 30 (3–4): 363390.Google Scholar
Zheng, R., Li, J., Chen, H. and Huang, Z. 2006. A framework for authorship identification of online messages: writing-style features and classification techniques. Journal of the American Society for Information Science and Technology 57 (3): 378393.CrossRefGoogle Scholar