Consensus Clustering and Supervised Classification for Profiling Phishing Emails in Internet Commerce Security

Dazeley, Richard; Yearwood, John L.; Kang, Byeong H.; Kelarev, Andrei V.

doi:10.1007/978-3-642-15037-1_20

Richard Dazeley²¹,
John L. Yearwood²¹,
Byeong H. Kang²² &
…
Andrei V. Kelarev²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6232))

Included in the following conference series:

Pacific Rim Knowledge Acquisition Workshop

938 Accesses
17 Citations

Abstract

This article investigates internet commerce security applications of a novel combined method, which uses unsupervised consensus clustering algorithms in combination with supervised classification methods. First, a variety of independent clustering algorithms are applied to a randomized sample of data. Second, several consensus functions and sophisticated algorithms are used to combine these independent clusterings into one final consensus clustering. Third, the consensus clustering of the randomized sample is used as a training set to train several fast supervised classification algorithms. Finally, these fast classification algorithms are used to classify the whole large data set. One of the advantages of this approach is in its ability to facilitate the inclusion of contributions from domain experts in order to adjust the training set created by consensus clustering. We apply this approach to profiling phishing emails selected from a very large data set supplied by the industry partners of the Centre for Informatics and Applied Optimization. Our experiments compare the performance of several classification algorithms incorporated in this scheme.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ailon, N., Charikar, M., Newman, A.: Aggregating inconsistent information: ranking and clustering. In: Proc. 37th Annual ACM Symposium on Theory of Computing, pp. 684–693 (2005)
Google Scholar
Anti-Phishing Working Group (2009), http://apwg.org/ (retrieved April 2010)
Bagirov, A.M.: Modified global k-means algorithm for minimum sum-of-squares clustering problems. Pattern Recognition 41, 3192–3199 (2008)
Article MATH Google Scholar
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley-Interscience, Hoboken (2001)
MATH Google Scholar
Fern, X.Z., Brodley, C.E.: Cluster ensembles for high dimensional clustering: an empirical study. J. Machine Learning Research (2004)
Google Scholar
Fern, X.Z., Brodley, C.E.: Solving cluster ensemble problems by bipartite graph partitioning. In: Proc. 21st Internat. Conference on Machine Learning, ICML 2004, Banff, Alberta, Canada, July 4-8, vol. 69, p. 36. ACM, New York (2004)
Google Scholar
Fette, I., Sadeh, N., Tomasic, A.: Learning to detect phishing emails. In: Proc. 16th Internat. Conf. on the World Wide Web, WWW 2007, pp. 649–656. ACM Press, New York (2007)
Chapter Google Scholar
Filkov, V., Skiena, S.: Heterogeneous data integration with the consensus clustering formalism. In: Proc. of Data Integration in the Life Sciences, pp. 110–123 (2004)
Google Scholar
Goder, A., Filkov, V.: Consensus clustering algorithms: comparison and refinement. In: Proc. Tenth SIAM Workshop on Algorithm Engineering and Experiments, ALENEX 2008, pp. 109–117 (2008)
Google Scholar
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)
MATH Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31(3), 264–323 (1999)
Article Google Scholar
Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TF-IDF for text categorization. In: Proc. 14th Internat. Conf. Machine Learning, pp. 143–151 (1997)
Google Scholar
Kang, B.H., Kelarev, A.V., Sale, A.H.J., Williams, R.N.: A new model for classifying DNA code inspired by neural networks and FSA. In: Hoffmann, A., Kang, B.-h., Richards, D., Tsumoto, S. (eds.) PKAW 2006. LNCS (LNAI), vol. 4303, pp. 187–198. Springer, Heidelberg (2006)
Chapter Google Scholar
Karypis, G., Kumar, V.: METIS: A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices, Technical Report, University of Minnesota, Department of Computer Science and Engineering, Army HPC Research Centre, Minneapolis (1998)
Google Scholar
Layton, R., Watters, P.: Determining provenance in phishing websites using automated conceptual analysis. In: Proc. 4th Annual APWG eCrime Researchers Summit, Tacoma, WA (2009)
Google Scholar
Layton, R., Brown, S., Watters, P.: Using differencing to increase distinctiveness for phishing website clustering. In: Proc. Cybercrime and Trustworthy Computing Workshop, CTC 2009, Brisbane, Australia (2009)
Google Scholar
Likas, A., Vlassis, N., Verbeek, J.J.: The global k-means clustering algorithm. Pattern Recognition 36, 451–461 (2003)
Article Google Scholar
Ma, L., Yearwood, J., Watters, P.A.: Establishing phishing provenance using orthographic features, APWG E-crime Research Summit (2009)
Google Scholar
OECD Task Force on Spam, OECD Anti-Spam Toolkit and its Annexes (2009), http://www.oecd-antispam.org/ (retrieved April 2010)
Strehl, A., Ghosh, J.: Cluster ensembles - a knowledge reuse framework for combining multiple partitions. J. Machine Learning Research 3, 583–617 (2002)
Article MathSciNet Google Scholar
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison Wesley, Reading (2005)
Google Scholar
Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 4th edn. Academic Press, London (2008)
Google Scholar
Topchy, A., Jain, A.K., Punch, W.: Combining multiple weak clusterings. In: Proc. IEEE Internat. Conf. on Data Mining, pp. 331–338 (2003)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Elsevier/Morgan Kaufman, Amsterdam (2005)
MATH Google Scholar
Yearwood, J.L., Kang, B.H., Kelarev, A.V.: Experimental investigation of classification algorithms for ITS dataset. In: PKAW 2008, Pacific Rim Knowledge Acquisition Workshop, Hanoi, Vietnam, December 15-16, pp. 262–272 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Centre for Informatics and Applied Optimization Graduate School of ITMS, University of Ballarat, P.O. Box 663, Ballarat, Victoria, 3353, Australia
Richard Dazeley, John L. Yearwood & Andrei V. Kelarev
School of Computing and Information Systems, University of Tasmania, Private Bag 100, Hobart, Tasmania, 7001, Australia
Byeong H. Kang

Authors

Richard Dazeley
View author publications
You can also search for this author in PubMed Google Scholar
John L. Yearwood
View author publications
You can also search for this author in PubMed Google Scholar
Byeong H. Kang
View author publications
You can also search for this author in PubMed Google Scholar
Andrei V. Kelarev
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computing ad Information Systems, University of Tasmania, TAS7250, Launceton, Tasmania, Australia
Byeong-Ho Kang
Computing Department,Division of Information and Communication Sciences, Macquarie University, 2109, Sydney, NSW, Australia
Debbie Richards

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dazeley, R., Yearwood, J.L., Kang, B.H., Kelarev, A.V. (2010). Consensus Clustering and Supervised Classification for Profiling Phishing Emails in Internet Commerce Security. In: Kang, BH., Richards, D. (eds) Knowledge Management and Acquisition for Smart Systems and Services. PKAW 2010. Lecture Notes in Computer Science(), vol 6232. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15037-1_20

Download citation

DOI: https://doi.org/10.1007/978-3-642-15037-1_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15036-4
Online ISBN: 978-3-642-15037-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics