Skip to main content

Consensus Clustering and Supervised Classification for Profiling Phishing Emails in Internet Commerce Security

  • Conference paper
Knowledge Management and Acquisition for Smart Systems and Services (PKAW 2010)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6232))

Included in the following conference series:

Abstract

This article investigates internet commerce security applications of a novel combined method, which uses unsupervised consensus clustering algorithms in combination with supervised classification methods. First, a variety of independent clustering algorithms are applied to a randomized sample of data. Second, several consensus functions and sophisticated algorithms are used to combine these independent clusterings into one final consensus clustering. Third, the consensus clustering of the randomized sample is used as a training set to train several fast supervised classification algorithms. Finally, these fast classification algorithms are used to classify the whole large data set. One of the advantages of this approach is in its ability to facilitate the inclusion of contributions from domain experts in order to adjust the training set created by consensus clustering. We apply this approach to profiling phishing emails selected from a very large data set supplied by the industry partners of the Centre for Informatics and Applied Optimization. Our experiments compare the performance of several classification algorithms incorporated in this scheme.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ailon, N., Charikar, M., Newman, A.: Aggregating inconsistent information: ranking and clustering. In: Proc. 37th Annual ACM Symposium on Theory of Computing, pp. 684–693 (2005)

    Google Scholar 

  2. Anti-Phishing Working Group (2009), http://apwg.org/ (retrieved April 2010)

  3. Bagirov, A.M.: Modified global k-means algorithm for minimum sum-of-squares clustering problems. Pattern Recognition 41, 3192–3199 (2008)

    Article  MATH  Google Scholar 

  4. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley-Interscience, Hoboken (2001)

    MATH  Google Scholar 

  5. Fern, X.Z., Brodley, C.E.: Cluster ensembles for high dimensional clustering: an empirical study. J. Machine Learning Research (2004)

    Google Scholar 

  6. Fern, X.Z., Brodley, C.E.: Solving cluster ensemble problems by bipartite graph partitioning. In: Proc. 21st Internat. Conference on Machine Learning, ICML 2004, Banff, Alberta, Canada, July 4-8, vol. 69, p. 36. ACM, New York (2004)

    Google Scholar 

  7. Fette, I., Sadeh, N., Tomasic, A.: Learning to detect phishing emails. In: Proc. 16th Internat. Conf. on the World Wide Web, WWW 2007, pp. 649–656. ACM Press, New York (2007)

    Chapter  Google Scholar 

  8. Filkov, V., Skiena, S.: Heterogeneous data integration with the consensus clustering formalism. In: Proc. of Data Integration in the Life Sciences, pp. 110–123 (2004)

    Google Scholar 

  9. Goder, A., Filkov, V.: Consensus clustering algorithms: comparison and refinement. In: Proc. Tenth SIAM Workshop on Algorithm Engineering and Experiments, ALENEX 2008, pp. 109–117 (2008)

    Google Scholar 

  10. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)

    MATH  Google Scholar 

  11. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31(3), 264–323 (1999)

    Article  Google Scholar 

  12. Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TF-IDF for text categorization. In: Proc. 14th Internat. Conf. Machine Learning, pp. 143–151 (1997)

    Google Scholar 

  13. Kang, B.H., Kelarev, A.V., Sale, A.H.J., Williams, R.N.: A new model for classifying DNA code inspired by neural networks and FSA. In: Hoffmann, A., Kang, B.-h., Richards, D., Tsumoto, S. (eds.) PKAW 2006. LNCS (LNAI), vol. 4303, pp. 187–198. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  14. Karypis, G., Kumar, V.: METIS: A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices, Technical Report, University of Minnesota, Department of Computer Science and Engineering, Army HPC Research Centre, Minneapolis (1998)

    Google Scholar 

  15. Layton, R., Watters, P.: Determining provenance in phishing websites using automated conceptual analysis. In: Proc. 4th Annual APWG eCrime Researchers Summit, Tacoma, WA (2009)

    Google Scholar 

  16. Layton, R., Brown, S., Watters, P.: Using differencing to increase distinctiveness for phishing website clustering. In: Proc. Cybercrime and Trustworthy Computing Workshop, CTC 2009, Brisbane, Australia (2009)

    Google Scholar 

  17. Likas, A., Vlassis, N., Verbeek, J.J.: The global k-means clustering algorithm. Pattern Recognition 36, 451–461 (2003)

    Article  Google Scholar 

  18. Ma, L., Yearwood, J., Watters, P.A.: Establishing phishing provenance using orthographic features, APWG E-crime Research Summit (2009)

    Google Scholar 

  19. OECD Task Force on Spam, OECD Anti-Spam Toolkit and its Annexes (2009), http://www.oecd-antispam.org/ (retrieved April 2010)

  20. Strehl, A., Ghosh, J.: Cluster ensembles - a knowledge reuse framework for combining multiple partitions. J. Machine Learning Research 3, 583–617 (2002)

    Article  MathSciNet  Google Scholar 

  21. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison Wesley, Reading (2005)

    Google Scholar 

  22. Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 4th edn. Academic Press, London (2008)

    Google Scholar 

  23. Topchy, A., Jain, A.K., Punch, W.: Combining multiple weak clusterings. In: Proc. IEEE Internat. Conf. on Data Mining, pp. 331–338 (2003)

    Google Scholar 

  24. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Elsevier/Morgan Kaufman, Amsterdam (2005)

    MATH  Google Scholar 

  25. Yearwood, J.L., Kang, B.H., Kelarev, A.V.: Experimental investigation of classification algorithms for ITS dataset. In: PKAW 2008, Pacific Rim Knowledge Acquisition Workshop, Hanoi, Vietnam, December 15-16, pp. 262–272 (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Dazeley, R., Yearwood, J.L., Kang, B.H., Kelarev, A.V. (2010). Consensus Clustering and Supervised Classification for Profiling Phishing Emails in Internet Commerce Security. In: Kang, BH., Richards, D. (eds) Knowledge Management and Acquisition for Smart Systems and Services. PKAW 2010. Lecture Notes in Computer Science(), vol 6232. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15037-1_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15037-1_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15036-4

  • Online ISBN: 978-3-642-15037-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics