Abstract
This article investigates internet commerce security applications of a novel combined method, which uses unsupervised consensus clustering algorithms in combination with supervised classification methods. First, a variety of independent clustering algorithms are applied to a randomized sample of data. Second, several consensus functions and sophisticated algorithms are used to combine these independent clusterings into one final consensus clustering. Third, the consensus clustering of the randomized sample is used as a training set to train several fast supervised classification algorithms. Finally, these fast classification algorithms are used to classify the whole large data set. One of the advantages of this approach is in its ability to facilitate the inclusion of contributions from domain experts in order to adjust the training set created by consensus clustering. We apply this approach to profiling phishing emails selected from a very large data set supplied by the industry partners of the Centre for Informatics and Applied Optimization. Our experiments compare the performance of several classification algorithms incorporated in this scheme.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ailon, N., Charikar, M., Newman, A.: Aggregating inconsistent information: ranking and clustering. In: Proc. 37th Annual ACM Symposium on Theory of Computing, pp. 684–693 (2005)
Anti-Phishing Working Group (2009), http://apwg.org/ (retrieved April 2010)
Bagirov, A.M.: Modified global k-means algorithm for minimum sum-of-squares clustering problems. Pattern Recognition 41, 3192–3199 (2008)
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley-Interscience, Hoboken (2001)
Fern, X.Z., Brodley, C.E.: Cluster ensembles for high dimensional clustering: an empirical study. J. Machine Learning Research (2004)
Fern, X.Z., Brodley, C.E.: Solving cluster ensemble problems by bipartite graph partitioning. In: Proc. 21st Internat. Conference on Machine Learning, ICML 2004, Banff, Alberta, Canada, July 4-8, vol. 69, p. 36. ACM, New York (2004)
Fette, I., Sadeh, N., Tomasic, A.: Learning to detect phishing emails. In: Proc. 16th Internat. Conf. on the World Wide Web, WWW 2007, pp. 649–656. ACM Press, New York (2007)
Filkov, V., Skiena, S.: Heterogeneous data integration with the consensus clustering formalism. In: Proc. of Data Integration in the Life Sciences, pp. 110–123 (2004)
Goder, A., Filkov, V.: Consensus clustering algorithms: comparison and refinement. In: Proc. Tenth SIAM Workshop on Algorithm Engineering and Experiments, ALENEX 2008, pp. 109–117 (2008)
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31(3), 264–323 (1999)
Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TF-IDF for text categorization. In: Proc. 14th Internat. Conf. Machine Learning, pp. 143–151 (1997)
Kang, B.H., Kelarev, A.V., Sale, A.H.J., Williams, R.N.: A new model for classifying DNA code inspired by neural networks and FSA. In: Hoffmann, A., Kang, B.-h., Richards, D., Tsumoto, S. (eds.) PKAW 2006. LNCS (LNAI), vol. 4303, pp. 187–198. Springer, Heidelberg (2006)
Karypis, G., Kumar, V.: METIS: A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices, Technical Report, University of Minnesota, Department of Computer Science and Engineering, Army HPC Research Centre, Minneapolis (1998)
Layton, R., Watters, P.: Determining provenance in phishing websites using automated conceptual analysis. In: Proc. 4th Annual APWG eCrime Researchers Summit, Tacoma, WA (2009)
Layton, R., Brown, S., Watters, P.: Using differencing to increase distinctiveness for phishing website clustering. In: Proc. Cybercrime and Trustworthy Computing Workshop, CTC 2009, Brisbane, Australia (2009)
Likas, A., Vlassis, N., Verbeek, J.J.: The global k-means clustering algorithm. Pattern Recognition 36, 451–461 (2003)
Ma, L., Yearwood, J., Watters, P.A.: Establishing phishing provenance using orthographic features, APWG E-crime Research Summit (2009)
OECD Task Force on Spam, OECD Anti-Spam Toolkit and its Annexes (2009), http://www.oecd-antispam.org/ (retrieved April 2010)
Strehl, A., Ghosh, J.: Cluster ensembles - a knowledge reuse framework for combining multiple partitions. J. Machine Learning Research 3, 583–617 (2002)
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison Wesley, Reading (2005)
Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 4th edn. Academic Press, London (2008)
Topchy, A., Jain, A.K., Punch, W.: Combining multiple weak clusterings. In: Proc. IEEE Internat. Conf. on Data Mining, pp. 331–338 (2003)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Elsevier/Morgan Kaufman, Amsterdam (2005)
Yearwood, J.L., Kang, B.H., Kelarev, A.V.: Experimental investigation of classification algorithms for ITS dataset. In: PKAW 2008, Pacific Rim Knowledge Acquisition Workshop, Hanoi, Vietnam, December 15-16, pp. 262–272 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dazeley, R., Yearwood, J.L., Kang, B.H., Kelarev, A.V. (2010). Consensus Clustering and Supervised Classification for Profiling Phishing Emails in Internet Commerce Security. In: Kang, BH., Richards, D. (eds) Knowledge Management and Acquisition for Smart Systems and Services. PKAW 2010. Lecture Notes in Computer Science(), vol 6232. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15037-1_20
Download citation
DOI: https://doi.org/10.1007/978-3-642-15037-1_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15036-4
Online ISBN: 978-3-642-15037-1
eBook Packages: Computer ScienceComputer Science (R0)