Abstract
Learning from small number of examples is a challenging problem in machine learning. An effective way to improve the performance is through exploiting knowledge from other related tasks. Multi-task learning (MTL) is one such useful paradigm that aims to improve the performance through jointly modeling multiple related tasks. Although there exist numerous classification or regression models in machine learning literature, most of the MTL models are built around ridge or logistic regression. There exist some limited works, which propose multi-task extension of techniques such as support vector machine, Gaussian processes. However, all these MTL models are tied to specific classification or regression algorithms and there is no single MTL algorithm that can be used at a meta level for any given learning algorithm. Addressing this problem, we propose a generic, model-agnostic joint modeling framework that can take any classification or regression algorithm of a practitioner’s choice (standard or custom-built) and build its MTL variant. The key observation that drives our framework is that due to small number of examples, the estimates of task parameters are usually poor, and we show that this leads to an under-estimation of task relatedness between any two tasks with high probability. We derive an algorithm that brings the tasks closer to their true relatedness by improving the estimates of task parameters. This is achieved by appropriate sharing of data across tasks. We provide the detail theoretical underpinning of the algorithm. Through our experiments with both synthetic and real datasets, we demonstrate that the multi-task variants of several classifiers/regressors (logistic regression, support vector machine, K-nearest neighbor, Random Forest, ridge regression, support vector regression) convincingly outperform their single-task counterparts. We also show that the proposed model performs comparable or better than many state-of-the-art MTL and transfer learning baselines.
Similar content being viewed by others
Notes
In case of a general nonlinear model, \(d\le N_{t'}\). For linear models, assuming a linearly independent set of data in task \(t'\), \(d={\text {min}}\left( M,N_{t'}\right) \).
The underestimations may be noticed in magnitude of relatedness, irrespective of its sign, i.e. positive relatedness values are often estimated as lower positive values, while negative relatedness values are often estimated as lower negative values.
Ethics approval obtained through University and the hospital—12/83.
References
Aggarwal CC, Yu PS (2008) A general survey of privacy-preserving data mining models and algorithms. Springer, Berlin
Argyriou A, Evgeniou T, Pontil M (2008) Convex multi-task feature learning. Mach Learn 73(3):243–272
Baxter J (2000) A model of inductive bias learning. J Artif Intell Res (JAIR) 12:149–198
Ben-David S, Schuller R (2003) Exploiting task relatedness for multiple task learning. pp 567–580
Bickel S, Brückner M, Scheffer T (2007) Discriminative learning for differing training and test distributions. In: Proceedings of the 24th international conference on machine learning, ACM, pp 81–88
Bonilla EV, Chai KM, Williams C (2007) Multi-task Gaussian process prediction. In: Advances in neural information processing systems, pp 153–160
Bonilla EV, Agakov FV, Williams C (2007) Kernel multi-task learning using task-specific features. In: International conference on artificial intelligence and statistics, pp 43–50
Bonilla EV, Kian CMA, Williams CKI (2007) Multi-task gaussian process prediction. In: Nips, vol 20, pp 153–160
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75
Chen M, Weinberger KQ, Blitzer J (2011) Co-training for domain adaptation. In: NIPS, pp 2456–2464
Clifton C, Kantarcioǧlu M, Doan A, Schadow G, Vaidya J, Elmagarmid A, Suciu D (2004) Privacy-preserving data integration and sharing. In: Proceedings of the 9th ACM SIGMOD workshop on research issues in data mining and knowledge discovery, ACM, pp 19–26
Dai W, Xue G-R, Yang Q, Yu Y (2007) Transferring naive bayes classifiers for text classification. In: Proceedings of the twenty-second AAAI conference on artificial intelligence, vol 22, AAAI Press, p 540
Dai W, Yang Q, Xue G-R, Yu Y (2007) Boosting for transfer learning. In: Proceedings of the 24th international conference on machine learning, ACM, pp 193–200
Daumé III H (2009) Bayesian multitask learning with latent hierarchies. In: Processing of the 25th conference on uncertainty in artificial intelligence, pp 135–142
Daume III H, Marcu D (2006) Domain adaptation for statistical classifiers. J Artif Intell Res, pp 101–126
Davis J, Domingos P (2009) Deep transfer via second-order markov logic. In: Proceedings of the 26th annual international conference on machine learning, ACM, pp 217–224
Evgeniou A, Pontil M (2007) Multi-task feature learning. In: Advances in neural information processing systems, vol 19, The MIT Press, p 41
Evgeniou T, Micchelli CA, Pontil M (2005) Learning multiple tasks with kernel methods. J Mach Learn Res, 615–637
Evgeniou T, Pontil M (2004) Regularized multi–task learning. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 109–117
Fung BCM, Wang K, Yu PS (2007) Anonymizing classification data for privacy preservation. Knowl Data Eng IEEE Trans 19(5):711–725
Gao J, Fan W, Jiang J, Han J (2008) Knowledge transfer via multiple model local structure mapping. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 283–291
Geibel P, Brefeld U, Wysotzki F (2003) Learning linear classifiers sensitive to example dependent and noisy costs. In: Advances in intelligent data analysis V, Springer, pp 167–178
Gong P, Ye J, Zhang C (2012) Robust multi-task feature learning. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 895–903
Gupta SK, Phung D, Adams B, Venkatesh S (2013) Regularized nonnegative shared subspace learning. Data Min Knowl Discov 26(1):57–97
Gupta SK, Phung D, Venkatesh S (2012) A slice sampler for restricted hierarchical beta process with applications to shared subspace learning. In: Proceedings of the twenty-eighth conference on uncertainty in artificial intelligence, Catalina Island, CA, USA, 14–18 Aug 2012, pp 316–325
Gupta SK, Phung D, Venkatesh S (2013) Factorial multi-task learning: a bayesian nonparametric approach. In: International conference on machine learning, pp 657–665
Gupta SK, Rana S, Phung D, Venkatesh S (2015) Collaborating differently on different topics: A multi-relational approach to multi-task learning. In: Advances in knowledge discovery and data mining, Ho Chi Minh City, Vietnam. Springer, Berlin Heidelberg, pp 303–316
Gupta SK, Rana S, Phung D, Venkatesh S (2015) What shall I share and with whom? A multi-task learning formulation using multi-faceted task relationships. In: Proceedings of the SIAM international conference on data mining, Vancouver, Canada, pp 703–711
Jawanpuria P, Nath JS (2012) A convex feature learning formulation for latent task structure discovery. In: Proceedings of the 29th international conference on machine learning (ICML)
Jebara T (2004) Multi-task feature and kernel selection for svms. In: Proceedings of the twenty-first international conference on machine learning, ACM, p 55
Kang Z, Grauman K, Sha F (2011) Learning with whom to share in multi-task feature learning. In: Proceedings of the 28th international conference on machine learning, pp 521–528
Kumar A, Daumé III H (2012) Learning task grouping and overlap in multi-task learning. In: International conference on machine learning (ICML)
Lawrence ND, Platt JC (2004) Learning to learn with the informative vector machine. In: Proceedings of the twenty-first international conference on machine learning, ACM, p 65
Lee H, Battle A, Raina R, Ng AY (2006) Efficient sparse coding algorithms. In: Advances in neural information processing systems, pp 801–808
Lee S-I, Chatalbashev V, Vickrey D, Koller D (2007) Learning a meta-level prior for feature relevance from multiple related tasks. In: Proceedings of the 24th international conference on machine learning, ACM, pp 489–496
Lenarcik A, Piasta Z (1998) Rough classifiers sensitive to costs varying from object to object. In: Rough sets and current trends in computing, Springer, pp 222–230
Lenk PJ, De Sarbo WS, Green PE, Young MR (1996) Hierarchical bayes conjoint analysis: recovery of partworth heterogeneity from reduced experimental designs. Mark Sci 15(2):173–191
Li S (2011) Concise formulas for the area and volume of a hyperspherical cap. Asian J Math Stat 4(1):66–70
Liao X, Xue Y, Carin L (2005) Logistic regression with an auxiliary data source. In: Proceedings of the 22nd international conference on machine learning, ACM, pp 505–512
Ling X, Dai W, Xue G-R, Yang Q, Yu Y (2008) Spectral domain-transfer learning. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 488–496
Mardia KV, Jupp PE (2009) Directional statistics, vol 494. Wiley, New York
McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In: AAAI-98 workshop on learning for text categorization, vol 752, Citeseer, pp 41–48
Mihalkova L, Huynh T, Mooney RJ (2007) Mapping and revising markov logic networks for transfer learning. In: AAAI, vol 7, pp 608–614
Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359
Passos A, Rai P, Wainer J, Daume III H (2012) Flexible modeling of latent task structures in multitask learning. arXiv preprint arXiv:1206.6486
Pavlov D, Balasubramanyan R, Dom B, Kapur S, Parikh J (2004) Document preprocessing for naive bayes classification and clustering with mixture of multinomials. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 829–834
Pearl J (2012) Some thoughts concerning transfer learning, with applications to meta-analysis and data-sharing estimation. Technical report, Technical Report Technical Report r-387, cognitive systems laboratory, Department of Computer Science, UCLA
Platt JC (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in large margin classifiers. Citeseer
Raina R, Battle A, Lee H, Packer B, Ng AY (2007) Self-taught learning: transfer learning from unlabeled data. In: proceedings of the 24th international conference on machine learning, ACM, pp 759–766
Saha B, Gupta SK, Phung D, Venkatesh S (2014) Multiple task transfer learning with small sample sizes. In: Knowledge and information systems, pp 1–28
Shimodaira H (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function. J Stat Plan Inference 90(2):227–244
Thrun S (1996) Learning to learn: introduction. In: Learning to learn, Citeseer
Van Belle VMCA, Van Calster B, Timmerman D, Bourne T, Bottomley C, Valentin L, Neven P, Van Huffel S, Suykens JAK, Boyd S (2012) A mathematical model for interpretable clinical decision support with applications in gynecology. PloS one 7(3):e34312
Wang Q, Zhang L, Chi M, Guo J (2008) MTForest: ensemble decision trees based on multi-task learning. In: European conference on artificial intelligence (ECAI), pp 122–126
Wang Z, Song Y, Zhang C (2008) Transferred dimensionality reduction. In: machine learning and knowledge discovery in databases, Springer, pp 550–565
Wu P, Dietterich TG (2004) Improving svm accuracy by training on auxiliary data sources. In: Proceedings of the twenty-first international conference on machine learning, ACM, p 110
Xue Y, Liao X, Carin L, Krishnapuram B (2007) Multi-task learning for classification with dirichlet process priors. J Mach Learn Res 8:35–63
Yang J, Yan R, Hauptmann AG (2007) Cross-domain video concept detection using adaptive svms. In: Proceedings of the 15th international conference on multimedia, pp 188–197
Yu K, Tresp V, Schwaighofer A (2005) Learning gaussian processes from multiple tasks. In: Proceedings of the 22nd international conference on Machine learning, ACM, pp 1012–1019
Zadrozny B (2004) Learning and evaluating classifiers under sample selection bias. In: Proceedings of the twenty-first international conference on Machine learning, ACM, p 114
Zadrozny B, Langford J, Abe N (2003) Cost-sensitive learning by cost-proportionate example weighting. In: Third IEEE international conference on data mining, 2003 (ICDM 2003), IEEE, pp 435–442
Zhang Y, Yeung D-Y (2010) A convex formulation for learning task relationships in multi-task learning. In: UAI, pp 733–442
Zhou J, Sun J, Liu Y, Hu J, Ye J (2013) Patient risk prediction model via top-k stability selection. In: SIAM conference on data mining. SIAM
Zhu J, Chen N, Xing EP (2011) Infinite latent svm for classification and multi-task learning. In: NIPS, pp 1620–1628
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gupta, S., Rana, S., Saha, B. et al. A new transfer learning framework with application to model-agnostic multi-task learning. Knowl Inf Syst 49, 933–973 (2016). https://doi.org/10.1007/s10115-016-0926-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-016-0926-z