Abstract
Learning preference models from human generated data is an important task in modern information processing systems. Its popular setting consists of simple input ratings, assigned with numerical values to indicate their relevancy with respect to a specific query. Since ratings are often specified within a small range, several objects may have the same ratings, thus creating ties among objects for a given query. Dealing with this phenomena presents a general problem of modelling preferences in the presence of ties and being query-specific. To this end, we present in this paper a novel approach by constructing probabilistic models directly on the collection of objects exploiting the combinatorial structure induced by the ties among them. The proposed probabilistic setting allows exploration of a super-exponential combinatorial state-space with unknown numbers of partitions and unknown order among them. Learning and inference in such a large state-space are challenging, and yet we present in this paper efficient algorithms to perform these tasks. Our approach exploits discrete choice theory, imposing generative process such that the finite set of objects is partitioned into subsets in a stagewise procedure, and thus reducing the state-space at each stage significantly. Efficient Markov chain Monte Carlo algorithms are then presented for the proposed models. We demonstrate that the model can potentially be trained in a large-scale setting of hundreds of thousands objects using an ordinary computer. In fact, in some special cases with appropriate model specification, our models can be learned in linear time. We evaluate the models on two application areas: (i) document ranking with the data from the Yahoo! challenge and (ii) collaborative filtering with movie data. We demonstrate that the models are competitive against state-of-the-arts.
Similar content being viewed by others
Notes
This rating-to-rank conversion is not reversible since we cannot generally infer ratings from a ranking. First, the top rating for each query is always converted into rank 1 even if this is not the maximum score in the rating scale. Second, there are no gaps in ranking, while it is possible that we may rate the best object by 5 stars, but the second best by 3 stars.
We are aware that clickthrough data can help to obtain a complete ordering, but the data may be noisy.
We caution the confusion between ‘rating’ and ‘ranking’ here. Ranking is the process of sorting a set of objects in an increasing or decreasing order, whereas in ‘rating’, each object is given with a value indicating its preference.
Strictly speaking, a partition can be an empty set but we deliberately left out this case, because empty sets do not contribute to the probability mass of the model, and it does not match the real-world intuition of object’s worth.
More precisely, when the number of partitions K is given, the cardinality ranges from 1 to \(N-K+1\) since partitions are non-empty.
The usual understanding would also contain the empty set, but we exclude it in this paper.
i.e. the function value does not depend on the order of elements within the partition.
To illustrate this intuition, suppose the remainder set is \(R_{k}=\left\{ a,b\right\} \), hence its power set, excluding \(\emptyset \), contains 3 subsets \(\left\{ a\right\} ,\left\{ b\right\} ,\left\{ a,b\right\} \). Under the arithmetic mean assumption, the denominator in Eq. (7) becomes \(\phi \left( r_{a}\right) +\phi \left( r_{b}\right) +\frac{1}{2}\left\{ \phi \left( r_{a}\right) +\phi \left( r_{b}\right) \right\} =(1+\frac{1}{2})\sum _{x\in \left\{ a,b\right\} }\phi \left( r_{x}\right) \). The constant term is \(C=\frac{3}{2}\) in this case.
To be more precise, for \(k=1\) we define \( X _{1:0}\) to be \(\emptyset \).
This is 2-D because we also need to index the parameters as well as the subsets.
We especially thank the reviewer who pointed out that the computation could be efficient for this case.
Please note that these states are defined for the Markov random field under study only.
We note a confusion that may arise here is that, although during training each training query q is supplied with a list of related objects and their ratings, during the ranking phase the system still needs to return a ranking over the list of related objects for an unseen query.
In document querying, for example, the list may consist of all documents which contain one or more query words.
Note that generally \(K\le M+1\) because there may be gaps in rating scales for a specific query.
This is much larger than the commonly used LETOR 3.0 and 4.0 data sets. In the preparation of this manuscript, we learnt that Microsoft had released two large sets of comparable size with that of Yahoo! but due to time constraint, we do not report the results here.
Strictly speaking, RankNet makes use of neural networks as the scoring function, but the overall loss is still logistic, and for simplicity, we use simple perceptron.
We are aware that the results should be associated with the error bars, however, since the data are huge, running the experiments repeatedly is extremely time-consuming.
Our result using second-order features was submitted to the Yahoo! challenge and obtained a position in the top 4 % over 1055 teams, given that our main purpose was to propose a new theoretical and useful model.
The code is available at: http://www.cofirank.org/downloads. We implement a simple wrapper to compute the ERR and NDCG scores (at various positions), which are not available in the code.
Note that, this is different from saying the states of variables are independent.
References
Adomavicius G, Tuzhilin A (2005) Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Trans Knowl Data Eng 17(6):734–749
Becchetti L, Colesanti UM, Marchetti-Spaccamela A, Vitaletti A (2011) Recommending items in pervasive scenarios: models and experimental analysis. Knowl Inf Syst 28(3):555–578
Bradley RA, Terry ME (1952) Rank analysis of incomplete block designs. Biometrika 39:324–345
Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine. Comput Netw ISDN Syst 30(1–7):107–117
Burges C, Shaked T, Renshaw E, Lazier A, Deeds M, Hamilton N, Hullender G (2005) Learning to rank using gradient descent. In: Proceedings of ICML, 96
Cao Z, Qin T, Liu TY, Tsai MF, Li H (2007) Learning to rank: from pairwise approach to listwise approach. In: Proceedings of the 24th international conference on machine learning, 136 pp. ACM
Carreira-Perpiñán MA, Hinton GE (2005) On contrastive divergence learning. In: Cowell RG, Ghahramani Z (eds) Proceedings of the 10th international workshop on artificial intelligence and statistics (AISTATS). Society for Artificial Intelligence and Statistics, Barbados, pp 33–40, Jan 6–8
Chapelle O, Chang Y (2011) Yahoo! learning to rank challenge overview. JMLR workshop and conference proceedings, vol 14, pp 1–24
Chapelle O, Metlzer D, Zhang Y, Grinspan P (2009) Expected reciprocal rank for graded relevance. In: CIKM. ACM, pp 621–630
Chu W, Ghahramani Z (2006) Gaussian processes for ordinal regression. J Mach Learn Res 6(1):1019
Chu W, Keerthi SS (2007) Support vector ordinal regression. Neural Comput 19(3):792–815
Cossock D, Zhang T (2008) Statistical analysis of Bayes optimal subset ranking. IEEE Trans Inf Theory 54(11):5140–5154
Davidson RR (1970) On extending the Bradley-Terry model to accommodate ties in paired comparison experiments. J Am Stat Assoc 65(329):317–328
Diaconis P (1988) Group representations in probability and statistics. Institute of Mathematical Statistics Hayward, CA
Dietterich TG, Lathrop RH, Lozano-Pérez T (1997) Solving the multiple instance problem with axis-parallel rectangles. Artif Intell 89(1–2):31–71
Fligner MA, Verducci JS (1988) Multistage ranking models. J Am Stat Assoc 83(403):892–901
Freund Y, Iyer R, Schapire RE, Singer Y (2004) An efficient boosting algorithm for combining preferences. J Mach Learn Res 4(6):933–969
Fürnkranz J, Hüllermeier E (2010) Preference learning. Springer, New York
Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell PAMI 6(6):721–742
Glenn WA, David HA (1960) Ties in paired-comparison experiments using a modified Thurstone-Mosteller model. Biometrics 16(1):86–109
Hastings WK (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1):97–109
Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14:1771–1800
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Huang J, Guestrin C, Guibas L (2009) Fourier theoretic probabilistic inference over permutations. J Mach Learn Res 10:997–1070
Huang TK, Weng RC, Lin CJ (2006) Generalized Bradley-Terry models and multi-class probability estimates. J Mach Learn Res 7:115
Järvelin K, Kekäläinen J (2002) Cumulated gain-based evaluation of IR techniques. ACM Trans Inf Syst TOIS 20(4):446
Joachims T (2002) Optimizing search engines using clickthrough data. In: Proceedings of SIGKDD. ACM, New York, NY, USA, pp 133–142
Kendall MG (1938) A new measure of rank correlation. Biometrika 30(1/2):81–93
Koren Y (2008) Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: KDD
Lauritzen SL (1996) Graphical models. Oxford Science Publications, Oxford
Lebanon G, Mao Y (2008) Non-parametric modeling of partially ranked data. J Mach Learn Res 9:2401–2429
Leung CW, Chan SC, Chung F (2006) A collaborative filtering framework based on fuzzy association rules and multiple-level similarity. Knowl Inf Syst 10(3):357–381
Liu NN, Zhao M, Yang Q (2009) Probabilistic latent preference analysis for collaborative filtering. In: CIKM. ACM, pp 759–766
Liu TY (2009) Learning to rank for information retrieval. Found Trends Inf Retr 3(3):225–331
Luce RD (1959) Individual choice behavior. Wiley, New York
Mallows CL (1957) Non-null ranking models. I. Biometrika 44(1):114–130
Marden JI (1995) Analyzing and modeling rank data. Chapman & Hall/CRC, London
Marlin B, Swersky K, Chen B, de Freitas N (May 2010) Inductive principles for restricted boltzmann machine learning. In: Proceedings of the 13rd international conference on artificial intelligence and statistics, Chia Laguna Resort, Sardinia, Italy
Mureşan M (2008) A concrete approach to classical analysis. Springer, Berlin
Neal RM (2001) Annealed importance sampling. Stat Comput 11(2):125–139
Plackett RL (1975) The analysis of permutations. Appl Stat 24(2):193–202
Rao PV, Kupper LL (1967) Ties in paired-comparison experiments: a generalization of the Bradley-Terry model. J Am Stat Assoc 62(317):194–204
Resnick P, Iacovou N, Suchak M, Bergstorm P, Riedl J (1994) GroupLens: an open architecture for collaborative filtering of netnews. In: Proceedings of ACM conference on computer supported cooperative work. Chapel Hill, North Carolina. ACM, pp 175–186
Sarwar B, Karypis G, Konstan J, Reidl J (2001) Item-based collaborative filtering recommendation algorithms. In: Proceedings of the 10th international conference on World Wide Web. ACM Press, New York, NY, USA, pp 285–295
Shi Y, Larson M, Hanjalic A (2010) List-wise learning to rank with matrix factorization for collaborative filtering. In: ACM RecSys. ACM, pp 269–272
Spearman C (1904) The proof and measurement of association between two things. Am J Psychol 15(1):72–101
Tieleman T, Hinton G (2009) Using fast weights to improve persistent contrastive divergence. In: Proceedings of the 26th annual international conference on machine learning. ACM, New York, NY, USA
Truyen T, Phung DQ, Venkatesh S (2011) Probabilistic models over ordered partitions with applications in document ranking and collaborative filtering. In: Proceedings of SIAM conference on data mining (SDM), Mesa, Arizona, USA. SIAM
van Lint JH, Wilson RM (1992) A course in combinatorics. Cambridge University Press, Cambridge
Vembu S, Gärtner T (2010) Label ranking algorithms: a survey. In Preference learning, p 45
Volkovs MN, Zemel RS (2009) BoltzRank: learning to maximize expected ranking gain. In: Proceedings of the 26th annual international conference on machine learning. ACM, New York, NY, USA
Weimer M, Karatzoglou A, Le Q, Smola A (2008) CoFi\(^{RANK}\)-maximum margin matrix factorization for collaborative ranking. Adv Neural Inf Process Syst 20:1593–1600
Xia F, Liu TY, Wang J, Zhang W, Li H (2008) Listwise approach to learning to rank: theory and algorithm. In: Proceedings of ICML, pp 1192–1199
Younes L (1989) Parametric inference for imperfectly observed Gibbsian fields. Probab Theory Relat Fields 82(4):625–645
Zhou K, Xue GR, Zha H, Yu Y (2008) Learning to rank with ties. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, pp 275–282
Acknowledgments
We thank anonymous reviewers for constructive comments, in particular, the suggestion of minimum/maximum aggregation functions.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 Computing \(C_{k}\)
We here calculate the constant \(C_{k}\) in Eq. (11). Let us rewrite the equation for ease of comprehension
where \(2^{R_{k}}\) is the power set with respect to the set \(R_{k}\), or the set of all non-empty subsets of \(R_{k}\). Equivalently
If all objects are the same, then this can be simplified to
where \(N_{k}=|R_{k}|\). In the last equation, we have made use of the fact that \(\sum _{S\in 2^{R_{k}}}1\) is the number of all possible non-empty subsets, or equivalently, the size of the power set, which is known to be \(2^{N_{k}}-1\). One way to derive this result is the imagine a collection of \(N_{k}\) variables, each has two states: \(\mathtt {selected}\) and \(\mathtt {notselected}\), where \(\mathtt {selected}\) means the object belongs to a subset. Since there are \(2^{N_{k}}\) such configurations over all states, the number of non-empty subsets must be \(2^{N_{k}}-1\).
For arbitrary objects, let us examine the probability that the object x belongs to a subset of size m, which is \(\frac{m}{N_{k}}\). Recall from standard combinatorics that the number of m-element subsets is the binomial coefficient \(\left( {{\begin{array}{ll} N_{k}\\ m\\ \end{array}}}\right) \), where \(1\le m\le N_{k}\). Thus the number of times an object appears in any m-subset is \(\left( {{\begin{array}{ll} N_{k}\\ m\\ \end{array}}}\right) \frac{m}{N_{k}}\). Taking into account that this number is weighted down by m (i.e. |S| in Eq. (11)), the contribution towards \(C_{k}\) is then \(\left( {{\begin{array}{ll} N_{k}\\ m\\ \end{array}}}\right) \frac{1}{N_{k}}\). Finally, we can compute the constant \(C_{k}\), which is the weighted number of times an object belongs to any subset of any size, as follows
We have made use of the known identity \(\sum _{m=1}^{N_{k}} \left( {{\begin{array}{ll} N_{k}\\ m\\ \end{array}}}\right) =2^{N_{k}}-1\).
1.2 Computing \(M_{k}(x)\)
We now calculate the constant \(M_{k}(x)\) in Eq. (17), which is reproduced here for clarity:
First, we arrange the objects in the decreasing order of worth \(\phi (x)\). For notation convenience we assume that the order is \(1,2,3,\ldots ,N_{k}\). The largest object will appear in a subset consisting of only itself, and \(2^{N_{k}-1}-1\) other subsets. Thus \(M_{k}(1)=2^{N_{k}-1}\) since for all subsets to which the largest object belong, the maximum aggregation is the worth of the object, as per definition. Now removing the largest object, consider the second largest one. With the same argument as before, \(M_{k}(2)=2^{N_{k}-2}\). Continuing the same line of reasoning, we end up \(M_{k}(n)=2^{N_{k}-n}\).
1.3 Pairwise losses
Let \(f(x_{i},w)\) be the scoring function parameterised by w that takes the input vector \(x_{i}\) and outputs a real value indicating the relevancy of the object i. Let \(\delta _{ij}(w)=f(x_{i},w)-f(x_{j},w)\). Pairwise models are quite similar in their general setting. The only difference is the specific loss function:
However, these losses behave quite different from each other. For the RankNet and Rank Boost, minimising the loss would widen the margin between the score for \(x_{i}\) and \(x_{j}\) as much as possible. The difference is that the RankNet is less sensitive to noise due to the log-scale. The Ranking SVM, however, aims just about to achieve the margin of 1, and the Rank Regress, attempts to bound the margin by 1.
At the first sight, the cost for gradient evaluation in pairwise losses would be \(\mathcal {O}(0.5N(N-1)F)\) where F is the number of parameters. However, we can achieve \(\max \{\mathcal {O}(0.5N(N-1)),\mathcal {O}(NF)\}\) as follows. The overall loss for a particular query is
Taking derivative with respect to w yields
As \(\left\{ \frac{\partial \ell (x_{i}\succ x_{j};w)}{\partial \delta _{ij}}\right\} _{i,j|x_{i}\succ x_{j}}\) can be computed in \(\mathcal {O}(0.5N(N-1))\) time, and \(\left\{ \frac{\partial f_{i}}{\partial w}\right\} _{i}\) in \(\mathcal {O}(NF)\) time, the overall cost would be \(\max \{\mathcal {O}(0.5N(N-1)),\mathcal {O}(NF)\}\).
1.4 Learning the pairwise ties models
This subsection describes the details of learning the paired ties models discussed in Sect. 6.
1.4.1 Davidson method
Recall from Sect. 2 that in the Davidson method, the probability masses are defined as
where \(Z_{ij}=\phi (x_{i})+\phi (x_{j})+\nu \sqrt{\phi (x_{i})\phi (x_{j})}\) and \(\nu \ge 0\). For simplicity of unconstrained optimisation, let \(\nu =e^{\beta }\) for \(\beta \in \mathbb {R}\). Let \(P_{i}=P(x_{i}\succ x_{j};w)\), \(P_{j}=P(x_{j}\succ x_{i};w)\) and \(P_{ij}=P(x_{i}\sim x_{j};w)\).
Taking derivatives of the log-likelihood gives
1.4.2 Rao-Kupper method
Recall from Sect. 2 that the Rao-Kupper model defines the following probability masses
where \(\theta \ge 1\) is the ties factor and w is the model parameter. Note that \(\phi (.)\) is also a function of w, which we omit here for clarity. For ease of unconstrained optimisation, let \(\theta =1+e^{\alpha }\) for \(\alpha \in \mathbb {R}\). In learning, we want to estimate both \(\alpha \) and w. Let
Taking partial derivatives of the log-likelihood gives
Rights and permissions
About this article
Cite this article
Tran, T., Phung, D. & Venkatesh, S. Modelling human preferences for ranking and collaborative filtering: a probabilistic ordered partition approach. Knowl Inf Syst 47, 157–188 (2016). https://doi.org/10.1007/s10115-015-0840-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-015-0840-9