Skip to main content

Advertisement

Log in

Modelling human preferences for ranking and collaborative filtering: a probabilistic ordered partition approach

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Learning preference models from human generated data is an important task in modern information processing systems. Its popular setting consists of simple input ratings, assigned with numerical values to indicate their relevancy with respect to a specific query. Since ratings are often specified within a small range, several objects may have the same ratings, thus creating ties among objects for a given query. Dealing with this phenomena presents a general problem of modelling preferences in the presence of ties and being query-specific. To this end, we present in this paper a novel approach by constructing probabilistic models directly on the collection of objects exploiting the combinatorial structure induced by the ties among them. The proposed probabilistic setting allows exploration of a super-exponential combinatorial state-space with unknown numbers of partitions and unknown order among them. Learning and inference in such a large state-space are challenging, and yet we present in this paper efficient algorithms to perform these tasks. Our approach exploits discrete choice theory, imposing generative process such that the finite set of objects is partitioned into subsets in a stagewise procedure, and thus reducing the state-space at each stage significantly. Efficient Markov chain Monte Carlo algorithms are then presented for the proposed models. We demonstrate that the model can potentially be trained in a large-scale setting of hundreds of thousands objects using an ordinary computer. In fact, in some special cases with appropriate model specification, our models can be learned in linear time. We evaluate the models on two application areas: (i) document ranking with the data from the Yahoo! challenge and (ii) collaborative filtering with movie data. We demonstrate that the models are competitive against state-of-the-arts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. This rating-to-rank conversion is not reversible since we cannot generally infer ratings from a ranking. First, the top rating for each query is always converted into rank 1 even if this is not the maximum score in the rating scale. Second, there are no gaps in ranking, while it is possible that we may rate the best object by 5 stars, but the second best by 3 stars.

  2. http://www.netflixprize.com/.

  3. We are aware that clickthrough data can help to obtain a complete ordering, but the data may be noisy.

  4. We caution the confusion between ‘rating’ and ‘ranking’ here. Ranking is the process of sorting a set of objects in an increasing or decreasing order, whereas in ‘rating’, each object is given with a value indicating its preference.

  5. Strictly speaking, a partition can be an empty set but we deliberately left out this case, because empty sets do not contribute to the probability mass of the model, and it does not match the real-world intuition of object’s worth.

  6. More precisely, when the number of partitions K is given, the cardinality ranges from 1 to \(N-K+1\) since partitions are non-empty.

  7. This process resembles the generative process of Plackett–Luce discrete choice model [35, 41], except we apply on partitions rather than single element. It clear from here that Plackett–Luce model is a special case of ours wherein each partition \( X _{k}\) reduces to a singleton.

  8. The usual understanding would also contain the empty set, but we exclude it in this paper.

  9. i.e. the function value does not depend on the order of elements within the partition.

  10. To illustrate this intuition, suppose the remainder set is \(R_{k}=\left\{ a,b\right\} \), hence its power set, excluding \(\emptyset \), contains 3 subsets \(\left\{ a\right\} ,\left\{ b\right\} ,\left\{ a,b\right\} \). Under the arithmetic mean assumption, the denominator in Eq. (7) becomes \(\phi \left( r_{a}\right) +\phi \left( r_{b}\right) +\frac{1}{2}\left\{ \phi \left( r_{a}\right) +\phi \left( r_{b}\right) \right\} =(1+\frac{1}{2})\sum _{x\in \left\{ a,b\right\} }\phi \left( r_{x}\right) \). The constant term is \(C=\frac{3}{2}\) in this case.

  11. To be more precise, for \(k=1\) we define \( X _{1:0}\) to be \(\emptyset \).

  12. This is 2-D because we also need to index the parameters as well as the subsets.

  13. We especially thank the reviewer who pointed out that the computation could be efficient for this case.

  14. Please note that these states are defined for the Markov random field under study only.

  15. We note a confusion that may arise here is that, although during training each training query q is supplied with a list of related objects and their ratings, during the ranking phase the system still needs to return a ranking over the list of related objects for an unseen query.

  16. In document querying, for example, the list may consist of all documents which contain one or more query words.

  17. Note that generally \(K\le M+1\) because there may be gaps in rating scales for a specific query.

  18. This is much larger than the commonly used LETOR 3.0 and 4.0 data sets. In the preparation of this manuscript, we learnt that Microsoft had released two large sets of comparable size with that of Yahoo! but due to time constraint, we do not report the results here.

  19. Strictly speaking, RankNet makes use of neural networks as the scoring function, but the overall loss is still logistic, and for simplicity, we use simple perceptron.

  20. We are aware that the results should be associated with the error bars, however, since the data are huge, running the experiments repeatedly is extremely time-consuming.

  21. Our result using second-order features was submitted to the Yahoo! challenge and obtained a position in the top 4 % over 1055 teams, given that our main purpose was to propose a new theoretical and useful model.

  22. http://grouplens.org/node/73.

  23. The code is available at: http://www.cofirank.org/downloads. We implement a simple wrapper to compute the ERR and NDCG scores (at various positions), which are not available in the code.

  24. Note that, this is different from saying the states of variables are independent.

References

  1. Adomavicius G, Tuzhilin A (2005) Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Trans Knowl Data Eng 17(6):734–749

    Article  Google Scholar 

  2. Becchetti L, Colesanti UM, Marchetti-Spaccamela A, Vitaletti A (2011) Recommending items in pervasive scenarios: models and experimental analysis. Knowl Inf Syst 28(3):555–578

    Article  Google Scholar 

  3. Bradley RA, Terry ME (1952) Rank analysis of incomplete block designs. Biometrika 39:324–345

    MathSciNet  MATH  Google Scholar 

  4. Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine. Comput Netw ISDN Syst 30(1–7):107–117

    Article  Google Scholar 

  5. Burges C, Shaked T, Renshaw E, Lazier A, Deeds M, Hamilton N, Hullender G (2005) Learning to rank using gradient descent. In: Proceedings of ICML, 96

  6. Cao Z, Qin T, Liu TY, Tsai MF, Li H (2007) Learning to rank: from pairwise approach to listwise approach. In: Proceedings of the 24th international conference on machine learning, 136 pp. ACM

  7. Carreira-Perpiñán MA, Hinton GE (2005) On contrastive divergence learning. In: Cowell RG, Ghahramani Z (eds) Proceedings of the 10th international workshop on artificial intelligence and statistics (AISTATS). Society for Artificial Intelligence and Statistics, Barbados, pp 33–40, Jan 6–8

  8. Chapelle O, Chang Y (2011) Yahoo! learning to rank challenge overview. JMLR workshop and conference proceedings, vol 14, pp 1–24

  9. Chapelle O, Metlzer D, Zhang Y, Grinspan P (2009) Expected reciprocal rank for graded relevance. In: CIKM. ACM, pp 621–630

  10. Chu W, Ghahramani Z (2006) Gaussian processes for ordinal regression. J Mach Learn Res 6(1):1019

    MathSciNet  MATH  Google Scholar 

  11. Chu W, Keerthi SS (2007) Support vector ordinal regression. Neural Comput 19(3):792–815

    Article  MathSciNet  MATH  Google Scholar 

  12. Cossock D, Zhang T (2008) Statistical analysis of Bayes optimal subset ranking. IEEE Trans Inf Theory 54(11):5140–5154

    Article  MathSciNet  MATH  Google Scholar 

  13. Davidson RR (1970) On extending the Bradley-Terry model to accommodate ties in paired comparison experiments. J Am Stat Assoc 65(329):317–328

    Article  Google Scholar 

  14. Diaconis P (1988) Group representations in probability and statistics. Institute of Mathematical Statistics Hayward, CA

    MATH  Google Scholar 

  15. Dietterich TG, Lathrop RH, Lozano-Pérez T (1997) Solving the multiple instance problem with axis-parallel rectangles. Artif Intell 89(1–2):31–71

    Article  MATH  Google Scholar 

  16. Fligner MA, Verducci JS (1988) Multistage ranking models. J Am Stat Assoc 83(403):892–901

    Article  MathSciNet  MATH  Google Scholar 

  17. Freund Y, Iyer R, Schapire RE, Singer Y (2004) An efficient boosting algorithm for combining preferences. J Mach Learn Res 4(6):933–969

    MathSciNet  MATH  Google Scholar 

  18. Fürnkranz J, Hüllermeier E (2010) Preference learning. Springer, New York

    MATH  Google Scholar 

  19. Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell PAMI 6(6):721–742

    Article  MATH  Google Scholar 

  20. Glenn WA, David HA (1960) Ties in paired-comparison experiments using a modified Thurstone-Mosteller model. Biometrics 16(1):86–109

    Article  MATH  Google Scholar 

  21. Hastings WK (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1):97–109

    Article  MathSciNet  MATH  Google Scholar 

  22. Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14:1771–1800

    Article  MATH  Google Scholar 

  23. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    Article  MathSciNet  MATH  Google Scholar 

  24. Huang J, Guestrin C, Guibas L (2009) Fourier theoretic probabilistic inference over permutations. J Mach Learn Res 10:997–1070

    MathSciNet  MATH  Google Scholar 

  25. Huang TK, Weng RC, Lin CJ (2006) Generalized Bradley-Terry models and multi-class probability estimates. J Mach Learn Res 7:115

    MathSciNet  MATH  Google Scholar 

  26. Järvelin K, Kekäläinen J (2002) Cumulated gain-based evaluation of IR techniques. ACM Trans Inf Syst TOIS 20(4):446

    Google Scholar 

  27. Joachims T (2002) Optimizing search engines using clickthrough data. In: Proceedings of SIGKDD. ACM, New York, NY, USA, pp 133–142

  28. Kendall MG (1938) A new measure of rank correlation. Biometrika 30(1/2):81–93

    Article  MathSciNet  MATH  Google Scholar 

  29. Koren Y (2008) Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: KDD

  30. Lauritzen SL (1996) Graphical models. Oxford Science Publications, Oxford

    MATH  Google Scholar 

  31. Lebanon G, Mao Y (2008) Non-parametric modeling of partially ranked data. J Mach Learn Res 9:2401–2429

    MathSciNet  MATH  Google Scholar 

  32. Leung CW, Chan SC, Chung F (2006) A collaborative filtering framework based on fuzzy association rules and multiple-level similarity. Knowl Inf Syst 10(3):357–381

    Article  Google Scholar 

  33. Liu NN, Zhao M, Yang Q (2009) Probabilistic latent preference analysis for collaborative filtering. In: CIKM. ACM, pp 759–766

  34. Liu TY (2009) Learning to rank for information retrieval. Found Trends Inf Retr 3(3):225–331

    Article  Google Scholar 

  35. Luce RD (1959) Individual choice behavior. Wiley, New York

    MATH  Google Scholar 

  36. Mallows CL (1957) Non-null ranking models. I. Biometrika 44(1):114–130

    Article  MathSciNet  MATH  Google Scholar 

  37. Marden JI (1995) Analyzing and modeling rank data. Chapman & Hall/CRC, London

    MATH  Google Scholar 

  38. Marlin B, Swersky K, Chen B, de Freitas N (May 2010) Inductive principles for restricted boltzmann machine learning. In: Proceedings of the 13rd international conference on artificial intelligence and statistics, Chia Laguna Resort, Sardinia, Italy

  39. Mureşan M (2008) A concrete approach to classical analysis. Springer, Berlin

    MATH  Google Scholar 

  40. Neal RM (2001) Annealed importance sampling. Stat Comput 11(2):125–139

    Article  MathSciNet  Google Scholar 

  41. Plackett RL (1975) The analysis of permutations. Appl Stat 24(2):193–202

    Article  MathSciNet  Google Scholar 

  42. Rao PV, Kupper LL (1967) Ties in paired-comparison experiments: a generalization of the Bradley-Terry model. J Am Stat Assoc 62(317):194–204

    Article  MathSciNet  Google Scholar 

  43. Resnick P, Iacovou N, Suchak M, Bergstorm P, Riedl J (1994) GroupLens: an open architecture for collaborative filtering of netnews. In: Proceedings of ACM conference on computer supported cooperative work. Chapel Hill, North Carolina. ACM, pp 175–186

  44. Sarwar B, Karypis G, Konstan J, Reidl J (2001) Item-based collaborative filtering recommendation algorithms. In: Proceedings of the 10th international conference on World Wide Web. ACM Press, New York, NY, USA, pp 285–295

  45. Shi Y, Larson M, Hanjalic A (2010) List-wise learning to rank with matrix factorization for collaborative filtering. In: ACM RecSys. ACM, pp 269–272

  46. Spearman C (1904) The proof and measurement of association between two things. Am J Psychol 15(1):72–101

    Article  Google Scholar 

  47. Tieleman T, Hinton G (2009) Using fast weights to improve persistent contrastive divergence. In: Proceedings of the 26th annual international conference on machine learning. ACM, New York, NY, USA

  48. Truyen T, Phung DQ, Venkatesh S (2011) Probabilistic models over ordered partitions with applications in document ranking and collaborative filtering. In: Proceedings of SIAM conference on data mining (SDM), Mesa, Arizona, USA. SIAM

  49. van Lint JH, Wilson RM (1992) A course in combinatorics. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  50. Vembu S, Gärtner T (2010) Label ranking algorithms: a survey. In Preference learning, p 45

  51. Volkovs MN, Zemel RS (2009) BoltzRank: learning to maximize expected ranking gain. In: Proceedings of the 26th annual international conference on machine learning. ACM, New York, NY, USA

  52. Weimer M, Karatzoglou A, Le Q, Smola A (2008) CoFi\(^{RANK}\)-maximum margin matrix factorization for collaborative ranking. Adv Neural Inf Process Syst 20:1593–1600

    Google Scholar 

  53. Xia F, Liu TY, Wang J, Zhang W, Li H (2008) Listwise approach to learning to rank: theory and algorithm. In: Proceedings of ICML, pp 1192–1199

  54. Younes L (1989) Parametric inference for imperfectly observed Gibbsian fields. Probab Theory Relat Fields 82(4):625–645

    Article  MathSciNet  MATH  Google Scholar 

  55. Zhou K, Xue GR, Zha H, Yu Y (2008) Learning to rank with ties. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, pp 275–282

Download references

Acknowledgments

We thank anonymous reviewers for constructive comments, in particular, the suggestion of minimum/maximum aggregation functions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Truyen Tran.

Appendix

Appendix

1.1 Computing \(C_{k}\)

We here calculate the constant \(C_{k}\) in Eq. (11). Let us rewrite the equation for ease of comprehension

$$\begin{aligned} \sum _{S\in 2^{R_{k}}}\frac{1}{\left| S\right| }\sum _{x\in S}\phi _{k}\left( x\right) =C_{k}\times \sum _{x\in R_{k}}\phi _{k}(x) \end{aligned}$$

where \(2^{R_{k}}\) is the power set with respect to the set \(R_{k}\), or the set of all non-empty subsets of \(R_{k}\). Equivalently

$$\begin{aligned} C_{k}=\sum _{S\in 2^{R_{k}}}\frac{1}{\left| S\right| }\sum _{x\in S}\frac{\phi _{k}\left( x\right) }{\sum _{x\in R_{k}}\phi _{k}(x)} \end{aligned}$$

If all objects are the same, then this can be simplified to

$$\begin{aligned} C_{k}= & {} \sum _{S\in 2^{R_{k}}}\frac{1}{\left| S\right| }\sum _{x\in S}\frac{1}{N_{k}}=\frac{1}{N_{k}}\sum _{S\in 2^{R_{k}}}1=\frac{2^{N_{k}}-1}{N_{k}} \end{aligned}$$

where \(N_{k}=|R_{k}|\). In the last equation, we have made use of the fact that \(\sum _{S\in 2^{R_{k}}}1\) is the number of all possible non-empty subsets, or equivalently, the size of the power set, which is known to be \(2^{N_{k}}-1\). One way to derive this result is the imagine a collection of \(N_{k}\) variables, each has two states: \(\mathtt {selected}\) and \(\mathtt {notselected}\), where \(\mathtt {selected}\) means the object belongs to a subset. Since there are \(2^{N_{k}}\) such configurations over all states, the number of non-empty subsets must be \(2^{N_{k}}-1\).

For arbitrary objects, let us examine the probability that the object x belongs to a subset of size m, which is \(\frac{m}{N_{k}}\). Recall from standard combinatorics that the number of m-element subsets is the binomial coefficient \(\left( {{\begin{array}{ll} N_{k}\\ m\\ \end{array}}}\right) \), where \(1\le m\le N_{k}\). Thus the number of times an object appears in any m-subset is \(\left( {{\begin{array}{ll} N_{k}\\ m\\ \end{array}}}\right) \frac{m}{N_{k}}\). Taking into account that this number is weighted down by m (i.e. |S| in Eq. (11)), the contribution towards \(C_{k}\) is then \(\left( {{\begin{array}{ll} N_{k}\\ m\\ \end{array}}}\right) \frac{1}{N_{k}}\). Finally, we can compute the constant \(C_{k}\), which is the weighted number of times an object belongs to any subset of any size, as follows

$$\begin{aligned} C_{k}= & {} \sum _{m=1}^{N_{k}} \left( {{\begin{array}{ll} N_{k}\\ m\\ \end{array}}}\right) \frac{1}{N_{k}}=\frac{1}{N_{k}}\sum _{m=1}^{N_{k}}\left( {{\begin{array}{ll} N_{k}\\ m\\ \end{array}}}\right) =\frac{2^{N_{k}}-1}{N_{k}} \end{aligned}$$

We have made use of the known identity \(\sum _{m=1}^{N_{k}} \left( {{\begin{array}{ll} N_{k}\\ m\\ \end{array}}}\right) =2^{N_{k}}-1\).

1.2 Computing \(M_{k}(x)\)

We now calculate the constant \(M_{k}(x)\) in Eq. (17), which is reproduced here for clarity:

$$\begin{aligned} \sum _{S \in 2^{R_{k}}}\max _{x\in S}\phi (x)=\sum _{x\in R_{k}}M_{k}(x)\phi (x) \end{aligned}$$
(27)

First, we arrange the objects in the decreasing order of worth \(\phi (x)\). For notation convenience we assume that the order is \(1,2,3,\ldots ,N_{k}\). The largest object will appear in a subset consisting of only itself, and \(2^{N_{k}-1}-1\) other subsets. Thus \(M_{k}(1)=2^{N_{k}-1}\) since for all subsets to which the largest object belong, the maximum aggregation is the worth of the object, as per definition. Now removing the largest object, consider the second largest one. With the same argument as before, \(M_{k}(2)=2^{N_{k}-2}\). Continuing the same line of reasoning, we end up \(M_{k}(n)=2^{N_{k}-n}\).

1.3 Pairwise losses

Let \(f(x_{i},w)\) be the scoring function parameterised by w that takes the input vector \(x_{i}\) and outputs a real value indicating the relevancy of the object i. Let \(\delta _{ij}(w)=f(x_{i},w)-f(x_{j},w)\). Pairwise models are quite similar in their general setting. The only difference is the specific loss function:

$$\begin{aligned} \ell (x_{i}\succ x_{j};w)={\left\{ \begin{array}{ll} \log (1+\exp \left\{ -\delta _{ij}(w)\right\} ) &{} \hbox { (RankNet) }\\ \max \left\{ 0,1-\delta _{ij}(w)\right\} &{} \hbox { (Ranking SVM)}\\ (1-\delta _{ij}(w))^{2} &{} \hbox { (Rank Regress)}\\ \exp \left\{ -\delta _{ij}(w)\right\} &{} \hbox { (Rank Boost)} \end{array}\right. } \end{aligned}$$

However, these losses behave quite different from each other. For the RankNet and Rank Boost, minimising the loss would widen the margin between the score for \(x_{i}\) and \(x_{j}\) as much as possible. The difference is that the RankNet is less sensitive to noise due to the log-scale. The Ranking SVM, however, aims just about to achieve the margin of 1, and the Rank Regress, attempts to bound the margin by 1.

At the first sight, the cost for gradient evaluation in pairwise losses would be \(\mathcal {O}(0.5N(N-1)F)\) where F is the number of parameters. However, we can achieve \(\max \{\mathcal {O}(0.5N(N-1)),\mathcal {O}(NF)\}\) as follows. The overall loss for a particular query is

$$\begin{aligned} \mathfrak {L}= & {} \sum _{i,j|x_{i}\succ x_{j}}\ell (x_{i}\succ x_{j};w) \end{aligned}$$

Taking derivative with respect to w yields

$$\begin{aligned} \frac{\partial \mathfrak {L}}{\partial w}= & {} \sum _{i,j|x_{i}\succ x_{j}} \frac{\partial \ell (x_{i} \succ x_{j};w)}{\partial \delta _{ij}}\left( -\frac{\partial f_{i}}{\partial w}+\frac{\partial f_{j}}{\partial w}\right) \\= & {} -\sum _{i}\frac{\partial f_{i}}{\partial w}\sum _{j|x_{i}\succ x_{j}}\frac{\partial \ell (x_{i}\succ x_{j};w)}{\partial \delta _{ij}}+\sum _{j}\frac{\partial f_{j}}{\partial w}\sum _{i|x_{i}\succ x_{j}}\frac{\partial \ell (x_{i}\succ x_{j};w)}{\partial \delta _{ij}} \end{aligned}$$

As \(\left\{ \frac{\partial \ell (x_{i}\succ x_{j};w)}{\partial \delta _{ij}}\right\} _{i,j|x_{i}\succ x_{j}}\) can be computed in \(\mathcal {O}(0.5N(N-1))\) time, and \(\left\{ \frac{\partial f_{i}}{\partial w}\right\} _{i}\) in \(\mathcal {O}(NF)\) time, the overall cost would be \(\max \{\mathcal {O}(0.5N(N-1)),\mathcal {O}(NF)\}\).

1.4 Learning the pairwise ties models

This subsection describes the details of learning the paired ties models discussed in Sect. 6.

1.4.1 Davidson method

Recall from Sect. 2 that in the Davidson method, the probability masses are defined as

$$\begin{aligned} P(x_{i}\succ x_{j};w)= & {} \frac{1}{Z_{ij}}\phi (x_{i});\quad P(x_{i}\sim x_{j};w)=\frac{1}{Z_{ij}}\nu \sqrt{\phi (x_{i})\phi (x_{j})} \end{aligned}$$

where \(Z_{ij}=\phi (x_{i})+\phi (x_{j})+\nu \sqrt{\phi (x_{i})\phi (x_{j})}\) and \(\nu \ge 0\). For simplicity of unconstrained optimisation, let \(\nu =e^{\beta }\) for \(\beta \in \mathbb {R}\). Let \(P_{i}=P(x_{i}\succ x_{j};w)\), \(P_{j}=P(x_{j}\succ x_{i};w)\) and \(P_{ij}=P(x_{i}\sim x_{j};w)\).

Taking derivatives of the log-likelihood gives

$$\begin{aligned} \frac{\partial \log P(x_{i}\succ x_{j};w)}{\partial w}= & {} (1-P_{i}-0.5P_{ij})\frac{\partial \log \phi (x_{i},w)}{\partial w}-(P_{i}+0.5P_{ij})\frac{\partial \log \phi (x_{j},w)}{\partial w}\\ \frac{\partial \log P(x_{i}\succ x_{j};w)}{\partial \beta }= & {} -P_{ij}\\ \frac{\partial \log P(x_{i}\sim x_{j};w)}{\partial w}= & {} (0.5-P_{i}-0.5P_{ij})\frac{\partial \log \phi (x_{i},w)}{\partial w}\\&+(0.5-P_{j}-0.5P_{ij})\frac{\partial \log \phi (x_{j},w)}{\partial w}\\ \frac{\partial \log P(x_{i}\sim x_{j};w)}{\partial \beta }= & {} 1-P_{ij}. \end{aligned}$$

1.4.2 Rao-Kupper method

Recall from Sect. 2 that the Rao-Kupper model defines the following probability masses

$$\begin{aligned} P(x_{i}\succ x_{j};w)= & {} \frac{\phi (x_{i})}{\phi (x_{i})+\theta \phi (x_{j})}\\ P(x_{i}\sim x_{j};w)= & {} \frac{(\theta ^{2}-1)\phi (x_{i})\phi (x_{j})}{\left[ \phi (x_{i})+\theta \phi (x_{j})\right] \left[ \theta \phi (x_{i})+\phi (x_{j})\right] } \end{aligned}$$

where \(\theta \ge 1\) is the ties factor and w is the model parameter. Note that \(\phi (.)\) is also a function of w, which we omit here for clarity. For ease of unconstrained optimisation, let \(\theta =1+e^{\alpha }\) for \(\alpha \in \mathbb {R}\). In learning, we want to estimate both \(\alpha \) and w. Let

$$\begin{aligned} P_{i}= & {} \frac{\phi (x_{i})}{\phi (x_{i})+(1+e^{\alpha })\phi (x_{j})};\quad P_{j}^{*}=\frac{\phi (x_{j})}{\phi (x_{i})+(1+e^{\alpha })\phi (x_{j})};\\ P_{i}^{*}= & {} \frac{\phi (x_{i})}{(1+e^{\alpha })\phi (x_{i})+\phi (x_{j})};\quad P_{j}=\frac{\phi (x_{j})}{(1+e^{\alpha })\phi (x_{i})+\phi (x_{j})}. \end{aligned}$$

Taking partial derivatives of the log-likelihood gives

$$\begin{aligned} \frac{\partial \log P(x_{i}\succ x_{j};w)}{\partial w}= & {} (1-P_{i})\frac{\partial \log \phi (x_{i},w)}{\partial w}-(1+e^{\alpha })P_{j}\frac{\partial \log \phi (x_{j},w)}{\partial w} \\ \frac{\partial \log P(x_{i}\succ x_{j};w)}{\partial \alpha }= & {} -P_{j}e^{\alpha } \\ \frac{\partial \log P(x_{i}\sim x_{j};w)}{\partial w}= & {} (1-P_{i}-(1+e^{\alpha })P_{i}^{*})\frac{\partial \log \phi (x_{i},w)}{\partial w} \\&+\, (1-P_{j}-(1+e^{\alpha })P_{j}^{*})\frac{\partial \log \phi (x_{j},w)}{\partial w} \\ \frac{\partial \log P(x_{i}\sim x_{j};w)}{\partial \alpha }= & {} \left( \frac{2(1+e^{\alpha })}{(1+e^{\alpha })^{2}-1}-P_{i}^{*}-P_{j}^{*}\right) e^{\alpha }. \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tran, T., Phung, D. & Venkatesh, S. Modelling human preferences for ranking and collaborative filtering: a probabilistic ordered partition approach. Knowl Inf Syst 47, 157–188 (2016). https://doi.org/10.1007/s10115-015-0840-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-015-0840-9

Keywords

Navigation