perform m pairwise-comparisons or 2m thumb evaluations, where m is varied from 1 to kn.

Thumb-based Approach: In each iteration, in case of the thumbs algorithm, we pick a random item biased towards those with higher score estimates. Specifically, we rank all items with score estimates, and we pick the rth item with

probability proportional to

1 r^{γ }

.

In each thumb review, the

item with score x wins with probability f(x) = erf(^{√}2 x).

Comparison-based Approach: In each iteration, we first pick one item, based on rank estimates above: The rth item i s p i c k e d w i t h p r o b a b i l i t y p r o p o r t i o n a l t o 1 r γ . S u p p o s e rth item is picked, the second item is the r + 1th ranked one. Let the scores of these items be x and y respectively. In each comparison, the item with score x wins with probability t h e

f (x

y).

The algorithm for updating scores also keeps track of the number of times every item has been selected for compari- son. This is a discounting factor to capture the intuition that after a lot of comparisons, the scores start to stabilize, and so one comparison can alter the score only marginally. Sup- p o s e t h e i t e m s i 1 a n d i 2 h a v e b e e n e v a l u a t e d k 1 a n d k 2 t i m e respectively. The discounting is done based on c_{i }= s 2 0.5 (1+k_{i})

where i = 1, 2. Further, let their current score estimates be s_{1 }and s_{2}. If i_{1 }gets voted in the comparison, then s_{1 }is

incremented by c_{1 }∗ (1

1+e

1 ^{s}2

s

1

) and s_{2 }is decremented by

c_{2 }∗

1+e

1 ^{s}1

s

2

. Similarly, the update is done in case i_{2 }wins.

Evaluation Metric: We compute the MRR (mean recip- rocal rank) for the item with highest score in the rank pro- duced by the two approaches above. The MRR is the mean

of the quantity

1 R

, where R is the rank of the actual top item

in the ranking produced by the approach above. MRR is at most 1 and an MRR of 1 means the

Clearly, top item

was

always

correctly

identified

as

the

best

item.

## Our

exper-

iments m.

are

performed over

several

runs

by

varying

,

γ,

and

## Comparison based ranking algorithm consistently outper-

forms the thumbs based algorithm for

= 1.0, 1.5. Further,

the rate at which the comparison based algorithms improve as the number of evaluations is increased is also higher. A similar behavior is observed with increasing number of play-

ers;

the comparison based algorithm seems more resistant

to

a

large

number

of

players

even

when

the

evaluation

is

done on just the top few in the produced rankings.

## See Fig-

ures 7, 8, and 9 for the results for

= 1.5,

= 1.0, and

= 0.5 respectively.

8.2

# Shoutvelocity system

Our system has the history of all 4853 pairwise compar- isons performed by users, over a set of 1245 items. We study the convergence of MRR for as the number of reviews in- creases. Since we do not have the “true” scores of each item, we take many random orderings of the collection of 4853 reviews, replay in that order using the ELO rating system. The score for each item, averaged over these random order- ings, is taken to be the true score of the item.

As in the simulations above, we measure how MRR in- creases with the number of reviews. Figure 10 shows how MRR increases with the number of comparisons.

Next we provide latency of reviewing from the shoutveloc- ity system. Figure 11 plots for each latency x, what fraction

## Figure 7:

MRRs for thumb-based and comparison-

based ranking for

= 1.5.

At m = 1000, for the best

γ, MRR is 0.244 for comparison-based approach and

0.019 for thumbs.

## Figure 8:

MRRs for thumb-based and comparison-

based ranking for

= 1.0.

At m = 1000, for the best

γ, MRR is 0.149 for comparison-based approach and

0.0186 for thumbs.

of items waited for x reviews to get their first feedback. We can see that all items got their first feedback in 20 reviews, and most had to wait for less than 5 reviews.

Finally, in Figure 12 we show the cumulative distribu- tion of scores for shouts from the shoutvelocity system. (In our system, for displaying scores, we’ve added a score of 5 to every shout, so that they are nonnegative.) The scores roughly display a normal distribution, confirming Elo’s as- sumption of shouts’ performance being based on a normal distribution.

9.

# REL TED WORK

Ranking performs two central roles in today’s social net- works - a) surface the most relevant result to the user query, and b) align the incentives with the goal of seeking rele- vant user generated content. There has been considerable amount of work in analyzing simple voting schemes like thumbs-up/down to comparison based voting schemes [13]. In the former, the user gets to see one result and gets to vote up or down as a choice suggesting her preference while in the latter, the user is compares more than one result (usu- ally two) and ranks them according to her preferences. The Bradley-Terry-Luce (BTL) model (Bradley & Terry [10] and Luce [15]) is often applied to pairwise comparison data to