Fixing Arc Rating Calculations

CormanStark · 05-30-2009 03:52 PM

This is a proposal for an alternate means of calculating and display the overall rating for an arc in the mission browser. It is derived from a formula that is used in QA to prevent exceptional conditions being reported when the sample size is too small to be statistically valid.

[color= orange]Warning:[/color] Moderate level code/math geek levels! Run!<<tweet!>>

The current method of calculating the rating is to take the mean average. This is calculated by summing up all the stars given the arc and dividing by the number of times the arc has been rated. Classic stuff, it can be coded as follows:

Code:[/color]<hr /><pre>
function mean(float sum, float n)
return (sum / n)
end function
</pre><hr />

Of course, this has the problem that the first rating given to the arc has a completely disproportionate effect on the rating until other ratings follow. An arc with 1 five star will come up before an arc with 19 five stars and 1 four star. But griefers and cartels aside, the second arc is probably more reliably considered to be of high quality.

In other words, for small sample sizes, the mean is too volatile, or has too much beta which means it jumps around too much to get a good idea of what the real value is. One way to counter that is to assume that the arc is an average (3-star) arc at the beginning, with a history of say, 30 ratings behind it already, and let the actual ratings move it away from that value. Here's a first pass at that idea:

Code:[/color]<hr /><pre>
final_rating = beta_reduced_mean_naive(total_stars, rated_count, 30.0, 3.0)
output(final_rating)

function beta_reduced_mean_naive (float sum, float n, float min_sample, float default)
return (sum + (default * min_sample)) / (n + min_sample)
end function
</pre><hr />

It's not bad, and as the number of real ratings grow, the seeded rating become less and less important. However, we can do better. We can't actually reach 5 stars (or 0 stars) no matter how good or bad the arc actually is. And if we have 60 real ratings, then about we have 90 total real and seed ratings, with 1/3 of them not reflecting reality. Furthermore, the impact of those 30 seed ratings never completely goes away. So, what we do is change the code so that as we gain real data, the seed data diminishes in absolute as well as relative value, until it goes away completely. Here's our improved function:

Code:[/color]<hr /><pre>
function beta_reduced_mean (float sum, float n, float min_sample, float default)
float n2 = max(0.0, (min_sample - n))
return (sum + (default * n2)) / (n + n2)
end function
</pre><hr />

If this is run with the same parameters as above, then the first time it's rated it will give the same value, but when the number of real ratings is 30 or more, n2 is zero, and the seed data is completely ignored.

Here's my pseudo code for a test driver to the function. You can adapt this to any language you want, and try playing with different values for the number of trials and the default.

Code:[/color]<hr /><pre>
function run(float trials, float default)
float ratings = 0.0
float total_stars = 0.0

float trial
for trial = 1.0 to trials step 1.0
ratings += 1.0
stars = get_player_rating()
total_stars = total_stars + stars
mean = mean(total_stars, ratings)
final_rating = beta_reduced_mean(total_stars, ratings, trials, default)

output(trial, stars, mean, final_rating)
next trial
end function
</pre><hr />

Here are some results based on assuming a history of 15 3-star ratings as the seed value. The columns are: Run #, rating given, stars w/ current mean method(same, showing 2 decimal places), stars with beta reduced mean method(same, showing two decimal places). For the ratings, I assume a common, aggravating complaint - you have a good arc that generally gets 4 and 5 star ratings, but the initial rating is a griefer or has some mission fetish ("I always 0-star any mission where I have to click a glowie!") and zero stars you. See how the last two columns begin far apart, but converge to the same value.

Code:[/color]<hr /><pre>
01) | 0 | 0(0.00) | 3(2.80)
02) | 4 | 2(2.00) | 3(2.87)
03) | 4 | 3(2.67) | 3(2.93)
04) | 5 | 3(3.25) | 3(3.07)
05) | 4 | 3(3.40) | 3(3.13)
06) | 4 | 4(3.50) | 3(3.20)
07) | 4 | 4(3.57) | 3(3.27)
08) | 5 | 4(3.75) | 3(3.40)
09) | 5 | 4(3.89) | 4(3.53)
10) | 5 | 4(4.00) | 4(3.67)
11) | 5 | 4(4.09) | 4(3.80)
12) | 4 | 4(4.08) | 4(3.87)
13) | 5 | 4(4.15) | 4(4.00)
14) | 4 | 4(4.14) | 4(4.07)
15) | 4 | 4(4.13) | 4(4.13)
</pre><hr />

If this method of calculating ratings is adopted (I can dream!), I would strongly encourage the concurrent adoption of Arcanaville's suggestion that no ratings at all be shown until 3 or so ratings are in the can. Otherwise, you get a strange situation were the first rating is given, and both the giver and/or the author are wondering how you get a "3" out of a "5".

Note that this does NOT get rid of the issue of having many griefers and/or cartels forcing high/low ratings on the arc. That's a completely different issue altogether. This is simply designed to reduce the wild ratings jump when an arc is enduring its initial ratings.

A comparative rating system is probably the best (where the player is simply asked to decide which of two arcs is better) solution of all these issues, but is awkward to implement in a way that works well for the user.

(84554) A Mid-Winter's Night Dream
(148487) Punk 'n' Pie

CormanStark · 05-30-2009 03:54 PM

For those of you with insanely high geek levels, here's the test code I used in Ruby to create the above output and to play with different parameters.

Most of you should probably just pretend I never posted this.

Code:[/color]<hr /><pre>
class Critic
def initialize
@count = 0
end

def rate
@count += 1
if @count == 1
0.0
elsif rand(3) == 0
4.0
else
5.0
end
end # def rate
end # class Critic

def mean(sum, n)
sum.to_f / n.to_f
end

def beta_reduced_mean (sum, n, min_sample, default)
n2 = [0.0, (min_sample-n)].max.to_f
(sum.to_f + (default * n2)) / (n.to_f + n2)
end

def run(trials, default, critic)
ratings = total_stars = 0.0
1.upto(trials) do |trial|
ratings += 1.0
total_stars += stars = critic.rate
mean = mean(total_stars, ratings)
final_rating = beta_reduced_mean(total_stars, ratings, trials, default)
yield trial, stars, mean, final_rating
end
end

puts '-' * 55
c = Critic.new
run(15, 3.0, c) do |t, s, m, r|
print "#{'%02d' % t})",
" | #{'%0.0f' % s}",
" | #{'%0.0f' % m}(#{'%0.2f' % m})",
" | #{'%0.0f' % r}(#{'%0.2f' % r})",
"\n"
end

puts '','',''
</pre><hr />

(84554) A Mid-Winter's Night Dream
(148487) Punk 'n' Pie

Soul Train · 05-30-2009 05:22 PM

... interesting.

It WOULD be good to adopt that model. At the very least it'd give the arc a little bit of inertia until there was a solid base of ratings behind it.

I of course would NOT publicize what the 'seed' value is. Otherwise the 5-star and 0-star cartels would know what to shoot for.

"City of Heroes. April 27, 2004 - August 31, 2012. Obliterated not with a weapon of mass destruction, not by an all-powerful supervillain... but by a cold-hearted and cowardly corporate suck-up."

CormanStark · 05-31-2009 05:02 PM

[ QUOTE ]

I of course would NOT publicize what the 'seed' value is. Otherwise the 5-star and 0-star cartels would know what to shoot for.

[/ QUOTE ]

Well, that's more a question of removing outliers. One way of doing that once you have about 20 ratings or so would be to throw out the top and bottom 5% of the ratings, and then calculate the mean based on the middle 90%, which we hope is more representative of actual player reaction.

Interesting note: I saw in an Arcanaville post that the actual mean rating for arcs out in the wild is 3.6.

(84554) A Mid-Winter's Night Dream
(148487) Punk 'n' Pie

Rigel_Kent · 06-01-2009 01:15 AM

What about using the median instead?

opprime28 · 06-01-2009 01:09 PM

I think this sounds like a fairly decent way to deal with the issue.

Unless of course they ALSO allow us to tag aspects for ratings, like story, gameplay, and refinement.

Mind Forever Burning · 06-01-2009 01:24 PM

Arcanaville and I actually already discussed this in a different thread, although she never answered to my satisfaction why she didn't think it was a good idea.

The context for it was the formula used by IMDB for calculating rankings, which is essentially the same one you give above with different constants.

As an aside, Arcanaville did come down firmly on the side of not throwing out the outliers. The argument for doing so is that the outliers are likely to be statistically invalid, but that's not a claim we can make in this case; there may be a real minority population for whom those outlier votes are representative, and we shouldn't deprive them of their "voice" in the ranking. Fortunately, dealing with outliers is a separate issue from the ranking algorithm.

And for a while things were cold,
They were scared down in their holes
The forest that once was green
Was colored black by those killing machines

Arcanaville · 06-01-2009 05:58 PM

[ QUOTE ]
Arcanaville and I actually already discussed this in a different thread, although she never answered to my satisfaction why she didn't think it was a good idea.

[/ QUOTE ]

What I said was that it was while it was a reasonable attempt to rank arcs with substantial ratings associated with them, it doesn't do well with arcs with only a few ratings. While the whole *point* of such a system is to dampen the effects of a few ratings, it also makes it (as an inescapable side-effect) impossible for ratings to serve one of their intended purposes, which is to distinguish better arcs and draw attention to them. By definition, an arc won't be designated as a "good" arc until its gotten so many ratings that the whole purpose of drawing attention to them is rendered moot.

Thus, it can help rank the most frequently rated arcs, but can't be used to attract attention to new or underplayed arcs. It might make a good HoF tool, but its not a good "diamond in the rough" tool.

[Guide to Defense] [Scrapper Secondaries Comparison] [Archetype Popularity Analysis]

In one little corner of the universe, there's nothing more irritating than a misfile...
(Please support the best webcomic about a cosmic universal realignment by impaired angelic interference resulting in identity crisis angst. Or I release the pigmy water thieves.)