With star ratings, I think an important point that often gets ignored is: different people use stars in different ways. One user might 5-star most things, but give the occasional 4- or 3-star review if they have a problem. But another user might 3-star by default, and save their 4- and 5-star reviews for exceptionally good cases.
I wonder if a simple way to fix that might be to reinterpret everyone's star ratings as percentiles, based on the overall distribution of stars in their reviews. "This user gives 5 stars 10% of the time, so we'll interpret a 5-star review from them as anything in the range 90-100 -- assume 95%."
You would probably also want to reinterpret the results for each user. "This review scores average out as 84%. For user A, that's 4.5 stars, but for user B, it's only 3.5 stars."
The big downside is that star ratings become subjective. But they're already subjective, and ignoring that problem doesn't make the results any better. Average star ratings on all the big websites and app stores right now are garbage -- they'll usually warn you if some Amazon product is terrible, but that's about all.
If you crunch all the review data and figure out the best possible recommendations, you end up with collaborative filtering and the Netflix Prize. It's a shame that so much great work was done for that competition, but nobody seems to be using it now. Netflix themselves just use a trivial upvote scheme now.
But I wonder if there's some much simpler approach that still gets pretty good results.
I wrote this a couple of years ago [1]. I think we need to remove subjectivity on ratings by asking more specific questions and only allowing a binary answer.
1. Is the food good?
2. Is the service good?
3. Is the atmosphere good?
That's a pretty simple answer. Often when I see 1 star reviews it's because of a single element of the experience but not the overall experience.
It's easier to leave a review because there's less cognitive load. It's easier to search for what you want: if I have my foodie hat on, I don't particularly care about the service. If it's a night out with a customer, that becomes more important all of a sudden.
And then you can generate some sort of average score based on the answers to these questions to calculate the 5 star rating.
I do prefer that over stars, but I think it potentially misses some information. Let's say most people answer "good" for all the categories. Does that just mean the place is good overall, or is it fantastic?
To put it another way, how do you distinguish the 4.0-star places from the 4.9-star places?
With conventional star ratings, you're reliant on most people using stars consistently. With a series of yes/no questions, you're relying on a potentially small pool of "no" answers to give you a useful signal.
I think stack ranking would be much more powerful. "How does this place compare to others? Average, better than average, in your all time top 5?" Everybody's feedback would be completely clear. It's not obvious how to aggregate that into a single rating number though.
Given a set of questions - e.g. "how's the food" "how's the atmosphere" "how's the service" etc. - you could figure out how the restaurant scores relative to others by stack ranking based on the % of answers to a particular question that got a "Yes". The numbers should hopefully reflect a normal distribution and from there you get your /5 rating.
If everybody answers "yes" to all of the questions - good value, service, food, atmosphere - then that suggests to me that it's a great restaurant. And you can have a lot of questions that are even asked randomly to limit the number of questions per user.
I rate a lot of places highly that have great a lot of things but not great service, because I don't think the service is bad enough to bring it down. But that's data that is being lost.
I like your idea of stack ranking but with a different flavour. I think that "in your all time top 5" is a hard question to answer. How about this though - if we know you've been to Taco Place X and now you're going to Taco Place Y, maybe the question is "are the tacos at Y better than X", "is the atmosphere at Y better than X" or even "is Y better than X" (but I like the idea of collecting more granular data).
If you collect this^ data to stack rank. Then it definitely gives you a better distribution of restaurants relative to each other in each category.
As a consumer, with this level of granularity, I can select what I care about tonight. If I'm grabbing takeout for lunch at work, does a five star rating even matter? I should ask Siri "show me the top fast and delicious takeout restaurants near me" and she should do: "select name from restaurants where distance < 500m order by (speed + flavour) limit 3;" and from there I will pick something from that list that looks nice. That seems like a nice UX.
There's a body of research on this, and it suggests that ratings are more meaningful if you add options, up to about 5 or 6 ratings.
That is, if you asked people to do the ratings once, and then asked them 1 hour later, there would be more consistency across time as you add options from 2 to 3 to 4, up to about 5 or 6.
The problem with binary ratings is that, as much as you might think otherwise, you're forcing a kind of hazy, grey experiential assessment into 0 or 1. And in doing so, people near the boundary (whatever that might be) will vacillate between them. E.g., people who feel "meh" about something are forced to choose something else, and sometimes they'll say 0 and sometimes 1. The more options you give, the more reliable / meaningful the ratings will be.
This example is interesting to me because it's something most people can relate to and illustrates the complications of utility-based and Bayesian formulations of the problem. You end up having to decide on utilities and/or priors.
To me the answer is to weight the data maximally in forming a posterior, in which case you end up using a reference prior. Similar kinds of arguments about utilities lead to reference priors. Reference priors can be complicated to compute, but for things like multinomials over ordinal ratings, reference priors have been worked out fairly well.
To me it always made sense to allow people to sort by the center of the estimate, or the lower bound (maybe using different language).
I think 1-4 stars is the ideal rating style. I wish that were used more often.
A choice of 1-4 stars gives you enough freedom to express your opinion, without being overwhelming. It's a small enough range to be reasonably objective (almost everybody will interpret it as 1 star = bad, 2 = passable, 3 = good, 4 = great). And with an even number of choices there's no middle "meh" option -- you're forced to make a choice between 2 and 3.
Of course it's important not to ruin it by adding extra options, like 0 stars or half-stars. (That was Ebert's big mistake!)
Edit to add: to relate this to the parent post, I'm thinking that maybe ranking things as 1-4 stars in several categories could be the best if both worlds.
I wonder if a simple way to fix that might be to reinterpret everyone's star ratings as percentiles, based on the overall distribution of stars in their reviews. "This user gives 5 stars 10% of the time, so we'll interpret a 5-star review from them as anything in the range 90-100 -- assume 95%."
You would probably also want to reinterpret the results for each user. "This review scores average out as 84%. For user A, that's 4.5 stars, but for user B, it's only 3.5 stars."
The big downside is that star ratings become subjective. But they're already subjective, and ignoring that problem doesn't make the results any better. Average star ratings on all the big websites and app stores right now are garbage -- they'll usually warn you if some Amazon product is terrible, but that's about all.
If you crunch all the review data and figure out the best possible recommendations, you end up with collaborative filtering and the Netflix Prize. It's a shame that so much great work was done for that competition, but nobody seems to be using it now. Netflix themselves just use a trivial upvote scheme now.
But I wonder if there's some much simpler approach that still gets pretty good results.