Don’t Base ‘Customer Ratings’ Sorting on Averages Only

This is the 8th in a series of 9 articles based on research findings from our e-commerce product list usability study.

Both our qualitative and quantitative test findings show that users expect ‘Customer Ratings’ sorting to function differently from how it’s currently implemented at 86% of major e-commerce sites.

During our most recent study on e-commerce Product Lists & Filtering the second most applied sorting direction across all test sessions was sorting by ‘customer ratings’ (the most utilized was ‘lowest price’). However, the test sessions identified a critical mismatch between how users expect sorting by ‘Customer Ratings’ to function and how 86% of e-commerce sites currently have it implemented. This mismatch was observed to cause great user frustration and curtailed the subjects’ ability to find what they considered “highly rated” products.

In this article we’ll outline why users expect ‘Customer Rating’ sorting to function differently, how you can align your sorting logic with user expectations, and provide examples from leading e-commerce sites which already have this new sorting logic implemented.

Test Observation: Users Don’t Trust Averages Based on 1-4 Ratings

Now the typical mismatch between how users expect customer ratings to function and how it’s implemented comes from the intent users have when applying the “Customer Rating” sort type. From the test sessions it’s clear that most users rely on customer ratings as a way to quickly tap into the “wisdom of the crowd” – the collective opinion and experiences of other shoppers.

During testing, the “Customer Rating” sort type was used most frequently when the subjects were browsing for products where they had little domain knowledge and therefore sought to rely on the insights and experiences of others to make an otherwise difficult decision and to reduce the risk of purchasing an “inadequate” product.

However, when benchmarking the product list experience of 50 major e-commerce sites, we found that on 86% of those sites, “Customer Ratings” sorting is implemented as a naive rating average sorted in descending order, where a 5-star-average-rated product will be placed before a 4.8-star-average product regardless of how many ratings those averages are based on.

“This one only has a single rating, so that isn’t trustworthy at all,” a subject noted after realizing several of the products positioned first when sorting by customer ratings only had 1-2 ratings. She and all of the other subjects who sorted by “Customer Ratings” at REI found this inadequate, and instead favored the products with a 4.0-4.5 star average based on 5+ ratings.

“But then again, I can see it’s only a single review. That’s of course not so.. so.. this could be fake,” a subject speculated after having clicked on the first few products in the list, which he had sorted by “Top Rated”, continuing, “It could just as well be the manufacturer who was in here and posted a good review.”

When sorting by ‘Customer Ratings’, most sites will position a product with a single 5-star rating before a product with a 4.8-star average based on 18 votes. And technically this is correct, as the former product technically has a higher average. Yet it is a naive implementation that doesn’t take the “sample size” into account, and indeed, nearly all users will find the latter product to be a much better indicator of a product “recommended by the crowd” when looking to make a product selection.

So while it may be mathematically correct to place the 5-star average first, it fails to account for the reliability of the average. A sample size of 1 is obviously flawed – a fact that wasn’t lost on the test subjects, who assumed that products with only a handful of perfect ratings were usually either a coincidence (a couple of ‘fanboys’) or even the manufacturer or site representatives who’d given the rating, and would often find it highly questionable.

Meanwhile the reliability of customer-rating averages based on several votes were never called into question by the test subjects. In practice, skepticism began to drop when the average was based on 5+ votes. This high level of skepticism toward a low number of perfect ratings has been confirmed during our prior Checkout and Mobile E-commerce usability studies as well.

Survey: Users Prefer Higher Number of Ratings Despite Slightly Lower Average

To get a more quantitative understanding for users’ bias of not fully trusting a 5-star rating average based on a just a few ratings, we tested three different rating averages against 2,024 people.

Methodology: In total three surveys were conducted with a total of 3,501 participans (split roughly evenly across the three surveys), testing different rating averages versus number of votes. Each survey showed the respondent two list items (shown in the result graphs) and asked them to pick which one they would purchase. To avoid sequencing bias, the display order of the two answers was randomized for each respondent.

For two otherwise identical products, where one product has a 5-star average based on 2 ratings, and the other has a 4.5-star average based on 12 ratings, 73% would pick the one with the higher number of ratings despite its lower average. This confirms the test observations that when a perfect average is based on only a few ratings users will often prefer other products with a slightly lower average but a higher number of ratings.

As noted in our earlier investigation of Users’ Perception of Product Ratings, product ratings essentially function as a type of social proof for users, letting them tap into the “wisdom of the crowd”, using good ratings as a proxy for “high quality” or “value for money”. The thinking goes that if a lot of other users are happy with a product it means that it must be a bargain or of high quality – or both. (Which is why users lacking domain knowledge or experience with the product find product ratings particularly useful because it allows them to rely on the domain knowledge and product experience of other customers.) The article also outlines why the number of ratings should always be displayed in conjunction with the rating average.

Solution: Sorting Logic Should Account for Both Number of Ratings and Its Average

To better match the user’s expectations and intent behind sorting by ‘Customer Ratings’, a site’s sorting logic has to take the number of ratings into account as well and not rely solely on the average score. In essence, when a user decides to sort by ‘customer ratings’, the products with a 5-star average based on just 1-4 ratings should not be placed before any products with a 4.5+ star average based on 50+ ratings.

Home Depot has a sorting logic for “Customer Ratings” that takes both the rating average and number of votes into account when determining display sequence for “Avg. Customer Rating” sorting. Notice how products with lower averages but more votes are placed above 5-star-rated products with only a few ratings.

The sorting logic should instead be weighted to account for the combination of rating average and the total number of ratings. This aligns much better with the intent the vast majority of users have when they sort by ‘Customer Ratings’ (i.e., “show me what other users think are the best products”). For instance, notice in the Home Depot example above how products with a 4.5-star average based on 50 and 36 ratings respectively are placed before the two products with a 5.0-star average based on only 6 ratings.

Now, a simpler 5-vote “cutoff” which simply excludes (i.e. doesn’t calculate an average for) any product with less than 5 votes could also be adopted. However, this is of course a much less sophisticated solution and obviously won’t work well for smaller sites and in categories with few user ratings.

While it’s true that the weighted sorting method makes the actual sorting logic less transparent to the user (as it changes from a simple high-to-low logic to a more complex equation), during testing, this issue proved to be far less severe than the issues caused by listing products with 5-star averages first even if their average was only based on a handful of ratings. Without a weighted logic the most trusted products with 4.5+ averages based on dozens or hundreds of ratings will be scattered across several pages of results, making it very difficult for users to find the products which are “recommended by the crowd”.

The exact weighting between averages and number of ratings will likely vary based on site context and audience and may require ongoing tweaking and A/B split-testing. For inspiration, here’s the few major e-commerce sites that we’ve identified that do currently have a weighted sorting logic for their customer ratings: Overstock, Amazon, Crutchfield, Best Buy, Home Depot, and Lowe’s.

This article presents the research findings from just 1 of the 650+ UX guidelines in Baymard Premium – get full access to learn how to create a “State of the Art” e-commerce user experience.

Authored by Christian Holst on June 16, 2015

If you have any comments on this article you can leave them on LinkedIn

User Experience Research, Delivered Twice a Month

Join 37,000+ UX professionals and get a new UX article every second week.

A screenshot of the UX article newsletter