This page describe the methodologies used for Baymard Institute’s 67,000+ hours of large scale E-commerce UX research. More specifically, Baymard’s research is based on:
In the following sections the methodology for each of the four research methodologies is described in detail.
To purchase access to the all of the research findings and Baymard Premium go to baymard.com/research. See the Roadmap & Changelog page for what new research Baymard has just published along with the roadmap for what’s coming.
The below video provides an overview of Baymard’s research methodology and the structure of the research foundation, along with how it’s used in Baymard’s core services:
A large part of the guideline research content comes from large-scale qualitative usability testing. Specifically it’s based 12 rounds qualitative usability testing with a total of 1,900+ test subject/site sessions following the “Think Aloud” protocol, in-person 1:1 moderated lab usability testing.
For a study following the think aloud protocol, a binomial probability calculation shows that 95% of all usability problems with an occurrence rate of 14% or higher will be discovered on average, when just 20 test subjects used. Since there will always be contextual site differences, the aim of this research body is not to arrive at final statistical conclusions of whether 61.4% or 62.3% of your users will encounter a specific interface issue. The aim is rather to examine the full breadth of the user’s online shopping behavior, and present the issues which are most likely to cause interface, experience and usability issues for just a reasonable sub-group of your users will face. And as importantly, to present the solutions and design patterns that during testing were verified to consistently constitute a high performing e-commerce user experience.
In total the test subjects encountered 14,500 usability issues in the process of finding, exploring, evaluating, and purchasing products at the tested e-commerce sites – all of which have been distilled into 580+ usability guidelines on what consistently constitute a good e-commerce user experience.
The 1:1 usability testing was conducted following the qualitative “Think Aloud” protocol. Each test subject was given 2–4 tasks, depending on the type of task and how fast the subject was. The duration of each subject’s test session was ~ 1 hour long, and the subjects were allowed breaks between each site tested. The usability studies tasked real users with finding, evaluating, selecting and purchasing products matching everyday purchasing preferences such as “find a case for your current camera”, “an outfit for a party”, “a jacket you’d like for the upcoming spring”, try to change your account password, etc.
The subjects were told to behave as they would on their own, including abandoning a site for a competitor and going off-site to search for information. When experiencing problems, the test subjects were asked open-ended questions such as What are you thinking right now?, Why did you click there?, What did you expect would happen?, etc.
If the test subjects got completely stuck, they were not helped at first. If they were still unable to proceed, they were given help to get past a specific problem, were given another task, or were directed to another test site for the same task. On such occasions, it was noted in the research log and counted as a “failed task.” A task was also recorded as failed if test subjects completely misinterpreted a product due to the product page layout or features; for example, if a subject ended up concluding they wanted product A because it was smaller than product B, when in fact the opposite was the case.
Any personal information submitted by accident has been edited out of the screenshots used in this report or replaced with dummy data. The compensation given was up to $100 in cash.
The test sites were rotated between the test subjects to distribute subjects evenly on the tested sites. The test sites across our different rounds of testing are Adidas, Allbirds, Amazon, American Apparel, Amnesty Shop, ASOS, Apple, Appalachian Mountain Company, All About Dance, AllPosters, American Eagle Outfitters, Avis, Banana Republic, Best Buy, Bed Bath & Beyond, BitDefender, Blue Apron, Blue Nile, Build.com, Box.com, B&H Photo, Bose, Caraway, Chanty, Chemist Direct, Cisco Webex, Cole Haan, Crutchfield, Crate&Barrel, Coastal.com, Daniel Wellington, Drugstore.com, eBags, Enterprise.com, Etsy, Egnyte, Equal Parts, Farer, Flock.com, Foot Locker, FTD, Fandango, GAP, Gilt, Great Jones, Go Outdoors, GoToMeeting, Greats, HelloFresh, Herschel, HomeChef, Home Depot, H&M, HobbyTron, IKEA, JBL, Kaspersky, KitchenAid, Kohl’s, Levi’s, Lowe’s, L’Occitane, Macy’s, Maya Chia, MAHALO Care, McAfee, Milo, Mirror, mvmt, Microsoft, Monastery Made, Neiman Marcus, Newegg, Nordstrom, Norton, Oakley, Old Navy, Overstock, Patagonia, Pixmania, Pottery Barn, Perfume.com, PetSmart, REI, Sears, Sephora, Slack, Shop Bop, Sahalie, Staples, Southwest Airlines, SunBasket, Sync.com, Target, Tesco, Tempo, Tonal, Toys’R’Us, The Entertainer/TheToyShop.com, Thomann, Thousand Fell, Ulta, Under Armour, United Airlines, Urban Outfitters, Walgreen’s, Walmart, Wayfair, Williams Sonoma, Zappos, Zoom, and 1-800-Flowers.
Additional test methodology notes specific for the 12 rounds of testing:
Each of the 580+ guidelines of course aren’t equally important and impactful on the user’s experience. Some issues will be the direct cause for site abandonments, while others will only amount to friction in the form of users stopping, doubting, getting anxious, getting frustrated, or performing futile actions.
Therefore, each of the 580+ guidelines have two ratings assigned, a severity rating and a frequency rating:
The guideline’s ‘Severity’ and ‘Frequency’ rating is combined into an overall ‘Importance’ rating, that describes the overall importance of reading a guideline in three levels; “Detail”, “Impactful” and “Essential”. This importance level is stated at the top of every guideline page.
Within each topic in Baymard Premium the guidelines are generally listed based on their importance (based on their combined frequency and severity). Hence, the first guidelines in each topic tend to be those with the largest impact on UX, and are often direct roadblocks. That’s not to say the guidelines presented later don’t matter – they are numerous and even if they are unlikely to cause abandonments individually, they can collectively add up and result in a high-friction experience.
While we generally advise to focus on the most severe issues first, the combined impact of a guideline should be judged against the specific cost for making the improvement. In our experience, teams that prioritize this way tend to see the best return on investment.
In addition to the severity and frequency ratings, select guidelines also have one or more special characteristics, that also boost the importance of reading the guideline (the ‘Importance’ rating):
Another major part of the research methodology and dataset is a comprehensive UX benchmark. Specifically, Baymard have conducted 18 rounds of manual benchmarking of 80 top-grossing US and European e-commerce sites across 700+ UX guidelines .
The benchmarking specifically consists of “heuristic evaluations” of the 80 e-commerce sites using Baymard’s usability guidelines (derived from the from the large-scale qualitative usability testing) as the 700+ review heuristics and weighted UX scoring parameters.
The UX performance score for each of the 700+ guidelines is weighted based on its Frequency and Severity rating, and each site was graded on a 7 point scale, across all 700+ guidelines. This has led to a comprehensive benchmark database with 96,800+ manually assigned and weighted UX performance scores, and 63,000+ additional implementation examples for the guidelines from top retailers (organized and performance-verified), each annotated with review notes.
The total UX performance score assigned to each benchmarked site is essentially an expression of how good (or bad) a e-commerce user experience a first-time user will have at the site based on the 580+ guidelines.
The specific theme score is calculated using a weighted multi-parameter algorithm with self-healing normalization:
Below is a brief description of the main elements in the algorithm:
The annotated highlights/pins found in the benchmark are examples that the reviewer judged to be of interest to the reader. It’s the site’s overall adherence or violation of a guideline that is used to calculate the site’s UX performance score. Thus, you may find a specific Highlight that shows an example of how a site adheres to a guideline, even though that same site is scored to violate the guideline (typically because the site violates the guideline at another page), and vice versa.
All site reviews were conducted by Baymard employees. All reviews were conducted as a new customer would experience them — hence no existing accounts or browsing history were used (except for Accounts & Self-Service benchmarking). For the US-based and UK based sites an IP address from that country was used. In the case multiple local or language versions of a site existed, the US/UK site version was used for the benchmark.
In the benchmark screenshots only 1-2 versions of each page is depicted, but the reviewer investigated 15-30 other pages which were used for the benchmark scoring and detailed highlight screenshots as well.
Notes specific for:
In addition to the qualitative usability testing following the “think aloud” protocol, eye-tracking was also used for select testing. The eye-tracking test study included 32 participants using a Tobii eye-tracker, with a moderator present in the lab during the test sessions (for task and technical questions only), which took approx. 20–30 minutes. All eye-tracking test subjects tested 4 sites: Cabelas, REI, L.L.Bean, and AllPosters. The eye-tracking test sessions began by starting the test subjects at a product listing page and asking them to, for example, “find a pair of shoes you like in this list and buy it.”
The eye-tracking subjects were given the option to use either their personal information or a made-up ID handed on a slip of paper. Most opted for the made-up ID. Any personal information has been edited out of the screenshots used in this report or replaced with dummy data. The compensation given was up to $50 in cash.
Lastly, the fourth research methodology relied upon for the Baymard Premium dataset is quantitative studies. The quantitative study component is in the form of 9 quantitative studies or tests with a total of 14,453 participants. The studies each sought answers on:
The UX benchmark graph contains the summarized results of the 18 rounds of manual UX performance benchmarking Baymard Institute’s research staff have conducted of 80 top-grossing US and European e-commerce sites.
This UX performance benchmark database contains a total of 96,800+ manually assigned and weighted UX performance scores, along with 63,300+ “best practice” implementation examples from top retailers. The graph summarizes the 35,000+ most recent UX performance ratings.
The total UX performance score assigned to each of the 80 benchmarked e-commerce sites is an expression of how good or bad a user experience that a first-time user will have at the site, based on 700+ weighted e-commerce UX guidelines.
The graph itself has multiple nested layers, where you can drill down into the more granular sub-performances. To reveal the deepest layers and the 700+ guidelines, you will need Baymard Premium research access. Full access also provides you a tool to self-assess your own websites and prototypes, to get a fully comparable scorecard with direct performance comparison to the public benchmark database.
The UX performance benchmarking is conducted as a “heuristic evaluations” of the 80 e-commerce sites. But instead of using 10-20 broad and generic usability heuristics, Baymard’s usability guidelines are used as 700+ highly detailed and weighted review heuristics. These 700+ guidelines come directly from Baymard’s 67,000+ hours of large-scale qualitative usability testing.
The UX performance score for each of the 700+ guidelines is weighted based on it’s observed impact during usability testing, and each of the 80 sites is graded on a 7 point scale, across all 700+ guidelines.
The specific theme and topic performances are calculated using a weighted multi-parameter algorithm with self-healing normalization. This ensures that the general progress of the e-commerce industry and users’ ever-increasing expectations are factored into the performance scoring (updated multiple times each year):
For full testing details, see our research methodology description.
Baymard Institute can’t be held responsible for any kind of usage or correctness of the provided information.
The screenshots used may contain images and artwork that can be both copyright and/or trademark protected by their respective owners. Baymard Institute does not claim to have ownership of the artwork that might be featured within these screenshots, and solely captures and stores the website screenshots in order to provide constructive review and feedback within the topic of web design and web usability.
Citations, images, and paraphrasing may not be published anywhere without first having written content from Baymard (reach out to email@example.com).
See the full terms and conditions for Baymard public and paid-for services, content, and publications.