This page describe the methodologies used for Baymard Institute’s 130,000 hours of large scale E-commerce UX research. More specifically, Baymard’s research is based on:
In the following sections the methodology for each of the four research methodologies is described in detail.
To purchase access to all of the research findings and Baymard Premium go to baymard.com/research. See the Roadmap & Changelog page for what new research Baymard has just published along with the roadmap for what’s coming.
The below video provides an overview of Baymard’s research methodology and cover how you can utilize the test data in your work:
A large part of the guideline research content comes from large-scale qualitative usability testing. Specifically it’s based on 25 rounds qualitative usability testing with a total of 4,400+ test subject/site sessions following the “Think Aloud” protocol, 1:1 moderated lab usability testing.
The usability test sessions occured both in in US, UK, Ireland, Germany, and the Nordics (see list of websites further down).
For a study following the think aloud protocol, a binomial probability calculation shows that 95% of all usability problems with an occurrence rate of 14% or higher will be discovered on average, when just 20 test subjects were used. Since there will always be contextual site differences, the aim of this research body is not to arrive at final statistical conclusions of whether 61.4% or 62.3% of your users will encounter a specific interface issue. The aim is rather to examine the full breadth of the user’s online shopping behavior, and present the issues which are most likely to cause interface, experience and usability issues for just a reasonable sub-group of your users will face. And as importantly, to present the solutions and design patterns that during testing were verified to consistently constitute a high performing e-commerce user experience.
In total the test subjects encountered 34,000+ usability issues in the process of finding, exploring, evaluating, and purchasing products at the tested e-commerce sites – all of which have been distilled into 700 usability guidelines on what consistently constitute a good e-commerce user experience.
The 1:1 usability testing was conducted following the qualitative “Think Aloud” protocol. Each test subject was given 2–4 tasks, depending on the type of task and how fast the subject was. The duration of each subject’s test session was ~ 1 hour long, and the subjects were allowed breaks between each site tested. The usability studies tasked real users with finding, evaluating, selecting and purchasing products matching everyday purchasing preferences such as “find a case for your current camera”, “an outfit for a party”, “a jacket you’d like for the upcoming spring”, try to change your account password, etc.
The subjects were told to behave as they would on their own, including abandoning a site for a competitor and going off-site to search for information. When experiencing problems, the test subjects were asked open-ended questions such as What are you thinking right now?, Why did you click there?, What did you expect would happen?, etc.
If the test subjects got completely stuck, they were not helped at first. If they were still unable to proceed, they were given help to get past a specific problem, were given another task, or were directed to another test site for the same task. On such occasions, it was noted in the research log and counted as a “failed task.” A task was also recorded as failed if test subjects completely misinterpreted a product due to the product page layout or features; for example, if a subject ended up concluding they wanted product A because it was smaller than product B, when in fact the opposite was the case.
Any personal information submitted by accident has been edited out of the screenshots used in this report or replaced with dummy data. The compensation given was up to $100 in cash (except ‘B2B’ and ‘hard to recruit’ participants who were given up to $300).
The test sites were rotated between the test subjects to distribute subjects evenly on the tested sites.
The 240+ test sites across our 25 rounds of testing are:
Additional test methodology notes specific for the 25 rounds of testing:
For the Homepage & Category Navigation testing: in order to avoid artificially forcing the subjects to use category navigation on the tested sites, this study was conducted as a combined e-commerce category navigation and search study. This way it was up to the test subjects themselves to decide if they preferred to search or navigate via the categories to find what they were looking for (i.e., they were never asked to use one approach over the other). Furthermore, it allowed the subjects to mix category navigation and search.
During the test sessions the subjects experienced 900+ usability issues specifically related to the homepage, category and main navigation, site taxonomy, category pages, and similar. The issues ranged from minor interruptions to severe misconceptions about the basic premises of how to find products at an e-commerce site, with task completion rates going as low 10-30% when asked to find fairly common products, e.g. a sleeping bag for cold weather, a spring jacket, or a camera with a case.
The 19 tested sites were: Amazon, Best Buy, Blue Nile, Chemist Direct, Drugstore.com, eBags, Gilt, Go Outdoors, H&M, IKEA, Macy’s, Newegg, Pixmania, Pottery Barn, REI, Tesco, Toys’R’Us, The Entertainer/TheToyShop.com, and Zappos.
For the On-Site Search testing: in order to avoid artificially forcing the subjects to use search on the tested sites, this study was conducted as a combined e-commerce category navigation and search study. This way it was up to the test subjects themselves to decide if they preferred to search or navigate via the categories to find what they were looking for (i.e., they were never asked to use one approach over the other). Furthermore, it allowed the subjects to mix category navigation and search.
During the test sessions, more than 750 usability issues arose specific to e-commerce search. All of these issues have been analyzed and distilled into 49 guidelines on search usability, specifically for an e-commerce context.
The observed search issues often proved so severe that 31% of the time the subjects were either unable to find the items they were looking for or became so frustrated that they decided to abandon. And 65% of the time, the subjects needed more than one search attempt, with three to four query iterations not being uncommon. This is despite testing major e-commerce sites and the tasks being fairly basic, such as “find a case for your laptop,” “find a sofa set you like,” etc.
The 1:1 “think aloud” test protocol was used to test the 19 sites: Amazon, Best Buy, Blue Nile, Chemist Direct, Drugstore.com, eBags, Gilt, Go Outdoors, H&M, IKEA, Macy’s, Newegg, Pixmania, Pottery Barn, REI, Tesco, Toys’R’Us, The Entertainer/TheToyShop.com, and Zappos.
For the Product Lists & Filtering testing: in order to avoid artificially forcing the subjects to use search on the tested sites, this study was conducted as a combined e-commerce category navigation and search study. This way it was up to the test subjects themselves to decide if they preferred to search or navigate via the categories to find what they were looking for (i.e., they were never asked to use one approach over the other). Furthermore, it allowed the subjects to mix category navigation and search.
During the test sessions the subjects experienced 580+ usability issues specifically related to the product list design and tools. Crucially, significant performance gaps were identified between the tested sites’ product list implementations. Sites with just mediocre product list implementations saw massive abandonments rates of 67-90% whereas other sites with better implementations saw only 17-33% abandonments, despite the test subjects trying to find the exact same products. These sites all had equally relevant products to what the test subjects were looking for, the difference stemmed solely from the design and features of each site’s product list.
The 19 test sites were: Amazon, Best Buy, Blue Nile, Chemist Direct, Drugstore.com, eBags, Gilt, Go Outdoors, H&M, IKEA, Macy’s, Newegg, Pixmania, Pottery Barn, REI, Tesco, Toys’R’Us, The Entertainer/TheToyShop.com, and Zappos.
For the Product Details Page testing: the subjects mainly tested the product details page, and the 1:1 qualitative test sessions used a task design that facilitated comparison shopping across sites.
Each task started by asking the subjects to choose between two product types, and pick the one they could relate to best. For example, asking subjects to choose between shopping for a grill or a fridge, or choosing between shopping for a backpack or a baby gift. After selecting a product type, they were sent to two different product pages, each from a different site (e.g., a gas grill at Lowe’s and another at Home Depot). The two products at each site weren’t identical, but very closely related substitutes, so subjects had to perform an actual comparison to determine which suited their needs the best.
The 12 sites tested were: Lowe’s, Sears, Home Depot, Sephora, L’Occitane, Nordstrom, Patagonia, Crutchfield, Bose, KitchenAid, Cole Haan, and eBags.
For the Cart & Checkout testing: the usability studies tasked real users completing the purchase for multiple different types of products – all the way from “shopping cart” to “completed sale”.
Each test subject tested 5 - 8 checkouts, depending on the type of task and how fast they were. The duration of each subject’s test session was ~ 1 hour long, and the subjects were allowed breaks between each site tested. The subjects were on some sites allowed to perform a guest checkout, whereas on other sites they were asked to sign into their existing account (some tasks personal account, other tasks a synthetic account) to complete an account-based checkout flow.
During the test sessions the subjects experienced 2,700+ usability issues specifically related to the checkout flow and design. All of these findings have been analyzed and distilled into 134 specific usability guidelines on how to design and structure the best possible performing checkout flow.
The sites tested in the two rounds of 1:1 qualitative think aloud test sessions were:
For the Account & Self-Service testing: For the self-service and account study the testing was done both as in-lab moderated usability testing and as remote moderated usability testing.
The in-lab usability testing gave users a series of account related tasks (using a synthetic account) such as “can you try to update your password/shipping address/newsletter subscriptions”, “you just placed this order, how would you track it” “can you cancel it”, etc.
For the remote moderated usability testing subjects were required that were about to purchase items on any given e-commerce site. Using remote testing a series of smaller test sessions were initiated each time they made any order interaction; from ordering, receiving order confirmation emails, tracking order online, receiving the order, initiating order return, return tracking, return confirmation, etc.
In total the subject encountered 1,400 usability issues which have been distilled into the Account & Self-Service guidelines.
The issues were found when testing the follwoing 17 websites: Home Depot, Amazon, Best Buy, Nordstrom, Build, REI, Walmart, B&H Photo, Apple, Appalachian Mountain Company, GAP, All About Dance, Bed Bath & Beyond, Shop Bop, Old Navy, and Sahalie.
For the Mobile E-Commerce testing: The Mobile e-commerce research is based on two different round of large scale mobile e-commerce testing, of a total of 29 major e-commerce sites. The usability study tasked real users with finding, evaluating, selecting and purchasing everyday products at mobile e-commerce sites.
The 1:1 “think aloud” test protocol was used to test the 29 mobile sites: 1-800-Flowers, Amazon, Avis, Best Buy, Buy.com, Coastal.com, Enterprise.com, Fandango, Foot Locker, FTD, GAP, H&M, Macy’s, REI, Southwest Airlines, Toy’R’Us, United Airlines, Walmart, Target, Overstock, Herschel, Under Armour, JBL, Adidas, Kohl’s, B&H Photo, Newegg, Walgreen’s, and Staples. Each test subject tested 3 - 6 mobile sites, depending on how fast they were. The duration of each subject’s test session varied between 1 and 1.5 hours, and the subjects were allowed breaks between each site tested.
While these sites are by no means small, the test subjects encountered 3,700+ usability-related issues during the test sessions. Ranging from small interruptions with an interface to severe misconceptions about the basic premises of the m-commerce site, resulting in abandonments.
Our studies show that 43.2% of all smartphone or tablet users have abandoned an online order during the checkout flow in the past 2 months. And 61% sometimes or always switch to their desktop devices when having to complete the mobile checkout process. Also, despite testing the mobile e-commerce sites of major e-commerce businesses such as Walmart, Amazon, Avis, United, BestBuy, FTD, Fandango, etc. numerous test subjects were unable to complete a purchase at multiple of these sites. Note that the word is unable, not unwilling. The usability issues were that severe.
Each of the 700 guidelines of course aren’t equally important and impactful on the user’s experience. Some issues will be the direct cause for site abandonments, while others will only amount to friction in the form of users stopping, doubting, getting anxious, getting frustrated, or performing futile actions.
Therefore, each of the 700 guidelines have two ratings assigned, a severity rating and a frequency rating:
The guideline’s ‘Severity’ and ‘Frequency’ rating is combined into an overall ‘Importance’ rating, that describes the overall importance of reading a guideline in three levels; “Detail”, “Impactful” and “Essential”. This importance level is stated at the top of every guideline page.
Within each topic in Baymard Premium the guidelines are generally listed based on their importance (based on their combined frequency and severity). Hence, the first guidelines in each topic tend to be those with the largest impact on UX, and are often direct roadblocks. That’s not to say the guidelines presented later don’t matter – they are numerous and even if they are unlikely to cause abandonments individually, they can collectively add up and result in a high-friction experience.
While we generally advise to focus on the most severe issues first, the combined impact of a guideline should be judged against the specific cost for making the improvement. In our experience, teams that prioritize this way tend to see the best return on investment.
In addition to the severity and frequency ratings, select guidelines also have one or more special characteristics, that also boost the importance of reading the guideline (the ‘Importance’ rating):
Another major part of the research methodology and dataset is a comprehensive UX benchmark. Specifically, Baymard have conducted 54 rounds of manual benchmarking of 214 top-grossing US and European e-commerce sites across 700 UX guidelines .
The benchmarking specifically consists of “heuristic evaluations” of the 214 e-commerce sites using Baymard’s usability guidelines (derived from the large-scale qualitative usability testing) as the 700 review heuristics and weighted UX scoring parameters.
The UX performance score for each of the 700 guidelines is weighted based on its Frequency and Severity rating, and each site was graded on a 7 point scale, across all 700 guidelines. This has led to a comprehensive benchmark database with 225,000+ manually assigned and weighted UX performance scores, and 150,000+ additional implementation examples for the guidelines from top retailers (organized and performance-verified), each annotated with review notes.
The total UX performance score assigned to each benchmarked site is essentially an expression of how good (or bad) a e-commerce user experience a first-time user will have at the site based on the 700 guidelines.
The specific theme score is calculated using a weighted multi-parameter algorithm with self-healing normalization:
Below is a brief description of the main elements in the algorithm:
The 150,000+ annotated highlights/pins found in the benchmark are examples that the reviewer judged to be of interest to the reader. It’s the site’s overall adherence or violation of a guideline that is used to calculate the site’s UX performance score. Thus, you may find a specific Highlight that shows an example of how a site adheres to a guideline, even though that same site is scored to violate the guideline (typically because the site violates the guideline at another page), and vice versa.
All site reviews were conducted by Baymard employees. All reviews were conducted as a new customer would experience them — hence no existing accounts or browsing history were used (except for Accounts & Self-Service benchmarking). For the US-based and UK based sites an IP address from that country was used. In the case multiple local or language versions of a site existed, the US/UK site version was used for the benchmark.
In the benchmark screenshots only 1-2 versions of each page is depicted, but the reviewer investigated 15-30 other pages which were used for the benchmark scoring and detailed highlight screenshots as well.
Notes specific for:
In addition to the qualitative usability testing following the “think aloud” protocol, eye-tracking was also used for select testing. The eye-tracking test study included 32 participants using a Tobii eye-tracker, with a moderator present in the lab during the test sessions (for task and technical questions only), which took approx. 20–30 minutes. All eye-tracking test subjects tested 4 sites: Cabelas, REI, L.L.Bean, and AllPosters. The eye-tracking test sessions began by starting the test subjects at a product listing page and asking them to, for example, “find a pair of shoes you like in this list and buy it.”
The eye-tracking subjects were given the option to use either their personal information or a made-up ID handed on a slip of paper. Most opted for the made-up ID. Any personal information has been edited out of the screenshots used in this report or replaced with dummy data. The compensation given was up to $50 in cash.
Lastly, the fourth research methodology relied upon for the Baymard Premium dataset is quantitative studies. The quantitative study component is in the form of 11 quantitative studies or tests with a total of 18,397 participants. The studies each sought answers on:
The UX benchmark graph contains the summarized results of the 54 rounds of manual UX performance benchmarking Baymard Institute’s research staff have conducted of 214 top-grossing US and European e-commerce sites.
This UX performance benchmark database contains a total of 225,000+ manually assigned and weighted UX performance scores, along with 150,000+ “best practice” implementation examples from 214 top retailers. The graph summarizes the 50,000+ most recent UX performance ratings.
The total UX performance score assigned to each of the 214 benchmarked e-commerce sites is an expression of how good or bad a user experience that a first-time user will have at the site, based on 700 weighted e-commerce UX guidelines.
The graph itself has multiple nested layers, where you can drill down into the more granular sub-performances. To reveal the deepest layers and the 700 guidelines, you will need Baymard Premium research access. Full access also provides you a tool to self-assess your own websites and prototypes, to get a fully comparable scorecard with direct performance comparison to the public benchmark database.
The UX performance benchmarking is conducted as a “heuristic evaluations” of the 214 e-commerce sites. But instead of using 10-20 broad and generic usability heuristics, Baymard’s usability guidelines are used as 700 highly detailed and weighted review heuristics. These 700 guidelines come directly from Baymard’s 130,000 hours of large-scale qualitative usability testing.
The UX performance score for each of the 700 guidelines is weighted based on its observed impact during usability testing, and each of the 214 sites is graded on a 7 point scale, across all 700 guidelines.
The specific theme and topic performances are calculated using a weighted multi-parameter algorithm with self-healing normalization. This ensures that the general progress of the e-commerce industry and users’ ever-increasing expectations are factored into the performance scoring (updated multiple times each year):
For full testing details, see our research methodology description.
Baymard Institute can’t be held responsible for any kind of usage or correctness of the provided information.
The screenshots used may contain images and artwork that can be both copyright and/or trademark protected by their respective owners. Baymard Institute does not claim to have ownership of the artwork that might be featured within these screenshots, and solely captures and stores the website screenshots in order to provide constructive review and feedback within the topic of web design and web usability.
Citations, images, and paraphrasing may not be published anywhere without first having written consent from Baymard (reach out to email@example.com).
See the full terms and conditions for Baymard’s public and paid-for services, content, and publications.