In this article, I will explain our approach to come up with a single overall score for our PageSpeed UX Score checker. And also admit why looking at a single score actually is not a good idea.
Google's Core Web Vitals thresholds
Google did the heavy lifting for us already. They introduced Core Web Vitals. And even came up with thresholds per metric, based on their research. And given the amount of data Google has, there is no need to reinvent the wheel here.
Improving pagespeed beyond Core Web Vitals
SEO gains will stop when passing the thresholds of each individual Core Web Vital metric. However, the user experience and conversion will continue to improve when optimizing.
The good threshold for LCP is 2.5 seconds. Depending on your niche, optimizing from 1 to 0.9 second LCP might not make a business impact anymore. But you might actually want to optimize beyond the 2.5 second threshold to increase sales or signups.
An extra pair of min and max
With that in mind, we use an absolute minimum and derived maximum per metric. Basically introducing a second set of boundaries on top of the existing 2.5 second and 4 second threshold for LCP.
Next to a new boundary, we also use a weight for each metric. Lighthouse does this as well, so nothing new here. Even Web Vitals does it, where the 3 Core Web Vitals are having an equal weight of 33.33%.
Our minimum and weight are as following per metric:
Metric | Good thresholds | Our minimum | Unit | Weight |
Largest Contentful Paint | 2500 | 1000 | ms | 25 |
First Input Delay | 100 | 50 | ms | 0 |
Cumulative Layout Shift | 0.1 | 0 | 25 | |
First Contentful Paint | 1800 | 500 | ms | 20 |
Interaction to Next Paint | 200 | 100 | ms | 25 |
Time to First Byte | 800 | 200 | ms | 5 |
As Interaction to Next Paint will become a Core Web Vitals metric in March 2024, we already reduced the weight of FID to zero
We already set FID's weight to 0. Meaning, we anticipated on the future Core Web Vitals adjustment already. And the future set of Core Web Vitals have an equal weight in our tool.
A low weight for TTFB
Our theory behind the weights of TTFB and FCP is as following:
- TTFB won't be noticed by users, FCP will (so, TTFB mainly is interesting for technical stakeholders);
- CrUX shows TTFB as experimental (for reasons I agree with);
- FCP can make up for bad TTFB;
- I acknowledge that it doesn't work the other way around;
- but it's then still FCP that users will see the first.
And because moment of first engagement is important for UX, we gave FCP quite a high score.
Calculating the percentage
The next step is to calculate the percentage per metric. Achieving the absolute minimum of a metric, and we reward that metric with the maximum score (100%). Using this same logic, you won't get a lower score than 0% when reaching the absolute maximum of that metric.
Copying Lighthouse
We ended up using the way Lighthouse is visualizing its scores. Because in the end, chances are most stakeholders are familiar with that already.
This means we matched the Core Web Vitals thresholds with their 0, 50 and 90% reasoning:
- An LCP of 2.5 seconds will get you at least a 90% score;
- An LCP of 4 seconds will get you a maximum score of 50%;
- et cetera.
This means that getting a 91 or 92% LCP score depends on where between 2.5 and 1.0 seconds your LCP is ending up.
Summing it up
Then, all scores per metric are multiplied with its weight. We will sum up end result and divided it by the total of all used weights. Resulting in an awesome score of your real user experience.
Shortcomings and alternatives
Let's not ignore the elephant in the room though: pagespeed is more nuanced than a single number. Google's Core Web Vitals containing 3 metrics is perfectly illustrating this.
Don't be blindsided
But the following scenario's are also able to illustrate how a single score will make you unaware of remaining issues:
- LCP
LCP experiences likely vary per template type. A score per page template group would be a better approach already. RUMvision helps doing this by using regular expressions to group page data, while still enabling you to zoom into individual pages. - FID & INP
FID and INP might look good to go in your aggregated Core Web Vitals data. However, if 20% of your pageviews are requested on low-end devices, they will have a more challenging time to interact with your pages. One of the reasons why Google recommends to supplement Core Web Vitals with your own RUM. - TTFB and FCP
FCP and TTFB can easily be impacted ad traffic, campaigns and traffic spikes for example. Now way to spot this with aggregated data, let alone an overall score. But it can be a minute work to spot discrepanties amongst query strings and improve caching rules accordingly.
Google Search Console might help in highlighting issues, especially when issues are different per template type. Most Real User Monitoring solutions are able to help out to pinpoint issues and which conditions they are happening the most.
For example, knowing what condition is to blame will help you to determine how and where you should start fixing it. Because it could either be the user's device, the internet speed, the page, the server, response time of (blocking and/or external) files and even the weather/heat.
Pagespeed is not a single number
I do acknowledge that it's way more convenient for most stakeholders to look at a single pagespeed score to know where they are standing. They can then quickly move to the next todo on their list. That's why some website owners need the dedicated pagespeed advocates.
Pagespeed and especially UX just can't be communicated via a single number. And pagespeed advocates will know. Preventing merchants from pagespeed and UX edge-cases holding back further conversion growth. Or introducing regressions after deploys.
Alternative approaches
Our approach might not be the right one at all. As a matter of fact and as illustrated above, there is no optimal way to communicate the UX health via a single score. A score per template would be a better approach already. That is easily achieved with RUM data, as grouping page URL's into page template buckets becomes easier. But even then, the individual metrics still matter as well.
Incorporating distribution percentages
There are two types of data that can be used to come up with a score:
- Percentiles
the experience at the 75th percentile. So, instead of looking at the best or average experience, Google is looking at the best experience within the group of 25% worst experiences. You could look at the 95% percentile as well, but there will be a moment where user conditions might be too challenging to meet the Core Web Vitals thresholds for that group; - Distributions
Nevertheless, one could still use distribution data of all metrics, as that is publicly available in Google's free to use API's.
Per metric, Google will show the percentage per good, moderate and poor groups. In their API response, you'll find this number in thedensity
field. Even when passing a metric's threshold, you might want to use the percentage of remaining poor experiences.
Did you know? Google's CrUX API's are allowing you to get performance metrics of any domain that has a sufficient amount of data.
You do need to get a Google API key. Or just use one of our free tools to skip the registration and development-parts.
Incorporating the distribution data could help stakeholders from being blindsided. The shortcoming of our approach is that users will be able to see a 100% score, even when a big portion of their pageviews might still result in below-par UX.
An screenshot of an LCP value that is passing the 75th percentile. However, looking at the distribution, we're able to see that it's only passing Core Web Vitals by a little bit, as 76.5% having a good LCP experience is just a bit more than Google's 75% mark.
But as a result, more than 23% of pageviews aren't resulting in an optimal LCP experience, so this merchant might want to continue optimizing the LCP for the sake of conversion.
To prevent this from happening, both the percentile and the percentage of poor plus maybe moderate experiences could be used to determine where between 90% and 100% the score should end up.
But at the same time, by using an absolute minimum on top of Google's thresholds, we're automatically incorporating moderate and poor experiences already. That's because using an LCP of 1000 milliseconds at the 75th percentile as the absolute minimum, would mean that:
- users dealing with more challenging conditions (both device and internet connectivity related, for example) won't end up having an LCP of 1 seconds;
- but they should still benefit from any optimizations to achieve an LCP of 1 second at the 75th percentile, maybe causing the LCP to be below 2.5 seconds even at the 95% percentile.
The mobile LCP score of rumvision.com at time of writing. A 0.68 second LCP obviously is a very healthy score, but there still are users with a moderate and poor experience.
However, the costs of optimizing the pagespeed beyond this point might not outweigh the conversion-related benefits anymore. You might not even be able to help out those remaining users, depending on their conditions.
Raising the bar
In conversations with fellow pagespeed specialists, we've also heard of an approach where one would only be able to earn a green score when at least all Core Web Vital metrics have a green score as well.
Conclusion
The conclusion is very simple: there is no right way here. And different approaches are possible. We might implement some -or even all- of them one day.
Meanwhile, we will continue to emphasize that it's important to not be blindsided by a single score. This way, website owners are less likely to leave money -or conversions in general- on the table.