Learning to Crawl: Site Performance and Googlebot

Angry Googlebots at a maternity ward.

You never forget the birth of your first child. For me, this meant sitting in the delivery room, glued to Slack on my phone. I was spending my last hours before paternity leave triaging the performance issues that had been plaguing TeePublic for months.

The problem started, poetically, 9 months ago. While my son grew week over week, so did TeePublic’s server response times. TeePublic was ‘slow for a fast site’, with the average response time on our product detail page floating just north of 300ms. That represented 80% of our traffic and cleared out bar for ‘good enough’ that our small development team could focus elsewhere.

TeePublic’s On-Demand Headaches

TeePublic is an artist marketplace where anyone can upload original art and get a storefront offering that artwork for sale across T-shirts, hoodies, phone cases, and more. By 2020, we had millions of designs, across over a dozen product categories, each with potentially hundreds of SKUs. TeePublic used Print-on-Demand technology for fulfillment; whereas a traditional e-commerce platform would need to print and hold everything it sold, TeePublic could wait until after a customer purchased to create the finished product. This resulted in a product catalog with billions of skus.

But in January 2020, our application's product detail page was struggling. We hadn't neglected it; in fact, improving the product page's performance had become a semi-annual tradition. Yet every time we made a fix, it was months or even weeks until we saw those gains evaporate and the site return to previous levels. Every millisecond of improvement we created was eaten up by more and more requests.

For most sites, more traffic meant more users. It would be one thing if our site were overwhelmed by a t-shirt-hungry public—a situation that would arise, unbeknownst to us, with the Spring 2020 Covid-driven e-commerce surge. But no, this traffic wasn't customers. This crush was coming from an insatiable, content devouring beast: Google.

Death by Google Crawl

TeePublic was not an e-commerce site. Despite the t-shirts that eventually made their way to doorsteps across the country, TeePublic was an SEO machine. Like a hydro-electric dam pulling electricity from the flow of water, TeePublic was a machine built to turn organic Google searches into American Dollars. Everything about the site, from its home page to navigation to search pages, was designed to maximize crawl and discoverability for Google. Actual human users were secondary concerns.

By January 2020, we had succeeded too well, and it was causing problems. Google’s crawlers were crushing us. Most of the time, the site was “slow fast”, crawl was steady, and Google was a Good Bot. But there were times, lasting hours or days, when Google's usually responsible crawling would cease, and Googlebot began mercilessly hammering our site. The number of bot requests would double or even triple. Our ‘slow but working’ site would experience timeouts and create real usability impacts for our human customers. We called this state “Hell Crawl” and were baffled on why it would happen and how we could hope to weather the storm it brought.

We needed a fix.

What About Caching?

The first and most obvious solution is to cache the requested content. Rather than render the page every time it’s needed, render it once, store it somewhere close to where you can re-use it: in the browser, on a CDN, in a memory store. The fastest code to run is no code at all. This wouldn’t work for us, though.

First, the volume of cached data would be enormous. Billions of rendered pages would amount to terabytes of data to cache. Even with the previous strategies we had employed to compress our data and store only the unique datasets, we were only ever able to cache a small fraction of our platform’s pages. Worse, our page space was quite literally the pessimal site for caching. Our target customer, Google, was only ever loading a page once. We'd cache a page only after it was needed, for a single user who was unlikely to request it again anytime soon.

Why Not Throttle Bots?

The next solution is to block or throttle bot traffic. After all, it wasn’t like Google’s bots were hiding their intent. We had a web application firewall in place; stemming the deluge would be trivial. Unfortunately, that cure was worse than the disease.

TeePublic loved crawl. We wanted it. We needed it. And Google would interpret any throttling, redirecting, or routing as a sign that we didn’t want it, and their crawlers would happily go elsewhere. The bot-induced performance issues were bad, but halting the traffic would be worse. It would’ve meant deliberately breaking our SEO to Dollars engine we’d spent years building.

We were trapped in SAW XVII: Nerd Stuff, where a crawl hungry website had their gluttony punished with never ending crush of Googlebots.

Two Steps Forward, Two Steps Back

All that winter, the performance issues grew. Our tune-and-tweak approach signaled to Google that we could handle slightly more traffic, which was gladly supplied. An equilibrium formed between response time and crawl traffic. Any improvements made manifested as more requests to process. By spring, we realized our scattershot approach wasn’t going to work, and we were no closer to solving “Hell Crawl”.

Our work exposed the architectural issues that prevented us from the scale of improvements needed. TeePublic was hosted on Heroku. While the service was good for smaller sites, their offering lacked certain flexibility; we couldn’t, for example, route Googlebot requests to their own servers. If we added server capacity, we’d flood the database. Heroku’s request queueing infrastructure created a positive feedback loop where our failure to keep up with request volume meant future requests would be even slower. We couldn’t improve database performance because the bottleneck wasn’t on the server: it within Heroku Postgres’ single-threaded PG Bouncer process. And we couldn’t split reads across multiple databases without rearchitecting elements of the application and upgrading Rails. And even then, those read databases would cost money.

But what choice did we have? We got to work.

The site was rendering product pages an average of 800ms for 80,000 requests per minute in January. With hard work, we were rendering product pages at an average of 800ms for 130,000 requests per minute by the summer. For all the impact we had made, we had made no impact. And we still couldn’t handle “Hell Crawl”.

A Light at the End of the Birth Canal

We were self-taught in the art of Rails performance; I realized we needed an expert. We brought in Nate Berkopec from Speedshop, a widely recognized guru on Rails performance. Beyond his work helping us interpret our application’s performance profile, we also enrolled our entire engineering team in his workshop. We knew what we were doing when the work was obvious; Nate showed us the non-obvious work, and the ways in which even seemingly small issues would manifest as big problems.

Instead of trying to ‘catch up’ to our current crawl demand, we set a goal of 200,000 rpm and worked backwards to determine the capacity we needed. With x processes, running y threads, with an average response time z... We could estimate our capacity and needs before writing a line of code. Even better, the work we’d done to understand our architecture meant our technical roadmap was exactly where it needed to be for the refactors and upgrades, we would need to fully support our forecasted demand. We had planned the major upgrades for the end of September, right before my paternity leave.

A New Standard is Born

Or at least that was the plan. A few days into September, a routine prenatal checkup turned into a hospital visit for high blood pressure, which turned into a baby being born 2 days later. As much as I wanted to keep tabs on our performance work, my attention was needed elsewhere.

Luckily, our team was on it. Thanks largely to our excellent Director of Engineering, the performance work was right on track. Throughout September and October, the team launched the major upgrades that would reset our bar for performance: a product page refactor that would dramatically reduce DB calls; a Rails upgrade that would unlock read database support; and the integration of read database into the application to get around our PgBouncer bottlenecks. The team managed to scale us past Googlebot's appetite just ahead of the November holiday traffic crunch we knew would happen.

Lessons on Crawl Performance

Be Proactive with Site Capacity

Our engineering team's approach to determining site capacity was unsophisticated. We knew roughly what traffic we could expect during peak, and roughly what the performance profile of the site was, and that was good enough. When traffic quickly exceeded those estimates, we were constantly playing catch-up without a fixed target.

With Speedshop, we developed a proactive approach to site capacity. Little’s Law let us understand the relationship between response time, request volume and our capacity. Once we understood how our site would perform, we could set our capacity correctly or target specific response times for the demand we forecast.

Average is Overrated

TeePublic was ‘fast enough’ on average, but our reliance on average hid plenty of performance issues we should’ve addressed sooner. 90th, 95th, and 99th percentile response times were far worse than a normal distribution should’ve allowed, but we convinced ourselves these were rare events that didn’t warrant our attention. In reality, with the volume of requests TeePublic was handling and the complexities of handling web requests, we were letting a huge number of requests sabotage our webservers.

Googlebot is Multiple

Our logs told us we were being hammered by Googlebot, but there was nuance in the source of that Googlebot.

Google used Googlebot for three reasons. The first two are obvious uses: validation and crawl. Validation comes from submitting your site to Google via Sitemaps or Web Console; it’s looking to understand your site and index it. Next is crawl. Crawl is the organic process of Googlebot browsing your site, finding links, and seeing where they go. Both these sources are generally good citizens, and are mindful not to crush your site.

There’s a third use of Googlebot though: shopping feed validation. When an e-commerce site submits content to Google Shopping, a more specialized version of Googlebot checks the page you attached to a product listing. Unlike validation or crawl, which tune their willingness to crawl based on your site’s performance, Google Shopping’s Googlebot is going to validate every page you submitted, performance be damned.

This was the source of our “Hell Crawl”. While we battled these performance issues, another engineering team had been working with Marketing to vastly expand the volume of products we listed in Google shopping. The crushing crawl we thought was happening randomly was actually fairly predictable Google Shopping crawl triggered by major updates to our marketing feeds.

Control Your Crawlable Surface Area

Make sure you’re only exposing the pages you want crawled to Google and review your logs to verify crawl is only going where you want. Don’t assume Googlebot will respect every boundary or understand your page intent either.

Googlebot will respect no-index directives, but that doesn’t mean Google won’t crawl that page. no-follow might stop an organic crawl from clicking into that link, but if Googlebot finds that link anywhere else on your site it’s going to dive right on it. Query strings are easy ways to complicate your crawlable surface area, and Google will treat those URLs as first-class crawl targets; setting a canonical meta value simply points Google to a better indexing target.

And don’t trust that your site is being crawled correctly! Review your logs and see where crawl is actually going.

An incident with a client reemphasized that. While doing more routine Rails performance work, it turned out the root cause was actually Googlebot traffic to malformed category URLS: some-page?locale=encategories, some-page?locale=encategories/categories/categories/, and more. The word categories repeated dozens, sometimes hundreds, of times.

The source was a bit sloppy Rails link building: user_path(@user) + ‘/categories’. Any query string already on the page would be included in the output of user_path, which would then have the /categories path appended to it. Visiting the malformed page would show Googlebot a link to another malformed URL, which our many bot friends would dutifully crawl and find another, more deeply nested bad category link. In the end, bots trying to navigate this tangle of recursive category links were making up 80% of the site’s traffic.

The Aftermath

By the end, our site hadn’t just met our crawl challenges; it demolished them. We’d load tested to over 250,000 requests per minute, more than double our current capacity, staying at a frosty 200ms average. Finally, almost miraculously, we'd built more request capacity than Google was interested in using. But more importantly, we’d reset our own expectations for performance and leveled up our team to meet them. We’d learned a lot about how Googlebot worked, and how to make sure it worked for us. We'd learned how to crawl.

Just like my son would be doing.

cue an enthusiastic, teary-eyed standing ovation