Beating captchas and saving seconds

One of the hardest (and most fun) problems I’ve solved was during the second week of my internship at Commenda (summer '23). The product lead casually mentioned there was an ongoing bounty, open for a few months, and asked if I wanted to take a shot. I said sure, not really knowing what I was getting into.

The task was to check if a company name was already registered in Delaware. The previous engineer had written a script that scraped data from Stripe. It used fake cursor movements to dodge bot detection, and took around 25 seconds per request. To make things worse, Stripe would log out the account every 7 days, so someone had to manually log back in. Not ideal.

I wrote a simple script using Puppeteer to fill the forms directly on the Delaware website, but of course, there was a captcha. I tried all sorts of image-to-text and OCR models I could find (this was before GPT-3.5 and there weren't any really good vision models). I even scraped a bunch of captcha images and trained a model using some of the top open-source OCR tools on GitHub, but no luck. (Funny enough, I later found out this is totally possible (such a good read))

Out of curiosity (and a bit of desperation), I right-clicked the captcha image and hit "Search image with Google". Boom! Google got it right on the first try!! Then I sent the captchas to Google Cloud Vision API, and got perfect results. But the whole thing still took around 17-18 seconds per request, which is way too slow.

The Delaware site itself was painfully slow, even in a normal browser. But I didn't need the full page, just the document data type. So I blocked all the unnecessary stuff like images, styles, and fonts (thank God for Puppeteer), which brought the time down. But the times were still inconsistent, sometimes taking way longer. Turns out, when a captcha failed, the site regenerated a new one, and that reload was the real bottleneck. So I ran two or three Puppeteer instances in parallel and used whichever finished first. That made the process way more consistent, and I managed to get request times down to 4-5 seconds.

Still, felt slow, I wanted to understand why it was slow even when fetching just the document. I used tools like dig and curl ipinfo to trace which server was serving the responses, and found their infra was closest to the US-West-2 region. So I hosted our Puppeteer script there. That shaved off even more time, bringing the total down to just 2 seconds in production.

For context, the Delaware site took around 6.5 seconds to fully load in a browser (tested at the time of writing this, from my home in Kozhikode, India). But now, we were pulling the needed data in just 2 seconds.

I remember hopping on a call with the CEO afterward. He didn’t quite believe it at first and said we should let it run in production for a week to be sure. And it held up. That whole moment felt surreal. I got the bounty, and a return offer later.