Beating captchas and saving seconds

One of the hardest (and most fun) problems I’ve solved was during the second week of my internship at

Commenda (summer '23). The product lead casually mentioned there was an ongoing bounty, one that had been open for a few months, and asked if I wanted to take a shot. I said sure, not really knowing what I was getting into.

The task was to check if a company name was already registered in Delaware. The previous engineer had written a script that scraped data from Stripe. It used fake cursor movements to dodge anti-bot systems and took around 25 seconds per request. To make things worse, Stripe would log out the account every 7 days, so someone had to manually log back in. Not ideal.

I wrote a simple script using Puppeteer to fill the forms directly on the Delaware website, but of course, there was a captcha. I tried all sorts of image-to-text and OCR models I could find (this was before GPT-3.5 and there weren't any really good vision models). I even scraped a bunch of captcha images and trained a model using some of the top open-source OCR tools on GitHub, but no luck. (Funny enough, I later found out this is

totally possible (such a good read))

Out of curiosity (and a bit of desperation), I right-clicked the captcha image and hit “Search image with Google.” Boom Google got it right on the first try. So I sent the captcha image to the

Google Cloud Vision API, and it worked really well. But the whole thing still took around 17–18 seconds per request. Way too slow.

I noticed the Delaware site was just... slow, even in my browser. That’s when I realised: I didn’t need the full page, just the document data type. So I blocked all the unnecessary stuff like images, styles, and fonts (thank God for Puppeteer), which brought the time down to 4–5 seconds. Progress, but not enough.

Sometimes it still took longer, 10 seconds in some cases. Turns out, when the captcha failed, the site would click retry, eating up extra time. So I ran two/three Puppeteer instances in parallel and used whichever finished first. That made the process way more consistent, with request times around 4–5 seconds.

Still, I felt we could do better. I wanted to understand why it was slow even when fetching just the document. I used tools like dig and curl ipinfo to trace which server was serving the responses, and found their infra was closest to the US-West-2 region. So I hosted our Puppeteer script there. That shaved off even more time, bringing the total down to just 2 seconds in production.

For context, the Delaware site still takes around 6.5 seconds to fully load in a browser (from my home, in

Kozhikode, India :) But now, we were pulling the needed data in just 2 seconds.

I remember hopping on a call with the CEO afterward. He didn’t quite believe it at first and said we should let it run in production for a week to be sure. And it held up. That whole moment felt surreal. I got the bounty, and a return offer later.