Published on

How to use the headerless browser to fetch the HK KMB ETA data? (Fight with the reCAPTCHA)

Authors
  • avatar
    Name
    Adam Liu
    Twitter

In my previous post say-byebye-to-the-eta-features-in-my-wechat-miniprogram, I mentioned a solution to fetch the ETA data of HK KMB. The solution is to fetching simulate a real browser behavior by using the headerless browser.

What's is the headerless browser?

The headerless browser is a kind of special browser without the user interface(UI). The typical use cases are:

  1. Test automation for the website E2E testing. (Cypress.io is a good example)
  2. Screenshot capturing, web crawler...

The headerless browser allows us to automatically control the browser behavior. Now we can easily hack the browser by using some Node.JS library Puppeteer.

Before using Puppeteer, you should know an important point. Puppeteer just provided you a Node.JS API to automated the Browser (Chrome or Chromium) over a browser Protocol called DevTools. Specifically, for Puppeteer, there're two libraries puppeteer vs puppeteer-core.

For puppeteer library, the command npm install will install the Chromium automatically. For local development install puppeteer directly will make it easier to run your code. Since the package size will very big including the Chromium, the total project will exceed the package size limitation if running in AWS Lambda. So production the production

Let's start my story.


The 1st attempt

The implementation is easy. You can get the code in my Github: https://github.com/adam0x01/kmb-eta-api/blob/master/handler.js#L59-L81

In my first attempt, I tested the above function locally and then exposed the function as an HTTP API endpoint. For deployment, I hosted the HTTP server in an EC2 instance (same as the server of this blog) and used the PM2 process to manage the process. Looks like anythings good and smooth. My Wechat mini-program was back to normal. 😎

One day later, as usual, I opened my mini-program. The bus ETA data was gone and empty... WTF?😢

When SSH to my EC2 server, I got a lot of error in PM2 log file:

/?route=E42&bound=1&seq=2&bsicode=SH05-T-0825-2
{
  Error: 'Captcha validation error',
  err_message: '',
  msg: '{\n' +
    '  "success": true,\n' +
    '  "challenge_ts": "2020-12-02T13:25:33Z",\n' +
    '  "hostname": "search.kmb.hk",\n' +
    '  "score": 0.1,\n' +
    '  "action": "get_eta"\n' +
    '}'
}

I tested the function in my local computer immediately. It works... How come it can't work in EC2?

The Ultimate Guide to Local Development Setup | by Zico Deng | Medium

My code only works on my machine..

My first guess was google reCAPTCHA would learn that there was a lot of abnormal traffic from a single IP address.


The 2n attempt

Let's me just show you my code: https://github.com/adam0x01/kmb-eta-api

So why not run the function in AWS Lambda?

This the AWS IP address range. Search your AWS region you can find the possible outbound IP address. I think it's impossible for Google reCAPTCHA to block all the AWS IP.

Important: To enjoy the AWS IP address pool, please ensure your AWS Lambda is not associated with any VPC subnet. Without specified VPC, the lambda will be executed in the default system-managed virtual private cloud. There's an option to disable your Lambda VPC.

null

Lambda settings to configure the VPC.

During the deployment of the lambda, I have encountered an AWS Lambda deployment error using the serverless package. The entire lambda function package size is more than 250 MB, which exceeds the limitation of lambda.

How to resolve the AWS Lambda package size limitation?

If you want to run your puppeteer function in AWS Lambda, you should use the lambda layer to install Chromium. For the lambda body import puppeteer-core. You can get help for installing the lambda layer https://github.com/alixaxel/chrome-aws-lambda#aws-lambda-layer.


The follow-up

After my 2nd attempt, my WeChat mini-program function was resumed again. It has been running without errors for half a month. 👍

But a new issue is coming. The headerless browser in lambda performance is really bad. For each API call, the average response time is around 10 seconds.

For the bad performance, there're some reason:

  1. Lambda cold starts (High probability, some package can Warmup the lambda periodically)
  2. Headerless browser instance can't be shared and need to re-initialized for each API call. (Very high probability)
  3. Too many networks back and forth between Lambda and KMB server.
  4. Lambda not enough RAM.

Before taking action, there's good practice to measure the lambda performance so that our effort did not spend in the wrong direction.

For lambda performance monitoring, I suggest the Datadog or Newrelic. Here's a post introducing how deeply Datadog can monitor https://www.datadoghq.com/blog/key-metrics-for-monitoring-aws-lambda/. I think Newrelic can do a similar thing to Datadog, especially after NewRelic released the new platform and pricing schema New Relic One.

Thanks for reading.