How to use the headerless browser to fetch the HK KMB ETA data? (Fight with the reCAPTCHA)

In my previous post say-byebye-to-the-eta-features-in-my-wechat-miniprogram, I mentioned a solution to fetch the ETA data of HK KMB. The solution is to fetching simulate a real browser behavior by using the headerless browser.

What’s is the headerless browser?

The headerless browser is a kind of special browser without the user interface(UI). The typical use cases are:

  1. Test automation for the website E2E testing. ( is a good example)
  2. Screenshot capturing, web crawler…

The headerless browser allows us to automatically control the browser behavior. Now we can easily hack the browser by using some Node.JS library Puppeteer.

Before using Puppeteer, you should know an important point. Puppeteer just provided you a Node.JS API to automated the Browser (Chrome or Chromium) over a browser Protocol called DevTools. Specifically, for Puppeteer, there’re two libraries puppeteer vs puppeteer-core.

For puppeteer library, the command npm install will install the Chromium automatically. For local development install puppeteer directly will make it easier to run your code. Since the package size will very big including the Chromium, the total project will exceed the package size limitation if running in AWS Lambda. So production the production

Let’s start my story.

The 1st attempt

The implementation is easy. You can get the code in my Github:

In my first attempt, I tested the above function locally and then exposed the function as an HTTP API endpoint. For deployment, I hosted the HTTP server in an EC2 instance (same as the server of this blog) and used the PM2 process to manage the process. Looks like anythings good and smooth. My Wechat mini-program was back to normal. 😎

One day later, as usual, I opened my mini-program. The bus ETA data was gone and empty… WTF?😢

When SSH to my EC2 server, I got a lot of error in PM2 log file:

  Error: 'Captcha validation error',
  err_message: '',
  msg: '{\n' +
    '  "success": true,\n' +
    '  "challenge_ts": "2020-12-02T13:25:33Z",\n' +
    '  "hostname": "",\n' +
    '  "score": 0.1,\n' +
    '  "action": "get_eta"\n' +

I tested the function in my local computer immediately. It works… How come it can’t work in EC2?

The Ultimate Guide to Local Development Setup | by Zico Deng | Medium
My code only works on my machine..

My first guess was google reCAPTCHA would learn that there was a lot of abnormal traffic from a single IP address.

The 2n attempt

Let’s me just show you my code:

So why not run the function in AWS Lambda?

This the AWS IP address range. Search your AWS region you can find the possible outbound IP address. I think it’s impossible for Google reCAPTCHA to block all the AWS IP.

Important: To enjoy the AWS IP address pool, please ensure your AWS Lambda is not associated with any VPC subnet. Without specified VPC, the lambda will be executed in the default system-managed virtual private cloud. There’s an option to disable your Lambda VPC.

Lambda settings to configure the VPC.

During the deployment of the lambda, I have encountered an AWS Lambda deployment error using the serverless package. The entire lambda function package size is more than 250 MB, which exceeds the limitation of lambda.

How to resolve the AWS Lambda package size limitation?

If you want to run your puppeteer function in AWS Lambda, you should use the lambda layer to install Chromium. For the lambda body import puppeteer-core. You can get help for installing the lambda layer

The follow-up

After my 2nd attempt, my WeChat mini-program function was resumed again. It has been running without errors for half a month. 👍

But a new issue is coming. The headerless browser in lambda performance is really bad. For each API call, the average response time is around 10 seconds.

For the bad performance, there’re some reason:

  1. Lambda cold starts (High probability, some package can Warmup the lambda periodically)
  2. Headerless browser instance can’t be shared and need to re-initialized for each API call. (Very high probability)
  3. Too many networks back and forth between Lambda and KMB server.
  4. Lambda not enough RAM.

Before taking action, there’s good practice to measure the lambda performance so that our effort did not spend in the wrong direction.

For lambda performance monitoring, I suggest the Datadog or Newrelic. Here’s a post introducing how deeply Datadog can monitor I think Newrelic can do a similar thing to Datadog, especially after NewRelic released the new platform and pricing schema New Relic One.

Thanks for reading.

Say ByeBye👋 to the ETA feature in my WeChat miniprogram


Last year, I have released a WeChat miniprogram which is an alternative to the KMB 1933 APP.

Why I still spent my time making this mini-program? I have below pain points when using the KMB APP, the points were based on the KMB 1933 app version released around min-2019.

  1. The full-screen ads keep popping up on my iPhone.
  2. The app was so slow to open and crash occasionally.
  3. I just care about some buss routes and stops, don’t give me too much info I don’t need.

My solution

My miniprogram have these three key features to solve the pain points:

  1. Home screen: Show ETA of the bookmarked bus stops(Swipe left to delete the bookmark).
  2. Second screen: Show nearby stops ETA (Swipe left the row to bookmark the stop).
  3. Third screen: Search bus routes(Bus announces, schedules, and map views of the routes)
miniprogram screenshots

Welcome to have a trial, just scan this QRCode by WeChat:

The miniprogram QR Code

The ETA features

EAT(Estimated Arrival Time) is the key info of the app. There’re two ways we can try to get the ETA info:

  1. KMB official Web site:
  2. KMB 1933 APP

My miniprogram is using the KMB official Website as the data source.

KMB official website site ETA feature screenshoot.

KMB official Website ETA feature is a pure front-end function without any authentication. You can easily find out the JS source code to inspect the logic of how it integrated with the API. I can tell you there’s no fanny encryption stuff.

However, the KMB official website has tried many ways to protect the API endpoint from abuse. The recent key improvement is KMB introduced Google reCAPTCHA to protect the API.

The new captcha key for invoking the ETA API.

The captcha key will be generated when the user open the KMB website and bound with the KMB domain (I guess it will bind with the user IP as well). So I the captcha key one-off and can’t be reused.

I created a codesandbox demo to try the captcha.

The sandbox site is using my own Google reCAPTCHA. If you replace with KMB key 6LdiOd8ZAAAAACukKcCRmmf_Ll2hgSIVya22YR99, you will get the error of “ERROR for site owner: Invalid domain for site key”.

Possible solutions

Since the KMB website is a pure front-end app, one possible solution is we can simulate the browser in node.js runtime to get the google captcha token then invoke the ETA API. The headerless browser can be done by puppeteer or PhantomJS.

To run a browser will be a huge overhead or it may require some daemon service to accelerate performance, so some serverless env such lambda or cloud function maybe not suitable to host this kind of service. (My miniprogram API is hosted by WeChat Cloud Function).

Another solution is to hack the KMB 1933 APP. For example, using some proxy apps such as Charles to monitor the APP traffic with backend API, hopefully, you can get the dedicated or more well-organized API of how the APP gets the ETA data.

Usually, the APP will use HTTPS protocol to secure communication. The good news is that Charles can use man-in-the-middle HTTPS proxy so that you’re able to view in plain text the communication between web browser and SSL web server. The bad news is that if the APP enables the HTTP Public Key Pinning (HPKP) the Charles will be useless for the proxy.

Thank you for reading.




最近在随机地听李志的各张live专辑,随机到了这首歌: 普希金 – 李志x丁薇丨live 2015 动静







当然我还发现了一些来自普希金有趣的句子(以下是浮躁时代的快速阅读方式【谷歌关键字:普希金 名言】):

  1. 讀書和學習是在別人思想和知識的幫助下,建立起自己的思想和知識。 (多么好的勉励自己读书的理由)
  2. 读书是最好的学习,追随伟大人物的思想,是富有趣味的事情。(多么好的解析为什么读书的理由)
  3. 世界的設計創造應以人為中心,而不是以謀取金錢,人並非以金錢為對象而生活,人的對象往往是人。(多么好的指导产品设计的理由)(在各种IT新品发布会,引用这句话,格调马上上来了)
  4. 不管怎么说,不怀希望、不求报答的爱情肯定比一切工于心计的引诱更能打动一个女人的心。(多么好的打动女孩子的理由)
  5. 沒有幸福,只有自由和平靜。(多么好的让自己接受平庸平凡的理由)



一个解决问题的思路 Didn’t find class “”


Android开发,targetSdkVersion 23,任务是给APP集成AWS Android IoT SDK(2.16.12),Android Studio build APP过程中遇到了标题所示的问题。

原因是SDK其中的一个依赖'org.eclipse.paho:org.eclipse.paho.client.mqttv3:1.2.2' 有个bug

我的解决方式是:Downgrade AWS SDK Android IoT SDK 到2.14.2,因为这个版本依赖的是org.eclipse.paho:org.eclipse.paho.client.mqttv3:1.1.0,该版本没有兼容旧版本API Level 23的问题。

4天前( 2020Jun19) AWS SDK Android已经发布2.16.13 fix了上述的问题,Issue是:



但是在我遇到标题所示的问题时,我并没有马上降低AWS SDK的版本,我的解决过程是:

1. Google Search Android Studio console the build error:

关键信息是: Didn’t find class “”

2. 定位到这个Github Issue

3. 看了上述很长的Issue之后,发现是 org.eclipse.paho:org.eclipse.paho.client.mqttv3 的版本1.2.2的有Bug


4. 发现AWS SDK Android 2.16.12 Gradle 依赖的正是 1.2.2版本😂

SDK 2.16.12 的发布时间是11 Apr。其实只要AWS SDK更新一下 mqtt的版本到1.2.3即可解决问题。

5. 我当时想到的解决问题的方式

5.1 给AWS SDK 提交PR去更新mqtt的版本到1.2.3,然后等待review release。

5.2 自己把AWS SDK 源码下来了,然后本地修改,本地导入修改过的module。

我用的是5.2的方式,aws-android-sdk-iot只是整个sdk的一个gradle subproject。

这个类库的 gradle 依赖链是: aws-android-sdk-iot => aws-android-sdk-core & aws-android-sdk-testutils

也就是说,为了能修改aws-android-sdk-iot的gradle依赖,我要把相关类库的源码都放在自己的代码库里面,这个方式有点 overkill了。最后我还是放弃了这个方法。



于是开电脑,检查AWS SDK android IoT Git file changes history找到了一个合适的版本后,修改build.gradle,sync gradle之后 build app 马上成功。👍