How to use the headerless browser to fetch the HK KMB ETA data? (Fight with the reCAPTCHA)

In my previous post say-byebye-to-the-eta-features-in-my-wechat-miniprogram, I mentioned a solution to fetch the ETA data of HK KMB. The solution is to fetching simulate a real browser behavior by using the headerless browser.

What’s is the headerless browser?

The headerless browser is a kind of special browser without the user interface(UI). The typical use cases are:

  1. Test automation for the website E2E testing. (Cypress.io is a good example)
  2. Screenshot capturing, web crawler…

The headerless browser allows us to automatically control the browser behavior. Now we can easily hack the browser by using some Node.JS library Puppeteer.

Before using Puppeteer, you should know an important point. Puppeteer just provided you a Node.JS API to automated the Browser (Chrome or Chromium) over a browser Protocol called DevTools. Specifically, for Puppeteer, there’re two libraries puppeteer vs puppeteer-core.

For puppeteer library, the command npm install will install the Chromium automatically. For local development install puppeteer directly will make it easier to run your code. Since the package size will very big including the Chromium, the total project will exceed the package size limitation if running in AWS Lambda. So production the production

Let’s start my story.


The 1st attempt

The implementation is easy. You can get the code in my Github: https://github.com/adam0x01/kmb-eta-api/blob/master/handler.js#L59-L81

In my first attempt, I tested the above function locally and then exposed the function as an HTTP API endpoint. For deployment, I hosted the HTTP server in an EC2 instance (same as the server of this blog) and used the PM2 process to manage the process. Looks like anythings good and smooth. My Wechat mini-program was back to normal. 😎

One day later, as usual, I opened my mini-program. The bus ETA data was gone and empty… WTF?😢

When SSH to my EC2 server, I got a lot of error in PM2 log file:

/?route=E42&bound=1&seq=2&bsicode=SH05-T-0825-2
{
  Error: 'Captcha validation error',
  err_message: '',
  msg: '{\n' +
    '  "success": true,\n' +
    '  "challenge_ts": "2020-12-02T13:25:33Z",\n' +
    '  "hostname": "search.kmb.hk",\n' +
    '  "score": 0.1,\n' +
    '  "action": "get_eta"\n' +
    '}'
}

I tested the function in my local computer immediately. It works… How come it can’t work in EC2?

The Ultimate Guide to Local Development Setup | by Zico Deng | Medium
My code only works on my machine..

My first guess was google reCAPTCHA would learn that there was a lot of abnormal traffic from a single IP address.


The 2n attempt

Let’s me just show you my code: https://github.com/adam0x01/kmb-eta-api

So why not run the function in AWS Lambda?

This the AWS IP address range. Search your AWS region you can find the possible outbound IP address. I think it’s impossible for Google reCAPTCHA to block all the AWS IP.

Important: To enjoy the AWS IP address pool, please ensure your AWS Lambda is not associated with any VPC subnet. Without specified VPC, the lambda will be executed in the default system-managed virtual private cloud. There’s an option to disable your Lambda VPC.

Lambda settings to configure the VPC.

During the deployment of the lambda, I have encountered an AWS Lambda deployment error using the serverless package. The entire lambda function package size is more than 250 MB, which exceeds the limitation of lambda.

How to resolve the AWS Lambda package size limitation?

If you want to run your puppeteer function in AWS Lambda, you should use the lambda layer to install Chromium. For the lambda body import puppeteer-core. You can get help for installing the lambda layer https://github.com/alixaxel/chrome-aws-lambda#aws-lambda-layer.


The follow-up

After my 2nd attempt, my WeChat mini-program function was resumed again. It has been running without errors for half a month. 👍

But a new issue is coming. The headerless browser in lambda performance is really bad. For each API call, the average response time is around 10 seconds.

For the bad performance, there’re some reason:

  1. Lambda cold starts (High probability, some package can Warmup the lambda periodically)
  2. Headerless browser instance can’t be shared and need to re-initialized for each API call. (Very high probability)
  3. Too many networks back and forth between Lambda and KMB server.
  4. Lambda not enough RAM.

Before taking action, there’s good practice to measure the lambda performance so that our effort did not spend in the wrong direction.

For lambda performance monitoring, I suggest the Datadog or Newrelic. Here’s a post introducing how deeply Datadog can monitor https://www.datadoghq.com/blog/key-metrics-for-monitoring-aws-lambda/. I think Newrelic can do a similar thing to Datadog, especially after NewRelic released the new platform and pricing schema New Relic One.

Thanks for reading.

Say ByeBye👋 to the ETA feature in my WeChat miniprogram

Background

Last year, I have released a WeChat miniprogram which is an alternative to the KMB 1933 APP.

Why I still spent my time making this mini-program? I have below pain points when using the KMB APP, the points were based on the KMB 1933 app version released around min-2019.

  1. The full-screen ads keep popping up on my iPhone.
  2. The app was so slow to open and crash occasionally.
  3. I just care about some buss routes and stops, don’t give me too much info I don’t need.

My solution

My miniprogram have these three key features to solve the pain points:

  1. Home screen: Show ETA of the bookmarked bus stops(Swipe left to delete the bookmark).
  2. Second screen: Show nearby stops ETA (Swipe left the row to bookmark the stop).
  3. Third screen: Search bus routes(Bus announces, schedules, and map views of the routes)
miniprogram screenshots

Welcome to have a trial, just scan this QRCode by WeChat:

The miniprogram QR Code

The ETA features

EAT(Estimated Arrival Time) is the key info of the app. There’re two ways we can try to get the ETA info:

  1. KMB official Web site: https://search.kmb.hk/KMBWebSite/index.aspx?lang=tc
  2. KMB 1933 APP

My miniprogram is using the KMB official Website as the data source.

KMB official website site ETA feature screenshoot.

KMB official Website ETA feature is a pure front-end function without any authentication. You can easily find out the JS source code to inspect the logic of how it integrated with the API. I can tell you there’s no fanny encryption stuff.

However, the KMB official website has tried many ways to protect the API endpoint from abuse. The recent key improvement is KMB introduced Google reCAPTCHA to protect the API.

The new captcha key for invoking the ETA API.

The captcha key will be generated when the user open the KMB website and bound with the KMB domain (I guess it will bind with the user IP as well). So I the captcha key one-off and can’t be reused.

I created a codesandbox demo to try the captcha.

https://codesandbox.io/embed/friendly-cartwright-dtp23?fontsize=14&hidenavigation=1&theme=dark

The sandbox site is using my own Google reCAPTCHA. If you replace with KMB key 6LdiOd8ZAAAAACukKcCRmmf_Ll2hgSIVya22YR99, you will get the error of “ERROR for site owner: Invalid domain for site key”.

Possible solutions

Since the KMB website is a pure front-end app, one possible solution is we can simulate the browser in node.js runtime to get the google captcha token then invoke the ETA API. The headerless browser can be done by puppeteer or PhantomJS.

To run a browser will be a huge overhead or it may require some daemon service to accelerate performance, so some serverless env such lambda or cloud function maybe not suitable to host this kind of service. (My miniprogram API is hosted by WeChat Cloud Function).

Another solution is to hack the KMB 1933 APP. For example, using some proxy apps such as Charles to monitor the APP traffic with backend API, hopefully, you can get the dedicated or more well-organized API of how the APP gets the ETA data.

Usually, the APP will use HTTPS protocol to secure communication. The good news is that Charles can use man-in-the-middle HTTPS proxy so that you’re able to view in plain text the communication between web browser and SSL web server. The bad news is that if the APP enables the HTTP Public Key Pinning (HPKP) the Charles will be useless for the proxy.

Thank you for reading.

《普希金》

本文旨在阐述严肃文学在流行音乐载体起到的大众传播效果。


听音乐也是会上瘾的。没得听音乐的感觉是这样,你可以问问烟鬼一天不抽烟是什么感觉。

最近在随机地听李志的各张live专辑,随机到了这首歌: 普希金 – 李志x丁薇丨live 2015 动静 https://www.youtube.com/watch?v=TXuLHJm9wsU

你听一听就感觉不一样,有种学院派的流行音乐风格,因为这是丁薇的歌。如果你对她熟悉,你可能听过她写的《女孩与四重奏》,早起是由歌手马格演唱。如果你对马格有点印象,可能你听过她的《远远的远,远远》。。。不过马格已经一早淡出了娱乐圈,在那个90年代,马格可能也是和一些乐队一样(我这里想说的是鲍家街43号),生存所迫,无奈解散或改行。


我应该要回到标题,有点离题了。

《普希金》这首歌是有点特别的,感觉像个钩子,你在100首歌里面随机到它,就会停下来,谷歌搜索一下看看发生什么情况。(这篇文章的诞生就是发生的情况之一。)

排除了这首歌旋律对我的吸引,起码我觉得,标题让这首歌加分不少。

歌词第一句:假如你不在我身边。。。

这难道不就是《假如生活欺骗了我》吗?

当然我还发现了一些来自普希金有趣的句子(以下是浮躁时代的快速阅读方式【谷歌关键字:普希金 名言】):

  1. 讀書和學習是在別人思想和知識的幫助下,建立起自己的思想和知識。 (多么好的勉励自己读书的理由)
  2. 读书是最好的学习,追随伟大人物的思想,是富有趣味的事情。(多么好的解析为什么读书的理由)
  3. 世界的設計創造應以人為中心,而不是以謀取金錢,人並非以金錢為對象而生活,人的對象往往是人。(多么好的指导产品设计的理由)(在各种IT新品发布会,引用这句话,格调马上上来了)
  4. 不管怎么说,不怀希望、不求报答的爱情肯定比一切工于心计的引诱更能打动一个女人的心。(多么好的打动女孩子的理由)
  5. 沒有幸福,只有自由和平靜。(多么好的让自己接受平庸平凡的理由)

你看伟大文豪普希金就被我几句话说完了。

完。

一个解决问题的思路 Didn’t find class “javax.net.ssl.SNIHostName”

TL;DR

Android开发,targetSdkVersion 23,任务是给APP集成AWS Android IoT SDK(2.16.12),Android Studio build APP过程中遇到了标题所示的问题。

原因是SDK其中的一个依赖'org.eclipse.paho:org.eclipse.paho.client.mqttv3:1.2.2' 有个bug

我的解决方式是:Downgrade AWS SDK Android IoT SDK 到2.14.2,因为这个版本依赖的是org.eclipse.paho:org.eclipse.paho.client.mqttv3:1.1.0,该版本没有兼容旧版本API Level 23的问题。

4天前( 2020Jun19) AWS SDK Android已经发布2.16.13 fix了上述的问题,Issue是: https://github.com/aws-amplify/aws-sdk-android/pull/1572

正文

这篇文章我想表达的点是,可以通过降低版本去解决新版可能导致的兼容性问题。

但是在我遇到标题所示的问题时,我并没有马上降低AWS SDK的版本,我的解决过程是:

1. Google Search Android Studio console the build error:

关键信息是: Didn’t find class “javax.net.ssl.SNIHostName”

2. 定位到这个Github Issue

https://github.com/eclipse/paho.mqtt.java/issues/633

3. 看了上述很长的Issue之后,发现是 org.eclipse.paho:org.eclipse.paho.client.mqttv3 的版本1.2.2的有Bug

但是1.2.3已经解决,而且release了。

https://github.com/eclipse/paho.mqtt.java/milestone/9?closed=1

4. 发现AWS SDK Android 2.16.12 Gradle 依赖的正是 1.2.2版本😂

SDK 2.16.12 的发布时间是11 Apr。其实只要AWS SDK更新一下 mqtt的版本到1.2.3即可解决问题。

https://github.com/aws-amplify/aws-sdk-android/blob/release_v2.16.12/aws-android-sdk-iot/build.gradle

5. 我当时想到的解决问题的方式

5.1 给AWS SDK 提交PR去更新mqtt的版本到1.2.3,然后等待review release。

5.2 自己把AWS SDK 源码下来了,然后本地修改,本地导入修改过的module。

我用的是5.2的方式,aws-android-sdk-iot只是整个sdk的一个gradle subproject。

这个类库的 gradle 依赖链是: aws-android-sdk-iot => aws-android-sdk-core & aws-android-sdk-testutils

也就是说,为了能修改aws-android-sdk-iot的gradle依赖,我要把相关类库的源码都放在自己的代码库里面,这个方式有点 overkill了。最后我还是放弃了这个方法。

我是如何想到Downgrade的?

问题没有解决,我就下班了。下班后一直还在想着怎么fix,洗澡过程中突然灵感乍现,为何不downgrade呢?

于是开电脑,检查AWS SDK android IoT Git file changes history找到了一个合适的版本后,修改build.gradle,sync gradle之后 build app 马上成功。👍