Where's my Voi scooter: [2] Deciding the specifications

·

6 min read

I researched the topic in the previous blog. I now know how to query the API, and what is the response. In this blog, I aim to start writing the program to get the scooter data.

Current problem: access token generation

Each time I want to query the locations, I need to supply an access token, which comes from submitting a session request with my authentical token (not to be confused with an access token). The access token expires after 15 minutes, so I will have to generate a new one after that.

If I plan to frequently send location requests, say every minute. I won't want to make a session request and location request every time I do, as it effectively doubles the request I make, which is not polite to the server[1]. Therefore I wish to find a way to only get an access token when I need it.

[1] The term polite comes from web scraping, which is to not send too many requests to a web server at the same time, to avoid overwhelming it. I am using this term to say that I wish to minimise the request I send to the API server.

Literature review

I'm doing a literature review again because I want to make sure what I'm going to do have not been done before. If I did the project first and found out someone did something exactly the same before, I would have wasted all that time. Although some might argue I would still have learned something while doing it.

On Github

I searched for projects on Voi scooters on Github instead of Google because I am searching for code. To my surprise, I actually found a decent amount of results, which I wasn't able to the last time I searched on Google. The projects I found are:

On Google

The last time I searched on Google, most of the results are from the Voi website, this time I decided to use -site:voiscooters.com so that results from that domain would be excluded, and it helped. These are some extra results I found on Google that might help:

Back to work

Now that I've learned what others have done on this subject. It is time to actually build my program. My plan is to have the program constantly query the API and save the result for further analysis.

The challenges that I currently face

  • How to save the result? should I create a file per query? a file per minute? per hour?
  • How do I keep the result concise to save storage space?
  • When should I request an access token?
  • Voi scooters are turned off at night, what happens then?

But one important thing is, that I should store my token in a separate file that is not committed to my git history since it is private. Otherwise, others would be able to steal my identity and potentially do something bad. I learned this from a time when I accidentally committed my Discord bot token, luckily, Discord runs a web scraping bot on Github and it found my token before the bad guys, it generated a new token for me and reminded me about that.

Remove a file completely from git history

I found out how to remove a file completely in the git history. This is useful because if I committed my token and remove it in a newer commit, it would still be in the history and people can access it. The way to completely remove it is to run git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch path_to_file" HEAD where path_to_file is the file that contains the secret, then git push -f to force push the altered git history. A side effect is that it would remove the file in the active directory too, so keep that in mind.

Making location requests

For now, my plan is to make a request every minute, which would be 1,440 requests per day. I also should process the data got from the API since as seen in the last blog, the response contains a lot of useless data. Each scooter has id, short, battery, location in lng and lat, zoneId, category, locked, lockType and lock status. But all I need is:

  • short: since it corresponds to the id
  • battery: would be nice to know the battery consumption
  • location: obviously I'm trying to track it
  • locked: I'm not sure if the scooter is unlocked, will it still show up on the system (edit: it is confirmed that the API only returns locked scooters, so no need to store it)

And I will manually generate the following data:

  • vehicle count: see how many scooter's locations are returned
  • timestamp: when did I make this request, in ISO format for better readability

I made some requests, and if I just store reduced JSON, with an indent of 4:

{
    "time_stamp": "2022-06-01T08:14:42.305285",
    "vehicle_count": ----,
    "vehicle_data": [
        {
            "short": "----",
            "battery": --,
            "lng": ----------------------,
            "lat": ----------------------,
            "locked": true
        },
        {
            "short": "----",
            "battery": --,
            "lng": ----------------------,
            "lat": ----------------------,
            "locked": true
        },
...

A single file is 225KB, which is 230,787 bytes. If 1440 such files a day, it will be 316MB, that's a lot, I will try to cut down on this.

By storing data as a list with no label:

{
    "time_stamp": "2022-06-01T08:20:09.340059",
    "vehicle_count": ----,
    "data_format": [
        "short",
        "battery",
        "lng",
        "lat",
        "locked"
    ],
    "vehicle_data": [
        [
            "----",
            --,
            ----------------------,
            ----------------------,
            true
        ],
        [
            "----",
            --,
            ----------------------,
            ----------------------,
            true
        ],
...

I reduced the file size to 172KB, 176,814 bytes

by not using indent=4:

{"time_stamp": "2022-06-01T08:21:39.708911", "vehicle_count": ----, "data_format": ["short", "battery", "lng", "lat", "locked"], "vehicle_data": [["----", --, ---------...

the file size became 71.2KB, 72,939 bytes, I never thought some while space could drive up file size so much

that will be 100MB of data each day for 1440 requests, maybe only 66 MB since they only operate from 6 am to 10 pm, plus stop when it is raining.

I should group the records into hours, so I only generate 24 files a day, since having a bunch of files will slow down the operating system.

What I plan to do in the future

Now that I have calculated the file sizes, I will start coding and see how it goes in the next blog.