The idea was to get an application which can check whenever a website goes down
specifically, it was to make sure that this check is based on regions.
eg. one website can be down in singapore, but working perfectly fine in europe.
What it does
This application can check if a site is accessible from across the world
The user specifies an application and the interval in which it checks its health.
A task is run on different regions in the world checking if that application is available
User can also check the history of these checks from different regions
How we built it
This is designed to make it work in an environment where there are multiple regions involved (much like aws).
Several things were kept in mind when making the architecture
Making sure that the scalability (both in same regions and different regions) is straight forward
Making sure that the write to the database after execution results were as fast as possible
Mainly there are four components of this application:
Executor
Webui
PostgreSQL
Couchbase
Executor and webui are services written in golang, while postgreSQL and couchbase are popular databases
Postgres is used to store information about the application
Couchbase is used to store results of the exection
For any region where we want to make it work, we need at least one couchbase instance.
Couchbase instance across different regions should be in an xdcr replication
An executor simply writes its results to couchbase within it's region
Webui reads the results(along with replicated data) from couchbase instance in it's own region
Challenges we ran into
Biggest challenge i ran into was finding a database which fits this requirement.
I wanted a database which can do following:
Do distributed writes (Write can be done on any instance in a cluster)
Give good performance in replication even on different regions
Handle time series database
Due to above points, i chose couchbase
Another major challenge was making sure that the setup I was doing for databases and the application itself were easy to scale
I used terraform to help with that
Accomplishments that we're proud of
The performance of the application
Though there are no metrics for this yet, in my observation, the performance of the application is better than other products like this.
It is due to several factors, one major one being that the executor is very lightweight and only does this one specific thing of doing very simple network request
Written in golang, executor performs really well and on top of that, I feel more performance can be extracted from it with simple optimizations
This simple designed coupled with golang's optimization for arm64 architecture, gives really good performance on graviton CPUs
I am able to check health for 100 applications using one tg4.micro instance running this executor
What we learned
One big thing, which i learnt is that making softwares which can scale well is significantly different from making one instance solutions.
I had to reiterate the architecture itself a several times before i was sure that this would work
Another big thing i realized is that most database support replication which is read only(you can write on master node, and read from any other node)
this architecture of database would not fit my design, and getting a good database which can do distributed write was difficult.
in the end, I settled for couchbase, and after working with it, it is not perfect and has a learning curve, but gets the job done
What's next for Alshain: Application Health Check
Performance optimization
It is clear that more performance can be extracted out of executor, which i look forward to doing
Better database
Couchbase has some performance issues when it comes to aggregation queries, I look forward to solving them with couchbase itself or replace the database layer with something else entirely
Log in or sign up for Devpost to join the conversation.