Stories by Samkit Jain on Medium

Minimise the risk of misuse in CI/CD pipeline

Samkit Jain — Fri, 19 Nov 2021 11:46:18 GMT

When having a CI/CD pipeline setup, it is generally configured to be run on commit pushes, pull request creation, etc. Developers are granted access to start the workflow either through customer managed policies or the cloud provider managed policies. It is required that malicious actors don’t misuse the access to tamper with the application and, take control of the servers by executing unwarranted commands using the granted privileges.

In this article, we will be discussing the steps you can take to minimise this threat. We will be giving examples of GitHub Actions and AWS CodeBuild as the CI/CD pipelines with the resources residing within the AWS ecosystem. The same principles can be applied to other workflows as well.

Threats

Photo by FLY:D on Unsplash

Exfiltration of secrets and build artifacts compromising the integrity of the system.
Injection of malicious commands into the buildspec tampering with the application.
Taking control of sensitive resources in the system.
Becoming the weakest link in the application security.

Prevention Strategies

Photo by FLY:D on Unsplash

Be strict with the access

When using an AWS IAM role or user in your CI/CD pipeline, be specific with the access it needs. Be as strict as possible and grant the least privileges. For example, if the role needs to download the file from Amazon S3, grant only the s3:GetObject permission instead of granting full access to S3. Even with the read permission, restrict the access to the paths it needs. Instead of granting access to all the files in the bucket, restrict the access to only the files your role needs.

The Condition JSON element allows you to provide the least privileges. You can use it to restrict access to requests originating from a certain IP address. An example of a strict policy using the Condition element is

https://medium.com/media/96eb42a61e3597791086cdcdff268940/href

Do not hardcode secrets

Never hardcode the secrets. The secrets can be API keys or IAM user credentials or any other sensitive information. If you are using GitHub Actions, take advantage of encrypted secrets to encrypt the secrets and store them securely. If using AWS CodeBuild, take the advantage of AWS Secrets Manager or AWS Systems Manager Parameter Store to load the environment variables securely into the build environment.

In AWS CodeBuild, you can load sensitive keys from Secrets Manager like below (ensure that the CodeBuild service role has access to fetch and decrypt the secret)

https://medium.com/media/32ff6bee092e3a2e19fd93285ae882f9/href

Diligently do code review

Perform proper code review and ensure that no comprising piece of code is getting merged in. Pull requests making changes to the workflow files must be reviewed by the code owners before they can be merged in. That way, any attempt to mess with the deployment servers would be averted.

Ensure that only trustworthy URLs are accessed to download packages or data.

On GitHub, you can take advantage of CODEOWNERS. For all the CI/CD related files, you can add a code owner so that no change is merged in before explicit approval from a code owner. An example .github/CODEOWNERS may look like the following making the GitHub user username as the owner of all the files in the directory deployment.

https://medium.com/media/a596ad9c9259912fb018ae0c3850a3ca/href

Adding an extra layer is also preferable for buildspec and even deployment files (AWS CloudFormation templates or Terraform files). You can add a check to ensure that no critical resource is being modified in those templates. The check would fail if

a critical resource is being deleted.
suspicious permissions are being granted to an IAM user.
a critical resource is being modified.

and many more depending on your use-case. To do so for AWS CloudFormation templates, you can create a stack policy like the one below which disables all updates to the production database.

https://medium.com/media/00b938d273ac613c893350054f76b226/href

Restrict build triggers

Restrict the workflow to be run only on certain trigger events. For example, if you have a workflow that deploys your stack, configure it to run on pull request merge events and not all pull request events. Identify when and how frequently the workflows need to run and specify the trigger events accordingly. For example, if you are maintaining a Python library, you must be having a workflow that deploys the library to pip. In such cases, you can configure the trigger as creating a new release or a tag instead of a commit push.

While some workflows are required to be run on commit or pull request events, some may be user actions specific. For those, the workflows should be restricted to certain users. In my last article, I created a GitHub Actions workflow to merge pull requests. The trigger event was a user commenting /merge on a pull request. For such events, there should be an additional check that ensures that the comment is coming only from designated users. This ensures that no one accidentally triggers a workflow that they should not be allowed to.

Restricting workflow start events at a branch level is also a good practice. You can configure the workflows to be run only if the event was triggered on a specific branch. For example, you may have a deployment workflow that is run on pull request merges. You can restrict it such that the workflow is run only if the base branch was of your deployable branches.

With GitHub Actions, you can take advantage of the if tag to restrict access like

https://medium.com/media/89a21778548ed46a05793ec57244293e/href

Such fine control is not possible with AWS CodeBuild, to bypass, you can use bash scripts like

https://medium.com/media/cbd270e45be5aa349a3eb9262f3a0126/href

where the conditional.sh is like

https://medium.com/media/40915c140c29bb002a4877d51c006815/href

Require manual approval

For deployment related workflows, especially the ones targeting production, you would want to have an additional layer of validation. In AWS CodePipeline, you can add a manual approval before the stack is finally deployed and available to the public. This adds an additional layer of checks that goes through the people of higher authority ensuring that broken/unsatisfactory builds are not deployed.

Source: https://www.trek10.com/blog/enforcing-two-person-rule-aws-codepipeline

Have environment-specific IAM roles/users

It is common practice to use a single IAM role or user and grant it access to the necessary resources in all the environments. While it results in maintaining fewer IAM roles and their permissions, it exposes a flaw that the role can access resources out of its scope. It is advisable to have a role for each environment so that the role for the development environment cannot access or modify the resources of the production environment.

With GitHub Actions, the same can be achieved like

https://medium.com/media/cf2686c2a0590b32f1b963af0e9ec39b/href

We hope the above steps helped you in strengthening the security of your CI/CD pipeline. If you have any other tip that can help in securing the pipeline even more, please share in the comments so that the others can also benefit from it.

Have suggestions that can improve the process? Come join us! Juvoxa is hiring. Send an email to hr@juvoxa.com.

Minimise the risk of misuse in CI/CD pipeline was originally published in instigence on Medium, where people are continuing the conversation by highlighting and responding to this story.

Override GitHub’s merge strategies

Samkit Jain — Wed, 20 Oct 2021 04:13:54 GMT

Perform fast-forward merge on GitHub

Photo by Yancy Min on Unsplash

In this article, you will learn how to override GitHub’s merge options using GitHub Actions. While the article focuses primarily on GitHub Actions, you can simulate the same on services like AWS CodeBuild as well.

My usual Git branch flow is

 -> development -> staging -> production

I would checkout a new branch from the development branch. Work on it and then raise a PR to merge it into the development branch. The changes would then be merged from the development to the staging branch and then from the staging to the production branch.

For merging a pull request, GitHub provides 3 options:

Create a merge commit
Squash and merge
Rebase and merge

Create a merge commit

Even though the file changes are exactly the same in all 4 branches, except feature, all branches have additional merge commits.

This option adds all the commits from the feature branch to the base branch in a merge commit. The pull request is merged using the --no-ff option. This prevents git merge from doing a fast-forward and creates an additional merge commit.

GitHub does not allow configuring the fast-forward option and the option will always construct a new merge commit.

Disadvantage: The history doesn't remain linear and there are unwanted extra commits.

Squash and merge

2 commits in the feature branch squashed into 1 and then the same process followed for others resulting in all 4 branches having the same code but different heads.

This option combines all the commits in the pull request into a single commit into the base branch. Instead of seeing all of the contributor’s individual commits from the feature branch, the commits are combined into one commit and merged into the base branch. The squashed commits are merged using the fast-forward option.

Disadvantage: The information on the specific changes that were made and who made them is lost.

Rebase and merge

All 4 branches have the same commits but different commit SHAs for the merged commits.

This option rebases and adds all the commits from the feature branch onto the base branch individually without a merge commit. The rebased commits are merged using the fast-forward option.

The rebase and merge on GitHub will always update the committer information and create new commit SHAs, whereas git rebase outside of GitHub does not change the committer information when the rebase happens on top of an ancestor commit.

Disadvantage: The commit SHAs are changed and even though all the branches have the same code, they are considered not in sync as the heads are different.

What I needed was a merge but with the fast-forward option. It would allow me to maintain a linear history without any unwanted commits. I prefer a linear commit history because it is easier to follow, the commit SHAs are the same across branches and it is easier to backtrack, revert and cherry-pick commits.

One option to achieve this is that you can open a terminal and run

$ git checkout feature && git pull origin feature
$ git checkout base && git pull origin base
$ git merge feature
$ git push origin base

but it requires that the developer who has push access to the branch must also be having access to a computer for the merge to happen. We also lose out on the information on what specific commands were run and when, in case we need to track that. The other alternative, which I implemented, was to use GitHub Actions to override the merge options provided by GitHub.

The GitHub Action is triggered when a user comments /merge on a pull request. It includes validation checks like

The comment was on an open pull request.
The comment was by people with access.
The status checks on the pull request have passed.
There are no merge conflicts.
The comment exactly matches /merge.

After the validation checks, the feature branch is merged into the base branch using the git rebase command and a success comment is added to the pull request. Here’s how the branch networks look like now:

Everything is in sync.

You can find the GitHub Actions workflow file below and can see the workflow in action on a pull request here.

https://medium.com/media/0bc966508b85a198145c4799d1b9e84c/href

Repository for quick access:

GitHub - samkit-jain/override-github-merge

This GitHub Actions workflow has helped me a lot in ensuring that the branches remain in sync and things are doable just from the GitHub.com interface. What you saw in this article is a simplified version of the action that we use. Per your needs, you can extend it to support more merge strategies, force pushes, auto-merge capability and so on.

Plan better sprints

Samkit Jain — Tue, 21 Sep 2021 13:44:48 GMT

Photo by airfocus on Unsplash

One of the most common traits you will find in successful companies is regular product updates. A trait that requires the discipline of proper planning. In this article, we will be covering how we improved sprint planning and meet targets at Juvoxa by introducing RFCs in the process.

What is an RFC?

An RFC, short for Request for Comment, is a document designed to help team members propose a solution to a problem and get feedback on that proposal. An RFC is used to drive clarity and consensus.

Problems

Not having an RFC in the planning process results in the developers thinking of a solution in their mind, giving a rough estimate and start coding. More often than not, during the implementation, unknowns start to uncover and they result in the deployment getting pushed. Estimations also take a hit because something that was expected to take X hours ends up taking >>X hours. Sometimes, even after the solution was implemented, the developers have to redo the work because of stakeholder expectation mismatch as the requirements were not properly understood. Since nothing is written, it also becomes difficult to refer to critical decision points.

How does it help?

Including an RFC in the process forces the developers to type the problem and the solution. When you write, you get more clarity and more edge cases start to unfold, handling which starts to improve the solution. Writing also improves the critical thinking of a person. Having a detailed RFC ensures that the authors have understood the problem well and the peer review process ensures that there is consensus amongst the team on the solution proposed.

This also indirectly results in proper task estimations as the developers would have greater clarity and understanding.

How it fits in?

At Juvoxa, we first create a PRD (Product Requirements Document) that gives a high-level overview of the problem and the expected user experience. Designs are then created for the same. After which, the backend and the frontend team create the RFCs. The RFCs then go through the review process and once they are approved, tickets/issues are created with estimations and actual coding starts. (*tell us in the comments if you would want us to write an article on the full process as well)

Template

You can find the RFC template embedded below. If it does not load, click here.

https://medium.com/media/c80c0028b2582a023b236c99d3ac30f9/href

Feel free to use the template as needed. Do share in the comments how you adapted the template to your need and integrated it into your company’s processes.

Think you can help us improve the planning process? Come join us! Juvoxa is hiring. Send an email to hr@juvoxa.com.

Plan better sprints was originally published in instigence on Medium, where people are continuing the conversation by highlighting and responding to this story.

How we do testing at Juvoxa — Part 1

Samkit Jain — Wed, 15 Sep 2021 13:31:55 GMT

How we do testing at Juvoxa — Part 1

https://xkcd.com/1700/

Something that is untested is broken.

This is part 1 of our multi-part series on how we do product testing at Juvoxa. In this part, we will be discussing the API testing process.

This is how testing was done at Juvoxa

Code
Deploy
Test
Bug? Go to step 1

Sounds like something you also do? Test Driven Development (TDD) is usually sacrificed in favour of faster and regular deployments. This is a cost that many chose to bear and later regret once it goes out of control. Yes, it is understandable to make that tradeoff at the start as manpower, resources and time are limited but keeping it in check is more crucial. In this article, we’ll be sharing how we do API testing at Juvoxa and hope that our practices will help you recover the tradeoff cost.

Our stack involves Python | Flask | Postgres. If your stack covers the same, read on.

Following is a bit theoretical and if you want to skip right to the fun coding part, skip a section below.

Manual vs Automated

Wherever possible, stick with automating your testing process as much as possible. With manual, you rely on the developer or the tester to test out the functionality properly which is time-consuming and prone to bugs. Plus, you need to ensure that all the edge cases are covered. Automated testing with TDD saves the developer time with the process being fully automated.

No unintended consequences as bugs are caught earlier on.
A faster transition from the development phase to the QA phase.
Reproducible user behaviour with integration testing.
An extra hour writing the tests today saves us an extra day tomorrow.
Overall more reliable and stable API system.
No more (okay, fewer) XKCD #1739 events.
Boosts developer confidence.

Minimal Requirements

The testing setup should have the following at the least

Cover both the positive (API endpoint returning HTTP 2XX response) and negative (API endpoint returning non-HTTP 2XX response) workflows.
Validate not just the HTTP status code, but also the response payload.
For a Pull Request (PR) to be deemed ready for review, it must have a test that passes with the code changes made in the PR but fails without them. The code coverage should improve or at the least remain the same. But should never decrease.
In the CI/CD pipeline, all the tests must pass before the changes can be deployed.

API Testing Strategies

Unit Testing
This involves testing a single API endpoint. The tests for that single endpoint run various simulations including both positive and negative workflows and validate that the endpoint is working as expected or not. For example, testing the login API endpoint with the correct credentials.

Integration testing
This involves running multiple API endpoints in a single test. This is useful for simulating user scenarios and testing API endpoints that are dependent on each other. For example, simulating the flow in which a doctor is creating a program on the platform which involves multiple endpoints to create a base program, add contents, add tags, etc.

Coding Time

* Assuming you already have a Flask app created and use Docker.

Before the tests can be run, you need to create an ephemeral database that lives for as long as your tests are run. You don’t want to be running the tests on the production database.

To start a sample PostgreSQL database, you can use the official Docker image like so (if you are using a different database backend then you can use a different Docker image)

$ docker run -d — rm -P -p 127.0.0.1:5432:5432 -e POSTGRES_DB=db_name -e POSTGRES_USER=username -e POSTGRES_PASSWORD=password — name postgres-container postgres:13.1

The above will start a PostgreSQL 13.1 server and keep it running in the background and also create an empty database by the name db_name that can be accessible by the Postgres user username with the password password. The database can be accessed over localhost at port 5432. Equivalent connection string postgresql://username:password@localhost:5432/db_name.

Setup your testing directory as

tests/
|-- __init__.py
|-- api
|   |-- test_00_signup_api.py
|   `-- test_01_signin_api.py
`-- conftest.py

Tests are run in alphabetical order, so if you want your tests to be run in a certain order, prefix them with 00, 01, 02, …
conftest.py — Configuration file where you can register hooks.

https://medium.com/media/6fd808bcb449c580040e7cd3e19e786b/href

It is better to organise your test files in the api/ folder and have a separate file for each API.

https://medium.com/media/52efc7f0b67cfc6c6cca0c60a79f985c/href https://medium.com/media/4b2950f06acdee550b6d5d4d32cbc3da/href

Notice that even in the individual testing files, using the method name with 00, 01 as a suffix to ensure that the tests are run always in the required order. You can leverage the testing order to also populate the same database on the go and don’t need to rely on having a prepopulated database (though that is a choice as well).

To run the tests, first, install pytest

$ python -m pip install pytest  # Required only once
$ python -m pytest tests/

If everything is ok, you should be seeing a green tick saying all tests have passed.

If you want to run only a subset of tests, you can use the -k option as provided by pytest. More details can be found here.

The above was a very simple example that will help you get started with integrating testing in your API development process. You can extend the test cases above to suit your needs. The above example demonstrated unit testing the APIs. To create scenarios and perform integration testing, you can update the tests to call multiple APIs in sequence and validate each (for example, simulating a flow where a user registers but does not verify the email and then tries to login and perform restricted actions).

CI/CD

We use AWS CodeBuild for our CI/CD flow. A sample buildspec.yml file to run the tests as part of your CI pipeline will look like

https://medium.com/media/5b93dbd939a7066d83f30cf7d8675d34/href

Code Coverage

Code coverage is a means to measure how much of your code is executed when running the tests. The higher the coverage, the better. There are multiple coverage report generators available that go along with pytest with the most popular ones being pytest-cov and Coverage.py.

We used pytest-cov and to use it, you just need to install it via pip and pass the required parameters when running the tests.

$ python -m pip install pytest-cov
$ python -m pytest --cov=./ --cov-report=xml tests/

To visualise the coverage in a better way and include it in your CI setup you need to use a service like Codecov. Create an account and authenticate GitHub. Once done, you just need to upload the coverage.xml report generated above to Codecov and let it handle the rest. For private repos, you can do so using

$ bash codecov_upload.sh

where codecov_upload.sh has the content as

#!/bin/bash
bash <(curl -s https://codecov.io/bash) -t YOUR-CODECOV-TOKEN

The buildspec.yml with the Codecov will now look

https://medium.com/media/913b76dcc69a414d8b7b678ea29af7f9/href

Tests should not be considered as a burden but rather as a means to empower the team to build better systems. They give you confidence when adding new features and allow you to scale your development with confidence as the codebase grows.

In the upcoming articles, we will be discussing how we test our frontend. Follow us to not miss out on any new articles.

Think you can help us improve the testing process? Come join us! Juvoxa is hiring. Send an email to hr@juvoxa.com.

How we do testing at Juvoxa — Part 1 was originally published in instigence on Medium, where people are continuing the conversation by highlighting and responding to this story.

Physio Pose: A virtual physiotherapy assistant

Samkit Jain — Wed, 04 Mar 2020 07:24:01 GMT

A Proof-Of-Concept for a real-time personal virtual trainer for physiotherapy exercises.

Pose estimation on a woman doing jumping jacks

Background
Physiotherapy treatments are usually long-running as it takes time to restore a person’s movement. One has to perform the same set of exercises for a few months every day with the correct posture to regain the movement. Visiting a physiotherapist for every session can be pretty expensive and not everyone can afford that. Those who do the exercise at the comforts of their home have to make sure to get the posture and movement right.

This project is an attempt to create a system using computer vision that can guide, provide instant feedback and act as a personal virtual trainer that would help people do exercises. A system that focuses more on form than on reps.

GitHub repo is at https://github.com/samkit-jain/physio-pose. (It starts off from when I decided to go ahead with openpifpaf)

Usage

The main script to run is physio.py. A sample run command could look like:

python physio.py --exercise seated_right_knee_extension --joints --skeleton --save-output

There are more options available and the full list can be seen by running python physio.py -h.

The result would look something like the GIF below where the top portion would show instructions to the user and the bottom would be the skeleton of the user.

Seated knee flexion and extension

How it works

Let’s start with pose estimation. One resource limitation that I faced was that my laptop does not have an Nvidia GPU which means no CUDA and had to use a pose estimation model that could work on a CPU. For pose estimation, I experimented with the following projects that could work on CPU only machines:

LightTrack: Implementation of “LightTrack: A Generic Framework for Online Top-Down Human Pose Tracking”. GitHub.
Lightweight OpenPose: Implementation of “Real-time 2D Multi-Person Pose Estimation on CPU: Lightweight OpenPose”. GitHub.
Openpifpaf: Implementation of “PifPaf: Composite Fields for Human Pose Estimation” in PyTorch. GitHub.
Tf-pose-estimation: TensorFlow implementation of OpenPose. GitHub.

Out of these, I found openpifpaf and tf-pose-estimation to have the highest FPS (frames per second) rates. Of the two, openpifpaf gave more accurate results in cases when the full body wasn’t visible or the body was lying sideways.

The next step involves assessing the movement of the joints. We do that by first checking whether all the required joints have been identified and are visible in the frame. Then, the user is instructed to get in the starting position. Next, the user is instructed to perform the subsequent steps involved in the exercise. The feedback is provided based on the angle made by the keypoints being covered.

Let’s take the example of seated knee flexion and extension with the right leg. It is best done sitting in a chair. The sub-steps can be written as

The right leg should be in a seated position, i.e., making an inner angle in the range of 120 to 150 degrees.
Extend the right leg such that the inner angle is 180 degrees.
Bring the right leg back to the starting position.

The program follows the same instructions and guides the user. First checks whether the keypoints are visible. Then waits for the user to move the right leg in the seated position. Once it is, waits for the user to extend and straighten the leg. Once it is, waits for the user to bring the leg back to the starting position. During each of the waiting periods, it provides valuable feedback to the user. For example, if the leg is not in starting position, instructs to do so.

3D

3D bent elbow shoulder rolls

3D data feed provides more real to life impression of a human body and can help in providing much more accurate results.

This experiment was short-lived as creating a 3D image from a 2D image is still not near perfect as can be seen in the GIF to the left. The 2D video which was provided had the person doing a seated bent elbow shoulder rolls. The information is clearly lost in the generated 3D.

Pose similarity

https://youtu.be/TrHdgq7rFsc

The current application focuses on providing real-time assessment to the exercise that the person is doing. A good feature could be to provide a score on how close the motion was to the perfect motion reference. The solution also needs to consider the variance in the duration of the videos as it is not guaranteed that both the videos would be of the same length and performing the same motion at the same point in time at the same speed.

Code for this is at pose_compare.py.

Run

You need 2 videos. Run pose estimation using physio.py on both the videos and save their CSV results. The execution would be similar to:

$ python physio.py --video video1.mp4 --csv-path video1.csv
$ python physio.py --video video2.mp4 --csv-path video2.csv

To calculate how similar the 2 poses are, run:

$ python pose_compare.py video1.csv video2.csv

This would output a decimal number. The lower the better.

Explanation

The implementation is inspired by the research paper Schneider P., Memmesheimer R., Kramer I., Paulus D. (2019) Gesture Recognition in RGB Videos Using Human Body Keypoints and Dynamic Time Warping. Even though the paper focuses on gesture recognition, we can leverage the processing techniques mentioned in the paper along with dynamic time warping to measure the similarity between sequences which may vary in speed.

It is a 4 step process:

Translation: All the key points are translated such that the nose key point becomes the origin of the coordinate system.
Scaling: The key points are scaled such that the distance between the left shoulder and right shoulder key point becomes 1.
Dimension Selection: Joints that do not move significantly in the sequence are removed.
Dynamic Time Warping: An approximate Dynamic Time Warping algorithm that provides optimal or near-optimal alignments with an O(N) time and memory complexity.

Improvements

There’s a lot of scope for improvement in this implementation and pull requests are welcome.

Stabilisation: As is evident from the demo GIF shared above, the key point identification can be stabilised to give a smooth transition effect.
Multiple reps: Support for doing multiple repetitions of an exercise.
Lightweight model: A model that is light enough to run predictions at 30 fps even on a CPU machine and yet accurate enough to be usable.
Mobile application: A mobile representation of the application.

I would like to extend a special thanks to Mr. Amit Prakash Gupta who gave me the opportunity to work on this project.

Why OCR-ing a bank statement is a bad idea

Samkit Jain — Sun, 03 Jun 2018 04:08:06 GMT

A lending company needs the answer to just one question when giving loans, “Will the borrower be able to repay the loan?”.

Coming to a binary (yes or no) answer to that question involves a lot of work (both manual and automated). Lending companies go through various financial documents of the borrower and perform a thorough analysis to come to a conclusion. With the advancements in artificial intelligence and machine learning, some companies have been able to reduce the time to approve a loan to a day (but don’t share the percentage of bad loans probably because OCR systems are still not as good as they want it to be). A bank account statement can be tens or hundreds of pages long with thousands of transactions. An individual’s account statement can contain just a few hundred transactions while a corporate’s can be in thousands. To understand the past spending behaviour of a borrower and predict the future loan repaying ability, one of the financial document that every lending company asks for is a bank account statement.

Bank statements shared come in all types imaginable. You’ll find PDFs (both bank generated and scanned copies), CSVs, images, in rare cases even HTMLs. In this post, we’ll talk about the most common type, image. Even in images, you’ll see a lot of variety,

screenshots
blurry photos
high-resolution photos
low-resolution photos
photos with folded pages
photos in bad lighting

Dealing with images has always been problematic. You cannot simply copy-paste the text and neither can you accurately get the text because of the ambiguous shape of letters and numbers. They are human readable but not machine readable.

OCR to the rescue!

Optical Character Recognition or OCR is a technology that recognizes text within an image. Humans have the ability to easily understand the text in an image, however complex (after all, we are the masters of the sacred texts!).

Source: https://www.reddit.com/r/comics/comments/7xi2bh/the_sacred_texts_oc/

Over the past few years OCR solutions have really gotten much better. They are able to recognise handwritten texts with a good amount of accuracy. Giants like Google and Microsoft have also invested in the field and have come up with their own text recognition products.

It’s a known fact that OCR works well when the characters are printed, image quality is high and lighting is ideal. Bank statement images shared by the borrower have one thing going for them, they contain printed text. But this is not enough. Even with ideal conditions, it won’t be enough.

Majority of errors in OCR systems are because of incorrect classification. It usually misclassifies in cases where the features of a letter and number are same. Some of the ambiguous cases are,

O (letter) and 0 (number)
I (uppercase i) and 1 (number)
l (lowercase L) and 1 (number)
S (uppercase s) and 5 (number)
Z (uppercase z) and 2(number)

When using OCR for general purpose text detection, this ambiguity might not be a serious concern, but when using it on a bank statement especially for lending purposes, this is a serious concern. Misclassifying 50,000 as SO,OOO is quite serious. It would lead to missing important transactional entries.

Images shared by borrowers are usually not in the ideal condition. They are in bad lighting, blurry, low res, have pen/pencil markings, pages are folded, etc. All these factors act as a catalyst and lead to more and more incorrectly classified characters.

Some examples where OCR didn’t work for us

Could not detect all the text. Misread . as : (Microsoft Azure’s Computer Vision API)

Could not detect every text. Misread 0 as O and , as . (Google’s Cloud Vision API)

Could not detect any text (Microsoft Azure’s Computer Vision API)

Misread , as . and C as G and . as : (Google’s Cloud Vision API)

Misread Z as 2 (Google’s Cloud Vision API)

Current state of OCR systems

When Inkredo was into P2P lending, we too accepted images of bank account statements and manually typed every entry in an Excel sheet. Expectedly, this was a time taking process and we tried multiple OCR solutions — in-house, open-sourced and paid — but, none of them gave us the desired result. Some of them worked really well with bad quality photos but all of them struggled with ambiguous characters.

To conclude, OCR is not reliable for text detection in financial documents where reading a comma as a dot (or vice-versa) can make a significant difference. PDF (containing text and not scanned images) should be the preferred type because it’s not as easy to manipulate as a CSV and is easier to extract text from PDF-encapsulated files rather than images. Plus, you won’t have to buy expen$ive OCR solutions.

Do you think OCR is reliable when it comes to credit risk assessment? Share your thoughts in the comments.

Click here to try a demo of Inkredo.

Why OCR-ing a bank statement is a bad idea was originally published in Zodhana on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to deploy Go web application on AWS Elastic Beanstalk

Samkit Jain — Wed, 09 May 2018 05:02:38 GMT

Deploying the version one of an app to production for the first time is never a walk in the park. We too faced obstacles when deploying the backend of 91paisa (written in Go) on Amazon Web Services (AWS) Elastic Beanstalk. After multiple trials and support from the AWS Support, we finally deployed the initial version of 91paisa. This article will help you if you too are figuring out how to deploy your Go web app on AWS Elastic Beanstalk.

You can deploy Go web applications on AWS Elastic Beanstalk in a couple of ways:

Uploading the source code
Uploading the binary

Uploading the source code

Create a Zip file which contains application.go (entry point of your application) file at the root along with the source code. Also, create three additional files - build.sh, Buildfile and Procfile.

The directory structure will look like:

.
├── application.go
├── datastore
│   └── db.go
│   └── ...
├── api
│   └── api.go
│   └── ...
├── public
│   └── index.html
├── build.sh
├── Buildfile
├── Procfile
└── .ebextensions
    └── ...

https://medium.com/media/69bd11abf7f7d79c326031ef3e07f549/href https://medium.com/media/6691bd483186d460161a10ef50337345/href https://medium.com/media/e70b5ad56dfa7311e6e2e99ebb832b5f/href

Now, upload the Zip file to Elastic Beanstalk and wait. If everything works, congratulations! If not, read on.

We followed the same process but the environment health immediately degraded. Digging through the logs we found,

The build command was failing.

We ran eb ssh to SSH into the instance and could find the files downloaded at /var/app/current/ (GOPATH) meaning that the datastore folder was located at /var/app/current/datastore while the application was looking for it in /var/app/current/src/github.com/91paisa/backend/datastore.

The build command was failing because of the project layout. We follow the standard structure (and naming convention) where the source code is located at go/src/github.com/91paisa/backend/ which can be imported by specifying the complete path, i.e. import "github.com/91paisa/backend/datastore". Since beanstalk does not know about the project layout it is not creating one and hence the build was failing.

You can get around this by either (a) using relative paths for your packages in the import statement or (b) dividing the application into packages such that you can install them using go get.

Note: Elastic Beanstalk does not delete the previous files. Example, your source code initially contained just three files — application.go, api.go and cmd.go — you uploaded the Zip and everything works great. Now, you don’t need the cmd.go file and you delete it and re-upload the Zip. On your environment, the cmd.go file will still be present and you’ll have to manually remove it or restart the environment. Whenever the load balancer creates a new instance it contains only the latest files though.

Uploading the binary

Create binary using

GOARCH=amd64 GOOS=linux go build -o bin/application application.go

This will create a bin folder containing the binary. Now, create a Zip file containing the bin folder. You can also add additional files like assets and .ebextensions folder.

The directory structure of the Zip will look like

.
├── bin
│   └── application
├── public
│   └── index.html
└── .ebextensions
    └── ...

You can also create a shell script to automate the process. A basic shell script that fits my use case:

https://medium.com/media/a0e4c4013c086b16c8a2453b91de62e0/href

Don’t forget to chmod +x create_zip.sh. Run it using ./create_zip.

Concluding remarks

We hope this article helped you in deploying your Go app to AWS Elastic Beanstalk. If you are looking to deploy on EC2 or DigitalOcean, the process becomes much more easier. All you have to do is build the executable binary and run it as a service on the server.

Know a better way to deploy? Please share your knowledge in the comments.

Wanna chat? Drop me a message on on Twitter or connect with me on LinkedIn!

Originally published at http://www.91paisa.com/blog/how-to-deploy-go-web-application-on-aws-elastic-beanstalk

Developing a bank statement analyser

Samkit Jain — Thu, 14 Dec 2017 05:54:18 GMT

Snapshot of sample output

How we achieved accuracy of over 90% after reading 800+ transactions

At Inkredo, we perform flow-based credit assessment to determine the monthly repaying capacity of a customer. Our customers are small & underbanked retailers who are running a bootstrapped and consistently profitable business, yet they remain excluded from formal credit. Formal institutions have shied to lend to lower-middle income group because the cost-benefit analysis of lending and collections do not offset the cost of originating and recovery. There is no cost-effective measure to monitor income/solvency and ensure timely repayment.

The assessment involves calculating useful analysis from the bank statement of our customer. This task requires copy pasting every transaction from the PDF of bank statement (containing tens of pages with hundreds of rows) to an Excel file, cleaning the copied data, and then using Excel wizardry to perform some statistical operations. Imagine using Ctrl+C and Ctrl+V almost a thousand times every other day. As you might have guessed, this involves a lot of human interaction and typically takes us a day to complete a single bank statement.

When you just want to copy a row but it selects the whole column

With our growing user base, a solution was required to reduce the effort and time required. A smart solution to generate insights within seconds with minimal human interaction.

Input

Bank statement in PDF

Output

Sources of earning
Merchant transactions
Operational expenses
Recurring transactions
Debt (if any)
Default (if any)

Process

The 5 step process

The Research

source

Before starting with the development, we tackled the problem manually. For each bank statement (from various banks), read all the transactions, highlighted keywords and assigned appropriate labels and categories to each. Then with the generated mapping created a set of keywords for each category with priorities assigned.

The dataset

A bank statement containing transactions from over six months of a person running a business is usually more than 20 pages long with around 1,000 transactions. Columns are generally of date, particular, balance, deposit, withdrawal, etc. For a specific bank, the result is pretty consistent and easy to play with, but every bank has its format for bank statements. Count of columns, positioning of columns, separators, text format and abbreviations vary.

The columns we require are

Date of transaction
Particulars
Deposit amount
Withdrawal amount
Closing balance
Cheque/Reference number

These columns are found in every bank statement. Example,

HDFC

IDBI

ICICI

The naming convention might be different, but the purpose of every column remains the same.

Created a dictionary called BANK_DETAILS that contains the position of the required column. Example,

https://medium.com/media/5c8462111d2f6f6bb0590bcd060eecce/href

Reading the bank statement

Reading tables from PDF documents is not an easy task. Even copying data from tables doesn’t work properly most of the time. Thankfully, there’s an open-source library available called tabula that can extract tables from a PDF with almost accurate results. We used its Python wrapper tabula-py for the data extraction.

https://medium.com/media/dbf342d4d39b0989781f0a27ad8e8551/href

Making the extracted data consistent

Table header from ICICI bank statement

Every page with transactions table of the ICICI bank statement consists of this header row. This row is useless for the system as we are only targeting transactions.

Aim: Remove header rows from the list of transactions.

Solution: From reading multiple transactions from numerous bank statements we realised that the closing balance column is always the last. So, a header can be considered as rows (why plural? see next task) starting from the first row till the row where closing balance is not null. Then, go through all the rows and if the row is a part of headers, remove it. In the end, we have rows without any header.

https://medium.com/media/32a3b31deb04a0cf5910fe58f6778fbb/href

Transaction from an HDFC bank statement without the sensitive information.

In the image, we can see that particular can be in multiple lines but belong to the same row. Tabula cannot differentiate whether multiple lines in a row belong to the same row. It will treat them as multiple rows, and as a result, we get the following output:

# First line read from HDFC statement
['22/06/17', 'IMPS-7-RAHUL-HDFC-XXXXXXXX', 'XXXX7', '22/06/17', nan, '1,000.00', '14,904.08']

# Second line read from HDFC statement
[nan, '8-XXXX', nan, nan, nan, nan, nan]

Aim: Convert the same particular from multiple rows into one.

Solution: The first line of every entry contains particular, date, balance, transaction amount and cheque number. Only the particulars can be multiline. So, a multiline particular can be between two date entries.

https://medium.com/media/7e5c4b421218ee8205ea72cb555dcad7/href

Credit, debit and default

As we saw in the columns of various bank statements, the differentiation between credit, debit and default is based on whether the entry is in deposit column or withdrawal column or in some cases whether mentioned as CR/DR.

Aim: Classify every transaction as credit, debit or default.

Solution: Classify all deposits as credit and withdrawals as debit. An event of default is defined when a withdrawal leads to negative closing balance and then immediately followed by a deposit of the same amount.

Example of a default transaction

https://medium.com/media/4ae5bcb990be3405c9bc8b51bb4cbd0a/href

Categorising transactions

To perform analysis on the bank transactions, we need to categorise every bank transaction. Categorizing enables us to perform category specific operations and answer questions such as “how much does he spend on operations?” or “what are the different channels of earning?”. A category can be ATM, Shopping, IMPS, NEFT, etc.

Aim: Categorise every transaction.

Solution: For every transaction, tokenise the particular and based on the occurrence and position of keywords assign a category.

https://medium.com/media/d3cfbf69a48780c5156bfa02f0fc396c/href

Cashflow analysis

Now that we have read, cleaned and categorised transactions from the bank statement, it’s time to generate some insights. After all, what’s data without information?

Cashflow analysis helps in

Analyzing the spending, buying and saving behaviour of the user
Checking whether the user is doing any side business
Calculating growth of the business
Checking whether user has any running loans and their payment status
Calculating repaying capacity of the user
Analyzing recurring transactions

Overall analysis

This analysis gives an overall view of the total number and amount of credits, debits and defaults in the bank statement. Also contains a categorical breakdown of cash and non-cash transactions.

Monthly analysis

This analysis is a month-wise breakdown of the overall analysis of the bank statement. Helps in calculating the growth of the business.

Defaults analysis

This analysis shows the total number and amount of defaults in the bank statement along with the details of every default.

Recurring transactions

To understand the spending behaviour of the user we need to know the most common transactions. To answer questions like, “Are there multiple NEFT transactions to/from the same person/company?”, “Is he an IRCTC agent?” etc., We used Ratcliff-Obershelp algorithm to club similar transactions with more than 85% similarity. For better results, removed numbers, special characters from the strings.

Note: Code snippets mentioned above are pseudocodes to demonstrate the idea and may not contain all the edge cases.