Stories by Haohui on Medium

So… How exactly is AI being used to detect COVID-19?

Haohui — Thu, 18 Feb 2021 16:58:27 GMT

Demystifying the mathematics behind Deep Learning & Convolutional Neural Networks (CNN)

Continue reading on TDS Archive »

I Built a Telegram Bot to Combat Food Wastage — Here’s How

Haohui — Tue, 05 May 2020 13:51:42 GMT

A Complete Python Guide to Telegram Bot Creation using python-telegram-bot

Continue reading on TDS Archive »

How to Deploy a Telegram Bot using Heroku for FREE

Haohui — Mon, 04 May 2020 16:18:23 GMT

Using the python-telegram-bot library

Continue reading on TDS Archive »

Adversarial Attacks in Machine Learning and How to Defend Against Them

Haohui — Thu, 19 Dec 2019 05:34:36 GMT

Notes from the Keynote Speech by Professor Ling Liu at the 2019 IEEE Big Data Conference

Continue reading on TDS Archive »

Responsible Data Science

Haohui — Thu, 12 Dec 2019 09:51:12 GMT

The Hidden Dangers of Data Science

Notes from the IEEE Big Data Conference Keynote Speech by Lise Getoor

Data science, machine learning, artificial intelligence. These are all buzz words that have emerged in our society. We have grown increasingly reliant on these technologies, but this growing reliance also raises questions about how justified we are in giving our complete trust to these technologies. Machine learning and deep learning are known famously to be a black box — we feed data into the model and come up with some results that we just take for granted, without really questioning how these results were obtained, and whether the process is justified. This issue formed the backdrop of the keynote presentation by Professor Lise Getoor at the 2019 IEEE Big Data Conference held in Los Angeles on 10 December 2019, and I will now give an overview of the enlightening talk she gave.

Image by manfredsteger from Pixabay

Background

Data science has increasingly found itself in the spotlight, with increasing coverage and attention all over the world. But whenever we see data science in the news, it is mostly for something bad. For instance, many of us are familiar with the Cambridge Analytica scandal which harvested the personal data of millions of peoples’ Facebook profiles without their consent and used it for political advertising purposes. Another example is how scientists proclaimed to have created a model able to infer criminality based on facial images.

In the keynote speech, Professor Getoor mainly focused on responsible data science in machine learning, which I will now outline.

Machine Learning — A short introduction

Machine learning has undergone several revolutions, with several themes emerging over the past century, starting from Concept Learning, Statistical Learning, Optimization-based learning to Deep Learning.

Image taken from IEEE Big Data 2019 Keynote. Reposted with permission

Concept Learning revolves around how machines can learn logically consistent hypothesis that can correctly label the positive and negative samples correctly. In the 1980s, machine learning moved towards statistical learning, in particular, probabilistic methods, with a focus on learning a hypothesis that maximizes probability and data likelihood. Next, machine learning moved towards optimization-based, for example, Support Vector Machines (SVM) where hypotheses minimize some loss function. Now we are in the neural-inspired learning age of deep learning which represents the hypothesis as a neural network.

Training and Testing

In essence, the goal of training in machine learning is to minimize the loss between the target label and the predicted label. This is formulated mathematically as such:

Image taken from IEEE Big Data 2019 Keynote. Reposted with permission

During the testing phase, the learned model is tested to determine how well it is able to predict the predicted label. Error is then calculated by the sum of the loss between the target label and the predicted label, formulated mathematically as such:

Image taken from IEEE Big Data 2019 Keynote. Reposted with permission

This seems relatively straight forward enough — train a model to reduce the loss and you are able to objectively quantify its performance by calculating the error.

So what could possibly go wrong? It turns out, many.

The Things That Could Go Wrong

In total, Professor Getoor covered 7 issues that could go wrong: Formalization of the problem, dealing with high dimensional data, measuring error, interpretability in deep learning, causal modelling, bias and data dignity. These are what the problems are:

Issue #1 — Formalization

It may seem as though coming up with the training objective is easy, but in fact, every time we train a model, we are making some frame of reference commitment to what the data are, what the labels are, and what the loss function is.

Firstly, the transformation of raw data into feature vectors requires us to make a frame of reference commitment because raw data always contains much more social and historical context which cannot be represented by the feature vector. This means we would miss much of this important information from human data whenever we transform the raw data.

Next, the choice of labels is also problematic, because who gets to define the labels? Labels can only be proxies for true data, never the real replacement.

Lastly, the choice of loss function is important as well because different loss functions penalize errors differently, and trade-offs between the factors influencing model performance are often over-simplified and force-fitted into these loss functions which may not truly represent the task requirements.

Therefore, there is a need for some criteria to evaluate if the chosen frame of reference is appropriate, with a criterion known as “Structural Plausibility”- that there should be a plausible scientific connection between the input features and the output label. If not, no matter how well your classifier performs, you should reject the hypothesis. For instance, the inferring of criminality based on facial images does not pass this test, because the purported “scientific connection” between the facial images (the input features) and the classification (the output label) is not scientific at all. Instead, it was based on the method in which the images were chosen. Non-criminal images were likely manually chosen by the experimenter to convey a positive impression. In contrast, the criminal images were likely selected neither by the individual depicted nor with the aim of casting an individual in a favourable light. Therefore, the model is essentially a “smile detector” and the connection purported to be discovered is not, in fact, a “plausible scientific connection”.

Issue #2 — High Dimensional Data

The next problem with machine learning is the huge reliance on data, both for training and testing. The issue arises with high dimensional data, because the danger of overfitting is much higher. This is also followed by numerous problems:

The curse of dimensionality means that our intuitions break down in high dimensions, so although we may still be able to rely on our intuitions when dealing with low dimensional data, we cannot do the same with high dimensional data.
The likelihood of finding a random subset of features that are predictive but actually have no correlation is high simply by virtue of the huge dimension of the data.
The required sample size for generalization grows proportionally with dimension, hence the sample size required is exponentially larger with high dimensional data.

One example is how the NSA tried using machine learning to predict the cell phone usage of terrorists cell phone usage. This was highly problematic because they used 80 variables for each cell phone user with only 7 known terrorists. When they tried testing the model in the wild, it ended up identifying an Al Jazeera reporter covering Al Qaeda as a potential terrorist! This shows how high dimensional data often leaves us more prone to errors because the sample size requirement is much higher.

Issue #3 — Measuring Error

The next issue that arises is the issue of measuring model performance. Researchers always proclaim that their new state-of-the-art models have reached XX accuracy or F1, and so on. However, such a claim always comes with many unspoken conditions, that the dataset has a well-defined population with both the training and test data being representative samples of the population. However, this almost never holds in practice. The image below illustrates the problem wonderfully:

Image taken from IEEE Big Data 2019 Keynote. Reposted with permission

The learned model, represented in green above, may seem to fit the true model initially. However, upon further testing, it may become evident later on that the learned model does not actually represent the true model.

Issue #4 — Interpretability in Deep Learning

Interpretability in deep learning has received an increased amount of interest in the past few years. This is important because although deep learning models may achieve excellent results, we cannot know for sure if the results are because the model has really learned the important features, or that the model actually learned the wrong features and it just so happened that the features remained unchanged for the images in the same category. The trouble comes when the wrong feature is changed while the important features remain the same. If the model learned the wrong features, it may then make a wrong prediction.

One example is the paper titled “Why Should I Trust You” Explaining the Predictions of Any Classifier” by Ribeiro et al. in 2016. They trained a model to classify between a Husky and a Wolf, but it turned out to be classifying the snow and grass in the background of the picture. It turned out that the snow in the image was used to classify the image as ‘wolf’, whereas grass in the image was used to classify the image as ‘husky’. As a result, when a husky was pictured with a snow background, it was wrongly classified as a wolf.

Image taken from Ribeiro, M. T., Singh, S., & Guestrin, C. (2016, August). " Why should i trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1135-1144).

Issue #5 — Causal Modeling: Correlation VS Causation

The issue of correlation versus causation is yet another topic discussed frequently, especially in statistics. The idea is that CORRELATION helps with prediction; if X and Y are positively correlated, then if we observe a high X, then we would expect to see high Y. On the other hand, CAUSATION is needed for decision making; if X and Y are causally connected, then if we manipulate the value of X while keeping everything else constant, then the value of Y will definitely change. The issue with confusing a correlation as causation is with CONFOUNDERS, where a correlation is often due to one or more confounding latent variables that is a hidden cause of both X and Y.

For example, it may seem as though a rise in sales of ice creams would lead to a rise in the number of shark attacks. To the untrained eye, it may seem as though the rise in sales causes the rise in attacks. However, there is actually a confounder — the weather. It might just be the case that the hot weather was leading to a rise in ice cream sales as well as a rise in shark attacks (because more people go to the beach thanks to the good weather), and there was only a correlation and not causation between the sales of ice cream and rise in shark attacks.

Issue #6 — Bias and Fairness

Bias in machine learning can be categorized into three categories — data bias, automation bias and algorithmic discrimination.

Firstly, data bias refers to the choice of dataset. The contents of the dataset are affected by many factors ranging from selection bias, institutional bias and societal bias. As the saying goes, garbage in, garbage out. If the input to the system is biased, then the output will be biased. For instance, Amazon came under fire for building an AI tool to hire people that discriminated against women. The reason is simple — the training set contained mainly male resumes, hence the model began to learn that males would be better employees than females based merely on the sheer amount of male resumes.

Secondly, automation bias refers to the preference that we human beings have for suggestions from automated decision-making systems and often ignore contradictory information. We tend to believe the decisions made by automated systems just because they are automated, without sparing additional thought for the validity and justifiability of these decisions. The danger then comes when decision makers start abdicating decision responsibility to algorithms. It is especially tempting to rely on algorithms for making hard decisions, hence this would affect accountability.

Finally, algorithmic discrimination refers to the phenomenon whereby algorithms can amplify, operationalize and finally legitimize institutional bias. When algorithms legitimize these biases, we may reach a point whereby we no longer question these biases that we used to look on with suspicion and instead embrace them. This would be extremely dangerous to our society.

This brings us to the problem of fairness. First of all, who is the fairness for? Different metrics matter to different stakeholders. For instance, a judge would want to minimize false negatives in trials, whereas a defendant would want to reduce the likelihood of false positives, of being convicted wrongly. When dealing with issues of fairness and bias, we must always keep in mind that fairness is a social and ethical concept and not a statistical concept. Bias is subjective and hence must be considered relative to the task.

Issue #7 — Data Dignity

This is the last issue raised. Data is the new currency and the data we each produce are highly valuable. However, our data are frequently misused without our consent and awareness, for example in the Cambridge Analytica scandal. Hence, there is a need for data dignity, which is the ability to understand and control how your data is being used. There should be also a concept of “data as labor”, which is the ability to get paid for use of your data. This is only right because data is the new currency of the world.

Conclusion

We have gone through a brief overview of machine learning as well as covered the seven problems that could go wrong with machine learning. This is definitely food for thought as we ponder about how we often give our unquestioning trust to machine learning algorithms and the implications this can have on our society.

A huge thanks to Professor Getoor for the wonderful and insightful keynote speech on responsible data science, it was truly enlightening.

Responsible Data Science was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

I Built a Music Sheet Transcriber — Here’s How

Haohui — Tue, 26 Nov 2019 16:12:51 GMT

Translating from notes to ABC notation has never been so easy!

Continue reading on TDS Archive »

Preparing TIFF images for image translation with Pix2Pix

Haohui — Fri, 22 Nov 2019 15:59:34 GMT

Your guide to getting started with pix2pix using tiff images

Generative Adversarial Networks (GANs) have gained a lot of attention recently for the impressive feats they have achieved, ranging from image generation, image translation, style transfer, image colorization and so on. In particular, pix2pix, developed by Isola et al., has become very popular as a Conditional Generative Adversarial Network (CGAN), which allows users to generate images based on an input image. Some examples include translating from semantic labelled images to a street scene, daytime photos to nighttime photos, sketches to photos and so on.

Image taken from https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix

All these are very impressive, but currently, pix2pix is catered mostly to PNG and JPG images. This is unfortunate as some tasks, for example medical imaging, use TIFF images which are lossless, whereas standard JPEG files are lossless, hence capturing more accurate details. TIFF images have float values whereas PNG and JPG images have integer values, so it is important to preserve this precision when implementing pix2pix.

Furthermore, CGANs require images to be scaled to the range of [-1, 1] for more stable training. However, normalization isn’t so straight forward by simply using an out-of-the-box function because TIFF images, unlike standard PNG and JPG images that have pixel values from 0–255, can have varying values. In my case, my TIFF images had values ranging from 0–1200!

Keeping these points in mind, I will detail how you can apply pix2pix to your TIFF images successfuly.

Understanding your data

First off, find the number of channels your image has. RGB images have 3 channels, whereas grayscale images only have 1. For TIFF images, they can come in varying numbers of channels, so it is important to understand your image data before using pix2pix, because the later decisions you make in coding the architecture will depend on this. Use the following code snippet to find the number of channels your image has:

https://medium.com/media/c07e48e9cf6297c9e9178dba8c0af39e/href

Preparing your dataset

Now that you have a better understanding of your dataset, you have to prepare your dataset. Pix2pix is unique because it requires paired images across the 2 domains which are exactly identical to each other. Hence, in the official PyTorch implementation, the images have to be combined together side by side to produce a composite image of width * 2 x height. Keeping in mind the need to preserve the precision of the values of the TIFF file, I used the PIL library to open the images and then used numpy to concatenate the two images together.

First, prepare your dataset in the following format: folderA should contain the subfolders train, validation (if any), and test (if any) containing all the images in domain A, while folderB should contain the subfolders train, validation (if any), and test (if any) containing all the images in domain B. Take care to make sure the images in folderA and folderB have the same dimensions and have the same name. Then, use the following code below to generate your concatenated images. The destination path (dest_path) is the directory that you want your concatenated images to be saved. The resulting name will be the same as the original name in folderA and folderB.

https://medium.com/media/1bd5c83782477dcb516d70dd2dd1e0ca/href

Normalizing your Data

Pix2pix uses the tanh activation function for the output layer of the generator model, which produces images with pixel values in the range of [-1, 1]. Hence it is important that the discriminator receives real images also in the same range as that generated by the generator model. However, out-of-the-box solutions do not work because they assume the pixel values are in the range of 0–255, as is the case for normal PNG and JPG images. This doesn’t hold for TIFF images, as the range of pixel values vary for each image, so it is important to first find the minimum and maximum for the image before dividing. The code snippet below allows you to scale images based on the original pixel values:

https://medium.com/media/3846767bded809ec92eea5cd8c51d43a/href

Wrapping up

So that’s it! You have prepared your tiff dataset and are ready to implement the pix2pix code, be it with the official Torch implementation, PyTorch, Tensorflow and so on. If you face any issues, let me know in the comments and I will try my best to help you.

This article has also been published here in my blog.

Preparing TIFF images for image translation with Pix2Pix was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Upgrade your memory on Google Colab FOR FREE

Haohui — Tue, 19 Nov 2019 04:16:26 GMT

Increase the 12GB limit to 25GB

Google Colab has truly been a godsend, providing everyone with free GPU resources for their deep learning projects. However, sometimes I do find the memory to be lacking. But don’t worry, because it is actually possible to increase the memory on Google Colab FOR FREE and turbocharge your machine learning projects! Each user is currently allocated 12 GB of RAM, but this is not a fixed limit — you can upgrade it to 25GB. Seems like the saying that “there is no free lunch in this world” doesn’t hold in this case…

So without further delay, I will introduce how you can get a free upgrade from the current 12GB to 25GB. This process is actually very simple and only requires 3 lines of code! After connecting to a runtime, just type the following snippet:

a = []
while(1):
    a.append(‘1’)

Credits to klazaj on Github for this code snippet!

That’s it — how simple! Simply execute the block of code and sit back and wait. After a minute or so, you will get a notification from Colab saying “Your session crashed.” (Trust me, you will actually be happy for once seeing this message).

You will receive a message on the bottom left side of your screen saying your session has crashed

This will be followed by a screen asking if you would like to switch to a high-RAM runtime.

Yes, definitely more RAM please!

Click yes, and you will be rewarded with 25GB of RAM. How wonderful!

Notice the new 25.51 GB limit. (And yes, Corgi Mode!)

Of course, let’s all be responsible and use this extra memory that Google has kindly provided for us well. I am extremely grateful to Google for providing us this free platform to run our machine learning and deep learning projects. I have benefited extremely from this free service and will definitely be eternally grateful!

That’s it for this post. Here’s to wishing everyone great success in their machine learning endeavors!

This article has also been published here in my blog.

Upgrade your memory on Google Colab FOR FREE was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.