Stories by Sreyan Ghosh on Medium

Denoising Autoencoders Explained

Sreyan Ghosh — Wed, 14 Oct 2020 12:16:53 GMT

Deep Learning Demystified

Denoising Autoencoders Explained

Autoencoders, a term popularized by the numerous online courses on Deep Learning. But what are they? Are they actually of any use? Here, I make an effort to clarify all those doubts.

Photo by Elijah O'Donnell on Unsplash

For anyone who has been acquainted with Deep Learning for a few months, the term autoencoder has surely caught their attention. Most industrial tasks require one to be familiar with the encoder-decoder architecture and here, autoencoders surely do not go amiss. This is an area still under improvement and research for solving the unsupervised learning problem but currently has very particular use cases.

But not all of you will be having the same level of experience as people who have been around DL for some time, so I will try my best to “keep things simple” if there is anything as such.

Imagine this. You are taking a walk in the park when two people on a motorcycle snatch the gold necklace off your neck. A common incident, right? Could happen to anyone. In fact, it is so common that in 2018, in Delhi, 18 cases of chain snatching were reported per day. In most of the cases, the culprit is on a vehicle. Vehicles can be tracked down using the license plate number, right? Not if the CCTV camera meant to be there for our safety produces images in which the numbers are barely legible. This issue of clarity is something that has rendered hundreds of footages inadmissible in court and has resulted in the culprit getting off scot-free.

Enter denoising autoencoders (DAE). After passing the image of the bike’s plates through this model, we can very well expect a legible result. Stacking these DAEs results in something called a super-resolution generator. That in essence is taking a low-res image and making a high-res image out of it. Refer to this link for a demonstration with code: https://keras.io/examples/vision/super_resolution_sub_pixel/

Today I will tell you a bit about how DAEs work and then go through some important things while dealing with the code and leave you with a notebook on how to denoise images.

So, what are Denoising Autoencoders? The DAE is an autoencoder that receives a corrupted data point as input and is trained to predict the original uncorrupted data point as the output. We introduce a corruption process C(x̄ | x) which represents a conditional distribution over corrupted samples x̄, given a data sample x. The autoencoder then learns a reconstruction distribution Pᵣₑ𝒸ₒₙₛₜᵣᵤ𝒸ₜ (x | x̄) estimated from training pairs (x, x̄), as follows:

1. Sample a training example x from the training data.

2. Sample a corrupted version x̄ from C(x̄ | X = x).

3. Use (x, x̄) as a training example for estimating the autoencoder reconstruction distribution Pᵣₑ𝒸ₒₙₛₜᵣᵤ𝒸ₜ(x | x̄) = P𝒹ₑ𝒸ₒ𝒹ₑᵣ (x | h) with h the output of the encoder.

f(x̄) and P𝒹ₑ𝒸ₒ𝒹ₑᵣ typically defined by a decoder g(h).

The last few lines must have been a whirlwind for quite a few of you. If it was not, then voila, you already know how DAEs work. For the people for whom it was, fret not. I was in your position not too long ago. Compare this to YouTube videos! Your training data is the 144p video frame and the target data is the 1080p video frame. So even if your model errs a bit on the way, you will be getting a 540p video. Upgrade!

So how do we create this DAE? I will be leaving the link to the notebook on my Github so check it out. I will go through some of the core aspects of the DAE in the article.

https://medium.com/media/2b72644de0a9154d880bdcd544d74f27/href

The above snippet is for the creation of the convolutional autoencoder class. The dataset which we are manipulating is the Fashion MNIST dataset. We add a random value to each pixel in the dataset, effectively adding noise to the dataset. Coming back to our autoencoder class, we are able to see 2 convolutional layers in the encoder and 3 in the decoder. Keeping the fancy names aside, effectively the purpose of this “encoder-decoder” structure is to map a general relation between a noisy and a clear image.

Next, we compile the model and train it using the “fit” method. We have defined our class in such a way that it inherits the methods of the Tensorflow Model class.

https://medium.com/media/2b5265e3933331a9fbec5ad3fce6ddca/href

I would like to encourage all of you to code alongside reading this so that if you face any problems you can rectify them without having to search for where the error is happening and what is the logic for the code you are writing.

Please feel free to refer to my Github repository for any kind of assistance in this matter: https://github.com/sreyan-ghosh/tensorflow_files/blob/master/Others/autoencoders_tf.ipynb

It would also be a pleasure to connect with you all on LinkedIn: https://www.linkedin.com/in/sreyan-ghosh-b0722a18b/

Denoising Autoencoders Explained was originally published in VITMAS on Medium, where people are continuing the conversation by highlighting and responding to this story.

Understanding Regularization Techniques in ML and DL

Sreyan Ghosh — Tue, 05 May 2020 07:54:18 GMT

Insight into Regularization Techniques

A simple yet comprehensive outlook on techniques that reduce compute time and man-hours invested in training ML and DL models.

Photo by Joe Gardner on Unsplash

The overview:

Now, as industries start to accept “Artificial Intelligence” as an important part of predicting their company’s success, the techniques of Machine Learning and Deep Learning are making their way into the job profile list of companies. But it is often seen that the actual decision-makers in a company (the people calling the shots: CxOs) have a very misguided notion of what these techniques can do and how their company can be benefitted. ML is often seen as a technology that has the potential to solve any and all industrial problems as per the people who don’t fully comprehend ML’s truth. The following picture makes the current state of ML quite clear.

This is not satire but a quite accurate understanding of ML. It is the un-hardcoded ability of computers to predict events while a set of precursor events is provided. I will try to keep this blog as non-math as possible but here I would like to include the fact that ML, in its essence, is the act of predicting the functional relation F between x and y given multiple such equations.

But many a time, it is seen that even after training a model and achieving an acceptable training accuracy, when the model is employed to work on the test cases, it fails miserably.

This happens due to the phenomenon of overfitting or making the function over-approximate the training data. This leads the model, instead of understanding the generic idea of how to solve the problem, to rote the training data. The following picture makes it clear.

The real function is a sinusoid (green) and we are trying to predict it from the data given. Till the third figure, we see the model learning very well. It is an almost perfect approximation of the function even though all data points are not satisfied. But as training continues, we see the function molding itself to fit all data points and taking a form quite different from what is desired. This is overfitting. Where the training loss is taken to zero but the test loss rises

Understanding Bias-Variance Tradeoff and the need for Regularization:

Bias is mathematically, the difference between the expected value and the actual value of the of the function. We won’t be going into the underlying statistics of bias but I will responsibly, leave you with a scary looking equation:

To make things clear, the bias of a simple, linear model is high and that of a complex, multidimensional model, is low. This is because a complex model is better at fitting all the training data.

Variance is the change in prediction accuracy of an ML model between training data and test data. Error due to variance is the amount by which the prediction, over one training set, differs from the expected value over all the training sets. In other words, how far are the values of the different predictions from each other as per the model. Another equation follows to scare you, folks, off.

A simple model has a low variance whereas a complex one has a high variance.

The following graph can be used to establish the concepts of Bias and Variance clearly. The left end of the graph is the zone with high bias as both the training and the testing error are high. This is the zone of underfitting or the zone where the model has not learned enough.

The right end of the model is an area of high variance where the training error is low and but the testing error is high. This is the zone of overfitting, where we see that even though the model has achieved a high training accuracy, and it seems like the model is near perfect, it performs poorly on test data. This is a sheer waste of computational power and the engineer’s time.

The middle zone, where both the bias and variance are low, even though not the lowest possible is the best possible zone for a model. The act of achieving this state of model training is known as Bias-Variance Tradeoff.

There are various methods using which we can achieve Bias Variance Tradeoff. These methods or techniques are known as Regularization Techniques.

Some common ones are:

L2 Regularization
Early Stopping
Dataset Augmentation
Ensemble methods
Dropout
Batch Normalization

L2 Regularisation:

Keeping things as simple as possible, I would define L2 Regularization as “a trick to not let the model drive the training error to zero”. If only things were that simple…

During training a model, we have continuous updating of the various variables (weights and biases; w & b) which try to predict our original function. This update takes place based on an “update rule” like Gradient Descent (which we won’t talk about). This update rule depends on the “loss function” which is a function of these variables. If things are getting complicated, bear with me. Our aim is to minimize this “loss function”. And that’s quite intuitive isn’t it? In any profitable industrial situation, you strive to minimize the loss. Simple, init?

So, we minimize the loss function during training. What is special about the L2 technique is that instead of minimizing the training loss, we minimize a different version of it.

The first term in the equation above us is the ‘loss term’ or the term that measures how well the model fits the data. The last term is the “log of the likelihood of the Gaussian distribution of the weights” by the math folks. This measures the model complexity. For us, laymen, it is the sum of squares of all the feature weights (w). Here again, responsibly, I leave you with:

This model complexity is quantified by the L2 technique. In this, the feature weights close to zero aren’t affected as much by the transformation, but the outliers have a huge impact. And the more the value of the above term, the more we see the increase in bias and decrease in variance. From the above graph, it is evident that at the end of the training, the bias is very low and the variance is high. So, if we were to increase in the bias and decrease the variance, we would effectively be reaching somewhere in the ‘good model’ zone of the graph. So now, we have a good model! Yay!

Early Stopping:

This is, by far, the simplest regularization technique (well all of them are, but you wouldn’t believe me, would you). This process involves recording the values of the variables (w & b) at the minimum loss value. While going through the training process, we record the values of w & b, at which we obtain the least validation error. We stop training when we see the validation error rising again. This is a very useful procedure but the downside to it is, during training very deep neural networks or very complex models, this utilizes a lot of processing power during the writing and rewriting the minimum values.

Dataset Augmentation:

Training a model to a good prediction state is only possible when we have a lot of data to train it on. In other words, it is quite easy to drive the training error to zero if the data is too less. Let’s take the example of training a neural network on image classification. Say we have 1000 images to train the model on. Wouldn’t it be better if we had say, 3000 images to train it on? Without procuring extra data, we can easily “augment” the current images and create “new” ones. These are not in fact, new to us, but to the model, it is as new as they come.

So what is augmentation? It is the act of artificially generating new data from the already available data by introducing certain variances in it while retaining the original labels. These variances depend on the type of data we are dealing with. For audio, speeding up the sample or introducing some background noise is an augmentation technique. This doesn’t change the label value. For text, we can replace a word by its synonyms and not change its conveying message. For images, we can change the viewing angle, the zoom, the lighting, and other techniques that change the image but retain its label. Here’s some cuteness to cancel out your boredom from reading this blog and make image augmentation clear.

So, when we now have more data to feed our model, which makes it more difficult for it to memorize the entire thing and therefore, the training error isn’t driven to zero. Kinda like your history test, init?

Ensemble Methods:

Ensemble methods are meta-algorithms that combine several machine learning techniques into one predictive model in order to decrease variance, bias, or improve predictions.

The above paragraph is Google’s definition of Ensemble Methods, and I’ll try to break it down for you. In this technique, we employ multiple model architectures to predict an output, be it classification or regression. Let’s say models A, B and C are given the task of classifying a dog: Model A says it’s a cat, but B and C say it’s a dog. So if we are to believe what the majority says, we arrive at the correct output, but if we were to trust the output of the first model, we would have erred. Similarly with regression, or value prediction. We take the weighted average of the predictions given the 3 models to arrive at our final output. This decreases the chance of an error and improves accuracy.

The interesting part about this is we needn’t spend resources on 3 models as well. We could train on the same model 3 times with different batches of the data. That would serve the purpose as well. But you get the idea, don’t you?

Dropout:

Dropout is also classified in the category of an ensemble method. But I, for fun, think it to be the reverse of that. In ensemble methods you ‘ask’ the opinion of other models to arrive at a conclusion but here, it is basically silencing other contributors. Let me make it clear.

This is a very simple neural network whose purpose is to be a True/False classifier. Look at the output layer (green). It has 2 blobs, one that gives the probability that the output is True and the other False. The sum of the two values: you guessed it: 1! Aren’t you smart? XD.

The idea here is to make you understand that these ‘blobs’ are called nodes. Each of these nodes has a ton of complex calculations happening inside them. Remember the stuff I was talking about in L2 Regularization? It all happens here. So these nodes are the actual contributors to the output.

Dropout involves turning off certain nodes randomly. This changes the architecture of the model and the way information flows through the nodes. Doing this makes the model a more robust predictor. The model has to predict the same outputs with some of its contributors turned off. That’s like saying you need to get through your quiz without your topper friends being around. You gotta learn. Get it? XD.

The Conclusion:

So that sums up my blog on regularization techniques. I intentionally did not provide you with information on Batch Normalization as that would have required me to give you the entire process of training a neural network and that would have gone against the main idea behind this blog: keeping things simple.

If you are itching to know how to code these on Python using PyTorch, refer to the following repository on GitHub. The batchnorm_dropout.ipynb file will be of interest. I will be uploading TensorFlow files on another repo as well, to have the code on both these frameworks.

https://github.com/sreyan-ghosh/pytorch_files

I’ve had an amazing time writing this out for you folks and I hope you could take away something from this. If you liked it, leave a clap. If you didn’t you probably would’ve left the page long back. And if you have any queries, please feel free to comment down there. I’ll be looking forward to clearing your doubts.

I love making new friends, so here is my LinkedIn ID. Please connect if you wanna chat or even if you don’t. XD.

https://www.linkedin.com/in/sreyan-ghosh-b0722a18b/

Understanding Regularization Techniques in ML and DL was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.