Stories by Hassan Abedi on Medium

An Indexing Checklist for MongoDB Atlas Users

Hassan Abedi — Sat, 12 Oct 2024 19:58:08 GMT

MongoDB Atlas is a fully managed cloud database service that makes deploying, operating, and scaling MongoDB clusters easy. It takes care of infrastructure management so you can focus on building your applications. However, even though it handles many database operations automatically, optimizing query performance through proper indexing is still something users and developers must manage.

Adding indexes to the collections in MongoDB on fields that are frequently searched and accessed is crucial. Queries can become slow without the right indexes, especially as your data grows. Indexes make searching faster but can add overhead and slow down write operations if overused or applied incorrectly.

In this article, we provide a checklist of important things that, when dealing with indexes, particularly in the MongoDB Atlas environment, should be taken into consideration. The aim is to help the reader make sure queries are as efficient as possible while avoiding common indexing pitfalls.

1. Avoid Adding Indexes to Small Collections

Generally, it’s a good idea not to add indexes to small collections (for example, collections with fewer than 200 documents). This is because the overhead of maintaining the indexes outweighs the benefit in such cases. The exception is when the collection is read frequently, and queries become slow without the index.

2. Indexes Slow Down Writes

Adding indexes always makes writing to the collection slower. To avoid this, don’t add indexes to collections where the ratio of reads to writes is low, like collections where most operations are updates or inserts.

3. Use the Performance Advisor

MongoDB Atlas comes with a tool named Performance Advisor, which can give recommendations for adding and removing indexes based on the performance of the queries that access the data in the collection. It’s a good idea to regularly check its recommendations and apply them as needed.

4. Index fields Used in where and sort

It’s always beneficial to have an index on any field used in a where or sort operation to optimize query performance. In case of sort, the values of indexed fields are effectively stored in sorted order, which allows MongoDB to return sorted results without needing additional processing time for sorting.

5. Use the Query Profiler

In MongoDB Atlas, you can use the Query Profiler tool to identify slow queries and add indexes to the relevant fields. When an index is used in a query, the planSummary field in the profiler output will show one or more IXSCAN, indicating that MongoDB is using an index to retrieve documents. For example, here’s an output from the Query Profiler showing compound indexes being used:

{
  // ...
  "planSummary": "IXSCAN { customerId: 1, transactionDate: 1, location.city: 1, hasAttachments: 1 }, IXSCAN { customerId: 1, transactionDate: 1, location.city: 1, hasAttachments: 1 }, IXSCAN { customerId: 1, transactionDate: 1, location.city: 1, hasAttachments: 1 }"
  // ...
}

6. Optimize the Examined:Returned Ratio

Ideally, the ratio between the number of index keys examined and the number of returned documents (shown in the Query Profiler as the Examined:Returned Ratio) should be small. A near-zero ratio is best, although it may not always be achievable. Note that the Examined:Returned Ratio should not be exactly zero because that means that no index is being used at all.

7. Use Atlas Search Indexes for Text Search

When using the search operation, it’s important to have an Atlas Search Index on the field being searched. For example, if you’re performing a text search (such as a regex search) on the vehicle.registrationNo field, you should have an Atlas Search Index on vehicle.registrationNo. The index name must also be specified in the search operation:

// ...
{
  "$search": {
    "index": "searchIndex", // Name of the search index should be given here
    "compound": {
      "must": [
        {
          "wildcard": {
            "query": "ABC1*",
            "path": "vehicle.registrationNo",
            "allowAnalyzedField": true
          }
        }
      ]
    }
  }
}
// ...

Note that search operation and Atlas Search Indexes are available only in MongoDB Atlas.

8. Order of Fields in Compound Indexes

The order of fields matters in compound indexes. Fields used in equality conditions (like =) should come first, followed by fields used in sorting, and then fields used in range conditions. For example, if you have a query like this:

// ...
{
  "vehicle.registrationNo": "ABC123",
  "vehicle.model": "Sedan",
  "vehicle.year": {
    "$gte": 2010
  }
}
// ...

The compound index should be created with fields in this order: vehicle.registrationNo, vehicle.model, and vehicle.year.

9. High Cardinality Fields

Indexes work best on fields with high cardinality (fields with many unique values). By contrast, indexes on fields with low cardinality are often not as useful. For example, a field like status with only two possible values (active or inactive) won’t benefit much from indexing. MongoDB may still need to scan many documents to find those with status: active, which defeats the purpose of the index.

10. TTL Indexes for Automatic Document Deletion

TTL (or time-to-live) indexes can be used to automatically delete documents from a collection after a certain period. This is especially useful for collections that store session data, logs, or cache data. Note that a collection can have only one TTL index, and the field used for the TTL index must be of type date or timestamp.

Top DuckDB CLI Commands That You Should Know

Hassan Abedi — Fri, 27 Sep 2024 17:42:39 GMT

DuckDB is a fast in-memory analytical database system. It is designed to be used as an embedded database, which means it can be easily integrated into applications and run within the same process as the application code. However, a command-line version of DuckDB is also available.

In this post, we will explore a few useful commands (also called dot commands) of the command-line version of DuckDB, which we refer to as DuckDB CLI, in the remainder of this post.

Getting Started

Before we start, make sure you have already downloaded the DuckDB CLI (for your OS from here) and that it is ready to run.

DuckDB CLI (the file is named duckdb on GNU/Linux systems) Started in a Terminal Emulator

After downloading the DuckDB CLI, we need to create a toy table that we will use to demonstrate the commands and their functionalities. The table will be called employees and will have the following schema:

-- Create the `employees` table
create table employees
(
    id         integer,
    name       varchar,
    department varchar,
    salary     decimal
);

-- Insert some records into the table
insert into employees
values (1, 'Alice', 'Engineering', 100000),
       (2, 'Bob', 'Engineering', 90000),
       (3, 'Charlie', 'Sales', 80000),
       (4, 'David', 'Marketing', 70000),
       (5, 'Eve', 'Sales', 60000);

1. Using .mode to Change the Output Format

The .mode command allows us to change the output format of the query results. By default, the output format is duckbox, which displays the results in a tabular format with extensive aesthetic formatting. However, we can change the output format to many other formats, including: csv, json, latex, or even markdown. See DuckDB's documentation on supported output formats for the full list.

Example output in different modes:

-- Show the current output mode
.mode

-- Show the result in LaTeX format
.mode latex
select * from employees;

-- Show the result in CSV format
.mode csv
select * from employees;

-- Show the result in JSON format
.mode json
select * from employees;

Using .mode to Change the Output Format of the Queries

2. Using .timer to Measure Query Execution Time

The .timer command lets us measure the execution time of queries. In other words, when we enable the timer, DuckDB will display the time taken to execute each query.

Example usage of .timer:

-- Turn the timer on
.timer on

-- Run a query
select * from employees;

-- Turn the timer off
.timer off

Using .timer to Measure Query Execution Time

3. Using .schema to Display Table Structure

The .schema command displays the structure or schema of a table. The displayed information includes column names, column types, and any constraints (like primary keys, foreign keys, or indexes) associated with the table.

Example usage of .schema:

-- Create an index on the `department` column
create index idx_employees_department on employees (department);

-- Show the schema for the `employees` table
.schema employees

Using .schema to Display Table Structure

4. Using .read to Execute Queries from a File

The .read command lets us execute queries (SQL code) stored in a file. This is a very useful command when we have a large number of queries that we want to run in sequence or when we want to reuse queries across multiple sessions.

Example usage of .read:

-- Execute queries from the `queries.sql` file
.read queries.sql

Example content of queries.sql:

select * from employees;

Using .read to Execute Queries from a File

5. Using .once to Redirect Output to a File

The .once command allows us to redirect the output of the next query to a file. This can be useful when we want to save the results of a query to a file for later use.

Example usage of .once:

-- Redirect the output of the next query to `output.txt`
.once output.txt

-- The (next) query
select * from employees;

Using .once to Redirect Output to a File

6. Using .shell to Execute Shell Commands

The .shell command lets us execute shell commands from the DuckDB CLI.

Example usage of .shell:

-- List the files in the current directory
.shell ls -l

Using .shell to Execute Shell Commands

7. Using .promp to Customize the Prompt Text

The .prompt command allows us to customize the prompt in the DuckDB CLI. For example, we can change the prompt to display the database name or any other custom text.

Example usage of .prompt:

-- Change the prompt to `EmployeeDB>` (the database name)
.prompt "EmployeeDB> "

select * from employees;

Using .promp to Customize the Prompt Text

8. Using .changes to Display the Number of Affected Rows

The .changes command allows us to display the number of rows affected by a query with side effects. This is useful when you want to know how many rows were inserted, updated, or deleted by a query.

Example usage of .changes:

-- Turn on tracking changes
.changes on

-- Insert a new row into the `employees` table
insert into employees values (6, 'Frank', 'Engineering', 95000);

-- Turn off tracking changes
.changes off

Using .changes to Display the Number of Affected Rows

9. Using .indexes to Display the Index Names

The .indexes command lets us display the names of the indexes on a table.

Example usage of .indexes:

-- Create an index on the `department` column
create index idx_employees_department on employees (department);

-- Get the names of the indexes on the `employees` table
.indexes employees

Using .indexes to Display the Index Names

10. Using .nullvalue to Customize How NULL Values are Displayed

The .nullvalue command allows us to customize how NULL values are displayed in DuckDB CLI. By default, NULL values are displayed as an empty string. However, using .nullvalue, we can change this to any other string.

Example usage of .nullvalue:

-- Show NULL values as "N/A"
.nullvalue "N/A"

-- Insert a row with a NULL value in the `department` column
insert into employees (id, name, department, salary) values (7, 'Grace', null, 85000);

-- Show the contents of the `employees` table
select * from employees;

Using .nullvalue to Customize How NULL Values are Displayed

11. Using .cd to Change the Working Directory

The .cd command allows us to change the working directory to another directory.

Example usage of .cd:

-- Change the working directory to /tmp
.cd /tmp

Using .cd to Change the Working Directory

Conclusion

In this post, we explored some useful DuckDB commands that can help us work more efficiently with the command-line version of DuckDB. These commands can help us customize the output format, measure query execution time, display table schemas, execute queries from a file, etc. For more information on the DuckDB CLI and its full feature list, please check out DuckDB’s documentation.

Best Books and Courses to Learn about Reinforcement Learning in 2022

Hassan Abedi — Mon, 09 May 2022 12:09:27 GMT

Source: https://flic.kr/p/aav5nQ

Introduction

Deep learning is an approach to designing, training, and building machine learning models that became extremely popular lately. Arguably, deep learning’s popularity is mainly due to three reasons. Firstly, deep learning model have performed greatly in many tasks in areas such natural language processing, computer vision, and speech recognition in comparison to alternative approaches such as tree-based learning methods like decision trees and support vector machines. Secondly, deep learning is versatile, which means it can be used to solve an extensive range of problems. Thirdly, deep learning, to a large extent, has removed the need for feature engineering, which was an enduring challenge in itself in building machine learning models.

Reinforcement learning is an area of machine learning that the main goal is to train an agent that aim to maximize a commutative reward by taking actions in an environment. The main application of reinforcement learning is to create an agent that that can solve a problem or perform a task that previously was solved or performed by a human being. As an example of such a task, reinforcement learning was used to train an agent that plays a video game like Super Mario World.

Deep reinforcement learning comes about when deep learning is used to approximate different components of a reinforcement learning-based system, such as the reward function. Deep reinforcement learning is a growing area for researchers and practitioners, likewise. This article aims to present a compact list of high-quality resources including books and online course to help anybody curious about reinforcement learning, in general, and deep reinforcement learning, in particular, get started quickly. Moreover, most of the books enlisted are available as downloadable PDFs. And, the code for the examples shown in the books are mostly available.

At any rate, I hope these resources will help you in your journey to learn more about (deep) reinforcement learning concepts and tools.

Books

Courses

Other resources

Acknowledgement

The resources introduced in this article are from this blog post by Yanzhe Bekkemoen.

Getting Started With Apache Spark Using Databricks Community Edition

Hassan Abedi — Mon, 17 Jan 2022 11:35:21 GMT

Remaining ruins of city of Babylon (source: Wikipedia)

This is a mini-tutorial aiming at helping the reader get started using Databricks Community Edition to process their data on an Apache Spark cluster. The tutorial includes a set of steps that one needs to take to be able to get familiar with Databricks Community Edition’s environment.

Step 1: Getting started

The first thing to do is go to the webpage for Databricks Community Edition and create a user account (if you do not already have one).

Databricks Community Edition’s login page

When you have finished creating your account, log into Databricks Community Edition.

Databricks Community Edition’s environment after you have logged into it

Now go to https://github.com/habedi/datasets and download the contents of the folder datascience.stackexchange.com to your machine. It includes the data that we are going to use here.

Step 2: Creating a Spark cluster

Now, we need to create an Apache Spark cluster to run our code on it. To do so, click on “Compute” on the vertical dark green line on the left side of the screen.

After pressing on “Compute” button this page will show up

Now press on “Create Cluster” to create your cluster. In Databricks Community Edition, you have to set a name for your cluster and select the Databricks runtimes of your cluster. Note that we can set more low-level configurations, but I’ll not go through their details in this tutorial. Moreover, different Databricks runtimes mainly differ on the version of their Spark; I chose the default Databricks runtime (runtime 9.1 LTS) that comes with Spark 3.1.2. Also, I named the cluster “MyCluster”.

Creating an Apache Spark cluster in Databricks Community Edition named “MyCluster”

The cluster has been created and is running and ready to use

Step 3: Installing libraries on the cluster

Usually, at the start, we have to install some additional libraries and packages on our cluster to do something useful with our data. Imagine we want to do some graph analytics on our Spark Cluster in Python. To do that, we have to install the appropriate Graphframes Spark library, a graph processing library for Apache Spark. To install libraries on our cluster, click on our cluster’s name in the “Compute” section and click on “Libraries”. Libraries can be installed from different sources, though; here we use the Maven repository to download and install the artefacts related to Graphframes in our cluster. Apache Spark has Python, R, and Java (and Scala) API. Depending on the API we are using, we can install libraries for these programming languages and environments.

We can install libraries in our Spark cluster by clicking on the name of our cluster on the page that opens after pressing “Compute”; then, by pressing “Install New” under the “Libraries” tab, we can open the page for installing a libraries

After pressing the “Install New” button, we can choose the repository of the library we want to install and install it on our cluster

Here, we search for Graphframes library and pick the newest version that matches the version of Spark instances installed on our cluster, which in our case is Graphframes 0.8.2 for Spark 3.1.x

Finally, after finding and selecting a library, we can press the “Install” button to install the library on the cluster

As you can see, Graphframes is installed on MyCluster and is ready to use

Now that the cluster is created and running, we can upload the data in the Databricks Community Edition environment.

Step 4: Uploading the data

In many real-world scenarios, we have the data in CSV format and want to do some analytics or train a machine learning model. Here, let’s assume we want to upload four compressed CSV files. (These are the files that you have downloaded to your machines in step 1 from https://github.com/habedi/datasets.) To move the files to our cluster, click on the icon named “Data” located on the vertical line on the left side of the screen to open up the user interface for managing the data.

The interface for data management can be accessed by clicking on the “Data” on the vertical dark green line on the left side of the screen

Now press the “Create Table” button to upload your data on the DBFS. DBFS is the file system’s name that Databricks Community Edition uses to store and access the data. In general, you can connect to different data providers to get your data, but in this tutorial, we assume that you have the data on your machine and need to move to DBFS.

To upload the data to DBFS, choose the “Upload File” button, select your files, and press “Open”.

Uploading files from your machine to DBFS

As you can see, files have been successfully uploaded from your machine to DBFS and are available under the path “/FileStore/tables/NAME_OF_FILE”

Optionally, at this stage, we select an uploaded CSV file and turn it into a Hive table. The main benefit of doing so is that we can directly run Spark SQL queries over a table that is stored as a Hive table on DBFS. But this is not a requirement, and we still will be able to load the compressed CSV files that we have already uploaded to DBFS into Spark DataFrames.

An example table with inferred schema using “Creating Table with UI”

Step 5: Processing the data on the cluster

At this stage, we are ready to start our real work, which is doing something interesting with the data we have already uploaded to our cluster. To do so, we can open a notebook and choose the API we want to use to work with the data in the cluster. By API, I mean the default programming language that we will use in our notebook. To do so, press on “Workspace” to open up the side panel for creating a Jupyter notebook.

We can create a new or access an existing notebook under “Workspace”

We can create a Jupyter notebook with the default programming language set to Python to process our data on the MyCluster Spark cluster

Now copy the code below into the notebook you have just created and run the notebook by pressing the “Run All” button on the top.

https://medium.com/media/d4060d1bb35e8989142c6f29d4a02bff/href

When you run the above code in the notebook that you have just created, you should be able to see something like this in the output:

Congratulations, you have a running Spark cluster ready to use to solve your Big Data and Data Science problems.

A final thing! — Remember that Databricks Community Edition is a free service, so it naturally comes with some limitations. Apart from the limitations on your cluster’s resources, like the number of CPUs and the amount of RAM, you should remember that after creating a cluster, if your cluster is inactive for two hours, it will be shut down. And when a cluster becomes shut down due to inactivity, you will need to delete it and create a new cluster. It can hinder, but it will not affect the data you have stored on DBFS. Usually, the only significant problems are the hassles of creating a new cluster and installing the appropriate libraries on it again.

Moreover, I suggest reading the following books to get familiar with Apache Spark and its applications:

Learning Spark — Lightning-Fast Data Analytics; https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf
Learning Apache Spark with Python; https://runawayhorse001.github.io/LearningApacheSpark/
Spark: The Definitive Guide; https://pages.databricks.com/definitive-guide-spark.html

Writing a HelloWorld Spark application with IntelliJ IDE and Python 3 in Windows 10

Hassan Abedi — Tue, 16 Feb 2021 00:26:04 GMT

Florence, Italy; source: https://flic.kr/p/2jPh9KA

Introduction and context

In this tutorial, I want to show you how to set up a minimum working environment to develop Apache Spark applications in your Windows machine. So, without any more wait let’s go!

Step 1: installing Java SE Development Kit 11

First, go to Oracle’s website and download the Java SE development kit 11 (JDK 11) installer file for Windows 64bit from there. Then run the file you just downloaded to install the JDK 11 on your computer. After installation finishes, you can check if Java 11 is available on your computer by executing `java -version` in the Windows Command Prompt or PowerShell.

Java 11 is successfully installed and available system-wide to use

Step 2: installing Python 3

Go to Python’s website and download the Windows binary installation file for Python 3.9. Then execute the downloaded file to install Python 3. After the installation finishes, you should be able to start Python 3’s interactive shell by executing `python` in the Windows Command Prompt or PowerShell.

Python 3.9 installed and ready to use

Step 3: downloading and configuring Spark 3

First, create a folder named `BigData` in `C directory` of your Windows. Go to Apache Spark’s website and download Spark 3’s binary files (for Hadoop 2.7) to the path `C:\BigData` in your computer. Then extract the downloaded file in this directory (I mean in `C:\BigData`). After this step content of the BigData folder should look like this:

At the time of writing of this tutorial the newest stable version of Apache Spark was Spark 3.0.1

Now rename the spark-3.* folder to spark3. Then go to https://github.com/cdarlint/winutils and download the files in this git repository as a Zip file using the green `Code` button on the top right corner. (By default the downloaded file will be named `winutils-master.zip`.)

Downloading `winutils` binary files from https://github.com/cdarlint/winutils

Move the file you downloaded to path C:\BigData and extract it there.

If everything went correctly so far, you must have two folders named `spark3` and `winutils-master` in `C:\BigData`

Now you must set a few environment variables in Windows 10. To do so, type `edit the system environment variables` in the Windows search bar (on the bottom left corner) and press the ENTER. You must be able to see the `System Properties window` now (see the picture below).

You can add, remove, and edit the environment variables in Windows by pressing the environment variable button at the bottom of the `System Properties window`

Press the button with the label `Environment Variables` on the `System Properties window`. Now you must be able to see the `Environment Variables window` (see picture below).

You can add, remove, or edit environment variables for your user (or for all the users) from the `Environment Variables window`

Now press the New button on the top (`User variables`) and add the following environment variables.

Setting the `HADOOP_HOME` environment variable to `C:\BigData\winutils-master\hadoop-2.7.7`

Setting the `SPARK_HOME` environment variable to `C:\BigData\spark3`

Now select the `Path` environment variable in the top panel (or list) and press the `Edit button`.

Editing the Path environment variable for the user (the top panel)

Now add `%HADOOP_HOME%\bin` and `%SPARK_HOME%\bin` to the `Path` environment variable.

`%HADOOP_HOME%\bin` and `%SPARK_HOME%\bin` are added to the `Path`. (You can add new values to the `Path` using the `New` button.)

Now, open PowerShell and write spark-shell in it and press the ENTER. Wait until you enter the Spark’s interactive shell environment, then open http://127.0.0.1:4040 in your web browser.

Checking Spark installation by running spark-shell in Windows PowerShell

Apache Spark 3's shell Application UI; here you can monitor your Spark services and tasks

Step 4: installing IntelliJ IDE with Python plugin

Go to IntelliJ IDE’s website and download the IntelliJ IDE. You can download and use either the free community edition or the ultimate edition of the IDE. (Students can get a free license to use the ultimate edition of the IDE; they only need a university email to get a free license.) After the installation finished, open the settings window under the file tab and go to the plugins sub-menu. Make sure the Python plugin for IntelliJ is installed.

Ensure the Python plugin is installed; you can search it under the path `File/Settings/Plugins`

Step 5: Writing and executing a Hello World Spark application

In IntelliJ IDE create a new Python project (go to `File/New/Project`). And select Python 3.9 which you have already installed in the first step of this tutorial as the `Project SDK`. Then press the next button (then press the next again).

Select Python 3.9 you installed in the step one of this tutorial as the Project SDK to be used for the project

Now, pick a name for the project and a path to save the project files and press the finish button. (I chose the name PySparkHelloWork for my project and saved it in my Documents directory under a folder name IntelliJ, as you can see in the picture.)

The project PySparkHelloWork is ready

Open the Terminal window in the button left corner of the project’s main window. And execute `pip install pyspark findspark` in it. Wait until `pip` finishes installing these three packages.

Now create a folder named `src` in the project and create a new Python script called main (or `main.py`) inside the `src` folder. Then copy the Python code you see in the box below into the `main.py` file and save it.

https://medium.com/media/fb586b9a8e1d0f24940c6bdbfb762d38/href

Run (or execute) the `main.py` script. You now should be able to see the results in the output.

The output of the HelloWorld Spark application after a successful execution

Congratulations! You developed your fairs Spark application in Python. Happy making (more!) Spark applications. :0)

How to install and run a simple Apache Spark cluster on Windows 10?

Hassan Abedi — Wed, 12 Feb 2020 22:40:59 GMT

How to setup a very simple Apache Spark cluster on your Windows 10?

Introduction

This is a step by step guide that aims to help the reader to install Apache Spark on his or her Windows machine. The Spark installation shown in this tutorial is the typical bare minimum single instance cluster installation that you need to get a useful Spark development environment. Typically, you could use this new Spark installation to develop, test and debug your Spark applications in languages like Java, Scala, and Python. The contents provided in this tutorial are mainly for getting a Spark environment in Ubuntu. Still, the main steps taken in this tutorial will apply to other Unix(-like) OSes such as macOS without much modifications. We assume through the tutorial your Windows machine is connected to the internet.

Currently Apache Spark is the most popular open-source Cluster Computing framework

Step 1: Installing Ubuntu 18.04 app

Windows 10 comes with a feature that allows its users to have a fully functional Linux environment within Windows. To install Ubuntu 18.04 LTS, you can go Microsoft Store and search for Ubuntu 18.04 LTS app there. To be able to install Ubuntu you must first make sure that you have installed something called Windows Subsystem for Linux on your machine. You could read more about how to install the Ubuntu app on Windows 10 operating system from here.

Ubuntu 18.04 LTS app in Microsoft Store

Step 2: Setting a username and a password for your new Ubuntu environment

At this point, we assume that you have already successfully installed Ubuntu 18.04 LTS app on your machine. When you start Ubuntu app for the first time, it asks you about the user and password you want to use when you are working in the Ubuntu app. For the sake of simplicity, we used sparkusr for both username and password here.

When you installed Ubuntu app, you can start it for the first time by pressing the Launch button

When you start Ubuntu app for the first time you need to provide a username and password; in our setup we used “sparkusr” for both username and passwords

Step 3: Setting up your Ubuntu environment

At this point, you need to properly configure your Ubuntu environment. You could do so by installing the software packages that you may need to start Apache Sparks services inside your newly installed Ubuntu. You can install packages mentioned above by running the following commands in your terminal emulator:

https://medium.com/media/0bf4d55865295f0ac22d2f20b98e2d52/href

One important thing that you should remember is that we need Java 8 Runtime Environment to be able to run the current stable version of Spark (version 2.4.x).

Step 4: Installing Apache Spark

At this step, you need to go to Apache Spark’s website and download Spark’s pre-built binary file. In time of writing this tutorial, the latest stable of Spark was 2.4.5 which you could download from here. To download and install Spark, you can execute the following commands in your terminal:

https://medium.com/media/669aeeaa2ebf388c464289a46eacdbf0/href

At this point, if there was no problem in your setup, you should be able to see that both SPARK_HOME and JAVA_HOME environment variables are set up correctly and you are ready to go! You can check this by running the following commands echo $JAVA_HOME and echo $SPARK_HOME in your terminal; the output you are going to see should look like the in the picture below:

If JAVA_HOME and SPARK_HOME are correctly setup you should be able to see their correct values in the terminal

Step 5: Starting Spark services

At this point, you can start your Spark cluster services by executing the following commands in your terminal:

https://medium.com/media/f4b1309cc95fc53036d984645293bb28/href

You can check various indicators for your Spark cluster (e.g., number of worker, and available resources in your cluster) by going to Spark’s web interface dashboard. By default, the web UI is available at http://localhost:8080/.

Spark master service is running with no available workers to execute a Spark application in the cluster

When you’ve started the worker service successfully, it will show up as a worker in your cluster; it attaches itself to the master; ready to execute Spark jobs

You can start Spark’s shell via running spark-shell in your terminal.

In the Spark shell environment, you can run Spark-SQL and Scala codes that are going to be executed on your cluster

When you finished your work with the cluster, you need to stop the Spark services. The general pattern is that you need to first stop the worker (slave) service then stop the master. The easiest way to shut down our simple cluster with our current settings is to execute stop-slave.sh and stop-master.shin the terminal consequently. If everything went well when you shutdown the cluster you will see no Java processes belonging to you Spark cluster service. You can check this by running htop command in the terminal, and checking the output. There should be no Spark related Java processes running inside your Ubuntu environment.

When you shut down your cluster, or you did not start it in the first place, you should see no processes related to Spark running within your Ubuntu environment

Conclusions

In this brief tutorial, we described the general steps to install and set up a very rudimentary Apache Spark on a machine running Microsoft Windows 10 operating system. Apache Spark is a sophisticated open-source Cluster Computing framework with many different modes of operation and settings. What we described here is suitable for people who want to get familiar with Spark for the first time, or the people who wish to develop Spark applications on their own machines. The resulting cluster at the end of this tutorial can be used as a mock cluster for learning purposes rather than real-world utility. Nevertheless, working with something as exciting and useful as Apache Spark can always be enjoyable for everybody :0).

Additional resources:

With the exceeding popularity of machine learning nowadays, it is natural to think that people will use Apache Spark in their machine learning pipelines. So, here is a nice little Spark tutorial for those who want to deploy their machine learning models unto a Spark cluster: https://neptune.ai/blog/apache-spark-tutorial.

How to win a Data Science challenge in Kaggle?

Hassan Abedi — Thu, 06 Feb 2020 16:46:55 GMT

Introduction

The main topic of this article is about winning or at least landing a descent top rank in a Data Science competition in Kaggle. It’s been written mainly for the general audience. Now everyone is talking about Data Science, AI, and Machine Learning and how the future of the world depends on the technologies associated with these hot topics. Within this context, Kaggle is THE PLACE for Data Science enthusiasts.

What is Kaggle?

“Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.” — Wikipedia

The boldest feature of Kaggle is that currently, it is the de-facto place where Data Science competitions are held there. For instance, many big companies, such as Microsoft, Google, and Facebook, had had various competitions in Kaggle for different reasons. The reason to post a challenge could range from the getting publicity, recruiting talented people, exploiting cheap labour, and least but not last to gather new insights about the ways they (companies) could tackle their data-driven problems.

Kaggle is not a new thing — but it could have managed to get a lot of attention over time!

Source: Wolfram Alpha

Why you may be interested to enter a Kaggle competition?

For most people, there could also be a various good reason to compete in a Kaggle competition, including:

● Trying to learn the skills for Data Science and get experience. Working on a real-world Data Science problem is hard, some people believe that you must spend at least 10k hours on something to be a professional in it

● Build a profile for your career. Having a Kaggle profile can be a good thing in your resume if you want to get a job related to Data Science

● For the money. Most Kaggle challenges do not provide much as the monetary prize for the top winning teams, but still, there are few high-profile challenges, with million-dollar range prize. The motivation behind the challenges with the money and without the money is a psychological thing. The amount of money in the award is not essential, but it helps gather many people from lower-income strata — making more people joining the challenge

● For the thrill of the competition, and the coolness factor involved. For many people, it could be fun to compete with others, it is like playing an online video game like WoW, or Fortnite somehow — an entertaining activity. Also, competing in Kaggle has a certain aura of coolness around it; years ago, at the start of the current millennium, it has become apparent that the future of the computing is to scale up computers, it leads to new technologies and trends. Cluster Computing (CC) was and is one of the critical factors back then, you could find a description like these in justification why people were so excited about CC at that time:

o “Coolness Factor: There is just something really cool about playing with clusters. Although a purely subjective factor, it nonetheless has driven much activity, especially among younger practitioners who react to many of the empowerment issues above but also to the attraction of open source software, unbounded performance opportunities, the sand box mindset that enables self-motivated experimentation, and a subtle attribute of the pieces of a cluster “talking” to each other on a cooperative basis to make something happen together. There is also a strong sense of community among contributors in the field, a basis for a sense of association that appeals to many people. Working with commodity clusters is fun!” — Encyclopedia of parallel computing, Springer

We could — by replacing the word related to cluster computing (i.e., boldened terms) with the right words (i.e., Data Science, ML, and AI) — use almost all those enticing things about Data Science, AI, and ML.

The central assumption here is that the audience (YOU!) know enough algorithm, and programming to be able to use Kaggle. The other important assumption that I’m making is that people learn best by doing! This writing does not try to teach you how to do these — be able to code; be able to understand and implement algorithms; read and analyse other people’s models — to you.

Choose your challenge wisely

The main starting point for entering a challenge in Kaggle is to pick up a challenge that firstly interests you — and secondly, you have the skills and resources to handle that challenge. One way to categorise the challenges in Kaggle is based on the format and shape of the data which is going to be used for the competition. In general, you could classify the competitions based on this method into two types. First are the challenges with tabular data, where the data is represented in tables with columns and rows in a natural way. The second type is the challenges that the data is not innately served or represented as rows of records packed up into tabular file formats such as CSV or MS Excel. The main differences between these two types of challenges are shown in the table below:

A binary classification of Kaggle competitions based on their data

This categorisation is not totally accurate because there are many competitions where their data can be a mixture of tabular and non-tabular data; let’s assume for the sake of simplicity we keep our simple classification throughout this text :-).

Generally, the type of the data of a challenge you chose determines the skills you need to have to win that challenge — or at least be at the top of the leaderboard.

Learn the skills you need

The minimum requirement to start working in a Kaggle competition is to be able to develop the code to submit a prediction. In many cases, people don’t start from scratch; instead, they try to use others’ code written in Python or R to start. Nevertheless, to be able to do anything serious in Kaggle, you need to have many more skills under your belt. These skills can be summarised in the following lines:

Good understanding of software development process, and related tools
Good knowledge of the primary machine learning and data mining algorithms
An analytical mind
Enough knowledge of linear algebra, optimisation, and statistics
Enough knowledge of data processing and management tools and technologies

The following picture an incomplete list of many relevant terms associated with the aforementioned required skills:

A word cloud of many of important terms and concepts relevant for a Data Science challenge

Find teammates

One important thing that could help you a lot while you are in a Kaggle competition is to find other people and teaming up with them. The type of people you should look for usually have got to have two characteristics, firstly they should be better than you in some way, so you can hope to learn and improve yourself via interacting with them, possibly via osmosis. Secondly, they preferably should be thinking differently than you, many people are sharing many similarities when they approach a problem, but for a predictive ML challenge, people with a congruent way of thinking are not going to be useful!

Working with smart people with refreshing ideas could be very rewarding!

Employ an agile workflow

The simple idea behind a building a predictive model is that there exists a collection of functions that accept some input variables and are going give a good enough approximation about the labels or target variables in the dataset. During the lifespan of a challenge, you need to be able to absorb new ideas add them into your models. It could include reading other people’s codes, comments, replies and posts to find and collect useful information about their approaches. Usually, the hard part is not to find helpful information from other people’s work and models — but it is to somehow add their ideas into your workflow. The problem is that usually as you start into a challenge — you start with a simple model — over time your model is going to become more and more complex, so there is impending problem that may at some point of the challenge your model fall into local minima or maxima for the score function used in the challenge.

You should automate as much as possible, every time you build and train a model, you can notice that parts of your work are going to be repeated. If your workflow contains many repetitious subtasks, there is an excellent chance that you could exploit it to save your precious time. Subtasks such as data reprocessing (e.g., imputing the missing values), Explanatory Data Analysis (EDA), and feature engineering are good candidates for automation.

It is an interesting challenge in itself— how to set up an efficient, agile Data Science workflow!

Understand the challenge description

Read the challenge description very carefully, then answer the following questions:

Do you understand what the type of challenge is? Is it a classification, regression or ranking challenge?
What is the score, or error function used in the challenge? Why they picked this particular function?
Does the challenge require you to have the right amount of domain expertise?
How large are the train and test datasets?
How raw is the data? How much time do you think it will take to build a starter model?
Does the challenge need you have access to a particular type of resource?
Is it a kernel challenge? What limitations are in place?
How long is the life span of the challenge? How much time can you invest in it?
Are there any previous challenges like this one you intend to take on? If yes, what can you scavenge or reuse from the previous similar challenges?

The point of answering these questions is to take on a challenge with a proper contextual understanding. If building a model, for example, requires a lot of computational resources (i.e., GPUs or TPUs) or a specific domain knowledge about the data is a critical advantage, and you do not have access to neither of these, you could head into the challenge hoping for getting a good standing on the final private leader board — only to disappoint yourself at the end.

Do not skip Exploratory Data Analysis

Exploratory Data Analysis is a very essential activity in every Data Science process. It is not something that you could learn by reading books — it is much like an art than an industrial process. But you should take it very seriously when you start a challenge in Kaggle. A typical EDA pipeline may encompass — but not limited — to the following relevant subtasks:

Using univariate and multivariate data analysis techniques
Data visualisation — it may include dimensionality reduction steps
Statistical hypothesis testing
Taking summary statistics from the data
Using clustering, anomaly detection/outlier detection, novelty detection techniques

EDA can be very time consuming, especially when the size of the dataset is large or the dataset is quite raw (i.e., it needs to be preprocessed a lot first). The main idea is that make the data talk to you! There are people in Kaggle that prepare great EDA kernels, you can learn from their work.

Start off with a robust validation

Always start with the construction of a high-quality validation for your model as early as possible. The main point of this work is that before anything, you should make sure that you are not going to overfit (or underfit) on your validation or test data. Ideally, when you have your validation strategy readied you should get more or less the same score on Competition’s leaderboard as you get on your validation. It is easier said than done, though. When you have an excellent validation the rest of your efforts is mainly focused on three things — finding or building the best features, finding the best single model or ensemble, and finally tuning your model’s hyper-parameters!

Wining is hard so prepare to lose, but hope to win

To be able to win a Kaggle competition, you need to fight with many other smart and hardworking people from all over the world. To get a gold medal usually, you have to occupy one of the top 10 to 15 places in the final leaderboard, to get a silver medal you have to be within top 5% and to get a bronze to be no further than the top 10%. It is apparent that winning — i.e., to be in the first place — is not going to be easy, because many people you try to beat may have unique advantages compared to you. Winning a Kaggle challenge depends on many factors, including but not limited to having a good understanding of the domain knowledge for the data, possibly having access to high-end computational resources that gives an advantage throughout a competition, having a good grasp of the cutting edge algorithms in the area, and finally be lucky!

One good strategy could be to focus on a niche. In other words, try to compete in specific challenges which you have some kind of the upper hand. For instance, the current number one Kaggle, bestfitting, is almost totally focused on challenges where the data is an image dataset. One distinguishing feature of his approach is that he heavily uses Deep Neural Networks in his work.

The profile page of bestfitting, the top ranked Kaggler at the moment of writing this article

Conclusion

It may seem obvious to you that winning or getting a medal in a Kaggle competition is not an easy task at all. You are right; winning is hard, mainly because it requires a mixture of proper amounts of work, knowledge, experience, and very importantly, a little bit of luck! But with investing the right amount of time and effort and having the lady luck on your side, it is not impossible to achieve.

How to win a Data Science challenge in Kaggle? was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.