<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Hassan Abedi on Medium]]></title>
        <description><![CDATA[Stories by Hassan Abedi on Medium]]></description>
        <link>https://medium.com/@habedi?source=rss-c248b3ac59be------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*4jbqOVz3Y5UCixlbJVKu4Q.jpeg</url>
            <title>Stories by Hassan Abedi on Medium</title>
            <link>https://medium.com/@habedi?source=rss-c248b3ac59be------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Tue, 23 Jun 2026 09:26:28 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@habedi/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[An Indexing Checklist for MongoDB Atlas Users]]></title>
            <link>https://habedi.medium.com/an-indexing-checklist-for-mongodb-atlas-users-25ab66a45ba1?source=rss-c248b3ac59be------2</link>
            <guid isPermaLink="false">https://medium.com/p/25ab66a45ba1</guid>
            <category><![CDATA[nosql]]></category>
            <category><![CDATA[indexing]]></category>
            <category><![CDATA[mongodb-atlas]]></category>
            <category><![CDATA[query-optimization]]></category>
            <category><![CDATA[mongodb]]></category>
            <dc:creator><![CDATA[Hassan Abedi]]></dc:creator>
            <pubDate>Sat, 12 Oct 2024 19:58:08 GMT</pubDate>
            <atom:updated>2024-10-12T20:00:42.193Z</atom:updated>
            <cc:license>http://creativecommons.org/licenses/by/4.0/</cc:license>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*ouUabXSJWg8eZAVm" /></figure><p><a href="https://www.mongodb.com/products/platform/atlas-database">MongoDB Atlas</a> is a fully managed cloud database service that makes deploying, operating, and scaling <a href="https://www.mongodb.com/">MongoDB</a> clusters easy. It takes care of infrastructure management so you can focus on building your applications. However, even though it handles many database operations automatically, optimizing query performance through proper indexing is still something users and developers must manage.</p><p>Adding indexes to the collections in MongoDB on fields that are frequently searched and accessed is crucial. Queries can become slow without the right indexes, especially as your data grows. Indexes make searching faster but can add overhead and slow down write operations if overused or applied incorrectly.</p><p>In this article, we provide a checklist of important things that, when dealing with indexes, particularly in the MongoDB Atlas environment, should be taken into consideration. The aim is to help the reader make sure queries are as efficient as possible while avoiding common indexing pitfalls.</p><h3><strong>1. Avoid Adding Indexes to Small Collections</strong></h3><p>Generally, it’s a good idea not to add indexes to small collections (for example, collections with fewer than 200 documents). This is because the overhead of maintaining the indexes outweighs the benefit in such cases. The exception is when the collection is read frequently, and queries become slow without the index.</p><h3><strong>2. Indexes Slow Down Writes</strong></h3><p>Adding indexes always makes writing to the collection slower. To avoid this, don’t add indexes to collections where the ratio of reads to writes is low, like collections where most operations are updates or inserts.</p><h3><strong>3. Use the Performance Advisor</strong></h3><p>MongoDB Atlas comes with a tool named <strong>Performance Advisor</strong>, which can give recommendations for adding and removing indexes based on the performance of the queries that access the data in the collection. It’s a good idea to regularly check its recommendations and apply them as needed.</p><h3><strong>4. Index fields Used in </strong><strong>where and </strong><strong>sort</strong></h3><p>It’s always beneficial to have an index on any field used in a where or sort operation to optimize query performance. In case of sort, the values of indexed fields are effectively stored in sorted order, which allows MongoDB to return sorted results without needing additional processing time for sorting.</p><h3><strong>5. Use the Query Profiler</strong></h3><p>In MongoDB Atlas, you can use the <strong>Query Profiler</strong> tool to identify slow queries and add indexes to the relevant fields. When an index is used in a query, the planSummary field in the profiler output will show one or more IXSCAN, indicating that MongoDB is using an index to retrieve documents. For example, here’s an output from the Query Profiler showing compound indexes being used:</p><pre>{<br>  // ...<br>  &quot;planSummary&quot;: &quot;IXSCAN { customerId: 1, transactionDate: 1, location.city: 1, hasAttachments: 1 }, IXSCAN { customerId: 1, transactionDate: 1, location.city: 1, hasAttachments: 1 }, IXSCAN { customerId: 1, transactionDate: 1, location.city: 1, hasAttachments: 1 }&quot;<br>  // ...<br>}</pre><h3><strong>6. Optimize the </strong>Examined:Returned Ratio</h3><p>Ideally, the ratio between the number of index keys examined and the number of returned documents (shown in the Query Profiler as the Examined:Returned Ratio) should be small. A near-zero ratio is best, although it may not always be achievable. Note that the Examined:Returned Ratio should not be exactly zero because that means that no index is being used at all.</p><h3><strong>7. Use Atlas Search Indexes for Text Search</strong></h3><p>When using the search operation, it’s important to have an <strong>Atlas Search Index</strong> on the field being searched. For example, if you’re performing a text search (such as a regex search) on the vehicle.registrationNo field, you should have an Atlas Search Index on vehicle.registrationNo. The index name must also be specified in the search operation:</p><pre>// ...<br>{<br>  &quot;$search&quot;: {<br>    &quot;index&quot;: &quot;searchIndex&quot;, // Name of the search index should be given here<br>    &quot;compound&quot;: {<br>      &quot;must&quot;: [<br>        {<br>          &quot;wildcard&quot;: {<br>            &quot;query&quot;: &quot;ABC1*&quot;,<br>            &quot;path&quot;: &quot;vehicle.registrationNo&quot;,<br>            &quot;allowAnalyzedField&quot;: true<br>          }<br>        }<br>      ]<br>    }<br>  }<br>}<br>// ...</pre><p>Note that search operation and Atlas Search Indexes are available only in MongoDB Atlas.</p><h3><strong>8. Order of Fields in Compound Indexes</strong></h3><p>The order of fields matters in compound indexes. Fields used in equality conditions (like =) should come first, followed by fields used in sorting, and then fields used in range conditions. For example, if you have a query like this:</p><pre>// ...<br>{<br>  &quot;vehicle.registrationNo&quot;: &quot;ABC123&quot;,<br>  &quot;vehicle.model&quot;: &quot;Sedan&quot;,<br>  &quot;vehicle.year&quot;: {<br>    &quot;$gte&quot;: 2010<br>  }<br>}<br>// ...</pre><p>The compound index should be created with fields in this order: vehicle.registrationNo, vehicle.model, and vehicle.year.</p><h3><strong>9. High Cardinality Fields</strong></h3><p>Indexes work best on fields with high cardinality (fields with many unique values). By contrast, indexes on fields with low cardinality are often not as useful. For example, a field like status with only two possible values (active or inactive) won’t benefit much from indexing. MongoDB may still need to scan many documents to find those with status: active, which defeats the purpose of the index.</p><h3><strong>10. TTL Indexes for Automatic Document Deletion</strong></h3><p>TTL (or time-to-live) indexes can be used to automatically delete documents from a collection after a certain period. This is especially useful for collections that store session data, logs, or cache data. Note that a collection can have only one TTL index, and the field used for the TTL index must be of type date or timestamp.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=25ab66a45ba1" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Top DuckDB CLI Commands That You Should Know]]></title>
            <link>https://habedi.medium.com/top-duckdb-cli-commands-that-you-should-know-7783af9c1fb4?source=rss-c248b3ac59be------2</link>
            <guid isPermaLink="false">https://medium.com/p/7783af9c1fb4</guid>
            <category><![CDATA[command-line]]></category>
            <category><![CDATA[duckdb]]></category>
            <category><![CDATA[sql]]></category>
            <category><![CDATA[tutorial]]></category>
            <category><![CDATA[database]]></category>
            <dc:creator><![CDATA[Hassan Abedi]]></dc:creator>
            <pubDate>Fri, 27 Sep 2024 17:42:39 GMT</pubDate>
            <atom:updated>2024-09-29T08:48:28.886Z</atom:updated>
            <cc:license>http://creativecommons.org/licenses/by/4.0/</cc:license>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1000/1*oKhEX6qf6GB3CRxwL0afmQ.png" /></figure><p>DuckDB is a fast in-memory analytical database system. It is designed to be used as an embedded database, which means it can be easily integrated into applications and run within the same process as the application code. However, a command-line version of DuckDB is also available.</p><p>In this post, we will explore a few useful commands (also called dot commands) of the command-line version of DuckDB, which we refer to as DuckDB CLI, in the remainder of this post.</p><h3>Getting Started</h3><p>Before we start, make sure you have already downloaded the DuckDB CLI (for your OS from <a href="https://duckdb.org/docs/installation/index.html">here</a>) and that it is ready to run.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/643/1*j84RZuWr_H4opHbPRPP_Pg.png" /><figcaption>DuckDB CLI (the file is named duckdb on GNU/Linux systems) Started in a Terminal Emulator</figcaption></figure><p>After downloading the DuckDB CLI, we need to create a toy table that we will use to demonstrate the commands and their functionalities. The table will be called employees and will have the following schema:</p><pre>-- Create the `employees` table<br>create table employees<br>(<br>    id         integer,<br>    name       varchar,<br>    department varchar,<br>    salary     decimal<br>);<br><br>-- Insert some records into the table<br>insert into employees<br>values (1, &#39;Alice&#39;, &#39;Engineering&#39;, 100000),<br>       (2, &#39;Bob&#39;, &#39;Engineering&#39;, 90000),<br>       (3, &#39;Charlie&#39;, &#39;Sales&#39;, 80000),<br>       (4, &#39;David&#39;, &#39;Marketing&#39;, 70000),<br>       (5, &#39;Eve&#39;, &#39;Sales&#39;, 60000);</pre><h3>1. Using .mode to Change the Output Format</h3><p>The .mode command allows us to change the output format of the query results. By default, the output format is duckbox, which displays the results in a tabular format with extensive aesthetic formatting. However, we can change the output format to many other formats, including: csv, json, latex, or even markdown. See DuckDB&#39;s documentation on <a href="https://duckdb.org/docs/api/cli/output_formats.html">supported output formats</a> for the full list.</p><p>Example output in different modes:</p><pre>-- Show the current output mode<br>.mode<br><br>-- Show the result in LaTeX format<br>.mode latex<br>select * from employees;<br><br>-- Show the result in CSV format<br>.mode csv<br>select * from employees;<br><br>-- Show the result in JSON format<br>.mode json<br>select * from employees;</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/903/1*MFPU4ojJVpu6H4K0IyZxMQ.png" /><figcaption>Using .mode to Change the Output Format of the Queries</figcaption></figure><h3>2. Using .timer to Measure Query Execution Time</h3><p>The .timer command lets us measure the execution time of queries. In other words, when we enable the timer, DuckDB will display the time taken to execute each query.</p><p>Example usage of .timer:</p><pre>-- Turn the timer on<br>.timer on<br><br>-- Run a query<br>select * from employees;<br><br>-- Turn the timer off<br>.timer off</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/903/1*Q5P-0KPLuEEZipNCYJdREA.png" /><figcaption>Using .timer to Measure Query Execution Time</figcaption></figure><h3>3. Using .schema to Display Table Structure</h3><p>The .schema command displays the structure or schema of a table. The displayed information includes column names, column types, and any constraints (like primary keys, foreign keys, or indexes) associated with the table.</p><p>Example usage of .schema:</p><pre>-- Create an index on the `department` column<br>create index idx_employees_department on employees (department);<br><br>-- Show the schema for the `employees` table<br>.schema employees</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1003/1*xVagCYNSEjt2UApcl9agsw.png" /><figcaption>Using .schema to Display Table Structure</figcaption></figure><h3>4. Using .read to Execute Queries from a File</h3><p>The .read command lets us execute queries (SQL code) stored in a file. This is a very useful command when we have a large number of queries that we want to run in sequence or when we want to reuse queries across multiple sessions.</p><p>Example usage of .read:</p><pre>-- Execute queries from the `queries.sql` file<br>.read queries.sql</pre><p>Example content of queries.sql:</p><pre>select * from employees;</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1003/1*uDe-38hAT5b9h6viZ7kEpw.png" /><figcaption>Using .read to Execute Queries from a File</figcaption></figure><h3>5. Using .once to Redirect Output to a File</h3><p>The .once command allows us to redirect the output of the next query to a file. This can be useful when we want to save the results of a query to a file for later use.</p><p>Example usage of .once:</p><pre>-- Redirect the output of the next query to `output.txt`<br>.once output.txt<br><br>-- The (next) query<br>select * from employees;</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1003/1*XwrkqiL7BuqrShCSfuTDvQ.png" /><figcaption>Using .once to Redirect Output to a File</figcaption></figure><h3>6. Using .shell to Execute Shell Commands</h3><p>The .shell command lets us execute shell commands from the DuckDB CLI.</p><p>Example usage of .shell:</p><pre>-- List the files in the current directory<br>.shell ls -l</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1003/1*IK5T1uu12Yxj2cMM6hZSSw.png" /><figcaption>Using .shell to Execute Shell Commands</figcaption></figure><h3>7. Using .promp to Customize the Prompt Text</h3><p>The .prompt command allows us to customize the prompt in the DuckDB CLI. For example, we can change the prompt to display the database name or any other custom text.</p><p>Example usage of .prompt:</p><pre>-- Change the prompt to `EmployeeDB&gt;` (the database name)<br>.prompt &quot;EmployeeDB&gt; &quot;<br><br>select * from employees;</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1003/1*EfVTfnLBLoSZEKinaqRZyA.png" /><figcaption>Using .promp to Customize the Prompt Text</figcaption></figure><h3>8. Using .changes to Display the Number of Affected Rows</h3><p>The .changes command allows us to display the number of rows affected by a query with side effects. This is useful when you want to know how many rows were inserted, updated, or deleted by a query.</p><p>Example usage of .changes:</p><pre>-- Turn on tracking changes<br>.changes on<br><br>-- Insert a new row into the `employees` table<br>insert into employees values (6, &#39;Frank&#39;, &#39;Engineering&#39;, 95000);<br><br>-- Turn off tracking changes<br>.changes off</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1003/1*wXJeWHaTihdue2TSLC7bhA.png" /><figcaption>Using .changes to Display the Number of Affected Rows</figcaption></figure><h3>9. Using .indexes to Display the Index Names</h3><p>The .indexes command lets us display the names of the indexes on a table.</p><p>Example usage of .indexes:</p><pre>-- Create an index on the `department` column<br>create index idx_employees_department on employees (department);<br><br>-- Get the names of the indexes on the `employees` table<br>.indexes employees</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1003/1*7r-xoS6FrmbBU4kJCi_juQ.png" /><figcaption>Using .indexes to Display the Index Names</figcaption></figure><h3>10. Using .nullvalue to Customize How NULL Values are Displayed</h3><p>The .nullvalue command allows us to customize how NULL values are displayed in DuckDB CLI. By default, NULL values are displayed as an empty string. However, using .nullvalue, we can change this to any other string.</p><p>Example usage of .nullvalue:</p><pre>-- Show NULL values as &quot;N/A&quot;<br>.nullvalue &quot;N/A&quot;<br><br>-- Insert a row with a NULL value in the `department` column<br>insert into employees (id, name, department, salary) values (7, &#39;Grace&#39;, null, 85000);<br><br>-- Show the contents of the `employees` table<br>select * from employees;</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1003/1*bRTgtr1y3hs9edUEg4FtOg.png" /><figcaption>Using .nullvalue to Customize How NULL Values are Displayed</figcaption></figure><h3>11. Using .cd to Change the Working Directory</h3><p>The .cd command allows us to change the working directory to another directory.</p><p>Example usage of .cd:</p><pre>-- Change the working directory to /tmp<br>.cd /tmp</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1003/1*zBw5zQOIdqWKnPXY1bhlPw.png" /><figcaption>Using .cd to Change the Working Directory</figcaption></figure><h3>Conclusion</h3><p>In this post, we explored some useful DuckDB commands that can help us work more efficiently with the command-line version of DuckDB. These commands can help us customize the output format, measure query execution time, display table schemas, execute queries from a file, etc. For more information on the DuckDB CLI and its full feature list, please check out <a href="https://duckdb.org/docs/api/cli/dot_commands">DuckDB’s documentation</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=7783af9c1fb4" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Best Books and Courses to Learn about Reinforcement Learning in 2022]]></title>
            <link>https://habedi.medium.com/best-books-and-courses-to-learn-about-deep-reinforcement-learning-in-2022-85df82b2dcf5?source=rss-c248b3ac59be------2</link>
            <guid isPermaLink="false">https://medium.com/p/85df82b2dcf5</guid>
            <category><![CDATA[books]]></category>
            <category><![CDATA[online-courses]]></category>
            <category><![CDATA[reinforcement-learning]]></category>
            <dc:creator><![CDATA[Hassan Abedi]]></dc:creator>
            <pubDate>Mon, 09 May 2022 12:09:27 GMT</pubDate>
            <atom:updated>2022-05-09T12:26:19.256Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/799/1*3EpZMriN-xByj91whs1DUg.jpeg" /><figcaption>Source: <a href="https://flic.kr/p/aav5nQ">https://flic.kr/p/aav5nQ</a></figcaption></figure><h3>Introduction</h3><p>Deep learning is an approach to designing, training, and building machine learning models that became extremely popular lately. Arguably, deep learning’s popularity is mainly due to three reasons. Firstly, deep learning model have performed greatly in many tasks in areas such natural language processing, computer vision, and speech recognition in comparison to alternative approaches such as tree-based learning methods like decision trees and support vector machines. Secondly, deep learning is versatile, which means it can be used to solve an extensive range of problems. Thirdly, deep learning, to a large extent, has removed the need for feature engineering, which was an enduring challenge in itself in building machine learning models.</p><p>Reinforcement learning is an area of machine learning that the main goal is to train an agent that aim to maximize a commutative reward by taking actions in an environment. The main application of reinforcement learning is to create an agent that that can solve a problem or perform a task that previously was solved or performed by a human being. As an example of such a task, reinforcement learning was used to train an agent that plays a video game like <a href="https://www.youtube.com/watch?v=qv6UVOQ0F44">Super Mario World</a>.</p><p>Deep reinforcement learning comes about when deep learning is used to approximate different components of a reinforcement learning-based system, such as the reward function. Deep reinforcement learning is a growing area for researchers and practitioners, likewise. This article aims to present a compact list of high-quality resources including books and online course to help anybody curious about reinforcement learning, in general, and deep reinforcement learning, in particular, get started quickly. Moreover, most of the books enlisted are available as downloadable PDFs. And, the code for the examples shown in the books are mostly available.</p><p>At any rate, I hope these resources will help you in your journey to learn more about (deep) reinforcement learning concepts and tools.</p><h3>Books</h3><ol><li><a href="https://mitpress.mit.edu/books/reinforcement-learning-second-edition">Reinforcement Learning, Second Edition, An Introduction by By Richard S. Sutton and Andrew G. Barto</a></li><li><a href="https://mitpress.mit.edu/books/algorithms-decision-making">Algorithms for Decision Making by By Mykel J. Kochenderfer, Tim A. Wheeler and Kyle H. Wray</a></li><li><a href="https://www.cambridge.org/core/books/bandit-algorithms/8E39FD004E6CE036680F90DD0C6F09FC">Bandit Algorithms by Tor Lattimore and Csaba Szepesvári</a></li><li><a href="https://rltheorybook.github.io/">Reinforcement Learning: Theory and Algorithms by Alekh Agarwal, Nan Jiang, Sham M. Kakade and Wen Sun</a></li></ol><h3>Courses</h3><ol><li><a href="https://www.deepmind.com/learning-resources/reinforcement-learning-lecture-series-2021">Reinforcement Learning Lecture Series 2021 by DeepMind x UCL</a></li><li><a href="https://rltheory.github.io/">CMPUT 653: Theoretical Foundations of Reinforcement Learning by University of Alberta</a></li><li><a href="https://rail.eecs.berkeley.edu/deeprlcourse/">CS 285: Deep Reinforcement Learning by UC Berkeley</a></li><li><a href="https://github.com/huggingface/deep-rl-class">The Hugging Face Deep Reinforcement Learning Class</a></li><li><a href="https://www.deepmind.com/learning-resources/introduction-to-reinforcement-learning-with-david-silver">Introduction to Reinforcement Learning with David Silver</a></li><li><a href="https://web.stanford.edu/class/cs234/CS234Win2019/index.html">CS234: Reinforcement Learning Winter 2019 by Stanford University</a></li><li><a href="https://www.youtube.com/playlist?list=PLwRJQ4m4UJjNymuBM9RdmB3Z9N5-0IlY0">Foundations of Deep RL — 6-lecture series by Pieter Abbeel</a></li></ol><h3>Other resources</h3><ol><li><a href="https://github.com/andyljones/reinforcement-learning-discord-wiki/wiki">Reinforcement Learning Discord Wiki</a></li><li><a href="https://rlsummerschool.com/">Reinforcement Learning Summer School by Vrije Universiteit Amsterdam</a></li></ol><p><strong>Acknowledgement</strong></p><p>The resources introduced in this article are from <a href="https://xrl.ai/posts/rl-resources/">this blog post</a> by Yanzhe Bekkemoen.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=85df82b2dcf5" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Getting Started With Apache Spark Using Databricks Community Edition]]></title>
            <link>https://habedi.medium.com/getting-started-with-apache-spark-using-databricks-community-edition-bbafef7268f?source=rss-c248b3ac59be------2</link>
            <guid isPermaLink="false">https://medium.com/p/bbafef7268f</guid>
            <category><![CDATA[python]]></category>
            <category><![CDATA[databricks]]></category>
            <category><![CDATA[spark]]></category>
            <category><![CDATA[apache-spark]]></category>
            <dc:creator><![CDATA[Hassan Abedi]]></dc:creator>
            <pubDate>Mon, 17 Jan 2022 11:35:21 GMT</pubDate>
            <atom:updated>2022-01-19T08:10:49.394Z</atom:updated>
            <cc:license>https://creativecommons.org/licenses/by-sa/4.0/</cc:license>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*daE3g6Vbstt2CHP605VDrg.jpeg" /><figcaption>Remaining ruins of city of Babylon (source: Wikipedia)</figcaption></figure><p>This is a mini-tutorial aiming at helping the reader get started using Databricks Community Edition to process their data on an Apache Spark cluster. The tutorial includes a set of steps that one needs to take to be able to get familiar with Databricks Community Edition’s environment.</p><h3>Step 1: Getting started</h3><p>The first thing to do is go to the webpage for <a href="https://community.cloud.databricks.com">Databricks Community Edition</a> and create a user account (if you do not already have one).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*3Bw34H_Eqaq0ygGmLxS3zA.png" /><figcaption>Databricks Community Edition’s login page</figcaption></figure><p>When you have finished creating your account, log into Databricks Community Edition.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*G4J8oT8dhgqcwr9cVIg-4A.png" /><figcaption>Databricks Community Edition’s environment after you have logged into it</figcaption></figure><p>Now go to <a href="https://github.com/habedi/datasets">https://github.com/habedi/datasets</a> and download the contents of the folder <a href="https://github.com/habedi/datasets/tree/main/datascience.stackexchange.com">datascience.stackexchange.com</a> to your machine. It includes the data that we are going to use here.</p><h3>Step 2: Creating a Spark cluster</h3><p>Now, we need to create an Apache Spark cluster to run our code on it. To do so, click on “Compute” on the vertical dark green line on the left side of the screen.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*i9qSBABG5KwWGgbMYYBnVA.png" /><figcaption>After pressing on “Compute” button this page will show up</figcaption></figure><p>Now press on “Create Cluster” to create your cluster. In Databricks Community Edition, you have to set a name for your cluster and select the Databricks runtimes of your cluster. Note that we can set more low-level configurations, but I’ll not go through their details in this tutorial. Moreover, different Databricks runtimes mainly differ on the version of their Spark; I chose the default Databricks runtime (runtime 9.1 LTS) that comes with Spark 3.1.2. Also, I named the cluster “MyCluster”.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*oDu3s8_nm8iEyby9qKt_rA.png" /><figcaption>Creating an Apache Spark cluster in Databricks Community Edition named “MyCluster”</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*6cqP70oYfEdzoreM2pRQpQ.png" /><figcaption>The cluster has been created and is running and ready to use</figcaption></figure><h3>Step 3: Installing libraries on the cluster</h3><p>Usually, at the start, we have to install some additional libraries and packages on our cluster to do something useful with our data. Imagine we want to do some graph analytics on our Spark Cluster in Python. To do that, we have to install the appropriate Graphframes Spark library, a graph processing library for Apache Spark. To install libraries on our cluster, click on our cluster’s name in the “Compute” section and click on “Libraries”. Libraries can be installed from different sources, though; here we use the Maven repository to download and install the artefacts related to Graphframes in our cluster. Apache Spark has Python, R, and Java (and Scala) API. Depending on the API we are using, we can install libraries for these programming languages and environments.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*4A2t38cRA_vbfM7m7hFExQ.png" /><figcaption>We can install libraries in our Spark cluster by clicking on the name of our cluster on the page that opens after pressing “Compute”; then, by pressing “Install New” under the “Libraries” tab, we can open the page for installing a libraries</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*QKQ1jC2_tn9TJqiHoPczlg.png" /><figcaption>After pressing the “Install New” button, we can choose the repository of the library we want to install and install it on our cluster</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*8hq2aNI1X3CCvSTFZq4Exg.png" /><figcaption>Here, we search for Graphframes library and pick the newest version that matches the version of Spark instances installed on our cluster, which in our case is Graphframes 0.8.2 for Spark 3.1.x</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*nMIzqUDBPtkdyJfGneK4tQ.png" /><figcaption>Finally, after finding and selecting a library, we can press the “Install” button to install the library on the cluster</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*PmSmVwmkHC4Z2kGU74IGjQ.png" /><figcaption>As you can see, Graphframes is installed on MyCluster and is ready to use</figcaption></figure><p>Now that the cluster is created and running, we can upload the data in the Databricks Community Edition environment.</p><h3>Step 4: Uploading the data</h3><p>In many real-world scenarios, we have the data in CSV format and want to do some analytics or train a machine learning model. Here, let’s assume we want to upload four compressed CSV files. (These are the files that you have downloaded to your machines in step 1 from <a href="https://github.com/habedi/datasets">https://github.com/habedi/datasets</a>.) To move the files to our cluster, click on the icon named “Data” located on the vertical line on the left side of the screen to open up the user interface for managing the data.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*l-uQMvdrFkO-6EZO8QQ9cA.png" /><figcaption>The interface for data management can be accessed by clicking on the “Data” on the vertical dark green line on the left side of the screen</figcaption></figure><p>Now press the “Create Table” button to upload your data on the DBFS. DBFS is the file system’s name that Databricks Community Edition uses to store and access the data. In general, you can connect to different data providers to get your data, but in this tutorial, we assume that you have the data on your machine and need to move to DBFS.</p><p>To upload the data to DBFS, choose the “Upload File” button, select your files, and press “Open”.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-yKwVmKL_dOst-1_kLhclA.png" /><figcaption>Uploading files from your machine to DBFS</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*I14XFW_Bw9zaB_avdSTKsQ.png" /><figcaption>As you can see, files have been successfully uploaded from your machine to DBFS and are available under the path “/FileStore/tables/NAME_OF_FILE”</figcaption></figure><p>Optionally, at this stage, we select an uploaded CSV file and turn it into a Hive table. The main benefit of doing so is that we can directly run Spark SQL queries over a table that is stored as a Hive table on DBFS. But this is not a requirement, and we still will be able to load the compressed CSV files that we have already uploaded to DBFS into Spark DataFrames.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*PlUoo-7nFZj_RWq0t1kfkw.png" /><figcaption>An example table with inferred schema using “Creating Table with UI”</figcaption></figure><h3>Step 5: Processing the data on the cluster</h3><p>At this stage, we are ready to start our real work, which is doing something interesting with the data we have already uploaded to our cluster. To do so, we can open a notebook and choose the API we want to use to work with the data in the cluster. By API, I mean the default programming language that we will use in our notebook. To do so, press on “Workspace” to open up the side panel for creating a Jupyter notebook.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*swz1jiL7Er8DveZQRBaNAg.png" /><figcaption>We can create a new or access an existing notebook under “Workspace”</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*n4QooV7ayx4-fGqTTaIaPg.png" /><figcaption>We can create a Jupyter notebook with the default programming language set to Python to process our data on the MyCluster Spark cluster</figcaption></figure><p>Now copy the code below into the notebook you have just created and run the notebook by pressing the “Run All” button on the top.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/d4060d1bb35e8989142c6f29d4a02bff/href">https://medium.com/media/d4060d1bb35e8989142c6f29d4a02bff/href</a></iframe><p>When you run the above code in the notebook that you have just created, you should be able to see something like this in the output:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*X5o0Bdcg4Gtbqw93lxyq4w.png" /></figure><p>Congratulations, you have a running Spark cluster ready to use to solve your Big Data and Data Science problems.</p><p>A final thing! — Remember that Databricks Community Edition is a free service, so it naturally comes with some limitations. Apart from the limitations on your cluster’s resources, like the number of CPUs and the amount of RAM, you should remember that after creating a cluster, if your cluster is inactive for two hours, it will be shut down. And when a cluster becomes shut down due to inactivity, you will need to delete it and create a new cluster. It can hinder, but it will not affect the data you have stored on DBFS. Usually, the only significant problems are the hassles of creating a new cluster and installing the appropriate libraries on it again.</p><p>Moreover, I suggest reading the following books to get familiar with Apache Spark and its applications:</p><ol><li>Learning Spark<em> — </em>Lightning-Fast Data Analytics<a href="https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf">; https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf</a></li><li>Learning Apache Spark with Python; <a href="https://runawayhorse001.github.io/LearningApacheSpark/">https://runawayhorse001.github.io/LearningApacheSpark/</a></li><li>Spark: The Definitive Guide; <a href="https://pages.databricks.com/definitive-guide-spark.html">https://pages.databricks.com/definitive-guide-spark.html</a></li></ol><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=bbafef7268f" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Writing a HelloWorld Spark application with IntelliJ IDE and Python 3 in Windows 10]]></title>
            <link>https://habedi.medium.com/writing-a-helloworld-spark-application-with-intellij-ide-and-python-3-in-windows-10-dc009520d4ab?source=rss-c248b3ac59be------2</link>
            <guid isPermaLink="false">https://medium.com/p/dc009520d4ab</guid>
            <category><![CDATA[big-data]]></category>
            <category><![CDATA[pyspark]]></category>
            <category><![CDATA[spark]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[python]]></category>
            <dc:creator><![CDATA[Hassan Abedi]]></dc:creator>
            <pubDate>Tue, 16 Feb 2021 00:26:04 GMT</pubDate>
            <atom:updated>2021-05-21T06:34:39.258Z</atom:updated>
            <cc:license>https://creativecommons.org/licenses/by-sa/4.0/</cc:license>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*8lfHcV-rKzwkorL0DWGz5g.jpeg" /><figcaption>Florence, Italy; source: <a href="https://flic.kr/p/2jPh9KA">https://flic.kr/p/2jPh9KA</a></figcaption></figure><h3>Introduction and context</h3><p>In this tutorial, I want to show you how to set up a minimum working environment to develop Apache Spark applications in your Windows machine. So, without any more wait let’s go!</p><h3>Step 1: installing Java SE Development Kit 11</h3><p>First, go to <a href="https://www.oracle.com/java/technologies/javase-jdk11-downloads.html">Oracle’s website</a> and download the Java SE development kit 11 (JDK 11) installer file for Windows 64bit from there. Then run the file you just downloaded to install the JDK 11 on your computer. After installation finishes, you can check if Java 11 is available on your computer by executing `java -version` in the Windows Command Prompt or PowerShell.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/979/1*pJ-g9FXUAuC58S79mAFsyA.png" /><figcaption>Java 11 is successfully installed and available system-wide to use</figcaption></figure><h3>Step 2: installing Python 3</h3><p>Go to <a href="https://www.python.org/downloads/">Python’s website</a> and download the Windows binary installation file for Python 3.9. Then execute the downloaded file to install Python 3. After the installation finishes, you should be able to start Python 3’s interactive shell by executing `python` in the Windows Command Prompt or PowerShell.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/979/1*vveELRy6PywnJEiRQ7UxEg.png" /><figcaption>Python 3.9 installed and ready to use</figcaption></figure><h3>Step 3: downloading and configuring Spark 3</h3><p>First, create a folder named `BigData` in `C directory` of your Windows. Go to <a href="https://spark.apache.org/downloads.html">Apache Spark’s website</a> and download Spark 3’s binary files (for Hadoop 2.7) to the path `C:\BigData` in your computer. Then extract the downloaded file in this directory (I mean in `C:\BigData`). After this step content of the BigData folder should look like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/997/1*l2WZzK4IyPodf9gMllXLdA.png" /><figcaption>At the time of writing of this tutorial the newest stable version of Apache Spark was Spark 3.0.1</figcaption></figure><p>Now rename the spark-3.* folder to spark3. Then go to <a href="https://github.com/cdarlint/winutils">https://github.com/cdarlint/winutils</a> and download the files in this git repository as a Zip file using the green `Code` button on the top right corner. (By default the downloaded file will be named `winutils-master.zip`.)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*mJWPS0rb45qr931jA0wHfQ.png" /><figcaption>Downloading `winutils` binary files from <a href="https://github.com/cdarlint/winutils">https://github.com/cdarlint/winutils</a></figcaption></figure><p>Move the file you downloaded to path C:\BigData and extract it there.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1007/1*KG4dF9mzZwYb30CyeAn8GQ.png" /><figcaption>If everything went correctly so far, you must have two folders named `spark3` and `winutils-master` in `C:\BigData`</figcaption></figure><p>Now you must set a few environment variables in Windows 10. To do so, type `edit the system environment variables` in the Windows search bar (on the bottom left corner) and press the ENTER. You must be able to see the `System Properties window` now (see the picture below).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/412/1*UooNavD0oP0Qhm_6Ozyp7Q.png" /><figcaption>You can add, remove, and edit the environment variables in Windows by pressing the environment variable button at the bottom of the `System Properties window`</figcaption></figure><p>Press the button with the label `Environment Variables` on the `System Properties window`. Now you must be able to see the `Environment Variables window` (see picture below).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/618/1*6aZuJDIfLS6PzuD1CKTZpw.png" /><figcaption>You can add, remove, or edit environment variables for your user (or for all the users) from the `Environment Variables window`</figcaption></figure><p>Now press the New button on the top (`User variables`) and add the following environment variables.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/653/1*kIfCJDyiicpzxoOsOWIP6Q.png" /><figcaption>Setting the `HADOOP_HOME` environment variable to `C:\BigData\winutils-master\hadoop-2.7.7`</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/653/1*LZ3SXzmj8NC8-N7ijZAiqA.png" /><figcaption>Setting the `SPARK_HOME` environment variable to `C:\BigData\spark3`</figcaption></figure><p>Now select the `Path` environment variable in the top panel (or list) and press the `Edit button`.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/618/1*kd2DJfHiJsZLle_gwM9QEA.png" /><figcaption>Editing the Path environment variable for the user (the top panel)</figcaption></figure><p>Now add `%HADOOP_HOME%\bin` and `%SPARK_HOME%\bin` to the `Path` environment variable.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/527/1*SxS6N-CKf8eZmcM_yTjnoQ.png" /><figcaption>`%HADOOP_HOME%\bin` and `%SPARK_HOME%\bin` are added to the `Path`. (You can add new values to the `Path` using the `New` button.)</figcaption></figure><p>Now, open PowerShell and write spark-shell in it and press the ENTER. Wait until you enter the Spark’s interactive shell environment, then open <a href="http://localhost:4040">http://127.0.0.1:4040</a> in your web browser.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/859/1*DRAtt26a5SVBDXj1FUzeDw.png" /><figcaption>Checking Spark installation by running spark-shell in Windows PowerShell</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*66-R4Gx8FjOKYR95Vko18w.png" /><figcaption>Apache Spark 3&#39;s shell Application UI; here you can monitor your Spark services and tasks</figcaption></figure><h3>Step 4: installing IntelliJ IDE with Python plugin</h3><p>Go to <a href="https://www.jetbrains.com/idea/download/#section=windows">IntelliJ IDE’s website</a> and download the IntelliJ IDE. You can download and use either the free community edition or the ultimate edition of the IDE. (Students can get a free license to use the ultimate edition of the IDE; they only need a university email to get a free license.) After the installation finished, open the settings window under the file tab and go to the plugins sub-menu. Make sure the Python plugin for IntelliJ is installed.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*spAGcq1TayAeB7vv2R7xBw.png" /><figcaption>Ensure the Python plugin is installed; you can search it under the path `File/Settings/Plugins`</figcaption></figure><h3>Step 5: Writing and executing a Hello World Spark application</h3><p>In IntelliJ IDE create a new Python project (go to `File/New/Project`). And select Python 3.9 which you have already installed in the first step of this tutorial as the `Project SDK`. Then press the next button (then press the next again).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/602/1*gKaOQUIY0GIbX5iqSo_9YQ.png" /><figcaption>Select Python 3.9 you installed in the step one of this tutorial as the Project SDK to be used for the project</figcaption></figure><p>Now, pick a name for the project and a path to save the project files and press the finish button. (I chose the name PySparkHelloWork for my project and saved it in my Documents directory under a folder name IntelliJ, as you can see in the picture.)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/596/1*KxkHkEK6bxoxcMsT5DveIg.png" /><figcaption>The project PySparkHelloWork is ready</figcaption></figure><p>Open the Terminal window in the button left corner of the project’s main window. And execute `pip install pyspark findspark` in it. Wait until `pip` finishes installing these three packages.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*_jfKJ-oq8JIt5dqKCtHZFA.png" /></figure><p>Now create a folder named `src` in the project and create a new Python script called main (or `main.py`) inside the `src` folder. Then copy the Python code you see in the box below into the `main.py` file and save it.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/fb586b9a8e1d0f24940c6bdbfb762d38/href">https://medium.com/media/fb586b9a8e1d0f24940c6bdbfb762d38/href</a></iframe><p>Run (or execute) the `main.py` script. You now should be able to see the results in the output.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*0NtVSiUKnJp5XW6iJnR4Ew.png" /><figcaption>The output of the HelloWorld Spark application after a successful execution</figcaption></figure><p>Congratulations! You developed your fairs Spark application in Python. Happy making (more!) Spark applications. :0)</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=dc009520d4ab" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to install and run a simple Apache Spark cluster on Windows 10?]]></title>
            <link>https://habedi.medium.com/how-to-install-and-run-a-simple-apache-spark-cluster-on-windows-10-dc7453cf490b?source=rss-c248b3ac59be------2</link>
            <guid isPermaLink="false">https://medium.com/p/dc7453cf490b</guid>
            <category><![CDATA[apache-spark]]></category>
            <category><![CDATA[windows-10]]></category>
            <category><![CDATA[spark]]></category>
            <dc:creator><![CDATA[Hassan Abedi]]></dc:creator>
            <pubDate>Wed, 12 Feb 2020 22:40:59 GMT</pubDate>
            <atom:updated>2021-05-21T06:34:18.857Z</atom:updated>
            <cc:license>https://creativecommons.org/licenses/by-sa/4.0/</cc:license>
            <content:encoded><![CDATA[<h3>How to setup a very simple Apache Spark cluster on your Windows 10?</h3><h3>Introduction</h3><p>This is a step by step guide that aims to help the reader to install <a href="https://spark.apache.org/docs/latest/index.html">Apache Spark</a> on his or her Windows machine. The Spark installation shown in this tutorial is the typical bare minimum single instance cluster installation that you need to get a useful Spark development environment. Typically, you could use this new Spark installation to develop, test and debug your Spark applications in languages like Java, Scala, and Python. The contents provided in this tutorial are mainly for getting a Spark environment in Ubuntu. Still, the main steps taken in this tutorial will apply to other Unix(-like) OSes such as macOS without much modifications. We assume through the tutorial your Windows machine is connected to the internet.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/376/0*Mrv7JZ_N-rxQkU1Z.png" /><figcaption>Currently Apache Spark is the most popular open-source Cluster Computing framework</figcaption></figure><h3>Step 1: Installing Ubuntu 18.04 app</h3><p>Windows 10 comes with a feature that allows its users to have a fully functional Linux environment within Windows. To install Ubuntu 18.04 LTS, you can go Microsoft Store and search for Ubuntu 18.04 LTS app there. To be able to install Ubuntu you must first make sure that you have installed something called Windows Subsystem for Linux on your machine. You could read more about how to install the Ubuntu app on Windows 10 operating system from <a href="https://ubuntu.com/tutorials/tutorial-ubuntu-on-windows#1-overview">here</a>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*q6dy1CpkBKmYzSqI0r_orQ.png" /><figcaption>Ubuntu 18.04 LTS app in Microsoft Store</figcaption></figure><h3>Step 2: Setting a username and a password for your new Ubuntu environment</h3><p>At this point, we assume that you have already successfully installed Ubuntu 18.04 LTS app on your machine. When you start Ubuntu app for the first time, it asks you about the user and password you want to use when you are working in the Ubuntu app. For the sake of simplicity, we used sparkusr for both username and password here.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*TRnZPDeO5tmGOBf3V2xdKQ.png" /><figcaption>When you installed Ubuntu app, you can start it for the first time by pressing the Launch button</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*mxfKv0obggODqIOMwYZ93w.png" /><figcaption>When you start Ubuntu app for the first time you need to provide a username and password; in our setup we used “sparkusr” for both username and passwords</figcaption></figure><h3>Step 3: Setting up your Ubuntu environment</h3><p>At this point, you need to properly configure your Ubuntu environment. You could do so by installing the software packages that you may need to start Apache Sparks services inside your newly installed Ubuntu. You can install packages mentioned above by running the following commands in your terminal emulator:</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/0bf4d55865295f0ac22d2f20b98e2d52/href">https://medium.com/media/0bf4d55865295f0ac22d2f20b98e2d52/href</a></iframe><p>One important thing that you should remember is that we need Java 8 Runtime Environment to be able to run the current stable version of Spark (version 2.4.x).</p><h3>Step 4: Installing Apache Spark</h3><p>At this step, you need to go to Apache Spark’s website and download Spark’s pre-built binary file. In time of writing this tutorial, the latest stable of Spark was 2.4.5 which you could download from <a href="https://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz">here</a>. To download and install Spark, you can execute the following commands in your terminal:</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/669aeeaa2ebf388c464289a46eacdbf0/href">https://medium.com/media/669aeeaa2ebf388c464289a46eacdbf0/href</a></iframe><p>At this point, if there was no problem in your setup, you should be able to see that both SPARK_HOME and JAVA_HOME environment variables are set up correctly and you are ready to go! You can check this by running the following commands echo $JAVA_HOME and echo $SPARK_HOME in your terminal; the output you are going to see should look like the in the picture below:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/844/1*8BRGWqM4oUKidEeON1eT-w.png" /><figcaption>If JAVA_HOME and SPARK_HOME are correctly setup you should be able to see their correct values in the terminal</figcaption></figure><h3>Step 5: Starting Spark services</h3><p>At this point, you can start your Spark cluster services by executing the following commands in your terminal:</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/f4b1309cc95fc53036d984645293bb28/href">https://medium.com/media/f4b1309cc95fc53036d984645293bb28/href</a></iframe><p>You can check various indicators for your Spark cluster (e.g., number of worker, and available resources in your cluster) by going to Spark’s web interface dashboard. By default, the web UI is available at <a href="http://localhost:8080/">http://localhost:8080/</a>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xhJObiOGg3IbX1Vvo-x-EA.png" /><figcaption>Spark master service is running with no available workers to execute a Spark application in the cluster</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*k4cHAlsvc1ln1mpobeb03w.png" /><figcaption>When you’ve started the worker service successfully, it will show up as a worker in your cluster; it attaches itself to the master; ready to execute Spark jobs</figcaption></figure><p>You can start Spark’s shell via running spark-shell in your terminal.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*dDJ-Wl2tuc19vq0ZBa0-Tw.png" /><figcaption>In the Spark shell environment, you can run Spark-SQL and Scala codes that are going to be executed on your cluster</figcaption></figure><p>When you finished your work with the cluster, you need to stop the Spark services. The general pattern is that you need to first stop the worker (slave) service then stop the master. The easiest way to shut down our simple cluster with our current settings is to execute stop-slave.sh and stop-master.shin the terminal consequently. If everything went well when you shutdown the cluster you will see no Java processes belonging to you Spark cluster service. You can check this by running htop command in the terminal, and checking the output. There should be no Spark related Java processes running inside your Ubuntu environment.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*HzJ9DTfwZMRnLB8G1G6IFQ.png" /><figcaption>When you shut down your cluster, or you did not start it in the first place, you should see no processes related to Spark running within your Ubuntu environment</figcaption></figure><h3>Conclusions</h3><p>In this brief tutorial, we described the general steps to install and set up a very rudimentary Apache Spark on a machine running Microsoft Windows 10 operating system. Apache Spark is a sophisticated open-source Cluster Computing framework with many different modes of operation and settings. What we described here is suitable for people who want to get familiar with Spark for the first time, or the people who wish to develop Spark applications on their own machines. The resulting cluster at the end of this tutorial can be used as a mock cluster for learning purposes rather than real-world utility. Nevertheless, working with something as exciting and useful as Apache Spark can always be enjoyable for everybody :0).</p><h3>Additional resources:</h3><p>With the exceeding popularity of machine learning nowadays, it is natural to think that people will use Apache Spark in their machine learning pipelines. So, here is a nice little Spark tutorial for those who want to deploy their machine learning models unto a Spark cluster: <a href="https://neptune.ai/blog/apache-spark-tutorial">https://neptune.ai/blog/apache-spark-tutorial</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=dc7453cf490b" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to win a Data Science challenge in Kaggle?]]></title>
            <link>https://medium.com/analytics-vidhya/how-to-win-a-data-science-challenge-in-kaggle-9f884ecf904?source=rss-c248b3ac59be------2</link>
            <guid isPermaLink="false">https://medium.com/p/9f884ecf904</guid>
            <category><![CDATA[kaggle-competition]]></category>
            <category><![CDATA[kaggle]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[how-to]]></category>
            <dc:creator><![CDATA[Hassan Abedi]]></dc:creator>
            <pubDate>Thu, 06 Feb 2020 16:46:55 GMT</pubDate>
            <atom:updated>2020-02-11T04:22:28.247Z</atom:updated>
            <cc:license>https://creativecommons.org/licenses/by-sa/4.0/</cc:license>
            <content:encoded><![CDATA[<h3><strong>Introduction</strong></h3><p>The main topic of this article is about winning or at least landing a descent top rank in a Data Science competition in Kaggle. It’s been written mainly for the general audience. Now everyone is talking about Data Science, AI, and Machine Learning and how the future of the world depends on the technologies associated with these hot topics. Within this context, Kaggle is THE PLACE for Data Science enthusiasts.</p><h3><strong>What is Kaggle?</strong></h3><p>“Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.” — <a href="https://en.wikipedia.org/wiki/Kaggle">Wikipedia</a></p><p>The boldest feature of Kaggle is that currently, it is the de-facto place where Data Science competitions are held there. For instance, many big companies, such as Microsoft, Google, and Facebook, had had various competitions in Kaggle for different reasons. The reason to post a challenge could range from the getting publicity, recruiting talented people, exploiting cheap labour, and least but not last to gather new insights about the ways they (companies) could tackle their data-driven problems.</p><p>Kaggle is not a new thing — but it could have managed to get a lot of attention over time!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/624/1*4nrqzUlnZL6BdZ_-o9hZaw.png" /><figcaption>Source: <a href="https://www.wolframalpha.com/input/?i=kaggle">Wolfram Alpha</a></figcaption></figure><h3><strong>Why you may be interested to enter a Kaggle competition?</strong></h3><p>For most people, there could also be a various good reason to compete in a Kaggle competition, including:</p><p>● <strong>Trying to learn the skills for Data Science and get experience</strong>. Working on a real-world Data Science problem is hard, some people believe that you must spend at least 10k hours on something to be a professional in it</p><p>● <strong>Build a profile for your career</strong>. Having a Kaggle profile can be a good thing in your resume if you want to get a job related to Data Science</p><p>● <strong>For the money</strong>. Most Kaggle challenges do not provide much as the monetary prize for the top winning teams, but still, there are few high-profile challenges, with million-dollar range prize. The motivation behind the challenges with the money and without the money is a psychological thing. The amount of money in the award is not essential, but it helps gather many people from lower-income strata — making more people joining the challenge</p><p>● <strong>For the thrill of the competition, and the coolness factor involved.</strong> For many people, it could be fun to compete with others, it is like playing an online video game like <a href="https://worldofwarcraft.com/en-us/">WoW</a>, or <a href="https://www.epicgames.com/fortnite/en-US/home">Fortnite</a> somehow — an entertaining activity. Also, competing in Kaggle has a certain aura of coolness around it; years ago, at the start of the current millennium, it has become apparent that the future of the computing is to scale up computers, it leads to new technologies and trends. <a href="https://en.wikipedia.org/wiki/Computer_cluster">Cluster Computing (CC)</a> was and is one of the critical factors back then, you could find a description like these in justification why people were so excited about CC at that time:</p><p>o “<em>Coolness Factor</em>: There is just something really cool about playing with <strong>clusters</strong>. Although a purely subjective factor, it nonetheless has driven much activity, especially among younger practitioners who react to many of the empowerment issues above but also to the attraction of open source software, unbounded performance opportunities, the sand box mindset that enables self-motivated experimentation, and a subtle attribute of the pieces of a <strong>cluster</strong> “talking” to each other on a cooperative basis to make something happen together. There is also a strong sense of community among contributors in the field, a basis for a sense of association that appeals to many people. Working with <strong>commodity clusters</strong> is fun!” — <a href="https://www.springer.com/gp/book/9780387097657">Encyclopedia of parallel computing, Springer</a></p><p>We could — by replacing the word related to cluster computing (i.e., <strong>boldened terms</strong>) with the right words (i.e., Data Science, ML, and AI) — use almost all those enticing things about Data Science, AI, and ML.</p><p>The central assumption here is that the audience (YOU!) know enough algorithm, and programming to be able to use Kaggle. The other important assumption that I’m making is that people learn best by doing! This writing does not try to teach you how to do these — be able to code; be able to understand and implement algorithms; read and analyse other people’s models — to you.</p><h3><strong>Choose your challenge wisely</strong></h3><p>The main starting point for entering a challenge in Kaggle is to pick up a challenge that firstly interests you — and secondly, you have the skills and resources to handle that challenge. One way to categorise the challenges in Kaggle is based on the format and shape of the data which is going to be used for the competition. In general, you could classify the competitions based on this method into two types. First are the challenges with tabular data, where the data is represented in tables with columns and rows in a natural way. The second type is the challenges that the data is not innately served or represented as rows of records packed up into tabular file formats such as CSV or MS Excel. The main differences between these two types of challenges are shown in the table below:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/895/1*mgLbWlltjOyvfldioMZbbw.png" /><figcaption>A binary classification of Kaggle competitions based on their data</figcaption></figure><p>This categorisation is not totally accurate because there are many competitions where their data can be a mixture of tabular and non-tabular data; let’s assume for the sake of simplicity we keep our simple classification throughout this text :-).</p><p>Generally, the type of the data of a challenge you chose determines the skills you need to have to win that challenge — or at least be at the top of the leaderboard.</p><h3><strong>Learn the skills you need</strong></h3><p>The minimum requirement to start working in a Kaggle competition is to be able to develop the code to submit a prediction. In many cases, people don’t start from scratch; instead, they try to use others’ code written in Python or R to start. Nevertheless, to be able to do anything serious in Kaggle, you need to have many more skills under your belt. These skills can be summarised in the following lines:</p><ul><li>Good understanding of software development process, and related tools</li><li>Good knowledge of the primary machine learning and data mining algorithms</li><li>An analytical mind</li><li>Enough knowledge of<strong> linear algebra</strong>, <strong>optimisation</strong>, and <strong>statistics</strong></li><li>Enough knowledge of data processing and management tools and technologies</li></ul><p>The following picture an incomplete list of many relevant terms associated with the aforementioned required skills:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/624/1*d2RbZIE5gJ10PKhC9SFkOA.png" /><figcaption>A word cloud of many of important terms and concepts relevant for a Data Science challenge</figcaption></figure><h3><strong>Find teammates</strong></h3><p>One important thing that could help you a lot while you are in a Kaggle competition is to find other people and teaming up with them. The type of people you should look for usually have got to have two characteristics, firstly they should be better than you in some way, so you can hope to learn and improve yourself via interacting with them, possibly via osmosis. Secondly, they preferably should be thinking differently than you, many people are sharing many similarities when they approach a problem, but for a predictive ML challenge, people with a congruent way of thinking are not going to be useful!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/195/1*N9jAvHGAusQ072ss5-CeOg.png" /><figcaption>Working with smart people with refreshing ideas could be very rewarding!</figcaption></figure><h3><strong>Employ an agile workflow</strong></h3><p>The simple idea behind a building a predictive model is that there exists a collection of functions that accept some input variables and are going give a good enough approximation about the labels or target variables in the dataset. During the lifespan of a challenge, you need to be able to absorb new ideas add them into your models. It could include reading other people’s codes, comments, replies and posts to find and collect useful information about their approaches. Usually, the hard part is not to find helpful information from other people’s work and models — but it is to somehow add their ideas into your workflow. The problem is that usually as you start into a challenge — you start with a simple model — over time your model is going to become more and more complex, so there is impending problem that may at some point of the challenge your model fall into local minima or maxima for the score function used in the challenge.</p><p>You should automate as much as possible, every time you build and train a model, you can notice that parts of your work are going to be repeated. If your workflow contains many repetitious subtasks, there is an excellent chance that you could exploit it to save your precious time. Subtasks such as data reprocessing (e.g., imputing the missing values), Explanatory Data Analysis (EDA), and feature engineering are good candidates for automation.</p><p>It is an interesting challenge in itself— how to set up an efficient, agile Data Science workflow!</p><h3><strong>Understand the challenge description</strong></h3><p>Read the challenge description very carefully, then answer the following questions:</p><ul><li>Do you understand what the type of challenge is? Is it a classification, regression or ranking challenge?</li><li>What is the score, or error function used in the challenge? Why they picked this particular function?</li><li>Does the challenge require you to have the right amount of domain expertise?</li><li>How large are the train and test datasets?</li><li>How raw is the data? How much time do you think it will take to build a starter model?</li><li>Does the challenge need you have access to a particular type of resource?</li><li>Is it a kernel challenge? What limitations are in place?</li><li>How long is the life span of the challenge? How much time can you invest in it?</li><li>Are there any previous challenges like this one you intend to take on? If yes, what can you scavenge or reuse from the previous similar challenges?</li></ul><p>The point of answering these questions is to take on a challenge with a proper contextual understanding. If building a model, for example, requires a lot of computational resources (i.e., <a href="https://en.wikipedia.org/wiki/Graphics_processing_unit">GPU</a>s or <a href="https://en.wikipedia.org/wiki/Tensor_processing_unit">TPU</a>s) or a specific domain knowledge about the data is a critical advantage, and you do not have access to neither of these, you could head into the challenge hoping for getting a good standing on the final private leader board — only to disappoint yourself at the end.</p><h3><strong>Do not skip Exploratory Data Analysis</strong></h3><p>Exploratory Data Analysis is a very essential activity in every Data Science process. It is not something that you could learn by reading books — it is much like an art than an industrial process. But you should take it very seriously when you start a challenge in Kaggle. A typical EDA pipeline may encompass — but not limited — to the following relevant subtasks:</p><ul><li>Using univariate and multivariate data analysis techniques</li><li>Data visualisation — it may include dimensionality reduction steps</li><li>Statistical hypothesis testing</li><li>Taking summary statistics from the data</li><li>Using clustering, anomaly detection/outlier detection, novelty detection techniques</li></ul><p>EDA can be very time consuming, especially when the size of the dataset is large or the dataset is quite raw (i.e., it needs to be preprocessed a lot first). The main idea is that make the data talk to you! There are people in Kaggle that prepare great EDA kernels, you can learn from their work.</p><h3><strong>Start off with a robust validation</strong></h3><p>Always start with the construction of a high-quality validation for your model as early as possible. The main point of this work is that before anything, you should make sure that you are not going to overfit (or underfit) on your validation or test data. Ideally, when you have your validation strategy readied you should get more or less the same score on Competition’s leaderboard as you get on your validation. It is easier said than done, though. When you have an excellent validation the rest of your efforts is mainly focused on three things — finding or building the best features, finding the best single model or ensemble, and finally tuning your model’s hyper-parameters!</p><h3><strong>Wining is hard so prepare to lose, but hope to win</strong></h3><p>To be able to win a Kaggle competition, you need to fight with many other smart and hardworking people from all over the world. To get a gold medal usually, you have to occupy one of the top 10 to 15 places in the final leaderboard, to get a silver medal you have to be within top 5% and to get a bronze to be no further than the top 10%. It is apparent that winning — i.e., to be in the first place — is not going to be easy, because many people you try to beat may have unique advantages compared to you. Winning a Kaggle challenge depends on many factors, including but not limited to having a good understanding of the domain knowledge for the data, possibly having access to high-end computational resources that gives an advantage throughout a competition, having a good grasp of the cutting edge algorithms in the area, and finally be lucky!</p><p>One good strategy could be to focus on a niche. In other words, try to compete in specific challenges which you have some kind of the upper hand. For instance, the current number one Kaggle, <a href="https://www.kaggle.com/bestfitting">bestfitting</a>, is almost totally focused on challenges where the data is an image dataset. One distinguishing feature of his approach is that he heavily uses Deep Neural Networks in his work.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/624/1*hn91fwjwMAIku863GKSbyg.png" /><figcaption>The profile page of bestfitting, the top ranked Kaggler at the moment of writing this article</figcaption></figure><h3>Conclusion</h3><p>It may seem obvious to you that winning or getting a medal in a Kaggle competition is not an easy task at all. You are right; winning is hard, mainly because it requires a mixture of proper amounts of work, knowledge, experience, and very importantly, a little bit of luck! But with investing the right amount of time and effort and having the lady luck on your side, it is not impossible to achieve.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=9f884ecf904" width="1" height="1" alt=""><hr><p><a href="https://medium.com/analytics-vidhya/how-to-win-a-data-science-challenge-in-kaggle-9f884ecf904">How to win a Data Science challenge in Kaggle?</a> was originally published in <a href="https://medium.com/analytics-vidhya">Analytics Vidhya</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>