{"id":26618,"date":"2022-01-29T15:36:33","date_gmt":"2022-01-29T15:36:33","guid":{"rendered":"https:\/\/www.askpython.com\/?p=26618"},"modified":"2022-01-29T15:46:23","modified_gmt":"2022-01-29T15:46:23","slug":"feature-selection-in-python","status":"publish","type":"post","link":"https:\/\/www.askpython.com\/python\/examples\/feature-selection-in-python","title":{"rendered":"Feature Selection in Python &#8211; A Beginner&#8217;s Reference"},"content":{"rendered":"\n<p>This article is a little on the advanced side. We&#8217;ll discuss feature selection in Python for training machine learning models. It&#8217;s important to identify the important features from a dataset and eliminate the less important features that don&#8217;t improve model accuracy.<\/p>\n\n\n\n<p>Model performance can be harmed by features that are irrelevant or only partially relevant. The first and most critical phase in model design should be feature selection and data cleaning.<\/p>\n\n\n\n<p>Feature selection is a fundamental concept in machine learning that has a significant impact on your model&#8217;s performance. In this article, you&#8217;ll learn how to employ feature selection strategies in Machine Learning.<\/p>\n\n\n\n<p><strong><em>Also read: <a href=\"https:\/\/www.askpython.com\/python\/machine-learning-introduction\" data-type=\"post\" data-id=\"22853\">Machine Learning In Python \u2013 An Easy Guide For Beginner\u2019s<\/a><\/em><\/strong><\/p>\n\n\n\n<p>Let\u2019s get started!<\/p>\n\n\n\n<p>First of all, let us understand what is Feature Selection.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-is-feature-selection\">What is Feature Selection?<\/h2>\n\n\n\n<p>The presence of irrelevant features in your data can reduce model accuracy and cause your model to train based on irrelevant features. Feature selection is the process of selecting the features that contribute the most to the prediction variable or output that you are interested in, either automatically or manually.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"why-should-we-perform-feature-selection-on-our-model\">Why should we perform Feature Selection on our Model?<\/h3>\n\n\n\n<p>Following are some of the benefits of performing feature selection on a machine learning model:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Improved Model Accuracy:<\/strong> Model accuracy improves as a result of less misleading data.<\/li><li><strong>Reduced Overfitting<\/strong>: With less redundant data, there is less chance of making conclusions based on noise.<\/li><li><strong>Reduced Training Time<\/strong>: Algorithm complexity is reduced as a result of fewer data points, and algorithms train faster.<\/li><\/ul>\n\n\n\n<p>When you conduct feature selection on a model, its accuracy improves dramatically.<\/p>\n\n\n\n<p><strong><em>Also read: <a href=\"https:\/\/www.askpython.com\/python\/examples\/split-data-training-and-testing-set\" data-type=\"post\" data-id=\"9234\">How to Split Data into Training and Testing Sets in Python using sklearn?<\/a><\/em><\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"methods-to-perform-feature-selection\">Methods to perform Feature Selection<\/h2>\n\n\n\n<p>There are three commonly used Feature Selection Methods that are easy to perform and yield good results.<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Univariate Selection<\/li><li>Feature Importance<\/li><li>Correlation Matrix with Heatmap<\/li><\/ol>\n\n\n\n<p>Let&#8217;s take a closer look at each of these methods with an example.<\/p>\n\n\n\n<p><strong>Link to download the dataset<\/strong>: <a href=\"https:\/\/www.kaggle.com\/iabhishekofficial\/mobile-price-classification#train.csv\" target=\"_blank\" rel=\"noopener\">https:\/\/www.kaggle.com\/iabhishekofficial\/mobile-price-classification#train.csv<\/a><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"1-univariate-selection\">1. Univariate Selection<\/h3>\n\n\n\n<p>Statistical tests can be performed to identify which attributes have the strongest link to the output variable. The SelectKBest class in the scikit-learn library can be used with a variety of statistical tests to choose a certain number of features.<\/p>\n\n\n\n<p>The chi-squared (chi2) statistical test for non-negative features is used in the example below to select 10 of the top features from the Mobile Price Range Prediction Dataset.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport pandas as pd\nimport numpy as np\nfrom sklearn.feature_selection import SelectKBest\nfrom sklearn.feature_selection import chi2\ndata = pd.read_csv(&quot;C:\/\/Users\/\/Intel\/\/Documents\/\/mobile_price_train.csv&quot;)\nX = data.iloc&#x5B;:,0:20]  #independent variable columns\ny = data.iloc&#x5B;:,-1]    #target variable column (price range)\n\n#extracting top 10 best features by applying SelectKBest class\nbestfeatures = SelectKBest(score_func=chi2, k=10)\nfit = bestfeatures.fit(X,y)\ndfscores = pd.DataFrame(fit.scores_)\ndfcolumns = pd.DataFrame(X.columns)\n\n#concat two dataframes\nfeatureScores = pd.concat(&#x5B;dfcolumns,dfscores],axis=1)\nfeatureScores.columns = &#x5B;&#039;Specs&#039;,&#039;Score&#039;]  #naming the dataframe columns\nprint(featureScores.nlargest(10,&#039;Score&#039;))  #printing 10 best features\n<\/pre><\/div>\n\n\n<p>Output:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">Specs          Score\n13            ram  931267.519053\n11      px_height   17363.569536\n0   battery_power   14129.866576\n12       px_width    9810.586750\n8       mobile_wt      95.972863\n6      int_memory      89.839124\n15           sc_w      16.480319\n16      talk_time      13.236400\n4              fc      10.135166\n14           sc_h       9.614878<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"2-feature-importance\">2. Feature Importance<\/h3>\n\n\n\n<p>The feature importance attribute of the model can be used to obtain the feature importance of each feature in your dataset.<\/p>\n\n\n\n<p>Feature importance assigns a score to each of your data&#8217;s features; the higher the score, the more important or relevant the feature is to your output variable. We will use Extra Tree Classifier in the below example to extract the top 10 features for the dataset because Feature Importance is an inbuilt class that comes with Tree-Based Classifiers.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport pandas as pd\nimport numpy as np\ndata = pd.read_csv(&quot;C:\/\/Users\/\/Intel\/\/Documents\/\/mobile_price_train.csv&quot;)\nX = data.iloc&#x5B;:,0:20]  #independent variable columns\ny = data.iloc&#x5B;:,-1]    #target variable column (price range)\nfrom sklearn.ensemble import ExtraTreesClassifier\nimport matplotlib.pyplot as plt\nmodel = ExtraTreesClassifier()\nmodel.fit(X,y)\nprint(model.feature_importances_) \n\n#plot the graph of feature importances \nfeat_importances = pd.Series(model.feature_importances_, index=X.columns)\nfeat_importances.nlargest(10).plot(kind=&#039;barh&#039;)\nplt.show()\n<\/pre><\/div>\n\n\n<p>Output:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[0.05945479 0.02001093 0.03442302 0.0202319  0.03345326 0.01807593\n 0.03747275 0.03450839 0.03801611 0.0335925  0.03590059 0.04702123\n 0.04795976 0.38014236 0.03565894 0.03548119 0.03506038 0.01391338\n 0.01895962 0.02066298]<\/pre>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"428\" height=\"248\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2022\/01\/Feature-Importance-Plot.png\" alt=\"Feature Importance Plot\" class=\"wp-image-26627\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2022\/01\/Feature-Importance-Plot.png 428w, https:\/\/www.askpython.com\/wp-content\/uploads\/2022\/01\/Feature-Importance-Plot-300x174.png 300w\" sizes=\"auto, (max-width: 428px) 100vw, 428px\" \/><\/figure><\/div>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"3-correlation-statistics-with-heatmap\">3. Correlation Statistics with Heatmap<\/h3>\n\n\n\n<p>Correlation describes the relationship between the features and the target variable.<br>Correlation can be:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Positive<\/strong>: An increase in one feature&#8217;s value improves the value of the target variable or&nbsp;<\/li><li><strong>Negative:<\/strong> An increase in one feature&#8217;s value decreases the value of the target variable.<\/li><\/ul>\n\n\n\n<p>We will plot a heatmap of correlated features using the Seaborn library to find which features are most connected to the target variable.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport pandas as pd\nimport numpy as np\nimport seaborn as sns\ndata = pd.read_csv(&quot;C:\/\/Users\/\/Intel\/\/Documents\/\/mobile_price_train.csv&quot;)\nX = data.iloc&#x5B;:,0:20]  #independent variable columns\ny = data.iloc&#x5B;:,-1]    #targetvariable column (price range)\n\n#obtain the correlations of each features in dataset\ncorrmat = data.corr()\ntop_corr_features = corrmat.index\nplt.figure(figsize=(20,20))\n#plot heat map\ng=sns.heatmap(data&#x5B;top_corr_features].corr(),annot=True,cmap=&quot;RdYlGn&quot;)\n<\/pre><\/div>\n\n\n<p>Output:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"968\" height=\"1024\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2022\/01\/HEAT-MAP-968x1024.png\" alt=\"HEAT MAP\" class=\"wp-image-26628\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2022\/01\/HEAT-MAP-968x1024.png 968w, https:\/\/www.askpython.com\/wp-content\/uploads\/2022\/01\/HEAT-MAP-284x300.png 284w, https:\/\/www.askpython.com\/wp-content\/uploads\/2022\/01\/HEAT-MAP-768x813.png 768w, https:\/\/www.askpython.com\/wp-content\/uploads\/2022\/01\/HEAT-MAP.png 1119w\" sizes=\"auto, (max-width: 968px) 100vw, 968px\" \/><\/figure><\/div>\n\n\n\n<p>Go to the last row and look at the price range. You will see all the features correlated to the price range. &#8216;ram&#8217; is the feature that is highly correlated to the price range, followed by features such as battery power, pixel height, and width.m_dep, clock_speed, and n_cores are the features least correlated with the price range. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"conclusion\"><strong>Conclusion<\/strong><\/h2>\n\n\n\n<p>We learned how to choose relevant features from data using the Univariate Selection approach, feature importance, and the correlation matrix in this article. Choose the method that suits your case the best and use it to improve your model\u2019s accuracy.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This article is a little on the advanced side. We&#8217;ll discuss feature selection in Python for training machine learning models. It&#8217;s important to identify the important features from a dataset and eliminate the less important features that don&#8217;t improve model accuracy. Model performance can be harmed by features that are irrelevant or only partially relevant. [&hellip;]<\/p>\n","protected":false},"author":39,"featured_media":26948,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[],"class_list":["post-26618","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-examples"],"blocksy_meta":[],"_links":{"self":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/26618","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/comments?post=26618"}],"version-history":[{"count":0,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/26618\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media\/26948"}],"wp:attachment":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media?parent=26618"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/categories?post=26618"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/tags?post=26618"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}