{"id":12161,"date":"2021-01-30T06:05:00","date_gmt":"2021-01-30T06:05:00","guid":{"rendered":"https:\/\/www.askpython.com\/?p=12161"},"modified":"2021-02-08T15:44:54","modified_gmt":"2021-02-08T15:44:54","slug":"chi-square-test","status":"publish","type":"post","link":"https:\/\/www.askpython.com\/python\/examples\/chi-square-test","title":{"rendered":"Chi-square test in Python &#8212; All you need to know!!"},"content":{"rendered":"\n<p>Hello, readers! In this article, we will be focusing on <strong>Chi-square Test<\/strong> in Python. So, let us get started!!<\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Understanding Statistical Tests for Data Science and Machine Learning<\/h2>\n\n\n\n<p>Statistical tests play an important role in the domain of Data Science and Machine Learning. With the statistical tests, one can presume a certain level of understanding about the data in terms of statistical distribution.<\/p>\n\n\n\n<p>Various statistics exist based on the type of variables i.e. continuous or categorical. For continuous data values, the following are the most used tests:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>T-test<\/strong><\/li><li><a aria-label=\"Correlation regression test (opens in a new tab)\" href=\"https:\/\/www.askpython.com\/python\/examples\/correlation-matrix-in-python\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"rank-math-link\">Correlation regression test<\/a><\/li><\/ul>\n\n\n\n<p>On the other hand, for categorical data variables, below are the popular statistical tests:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>ANOVA test<\/strong><\/li><li><strong>Chi-square test<\/strong><\/li><\/ul>\n\n\n\n<p>Today, let us have a look at <strong>Chi-square test in Python<\/strong>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What is a Chi-square Test?<\/h2>\n\n\n\n<p>The Chi-square test is a non-parametric statistical test that enables us to understand the relationship between the categorical variables of the dataset. That is, it defines the correlation amongst the grouping categorical data.<\/p>\n\n\n\n<p>Using the Chi-square test, we can estimate the level of correlation i.e. association between the categorical variables of the dataset. This helps us analyze the dependence of one category of the variable on the other independent category of the variable.<\/p>\n\n\n\n<p>Let us now understand Chi-square test in terms of Hypothesis.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Hypothesis setup for Chi-square test<\/h3>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>The null hypothesis<\/strong> can be framed in the below manner:  <em>The grouping variables have no association or correlation amongst them.<\/em><\/li><li><strong>The alternate Hypothesis<\/strong> goes as framed below: <em>The variables are associated with each other and happen to have a correlation between the variables.<\/em><\/li><\/ul>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Using scipy.stats library to implement Chi-square test<\/h2>\n\n\n\n<p>In this example, we have created a table as shown below &#8212; &#8216;info&#8217;. Further, we have made use of <code>scipy.stats<\/code> library which provides us with <code>chi2_contingency()<\/code> function to implement Chi-square test.<\/p>\n\n\n\n<p><strong>Example:<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nfrom scipy.stats import chi2_contingency \n\ninfo = &#x5B;&#x5B;100, 200, 300], &#x5B;50, 60, 70]] \nprint(info)\nstat, p, dof= chi2_contingency(info) \n\nprint(dof)\n\nsignificance_level = 0.05\nprint(&quot;p value: &quot; + str(p)) \nif p &lt;= significance_level: \n\tprint(&#039;Reject NULL HYPOTHESIS&#039;) \nelse: \n\tprint(&#039;ACCEPT NULL HYPOTHESIS&#039;) \n\n<\/pre><\/div>\n\n\n<p>As an output, we get three values from the test: statistic value (which can be used to decide upon hypothesis when compared to the critical values), p-value and degree of freedom (number of variables that are free to vary)<\/p>\n\n\n\n<p>We make use of p-value to interpret the Chi-square test.<\/p>\n\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n&#x5B;&#x5B;100, 200, 300], &#x5B;50, 60, 70]]\n2\np value: 0.001937714203415323\nReject NULL HYPOTHESIS\n<\/pre><\/div>\n\n\n<p>If the p-value is less than the assumed significance value (0.05), then we fail to accept that there is no association between the variables. That is, we reject the NULL hypothesis and accept the alternate hypothesis claim.<\/p>\n\n\n\n<p>Thus, in this case, we reject the Null hypothesis and assume a relationship between the passed data.<\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Using Chi-square test on a dataset<\/h2>\n\n\n\n<p>In this example, we will be making use of Bike rental count dataset. You can find the dataset <a href=\"https:\/\/github.com\/Safa1615\/BIKE-RENTAL-COUNT\/blob\/master\/day.csv\" target=\"_blank\" aria-label=\"here (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"rank-math-link\">here<\/a>!<\/p>\n\n\n\n<p>Now, we would be implementing Chi-square test to analyze the relationship between the independent categorical variables.<\/p>\n\n\n\n<p>Initially, we load the dataset into the environment and then print the names of the categorical data variables as shown:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport os\nimport pandas\n#Changing the current working directory\nos.chdir(&quot;D:\/Ediwsor_Project - Bike_Rental_Count&quot;)\nBIKE = pandas.read_csv(&quot;day.csv&quot;)\ncategorical_col = &#x5B;&#039;season&#039;, &#039;yr&#039;, &#039;mnth&#039;, &#039;holiday&#039;, &#039;weekday&#039;, &#039;workingday&#039;,\n       &#039;weathersit&#039;]\nprint(categorical_col)\n<\/pre><\/div>\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n&#x5B;&#039;season&#039;, &#039;yr&#039;, &#039;mnth&#039;, &#039;holiday&#039;, &#039;weekday&#039;, &#039;workingday&#039;, &#039;weathersit&#039;]\n<\/pre><\/div>\n\n\n<p>Further, we use the crosstab() function to create a contingency table of the two selected variables to work on &#8216;holiday&#8217; and &#8216;weathersit&#8217;.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nchisqt = pandas.crosstab(BIKE.holiday, BIKE.weathersit, margins=True)\nprint(chisqt)\n<\/pre><\/div>\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nweathersit    1    2   3  All\nholiday                      \n0           438  238  20  696\n1            15    6   0   21\nAll         453  244  20  717\n<\/pre><\/div>\n\n\n<p>At last, we apply the chi2_contingency() function on the table and get the statistics, p-value and degree of freedom values.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nfrom scipy.stats import chi2_contingency \nimport numpy as np\nchisqt = pandas.crosstab(BIKE.holiday, BIKE.weathersit, margins=True)\nvalue = np.array(&#x5B;chisqt.iloc&#x5B;0]&#x5B;0:5].values,\n                  chisqt.iloc&#x5B;1]&#x5B;0:5].values])\nprint(chi2_contingency(value)&#x5B;0:3])\n\n<\/pre><\/div>\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n(1.0258904805937215, 0.794987564022437, 3)\n<\/pre><\/div>\n\n\n<p>From above, 0.79 is the p-value, 1.02 is the statistical value and 3 is the degree of freedom. As the p-value is greater than 0.05, we accept the NULL hypothesis and assume that the variables &#8216;holiday&#8217; and &#8216;weathersit&#8217; are independent of each other.<\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.<\/p>\n\n\n\n<p>Till then, Happy Analyzing!! \ud83d\ude42<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hello, readers! In this article, we will be focusing on Chi-square Test in Python. So, let us get started!! Understanding Statistical Tests for Data Science and Machine Learning Statistical tests play an important role in the domain of Data Science and Machine Learning. With the statistical tests, one can presume a certain level of understanding [&hellip;]<\/p>\n","protected":false},"author":4,"featured_media":12195,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[],"class_list":["post-12161","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-examples"],"blocksy_meta":[],"_links":{"self":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/12161","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/comments?post=12161"}],"version-history":[{"count":0,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/12161\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media\/12195"}],"wp:attachment":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media?parent=12161"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/categories?post=12161"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/tags?post=12161"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}