{"id":11343,"date":"2020-12-14T13:46:52","date_gmt":"2020-12-14T13:46:52","guid":{"rendered":"https:\/\/www.askpython.com\/?p=11343"},"modified":"2023-02-16T19:56:56","modified_gmt":"2023-02-16T19:56:56","slug":"unicode-in-python-unicodedata","status":"publish","type":"post","link":"https:\/\/www.askpython.com\/python-modules\/unicode-in-python-unicodedata","title":{"rendered":"Unicode In Python &#8211; The unicodedata Module Explained"},"content":{"rendered":"\n<p>Hey guys! In this tutorial, we will learn about Unicode in Python and the character properties of Unicode. So, let&#8217;s get started.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What is Unicode?<\/h2>\n\n\n\n<p>Unicode associates each character and symbol with a unique number called code points. It supports all of the world&#8217;s writing systems and ensures that data can be retrieved or combined using any combination of languages.<\/p>\n\n\n\n<p>The codepoint is an integer value ranging from 0 to 0x10FFFF in hexadecimal coding.<\/p>\n\n\n\n<p>To begin using Unicode characters in Python, we need to understand how the string module interprets characters. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How to interpret ASCII and Unicode in Python?<\/h2>\n\n\n\n<p>Python provides us a <em>string<\/em> module that contains various functions and tools to manipulate strings. It falls under the ASCII character set.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport string\n\nprint(string.ascii_lowercase) \nprint(string.ascii_uppercase)\nprint(string.ascii_letters)\nprint(string.digits)\nprint(string.hexdigits)\nprint(string.octdigits)\nprint(string.whitespace)  \nprint(string.punctuation)\n<\/pre><\/div>\n\n\n<p>Output:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nABCDEFGHIJKLMNOPQRSTUVWXYZ\nabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\n0123456789\n0123456789abcdefABCDEF\n01234567\n \t\n!&quot;#$%&amp;&#039;()*+,-.\/:;&lt;=&gt;?@&#x5B;\\]^_`{|}~\n<\/pre><\/div>\n\n\n<p>We can create one-character Unicode strings by using <strong>chr() <\/strong>built-in function. It takes only one integer as argument and returns the unicode of the given character. <\/p>\n\n\n\n<p>Similarly, odr() is an inbuilt function that takes a one-character Unicode string as input and returns the code point value. <\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nchr(57344)\nord(&#039;\\ue000&#039;)\n<\/pre><\/div>\n\n\n<p>Output:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n&#039;\\ue000&#039;\n57344\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\">What does character encoding mean in Python?<\/h2>\n\n\n\n<p>A string is a sequence of Unicode codepoints. These codepoints are converted into a sequence of bytes for efficient storage. This process is called character encoding.<\/p>\n\n\n\n<p>There are many encodings such as UTF-8,UTF-16,ASCII etc. <\/p>\n\n\n\n<p><strong>By default, Python uses UTF-8 encoding.<\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What is UTF-8 Encoding?<\/h2>\n\n\n\n<p>UTF-8 is the most popular and commonly used for encoding characters. UTF stands for <em>Unicode Transformation Format<\/em> and &#8216;8&#8217; means that <em>8-bit values<\/em> are used in the encoding. <\/p>\n\n\n\n<p>It replaced ASCII (American Standard Code For Information Exchange) as it provides more characters and can be used for different languages around the world, unlike ASCII which is only limited to Latin languages. <\/p>\n\n\n\n<p>The first 128 codepoints in the UTF-8 character set are also valid ASCII characters. A character in UTF-8 can be from 1 to 4 bytes long.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Encoding Characters in UTF-8 Using the Python encode() function<\/h2>\n\n\n\n<p>The <a href=\"https:\/\/www.askpython.com\/python\/string\/python-encode-and-decode-functions\" class=\"rank-math-link\">encode() method<\/a> converts any character from one encoding to another. The syntax of the encode function is as shown below &#8211;<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nstring.encode(encoding=&#039;UTF-8&#039;,errors=&#039;strict&#039;)\n<\/pre><\/div>\n\n\n<p><strong>Parameters<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong><em>encoding<\/em> <\/strong>is the encoding to be used which is supported by python.<\/li><li><strong><em>errors<\/em><\/strong> &#8211; The list of different error types is below<\/li><\/ul>\n\n\n\n<ol class=\"wp-block-list\"><li><strong>strict-<\/strong> The default error is <em>strict<\/em> which raises UnicodeDecode error on failure.<\/li><li><strong>ignore<\/strong>&#8211; Ignores the undecodable unicode from the result.<\/li><li><strong>replace<\/strong>&#8211; Replaces the undecodable unicode with &#8216;?&#8217;<\/li><li><strong>xmlcharrefreplace-<\/strong> Inserts xlm character reference in place of undecodable unicode.<\/li><li><strong>backslashreplace- <\/strong>Insets \\uNNNN escape sequence in place of undecodable unicode.<\/li><li><strong>namereplace-<\/strong> Inserts \\N{&#8230;} escape sequence in place of undecodable unicode.<\/li><\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">How to use Unicode in Python with the encode() function?<\/h2>\n\n\n\n<p>Let&#8217;s now move to understanding how the string encode function can allow us to create unicode strings in Python. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1. Encode a string to UTF-8 encoding<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nstring = &#039;\u00f6range&#039;\nprint(&#039;The string is:&#039;,string)\nstring_utf=string.encode()\nprint(&#039;The encoded string is:&#039;,string_utf)\n<\/pre><\/div>\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nThe string is: \u00f6range\nThe encoded string is: b&#039;\\xc3\\xb6range&#039;\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\">2. Encoding with error parameter<\/h3>\n\n\n\n<p>Let us encode the german word wei\u00df which means white.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nstring = &#039;wei\u00df&#039;\n\nx = string.encode(encoding=&#039;ascii&#039;,errors=&#039;backslashreplace&#039;)\nprint(x)\n\nx = string.encode(encoding=&#039;ascii&#039;,errors=&#039;ignore&#039;)\nprint(x)\n\nx = string.encode(encoding=&#039;ascii&#039;,errors=&#039;namereplace&#039;)\nprint(x)\n\nx = string.encode(encoding=&#039;ascii&#039;,errors=&#039;replace&#039;)\nprint(x)\n\nx = string.encode(encoding=&#039;ascii&#039;,errors=&#039;xmlcharrefreplace&#039;)\nprint(x)\n\nx = string.encode(encoding=&#039;UTF-8&#039;,errors=&#039;strict&#039;)\nprint(x)\n<\/pre><\/div>\n\n\n<p>Output:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nb&#039;wei\\\\xdf&#039;\nb&#039;wei&#039;\nb&#039;wei\\\\N{LATIN SMALL LETTER SHARP S}&#039;\nb&#039;wei?&#039;\nb&#039;wei\u00df&#039;\nb&#039;wei\\xc3\\x9f&#039;\n\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\">The uncidedata module to work with Unicode in Python<\/h2>\n\n\n\n<p>The <em><strong>unicodedata<\/strong> <\/em>module provides us the <strong><em>Unicode Character Database (UCD)<\/em><\/strong> which defines all character properties of all Unicode characters.<\/p>\n\n\n\n<p>Let&#8217;s look at all the functions defined within the module with a simple example to explain their functionality. We can efficiently use Unicode in Python with the use of the following functions. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1. <strong>unicodedata.lookup(name)<\/strong><\/h3>\n\n\n\n<p>This function looks up the character by the given name. If the character is found, the corresponding character is returned. If not found, then Keyerror is raised.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport unicodedata \n   \nprint (unicodedata.lookup(&#039;LEFT CURLY BRACKET&#039;)) \nprint (unicodedata.lookup(&#039;RIGHT SQUARE BRACKET&#039;)) \nprint (unicodedata.lookup(&#039;ASTERISK&#039;))\nprint (unicodedata.lookup(&#039;EXCLAMATION MARK&#039;))\n<\/pre><\/div>\n\n\n<p>Output:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n{\n]\n*\n!\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\">2. <strong> unicodedata.name(chr[, default])<\/strong><\/h3>\n\n\n\n<p>This function returns the name assigned to character <em>chr <\/em>as string. If no name is defined, it returns the default otherwise it raises Keyerror.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport unicodedata \n   \nprint (unicodedata.name(u&#039;%&#039;)) \nprint (unicodedata.name(u&#039;|&#039;)) \nprint (unicodedata.name(u&#039;*&#039;)) \nprint (unicodedata.name(u&#039;@&#039;))\n\n<\/pre><\/div>\n\n\n<p>Output:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nPERCENT SIGN\nVERTICAL LINE\nASTERISK\nCOMMERCIAL AT\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\">3. <strong>unicodedata.decimal(chr[, default])<\/strong><\/h3>\n\n\n\n<p>This function returns the decimal value assigned to the character <em>chr<\/em>. If no value is defined then the default is returned otherwise Keyerror is raised as shown in the example below.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport unicodedata\n   \nprint (unicodedata.decimal(u&#039;6&#039;))\nprint (unicodedata.decimal(u&#039;b&#039;)) \n<\/pre><\/div>\n\n\n<p>Output:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n6\nTraceback (most recent call last):\n  File &quot;D:\\DSCracker\\DS Cracker\\program.py&quot;, line 4, in &lt;module&gt;\n    print (unicodedata.decimal(u&#039;b&#039;)) \nValueError: not a decimal\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\">4. <strong>unicodedata.digit(chr[, default])<\/strong><\/h3>\n\n\n\n<p>This function returns the digit value assigned to the character <em>chr <\/em>as integer. One thing to note is that this function takes a single character as an input. In the last line in this example, I&#8217;ve used &#8220;20&#8221; and the function throws an error stating that it cannot accept a string as an input.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport unicodedata \n   \nprint (unicodedata.decimal(u&#039;9&#039;)) \nprint (unicodedata.decimal(u&#039;0&#039;)) \nprint (unicodedata.decimal(u&#039;20&#039;))\n<\/pre><\/div>\n\n\n<p>Output:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n9\n0\nTraceback (most recent call last):\n  File &quot;D:\\DSCracker\\DS Cracker\\program.py&quot;, line 5, in &lt;module&gt;\n    print (unicodedata.decimal(u&#039;20&#039;))\nTypeError: decimal() argument 1 must be a unicode character, not str\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\">5. <strong>unicodedata.numeric(chr[, default])<\/strong><\/h3>\n\n\n\n<p>This function returns the numeric value assigned to the character <em>chr <\/em>as an integer. If no value is defined then it returns default otherwise ValueError is raised.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport unicodedata \n   \nprint (unicodedata.decimal(u&#039;1&#039;))\nprint (unicodedata.decimal(u&#039;8&#039;))\nprint (unicodedata.decimal(u&#039;123&#039;))\n<\/pre><\/div>\n\n\n<p>Output:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n1\n8\nTraceback (most recent call last):\n  File &quot;D:\\DSCracker\\DS Cracker\\program.py&quot;, line 5, in &lt;module&gt;\n    print (unicodedata.decimal(u&#039;123&#039;)) \nTypeError: decimal() argument 1 must be a unicode character, not str\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\">6. <strong>unicodedata.category(chr)<\/strong><\/h3>\n\n\n\n<p>This function returns the general category assigned to the character <em>chr <\/em>as a string. It returns \u2018L\u2019 for letter and \u2018u\u2019 for uppercase and \u2018l\u2019 for lowercase.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport unicodedata \n   \nprint (unicodedata.category(u&#039;P&#039;)) \nprint (unicodedata.category(u&#039;p&#039;)) \n<\/pre><\/div>\n\n\n<p>Output:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nLu\nLl\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\">7. <strong>unicodedata.bidirectional(chr)<\/strong><\/h3>\n\n\n\n<p>This function returns the bidirectional class assigned to the character chr as a string. An empty string is returned by this function if no such value is defined. <\/p>\n\n\n\n<p>AL denotes Arabic letter, AN denotes Arabic number and L denotes left to right etc.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport unicodedata \n   \nprint (unicodedata.bidirectional(u&#039;\\u0760&#039;))\n\nprint (unicodedata.bidirectional(u&#039;\\u0560&#039;)) \n\nprint (unicodedata.bidirectional(u&#039;\\u0660&#039;)) \n\n\n<\/pre><\/div>\n\n\n<p>Output:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nAL\nL\nAN\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\">8. <strong>unicodedata.combining(chr)<\/strong><\/h3>\n\n\n\n<p>This function returns canonical combining class assigned to the given character <em>chr <\/em>as string. It returns 0 if there is  no combining class defined.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport unicodedata \n   \nprint (unicodedata.combining(u&quot;\\u0317&quot;))\n\n<\/pre><\/div>\n\n\n<p>Output:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n220\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\">9. unicodedata.mirrored(chr)<\/h3>\n\n\n\n<p>This function returns a <em>mirrored <\/em>property assigned to the given character <em>chr<\/em> as an integer. It returns <em>1<\/em> if the character is identified as &#8216;<em>mirrored<\/em>&#8216; in bidirectional text or else it returns<em> 0<\/em>.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport unicodedata \n   \nprint (unicodedata.mirrored(u&quot;\\u0028&quot;))\nprint (unicodedata.mirrored(u&quot;\\u0578&quot;))\n\n<\/pre><\/div>\n\n\n<p>Output:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n1\n0\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\">10. <strong>unicodedata.normalize(form, unistr)<\/strong><\/h3>\n\n\n\n<p>Using this function returns the conventional form for the Unicode string unistr. The valid values for form are \u2018NFC\u2019, \u2018NFKC\u2019, \u2018NFD\u2019, and \u2018NFKD\u2019.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nfrom unicodedata import normalize \n   \nprint (&#039;%r&#039; % normalize(&#039;NFD&#039;, u&#039;\\u00C6&#039;)) \nprint (&#039;%r&#039; % normalize(&#039;NFC&#039;, u&#039;C\\u0367&#039;)) \nprint (&#039;%r&#039; % normalize(&#039;NFKD&#039;, u&#039;\\u2760&#039;)) \n\n<\/pre><\/div>\n\n\n<p>Output:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n&#039;\u00c6&#039;\n&#039;C\u0367&#039;\n&#039;\u2760&#039;\n<\/pre><\/div>\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>In this tutorial, we learned about unicode and unicodedatabase module which defines the unicode characteristics. Hope you all enjoyed.  Stay Tuned \ud83d\ude42<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">References<\/h2>\n\n\n\n<p><a href=\"https:\/\/docs.python.org\/3\/howto\/unicode.html#:~:text=Python&#039;s%20string%20type%20uses%20the,character%20its%20own%20unique%20code.\" class=\"rank-math-link\" target=\"_blank\" rel=\"noopener\">Unicode Official Docs<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/docs.python.org\/3\/library\/unicodedata.html\" class=\"rank-math-link\" target=\"_blank\" rel=\"noopener\">Unicodedatabase<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hey guys! In this tutorial, we will learn about Unicode in Python and the character properties of Unicode. So, let&#8217;s get started. What is Unicode? Unicode associates each character and symbol with a unique number called code points. It supports all of the world&#8217;s writing systems and ensures that data can be retrieved or combined [&hellip;]<\/p>\n","protected":false},"author":18,"featured_media":11377,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[],"class_list":["post-11343","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-python-modules"],"blocksy_meta":[],"_links":{"self":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/11343","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/comments?post=11343"}],"version-history":[{"count":0,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/11343\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media\/11377"}],"wp:attachment":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media?parent=11343"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/categories?post=11343"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/tags?post=11343"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}