Demystifying the basics of Encoding and Decoding in Python

A short blog on my experiences dealing with non-ASCII characters in Python 3
Coding
Python
Author

Senthil Kumar

Published

December 9, 2022

Basics of Encoding and Decoding

What is a Unicode?

Unicode is a unique number for every character irrespective of the spoken language (JP, En, Fr, etc.,) they come from or the programming language (Python, Java, etc.,) they are used in


What is the purpose of Unicode?

There are innumerable number of languages in this world. Some follow the Latin writing system (English, French, Spanish, etc.,), and there are so many non-latin writing styles when we look at Asian languages. Unicodes are unique numerical representations for each character of the known, major languages in the world.

  • The uniqueness of the unicodes help in transmission of information in digial channels

Again, What is encoding and decoding, if you wonder

  • Computers transmit information in bytes. Encoding is the process of converting unicodes to bytes
  • Decoding is the process of converting bytes back to unicodes so humans can interpret

What is Unicode Character Set (UCS)

  • For all major languages in the world, every unique character is assigned a unique value or “code point”. This set of unique values, also representing emojis and other symbols, is the Unicode Character Set. Unicode includes characters from Latin, Greek, Cyrillic, Arabic, Hebrew, Chinese, Japanese, Korean, and many others.
  • Code points are typically represented in hexadecimal format, such as U+0041 for the Latin capital letter “A” or U+30A2 for the Japanese hiragana character “ア”.

What are some of the commonly used Encoding techniques

Encoding Table
Encoding Type Full Description Num of bits Where Used/Supported Character set
ASCII American Standard Code for Information Interchange 7 bits For English text/ supports basic Latin letters, numbers and punctuation marks
UTF-8 Unicode Transformation Format variable-length min 8 bits Can support multiple languages; 8 bits for most ASCII characters; Supports upto 32 bits for some characters
UTF-16 Unicode Transformation Format variable-length min 16 bits Commonly used for applications which require multi-lang support
Latin-1 ISO-8859-1 or Western European Encoding 8 bits Limited to Western European languages and does not cover entire unicode characters set
UTF-32 Unicode Transformation Format fixed-length 32 bits Provides direct mapping between unicodes and characters; Less commonly used; High Storage

Encoding and Decoding Strings in Python

  • In Python, all strings by default are Unicode strings
  • If it is unicode, computer reads it by “encoding” into a byte string
  • By default, Python uses utf-8 encoding. You can also encode in utf-16
byte_string = "センティル・クマール".encode()
byte_string
b'\xe3\x82\xbb\xe3\x83\xb3\xe3\x83\x86\xe3\x82\xa3\xe3\x83\xab\xe3\x83\xbb\xe3\x82\xaf\xe3\x83\x9e\xe3\x83\xbc\xe3\x83\xab'
byte_string_utf16 = "センティル・クマール".encode('utf-16')
byte_string_utf16
b'\xff\xfe\xbb0\xf30\xc60\xa30\xeb0\xfb0\xaf0\xde0\xfc0\xeb0'
print(byte_string.decode())
print(byte_string_utf16.decode('utf-16'))
センティル・クマール
センティル・クマール

About Byte Strings in Python

Byte strings are used to represent binary data, such as images, audio files, or serialized objects. Binary data is not directly representable as text and needs to be stored and processed as a sequence of bytes.

>> type(byte_string)
bytes
  • It is possible to save the byte strings directly in python using the prefix “b”
>> forced_byte_string = b"some_string"
>> type(forced_byte_string)
bytes
  • It is NOT possible to save Non-ASCII characters as byte strings
forced_byte_string = b"センティル・クマール"
SyntaxError: bytes can only contain ASCII literal characters.
  • One example of using byte strings is when we serialize objects (such as python objects) using pickle module

import pickle
an_example_dict = {
  "English": "Senthil Kumar", 
  "Japanese": "センティル・クマール",
  "Chinese": "森蒂尔·库马尔",
  "Korean": "센틸 쿠마르",
  "Arabic": "سينتيل كومار",
  "Urdu": "سینتھل کمار"
}

serialized_data = pickle.dumps(an_example_dict)
print(type(serialized_data))

with open("serialized_dict.pkl", "wb") as file:
    file.write(serialized_data)
bytes

Encoding and Decoding Files in Python

Saving Text Files in ASCII and UTF Formats

  • The below code will throw NO error, because it is a English only text
normal_text = 'Hot: Microsoft Surface Pro 4 Tablet Intel Core i7 8GB RAM 256GB.. now Pound 1079.00! #SpyPrice #Microsoft'
with open("saving_eng__only_text.txt","w",encoding="ascii") as f:
    f.write(normal_text)

  • The below code will throw an error, because you have latin character “£”
non_ascii_text = 'Hot: Microsoft Surface Pro 4 Tablet Intel Core i7 8GB RAM 256GB.. now £1079.00! #SpyPrice #Microsoft'with open("saving_eng__only_text.txt","w",encoding="ascii") as f:
    f.write(non_ascii_text)
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
Input In [21], in <cell line: 1>()
      1 with open("saving_a_latin_string.txt","w",encoding="ascii") as f:
----> 2     f.write(non_ascii_text)

UnicodeEncodeError: 'ascii' codec can't encode character '\xa3' in position 70: ordinal not in range(128)
  • Changing the encoding to “utf-8” fixed the error
with open("saving_a_latin_string.txt","w",encoding="utf-8") as f:
    f.write(non_ascii_text )

Saving Non-ASCII JSON Files in different formats

  1. Saving a dict using json.dump, utf-8 encoding
  2. Saving the same dict as a json_string using json.dumps, utf-8 encoding
  3. Saving the same dict using json.dump, utf-16 encoding
import json
an_example_dict = {
  "English": "Senthil Kumar", 
  "Japanese": "センティル・クマール",
  "Chinese": "森蒂尔·库马尔",
  "Korean": "센틸 쿠마르",
  "Arabic": "سينتيل كومار",
  "Urdu": "سینتھل کمار"
}

with open("saving_the_names_dict_utf8.json","w",encoding="utf-8") as f:
    json.dump(an_example_dict, f,ensure_ascii=False)

an_example_dict_str = json.dumps(an_example_dict,ensure_ascii=False)
with open("saving_the_names_dict_utf8_using_json_string.json","w",encoding="utf-8") as f:
    f.write(an_example_dict_str)
    
with open("saving_the_names_dict_utf16.json","w",encoding="utf-16") as f:
    json.dump(an_example_dict, f,ensure_ascii=False)
  • How to load the dict?
with open("saving_the_names_dict_utf8.json","r",encoding="utf-8") as f:
    loaded_dict = json.load(f)

print(loaded_dict)
{'English': 'Senthil Kumar', 'Japanese': 'センティル・クマール', 'Chinese': '森蒂尔·库马尔', 'Korean': '센틸 쿠마르', 'Arabic': 'سينتيل كومار', 'Urdu': 'سینتھل کمار'}
>> cat saving_the_names_dict_utf8.json
{"English": "Senthil Kumar", "Japanese": "センティル・クマール", "Chinese": "森蒂尔·库马尔", "Korean": "센틸 쿠마르", "Arabic": "سينتيل كومار", "Urdu": "سینتھل کمار"}
>> echo "the file size:" && du -hs saving_the_names_dict.json
echo "the utf8 file size in bytes:" && wc -c saving_the_names_dict_utf8.json 
echo "the utf8 file size in bytes:" && wc -c saving_the_names_dict_utf8_using_json_string.json
echo "the utf16 file size in bytes:" && wc -c saving_the_names_dict_utf16.json

the utf8 file size in bytes:
     209 saving_the_names_dict_utf8.json
the utf8 file size in bytes:
     209 saving_the_names_dict_utf8_using_json_string.json
the utf16 file size in bytes:
     292 saving_the_names_dict_utf16.json

Conclusion: - In the example above, the byte size of utf16 file is more than that of utf8 file

Conclusion

  • Use utf8 everywhere | check more here
    • UTF-8 can be used to encode anything that UTF-16 can. So most of the usecases can be met with utf-8.
    • UTF-16 starts with a minimum of 2 bytes (16-bits) and hence not compatible with 7 bit ASCII. But UTF-8 is backwards compatible with ASCII.

Good Sources

  • Why UTF-8 should be used?
    • https://stackoverflow.com/a/18231475
    • http://utf8everywhere.org/
  • Other good resources
    • Encoding-Decoding in Python 3 https://www.pythoncentral.io/encoding-and-decoding-strings-in-python-3-x/