Demystifying the basics of Encoding and Decoding in Python

Basics of Encoding and Decoding

What is a Unicode?

Unicode is a unique number for every character irrespective of the spoken language (JP, En, Fr, etc.,) they come from or the programming language (Python, Java, etc.,) they are used in

What is the purpose of Unicode?

There are innumerable number of languages in this world. Some follow the Latin writing system (English, French, Spanish, etc.,), and there are so many non-latin writing styles when we look at Asian languages. Unicodes are unique numerical representations for each character of the known, major languages in the world.

The uniqueness of the unicodes help in transmission of information in digial channels

Again, What is encoding and decoding, if you wonder

Computers transmit information in bytes. Encoding is the process of converting unicodes to bytes
Decoding is the process of converting bytes back to unicodes so humans can interpret

What is Unicode Character Set (UCS)

For all major languages in the world, every unique character is assigned a unique value or “code point”. This set of unique values, also representing emojis and other symbols, is the Unicode Character Set. Unicode includes characters from Latin, Greek, Cyrillic, Arabic, Hebrew, Chinese, Japanese, Korean, and many others.
Code points are typically represented in hexadecimal format, such as U+0041 for the Latin capital letter “A” or U+30A2 for the Japanese hiragana character “ア”.

What are some of the commonly used Encoding techniques

Encoding Table
Encoding Type	Full Description	Num of bits	Where Used/Supported Character set
ASCII	American Standard Code for Information Interchange	7 bits	For English text/ supports basic Latin letters, numbers and punctuation marks
UTF-8	Unicode Transformation Format	variable-length min 8 bits	Can support multiple languages; 8 bits for most ASCII characters; Supports upto 32 bits for some characters
UTF-16	Unicode Transformation Format	variable-length min 16 bits	Commonly used for applications which require multi-lang support
Latin-1	ISO-8859-1 or Western European Encoding	8 bits	Limited to Western European languages and does not cover entire unicode characters set
UTF-32	Unicode Transformation Format	fixed-length 32 bits	Provides direct mapping between unicodes and characters; Less commonly used; High Storage

Encoding and Decoding Strings in Python

In Python, all strings by default are Unicode strings
If it is unicode, computer reads it by “encoding” into a byte string
By default, Python uses utf-8 encoding. You can also encode in utf-16

byte_string = "センティル・クマール".encode()
byte_string

b'\xe3\x82\xbb\xe3\x83\xb3\xe3\x83\x86\xe3\x82\xa3\xe3\x83\xab\xe3\x83\xbb\xe3\x82\xaf\xe3\x83\x9e\xe3\x83\xbc\xe3\x83\xab'

byte_string_utf16 = "センティル・クマール".encode('utf-16')
byte_string_utf16

b'\xff\xfe\xbb0\xf30\xc60\xa30\xeb0\xfb0\xaf0\xde0\xfc0\xeb0'

print(byte_string.decode())
print(byte_string_utf16.decode('utf-16'))

センティル・クマール
センティル・クマール

About Byte Strings in Python

Byte strings are used to represent binary data, such as images, audio files, or serialized objects. Binary data is not directly representable as text and needs to be stored and processed as a sequence of bytes.

>> type(byte_string)
bytes

It is possible to save the byte strings directly in python using the prefix “b”

>> forced_byte_string = b"some_string"
>> type(forced_byte_string)
bytes

It is NOT possible to save Non-ASCII characters as byte strings

forced_byte_string = b"センティル・クマール"

SyntaxError: bytes can only contain ASCII literal characters.

One example of using byte strings is when we serialize objects (such as python objects) using pickle module


import pickle
an_example_dict = {
  "English": "Senthil Kumar", 
  "Japanese": "センティル・クマール",
  "Chinese": "森蒂尔·库马尔",
  "Korean": "센틸 쿠마르",
  "Arabic": "سينتيل كومار",
  "Urdu": "سینتھل کمار"
}

serialized_data = pickle.dumps(an_example_dict)
print(type(serialized_data))

with open("serialized_dict.pkl", "wb") as file:
    file.write(serialized_data)

bytes

Encoding and Decoding Files in Python

Saving Text Files in `ASCII` and `UTF` Formats

The below code will throw NO error, because it is a English only text

normal_text = 'Hot: Microsoft Surface Pro 4 Tablet Intel Core i7 8GB RAM 256GB.. now Pound 1079.00! #SpyPrice #Microsoft'
with open("saving_eng__only_text.txt","w",encoding="ascii") as f:
    f.write(normal_text)

The below code will throw an error, because you have latin character “£”

non_ascii_text = 'Hot: Microsoft Surface Pro 4 Tablet Intel Core i7 8GB RAM 256GB.. now £1079.00! #SpyPrice #Microsoft'with open("saving_eng__only_text.txt","w",encoding="ascii") as f:
    f.write(non_ascii_text)

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
Input In [21], in <cell line: 1>()
      1 with open("saving_a_latin_string.txt","w",encoding="ascii") as f:
----> 2     f.write(non_ascii_text)

UnicodeEncodeError: 'ascii' codec can't encode character '\xa3' in position 70: ordinal not in range(128)

Changing the encoding to “utf-8” fixed the error

with open("saving_a_latin_string.txt","w",encoding="utf-8") as f:
    f.write(non_ascii_text )

Saving `Non-ASCII` JSON Files in different formats

Saving a dict using json.dump, utf-8 encoding
Saving the same dict as a json_string using json.dumps, utf-8 encoding
Saving the same dict using json.dump, utf-16 encoding

import json
an_example_dict = {
  "English": "Senthil Kumar", 
  "Japanese": "センティル・クマール",
  "Chinese": "森蒂尔·库马尔",
  "Korean": "센틸 쿠마르",
  "Arabic": "سينتيل كومار",
  "Urdu": "سینتھل کمار"
}

with open("saving_the_names_dict_utf8.json","w",encoding="utf-8") as f:
    json.dump(an_example_dict, f,ensure_ascii=False)

an_example_dict_str = json.dumps(an_example_dict,ensure_ascii=False)
with open("saving_the_names_dict_utf8_using_json_string.json","w",encoding="utf-8") as f:
    f.write(an_example_dict_str)
    
with open("saving_the_names_dict_utf16.json","w",encoding="utf-16") as f:
    json.dump(an_example_dict, f,ensure_ascii=False)

How to load the dict?

with open("saving_the_names_dict_utf8.json","r",encoding="utf-8") as f:
    loaded_dict = json.load(f)

print(loaded_dict)

{'English': 'Senthil Kumar', 'Japanese': 'センティル・クマール', 'Chinese': '森蒂尔·库马尔', 'Korean': '센틸 쿠마르', 'Arabic': 'سينتيل كومار', 'Urdu': 'سینتھل کمار'}

>> cat saving_the_names_dict_utf8.json
{"English": "Senthil Kumar", "Japanese": "センティル・クマール", "Chinese": "森蒂尔·库马尔", "Korean": "센틸 쿠마르", "Arabic": "سينتيل كومار", "Urdu": "سینتھل کمار"}
>> echo "the file size:" && du -hs saving_the_names_dict.json

echo "the utf8 file size in bytes:" && wc -c saving_the_names_dict_utf8.json 
echo "the utf8 file size in bytes:" && wc -c saving_the_names_dict_utf8_using_json_string.json
echo "the utf16 file size in bytes:" && wc -c saving_the_names_dict_utf16.json

the utf8 file size in bytes:
     209 saving_the_names_dict_utf8.json
the utf8 file size in bytes:
     209 saving_the_names_dict_utf8_using_json_string.json
the utf16 file size in bytes:
     292 saving_the_names_dict_utf16.json

Conclusion: - In the example above, the byte size of utf16 file is more than that of utf8 file

Conclusion

Use utf8 everywhere | check more here
- UTF-8 can be used to encode anything that UTF-16 can. So most of the usecases can be met with utf-8.
- UTF-16 starts with a minimum of 2 bytes (16-bits) and hence not compatible with 7 bit ASCII. But UTF-8 is backwards compatible with ASCII.

Good Sources

Why UTF-8 should be used?
- https://stackoverflow.com/a/18231475
- http://utf8everywhere.org/
Other good resources
- Encoding-Decoding in Python 3 https://www.pythoncentral.io/encoding-and-decoding-strings-in-python-3-x/