Basics of Encoding and Decoding
What is a Unicode?
Unicode is a unique number for every character irrespective of the spoken language (JP, En, Fr, etc.,) they come from or the programming language (Python, Java, etc.,) they are used in
What is the purpose of Unicode?
There are innumerable number of languages in this world. Some follow the Latin writing system (English, French, Spanish, etc.,), and there are so many non-latin writing styles when we look at Asian languages. Unicodes are unique numerical representations for each character of the known, major languages in the world.
- The uniqueness of the unicodes help in transmission of information in digial channels
Again, What is encoding and decoding, if you wonder
- Computers transmit information in bytes. Encoding is the process of converting unicodes to bytes
- Decoding is the process of converting bytes back to unicodes so humans can interpret
What is Unicode Character Set (UCS)
- For all major languages in the world, every unique character is assigned a unique value or “code point”. This set of unique values, also representing emojis and other symbols, is the Unicode Character Set. Unicode includes characters from Latin, Greek, Cyrillic, Arabic, Hebrew, Chinese, Japanese, Korean, and many others.
- Code points are typically represented in hexadecimal format, such as U+0041 for the Latin capital letter “A” or U+30A2 for the Japanese hiragana character “ア”.
What are some of the commonly used Encoding techniques
Encoding Type | Full Description | Num of bits | Where Used/Supported Character set |
---|---|---|---|
ASCII | American Standard Code for Information Interchange | 7 bits | For English text/ supports basic Latin letters, numbers and punctuation marks |
UTF-8 | Unicode Transformation Format | variable-length min 8 bits | Can support multiple languages; 8 bits for most ASCII characters; Supports upto 32 bits for some characters |
UTF-16 | Unicode Transformation Format | variable-length min 16 bits | Commonly used for applications which require multi-lang support |
Latin-1 | ISO-8859-1 or Western European Encoding | 8 bits | Limited to Western European languages and does not cover entire unicode characters set |
UTF-32 | Unicode Transformation Format | fixed-length 32 bits | Provides direct mapping between unicodes and characters; Less commonly used; High Storage |
Encoding and Decoding Strings in Python
- In Python, all strings by default are
Unicode
strings - If it is unicode, computer reads it by “encoding” into a byte string
- By default, Python uses
utf-8
encoding. You can also encode inutf-16
= "センティル・クマール".encode()
byte_string byte_string
b'\xe3\x82\xbb\xe3\x83\xb3\xe3\x83\x86\xe3\x82\xa3\xe3\x83\xab\xe3\x83\xbb\xe3\x82\xaf\xe3\x83\x9e\xe3\x83\xbc\xe3\x83\xab'
= "センティル・クマール".encode('utf-16')
byte_string_utf16 byte_string_utf16
b'\xff\xfe\xbb0\xf30\xc60\xa30\xeb0\xfb0\xaf0\xde0\xfc0\xeb0'
print(byte_string.decode())
print(byte_string_utf16.decode('utf-16'))
センティル・クマール
センティル・クマール
About Byte Strings in Python
Byte strings are used to represent binary data, such as images, audio files, or serialized objects. Binary data is not directly representable as text and needs to be stored and processed as a sequence of bytes.
>> type(byte_string)
bytes
- It is possible to save the byte strings directly in python using the prefix “b”
>> forced_byte_string = b"some_string"
>> type(forced_byte_string)
bytes
- It is NOT possible to save Non-ASCII characters as byte strings
= b"センティル・クマール" forced_byte_string
SyntaxError: bytes can only contain ASCII literal characters.
- One example of using byte strings is when we serialize objects (such as python objects) using pickle module
import pickle
= {
an_example_dict "English": "Senthil Kumar",
"Japanese": "センティル・クマール",
"Chinese": "森蒂尔·库马尔",
"Korean": "센틸 쿠마르",
"Arabic": "سينتيل كومار",
"Urdu": "سینتھل کمار"
}
= pickle.dumps(an_example_dict)
serialized_data print(type(serialized_data))
with open("serialized_dict.pkl", "wb") as file:
file.write(serialized_data)
bytes
Encoding and Decoding Files in Python
Saving Text Files in ASCII
and UTF
Formats
- The below code will throw NO error, because it is a English only text
= 'Hot: Microsoft Surface Pro 4 Tablet Intel Core i7 8GB RAM 256GB.. now Pound 1079.00! #SpyPrice #Microsoft'
normal_text with open("saving_eng__only_text.txt","w",encoding="ascii") as f:
f.write(normal_text)
- The below code will throw an error, because you have latin character “£”
= 'Hot: Microsoft Surface Pro 4 Tablet Intel Core i7 8GB RAM 256GB.. now £1079.00! #SpyPrice #Microsoft'with open("saving_eng__only_text.txt","w",encoding="ascii") as f:
non_ascii_text f.write(non_ascii_text)
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
Input In [21], in <cell line: 1>()
1 with open("saving_a_latin_string.txt","w",encoding="ascii") as f:
----> 2 f.write(non_ascii_text)
UnicodeEncodeError: 'ascii' codec can't encode character '\xa3' in position 70: ordinal not in range(128)
- Changing the encoding to “utf-8” fixed the error
with open("saving_a_latin_string.txt","w",encoding="utf-8") as f:
f.write(non_ascii_text )
Saving Non-ASCII
JSON Files in different formats
- Saving a dict using
json.dump
, utf-8 encoding - Saving the same dict as a json_string using
json.dumps
, utf-8 encoding - Saving the same dict using
json.dump
, utf-16 encoding
import json
= {
an_example_dict "English": "Senthil Kumar",
"Japanese": "センティル・クマール",
"Chinese": "森蒂尔·库马尔",
"Korean": "센틸 쿠마르",
"Arabic": "سينتيل كومار",
"Urdu": "سینتھل کمار"
}
with open("saving_the_names_dict_utf8.json","w",encoding="utf-8") as f:
=False)
json.dump(an_example_dict, f,ensure_ascii
= json.dumps(an_example_dict,ensure_ascii=False)
an_example_dict_str with open("saving_the_names_dict_utf8_using_json_string.json","w",encoding="utf-8") as f:
f.write(an_example_dict_str)
with open("saving_the_names_dict_utf16.json","w",encoding="utf-16") as f:
=False) json.dump(an_example_dict, f,ensure_ascii
- How to load the dict?
with open("saving_the_names_dict_utf8.json","r",encoding="utf-8") as f:
= json.load(f)
loaded_dict
print(loaded_dict)
{'English': 'Senthil Kumar', 'Japanese': 'センティル・クマール', 'Chinese': '森蒂尔·库马尔', 'Korean': '센틸 쿠마르', 'Arabic': 'سينتيل كومار', 'Urdu': 'سینتھل کمار'}
>> cat saving_the_names_dict_utf8.json
{"English": "Senthil Kumar", "Japanese": "センティル・クマール", "Chinese": "森蒂尔·库马尔", "Korean": "센틸 쿠마르", "Arabic": "سينتيل كومار", "Urdu": "سینتھل کمار"}
>> echo "the file size:" && du -hs saving_the_names_dict.json
echo "the utf8 file size in bytes:" && wc -c saving_the_names_dict_utf8.json
echo "the utf8 file size in bytes:" && wc -c saving_the_names_dict_utf8_using_json_string.json
echo "the utf16 file size in bytes:" && wc -c saving_the_names_dict_utf16.json
the utf8 file size in bytes:
209 saving_the_names_dict_utf8.json
the utf8 file size in bytes:
209 saving_the_names_dict_utf8_using_json_string.json
the utf16 file size in bytes:
292 saving_the_names_dict_utf16.json
Conclusion: - In the example above, the byte size of utf16
file is more than that of utf8
file
Conclusion
- Use
utf8
everywhere | check more here- UTF-8 can be used to encode anything that UTF-16 can. So most of the usecases can be met with utf-8.
- UTF-16 starts with a minimum of 2 bytes (16-bits) and hence not compatible with 7 bit ASCII. But UTF-8 is backwards compatible with ASCII.
Good Sources
- Why UTF-8 should be used?
- https://stackoverflow.com/a/18231475
- http://utf8everywhere.org/
- Other good resources
- Encoding-Decoding in Python 3 https://www.pythoncentral.io/encoding-and-decoding-strings-in-python-3-x/