Code 20: Unicode, UTF-8 and Bytes

Converting my (Chinese) name to bytes and back
coding
Author

Tony Phung

Published

December 23, 2024

1. Introduction

Unicode provides a comprehensive set of characters and assigns each a unique code point.

UTF-8 is a method of encoding these Unicode code points into bytes, allowing for efficient storage and transmission of text. It is designed to be backward compatible with ASCII, meaning that any ASCII text is also valid UTF-8.

Bytes are the actual data representation of the encoded text. When you encode a string using UTF-8, you get a sequence of bytes that can be stored in files, sent over networks, or processed by applications.

2. Unicoding My Name

2.1 A Cool Name “😎梁國富⚽”

my_str = "😎梁國富⚽" # a cool chinese name

2.2 Encode string to bytes via utf-8

Converts string into a bytes object using the UTF-8 encoding scheme.

If string contains non-ASCII characters, UTF-8 ensures they are represented properly in bytes.

my_str_utf8_bytes = my_str.encode("utf-8") # (str -> utf-8 bytes)
my_str_utf8_bytes
b'\xf0\x9f\x98\x8e\xe6\xa2\x81\xe5\x9c\x8b\xe5\xaf\x8c\xe2\x9a\xbd'

2.2.1 Why bytes?

Bytes data/objects are important in programming for several key reasons:

  • Efficient Storage: Bytes provide an efficient way to store raw binary data. They use 8 bits per byte, which allows for compact storage of information.

  • Low-Level Operations: Bytes are fundamental units of data in computer systems. Working with bytes enables low-level operations like memory manipulation, file I/O, and network communication.

  • Binary Data Handling: Bytes are essential for handling binary data formats like images, audio files, and executable code. These formats are represented as sequences of bytes.

  • Cryptographic Operations: In cryptography and security-related tasks, working with raw byte data is often necessary. This includes generating random numbers, hashing, and encryption/decryption.

  • Network Communication: When sending data over networks, it’s typically transmitted as byte streams. This allows for efficient transmission of various types of data.

  • Compression Algorithms: Some compression algorithms work directly on byte sequences rather than text strings.

  • Memory Efficiency: In scenarios where memory usage is critical (like embedded systems), working with bytes allows for more efficient use of available resources.

  • Performance Optimization: Certain operations, especially those involving large datasets, can be optimized by working directly with bytes rather than converting to and from strings repeatedly.

  • Interoperability: Bytes provide a common format that can be easily converted between different programming languages and systems.

  • Data Integrity: When dealing with binary data that may contain non-printable characters or invalid Unicode sequences, working with bytes ensures data integrity.

  • File Handling: Many file formats, especially those used in scientific computing or specialized applications, are represented as byte streams.

  • Protocol Buffers: In distributed systems and microservices architectures, protocols like Protocol Buffers often serialize data into byte streams for efficient transmission.

2.3 Convert bytes to hexadecimal

Converts to base-16 representation of the binary data.

my_str_hex = my_str_utf8_bytes.hex() # (utf-8-bytes -> hex-#)
my_str_hex
'f09f988ee6a281e59c8be5af8ce29abd'
my_str_bytes = bytes.fromhex(my_str_hex) # (hex-# -> utf-8-bytes)
my_str_bytes
b'\xf0\x9f\x98\x8e\xe6\xa2\x81\xe5\x9c\x8b\xe5\xaf\x8c\xe2\x9a\xbd'
my_str_bytes.decode("utf-8") # '梁國富' (utf-8-bytes to str)
'😎梁國富⚽'