Let's study Python

Python unicodedata, regular expressions, and binary data handling are essential concepts for working with text, patterns, and binary data in Python programming.

## Python unicodedata 사용법

### 1. 문자열
#### 1) 유니코드
If you are starting Unicode programming for the first time, the first character format you will encounter is probably ASCII code. ASCII code was defined in the 1960s, using 7 bits and utilizing only 128 unique values. It consists of 26 uppercase and lowercase English letters, 10 numbers, punctuation marks, space characters, and non-printable control codes. However, as time passed and the need arose to represent characters beyond what ASCII could cover, an international standard code was defined to represent characters from all languages, and this is Unicode. Unicode provides a unique code value for each character regardless of platform, program, or language, including mathematical symbols and other character symbols in addition to languages. In Python 3, strings use Unicode characters. This is one of the significant changes compared to Python 2, allowing you to distinguish between regular byte strings and Unicode characters. To search for Unicode identifiers and names, Python 3 provides the unicodedata module, which offers two functions as follows:

– `lookup()`: Takes a case-insensitive argument and returns the Unicode character.
– `name()`: Takes a Unicode character as an argument and returns it in uppercase.

Let’s explore the usage of these two functions through an example:

“`python
def unicode_test(value):
import unicodedata
name = unicodedata.name(value)
value2 = unicodedata.lookup(name)
print(“value {}, name={}, value2={}”.format(value, name, value2))

unicode_test(‘A’)
unicode_test(‘$’)
unicode_test(‘\u20ac’)
“`

**Output:**
“`
value A, name=LATIN CAPITAL LETTER A, value2=A
value $, name=DOLLAR SIGN, value2=$
value €, name=EURO SIGN, value2=€
“`

#### 2) Formatting
In Python, formatting refers to inserting data values into strings. It is also known as interpolation and is commonly used with the `print()` function. There are two main ways to format strings:
– Old Style: Using `%` formatting
– New Style: Using `{}` and `format()`

##### Old Style:
Old style formatting is represented by patterns like `”String % data”`. It involves using interpolation sequences to indicate the data to be inserted into the string. Various types that can be used are shown in the table below:

| Format | Conversion Type |
|——–|——————|
| `%s` | String |
| `%d` | Decimal integer |
| `%x` | Hexadecimal integer |
| `%o` | Octal integer |
| `%f` | Decimal floating point |
| `%e` | Exponential floating point |
| `%g` | General format |
| `%%` | Literal % |

Let’s look at an example to understand the code execution and its results:

“`python
print(“1. Integer” + “\n”)
print(‘%s’ % 42)
print(‘%d’ % 42)
print(‘%x’ % 42)
print(‘%o’ % 42)
print(“2. Floating Point” + “\n”)
print(‘%s’ % 10.8)
print(‘%f’ % 10.8)
print(‘%e’ % 10.8)
print(‘%g’ % 10.8)
print(“3. String + Integer”)
actor = “Richard Gere”
cat = “Chester”
weight = 28
print(“My wife’s favorite actor is %s” % actor)
print(“Our cat %s weights %s pound” % (cat, weight))
“`

**Output:**
“`
1. Integer
42
42
2a
52
2. Floating Point
10.8
10.800000
1.080000e+01
10.8
3. String + Integer
My wife’s favorite actor is Richard Gere
Our cat Chester weights 28 pound
“`

As shown in the example, `%s` is used in the string to indicate inserting another string, and the character following `%` represents the data item to be inserted after the string. The number of data items after `%` should match the number of variables inserted in the string.

Next, let’s learn how to align strings to a specific length. To achieve this, you need to add values between `%` and the type specifier for minimum and maximum length adjustment, alignment, and character filling. Let’s look at the method through the example below:

“`python
print(“4. Adjusting String Length”)
n = 100
f = 10.8
s = “String cheese”
print(“%d %f %s” % (n, f, s))
print(“%10d %10f %10s” % (n, f, s))
“`

**Output:**
“`
4. Adjusting String Length
100 10.800000 String cheese
100 10.800000 String cheese
“`

In the example above, by showing the total number of digits together, you can specify the length of the formatting string.

##### New Style:
If you are using Python 3, it is recommended to use the new formatting style `{}` and `format()`. This method allows specifying the order of variables to be inserted. Let’s look at an example to understand how it works:

“`python
print(“1. {} Usage”)
n = 100
f = 10.8
s = “String cheese”
print(“{} {} {}”.format(n, f, s))
“`

**Output:**
“`
1. {} Usage
100 10.8 String cheese
“`

The example above returns the same result as the previous example, but the current example seems more concise in code. The difference is that in the old style, the data was provided in the order in which the `%` appeared in the string, whereas in the example above, the order can be specified as follows:

“`python
print(“2. {} Specifying Order”)
n = 100
f = 10.8
s = “String cheese”
print(“{2} {0} {1}”.format(n, f, s))
“`

**Output:**
“`
2. {} Specifying Order
String cheese 100 10.8
“`

By putting the variable’s order between `{}`, the `format()` method assigns the value of the variable corresponding to the order in which it appears.

Now, let’s look at a different way of formatting strings using data structures. Suppose you have a dictionary defined as follows:

“`python
dict_a = {‘n’: 100, ‘f’: 10.8, ‘s’: “String cheese”}
“`

And when performing the example below, let’s see what value is output:

“`python
print(“3. Formatting Using Data Structure”)
dict_a = {‘n’: 100, ‘f’: 10.8, ‘s’: “String cheese”}
print(“{0[n]} {0[f]} {0[s]} {1}”.format(dict_a, “other”))
“`

**Output:**
“`
3. Formatting Using Data Structure
100 10.8 String cheese other
“`

As shown in the example above, when using dictionaries or other data structures, you can perform string formatting by utilizing the characteristics of the data structure.

Now, how do you adjust string lengths? The method is similar to the previous one, but as shown in the example below, you enter a type specifier after `:`.

“`python
print(“4. Adjusting String Length”)
n = 100
f = 10.8
s = “String cheese”
print(“{0:d} {1:f} {2:s}”.format(n, f, s))
“`

**Output:**
“`
4. Adjusting String Length
100 10.800000 String cheese
“`

Furthermore, you can set the minimum and maximum length of each field value and alignment in the following way:

“`python
print(“5. Setting String Length”)
n = 100
f = 10.8
s = “String cheese”
print(“{0:>10d} {1:^10f} {2:<10s}".format(n, f, s)) ``` **Output:** ``` 5. Setting String Length 100 10.800000 String cheese ``` In the above example, the minimum and maximum lengths are represented by numbers, where `<` signifies left alignment, `^` signifies center alignment, and `>` signifies right alignment.

### 2. Regular Expressions
#### 1) What is a Regular Expression?
A regular expression is a formal language used to represent a set of strings with a specific pattern. It is commonly used for searching and replacing strings. When using regular expressions, you can easily represent complex patterns of characters that would otherwise require long conditions using if statements. However, as the code becomes simpler, readability may decrease if you do not understand the meaning of the expression.

There are two main standard methods in regular expressions: POSIX and its extension PCRE. Common metacharacters used in regular expressions include the following characters:

– Metacharacters: `. ^ $ * + ? { } [ ] \ | ( )`

On the other hand, characters specific to POSIX are as follows. Since they are expressions themselves, when used as character classes, they are enclosed in square brackets:

– POSIX-specific characters: `[:alnum:] [:alpha:] [:ascii:] [:blank:] [:cntrl:] [:digit:] [:graph:] [:lower:] [:print:] [:punct:] [:space:] [:upper:] [:xdigit:]`

To use regular expressions in Python, you need to import the `re` module. Let’s execute the example below to understand the concept:

“`python
import re
result = re.match(‘You’, ‘Young Frankenstein’)
print(result)
“`

**Output:**
“`

“`

In the code above, `’You’` is the pattern, and `’Young Frankenstein’` is the input string. If the pattern you are looking for exists in the input string, the function used will return a value.

The `match()` function used in the previous example is used to search for a pattern that matches exactly from the beginning. If the pattern you are looking for exists within the input string, it returns the index from the start to where the pattern is found. If the pattern does not exist from the start position, it returns nothing.

Additionally, to quickly check a pattern, you can compile the pattern first. Check the following example:

“`python
pattern = re.compile(“You”)
result = re.match(pattern, ‘Young Frankenstein’)
print(result)
“`

**Output:**
“`

“`

#### 2) Related Functions
Now, let’s look at the functions used for regular expressions. The main functions are as follows:
– `match()`: As seen in the previous example, it is used when you want to find a pattern that matches perfectly from the start. If the pattern you are looking for exists in the input string, it returns the index from the start to the pattern.
– `search()`: Similar to `match()`, this function is used to find a pattern, but the difference is that the `search()` function, regardless of the starting position, returns the start and end indexes of the pattern if the pattern you are looking for exists in the input string.
– `findall()`: If you need to find multiple occurrences of the pattern, `match()` or `search()` functions are not suitable since they return only one. In such cases, the `findall()` function is useful. It returns all matching patterns if multiple instances exist in the input string.
– `split()`: This function is used when you want to split the input string based on a specific pattern. It divides the string into a list based on the position where the pattern is found.
– `sub()`: When you want to replace a part of the input string with another string based on a pattern, you can use the `sub()` function. Although similar to the `replace()` method, it uses a pattern instead of a string.

Let’s explore these functions through examples to understand their usage and results.

### 3. Binary Data
Binary data consists of data composed of 0s and 1s, which may be more challenging to handle than text data. To understand binary data, you need to comprehend concepts such as endianness, signed bits for integers, and data extraction and manipulation in binary file formats and network packets.

#### 1) Bytes and Byte Arrays
In Python 3, binary data is represented by two types: bytes and byte arrays. These are 8-bit integers that can be used in the range of 0 to 255. The difference between bytes and byte arrays is that bytes are immutable, similar to tuples, while byte arrays are mutable, akin to lists. Let’s delve into an example to understand this concept:

“`python
input = [1, 2, 3, 255]
i_bytes = bytes(input)
array_bytes = bytearray(input)
print(i_bytes)
print(array_bytes)
“`

**Output:**
“`
b’\x01\x02\x03\xff’
bytearray(b’\x01\x02\x03\xff’)
“`

In the example above, if you want to convert data to bytes, you can use the `bytes()` function, and for byte arrays, you can use the `bytearray()` function. Next, let’s verify whether values can be modified for byte types and byte array types:

“`python
i_bytes[1] = 127 # Error
array_bytes[1] = 127
“`

**Output:**
“`
Traceback (most recent call last):
File ““, line 1, in
TypeError: ‘bytes’ object does not support item assignment
“`

In the second example, bytes cannot be changed, which results in an error, whereas byte arrays can have their values modified.

Finally, let’s examine the representable range of values. As mentioned earlier, the range is from 0 to 255, as demonstrated in the following example:

“`python
i_bytes = bytes(range(0, 256))
array_bytes = bytearray(range(0, 256))
print(i_bytes)
print(array_bytes)
“`

**Output:**
“`
b’\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !”#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff’
bytearray(b’\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !”#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff’
“`

#### 2) Converting Binary Data: struct
Now, let’s introduce `struct`, a module that handles data similarly to C or C++ structures. With `struct`, you can convert binary data into Python data structures or vice versa. Let’s implement code to read binary data and output information about an image:

“`python
import struct
valid_png_header = b’\x89PNG\r\n\x1a\n’
data = b’\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR’ + b’\x00\x00\x00\x9a\x00\x00\x00\x8d\x08\x02\x00\x00\x00\xc0′
if data[:8] == valid_png_header:
width, height = struct.unpack(‘>LL’, data[16:24])
print(‘Valid PNG, width’, width, ‘height’, height)
else:
print(‘Not a valid PNG’)
“`

**Output:**
“`
Valid PNG, width 154 height 141
“`

In the code above, the `data` variable represents the first 30 bytes of an example image file. `valid_png_header` denotes the 8-byte sequence that signifies the start of a PNG file. The `unpack()` function interprets the input byte sequence and converts it into Python data types. In the code snippet, `>LL` indicates that the byte sequence should be interpreted and converted into unsigned long integers. To provide further insight, we can examine the individual 4-byte values directly:

“`python
data[16:20]
data[20:24