Hướng dẫn i 1 ⁄ 2 decode python - tôi 1 ⁄ 2 giải mã python

Question

Đây là một vấn đề phổ biến, vì vậy đây là một minh họa tương đối kỹ lưỡng.

Nội dung chính Show

Một chuỗi đơn giản
Giải mã chuỗi ISO8859-1 - Chuyển đổi chuỗi đơn giản thành Unicode
Thêm một chút minh họa - với "ä"
Mã hóa cho UTF
Mối quan hệ giữa Unicode và UTF và Latin1
Ngoại lệ Unicode
đọc thêm
Làm thế nào để bạn giải mã một giá trị trong Python?
Giải mã là gì ('utf
UTF là gì
Tôi có nên sử dụng UTF không

Đối với các chuỗi không phải là Unicode (nghĩa là những người không có tiền tố u như u'\xc4pple'), người ta phải giải mã khỏi mã hóa gốc (____ 9/________ 10, trừ khi được sửa đổi với hàm

>>> v = "\xC4pple" # iso-8859-1 aka latin1 encoded string

>>> tell_me_about(v)
(<type 'str'>, '\xc4pple')

>>> v
'\xc4pple'        # representation in memory

>>> print v
?pple             # map the iso-8859-1 in-memory to iso-8859-1 chars
                  # note that '\xc4' has no representation in iso-8859-1, 
                  # so is printed as "?".

1 elli Mong muốn, trong trường hợp này, tôi muốn giới thiệu

>>> v = "\xC4pple" # iso-8859-1 aka latin1 encoded string

>>> tell_me_about(v)
(<type 'str'>, '\xc4pple')

>>> v
'\xc4pple'        # representation in memory

>>> print v
?pple             # map the iso-8859-1 in-memory to iso-8859-1 chars
                  # note that '\xc4' has no representation in iso-8859-1, 
                  # so is printed as "?".

3.

Đầu tiên, đây là một chức năng tiện ích tiện dụng sẽ giúp chiếu sáng các mẫu của chuỗi Python 2.7 và Unicode:

>>> def tell_me_about(s): return (type(s), s)

Một chuỗi đơn giản

>>> v = "\xC4pple" # iso-8859-1 aka latin1 encoded string

>>> tell_me_about(v)
(<type 'str'>, '\xc4pple')

>>> v
'\xc4pple'        # representation in memory

>>> print v
?pple             # map the iso-8859-1 in-memory to iso-8859-1 chars
                  # note that '\xc4' has no representation in iso-8859-1, 
                  # so is printed as "?".

Giải mã chuỗi ISO8859-1 - Chuyển đổi chuỗi đơn giản thành Unicode

>>> uv = v.decode("iso-8859-1")
>>> uv
u'\xc4pple'       # decoding iso-8859-1 becomes unicode, in memory

>>> tell_me_about(uv)
(<type 'unicode'>, u'\xc4pple')

>>> print v.decode("iso-8859-1")
Äpple             # convert unicode to the default character set
                  # (utf-8, based on sys.stdout.encoding)

>>> v.decode('iso-8859-1') == u'\xc4pple'
True              # one could have just used a unicode representation 
                  # from the start

Thêm một chút minh họa - với "ä"

>>> u"Ä" == u"\xc4"
True              # the native unicode char and escaped versions are the same

>>> "Ä" == u"\xc4"  
False             # the native unicode char is '\xc3\x84' in latin1

>>> "Ä".decode('utf8') == u"\xc4"
True              # one can decode the string to get unicode

>>> "Ä" == "\xc4"
False             # the native character and the escaped string are
                  # of course not equal ('\xc3\x84' != '\xc4').

Mã hóa cho UTF

>>> u8 = v.decode("iso-8859-1").encode("utf-8")
>>> u8
'\xc3\x84pple'    # convert iso-8859-1 to unicode to utf-8

>>> tell_me_about(u8)
(<type 'str'>, '\xc3\x84pple')

>>> u16 = v.decode('iso-8859-1').encode('utf-16')
>>> tell_me_about(u16)
(<type 'str'>, '\xff\xfe\xc4\x00p\x00p\x00l\x00e\x00')

>>> tell_me_about(u8.decode('utf8'))
(<type 'unicode'>, u'\xc4pple')

>>> tell_me_about(u16.decode('utf16'))
(<type 'unicode'>, u'\xc4pple')

Mối quan hệ giữa Unicode và UTF và Latin1

>>> print u8
Äpple             # printing utf-8 - because of the encoding we now know
                  # how to print the characters

>>> print u8.decode('utf-8') # printing unicode
Äpple

>>> print u16     # printing 'bytes' of u16
���pple

>>> print u16.decode('utf16')
Äpple             # printing unicode

>>> v == u8
False             # v is a iso8859-1 string; u8 is a utf-8 string

>>> v.decode('iso8859-1') == u8
False             # v.decode(...) returns unicode

>>> u8.decode('utf-8') == v.decode('latin1') == u16.decode('utf-16')
True              # all decode to the same unicode memory representation
                  # (latin1 is iso-8859-1)

Ngoại lệ Unicode

 >>> u8.encode('iso8859-1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
  ordinal not in range(128)

>>> u16.encode('iso8859-1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0:
  ordinal not in range(128)

>>> v.encode('iso8859-1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
  ordinal not in range(128)

Người ta sẽ hiểu xung quanh những điều này bằng cách chuyển đổi từ mã hóa cụ thể (Latin-1, UTF8, UTF16) thành Unicode, ví dụ:

>>> v = "\xC4pple" # iso-8859-1 aka latin1 encoded string

>>> tell_me_about(v)
(<type 'str'>, '\xc4pple')

>>> v
'\xc4pple'        # representation in memory

>>> print v
?pple             # map the iso-8859-1 in-memory to iso-8859-1 chars
                  # note that '\xc4' has no representation in iso-8859-1, 
                  # so is printed as "?".

4.

Vì vậy, có lẽ người ta có thể rút ra các nguyên tắc và khái quát sau:

Loại

>>> v = "\xC4pple" # iso-8859-1 aka latin1 encoded string

>>> tell_me_about(v)
(<type 'str'>, '\xc4pple')

>>> v
'\xc4pple'        # representation in memory

>>> print v
?pple             # map the iso-8859-1 in-memory to iso-8859-1 chars
                  # note that '\xc4' has no representation in iso-8859-1, 
                  # so is printed as "?".

5 là một tập hợp các byte, có thể có một trong số các mã hóa như Latin-1, UTF-8 và UTF-16

Loại

>>> v = "\xC4pple" # iso-8859-1 aka latin1 encoded string

>>> tell_me_about(v)
(<type 'str'>, '\xc4pple')

>>> v
'\xc4pple'        # representation in memory

>>> print v
?pple             # map the iso-8859-1 in-memory to iso-8859-1 chars
                  # note that '\xc4' has no representation in iso-8859-1, 
                  # so is printed as "?".

2 là một tập hợp các byte có thể được chuyển đổi thành bất kỳ số lượng mã hóa nào, phổ biến nhất là UTF-8 và Latin-1 (ISO8859-1)

Lệnh

>>> v = "\xC4pple" # iso-8859-1 aka latin1 encoded string

>>> tell_me_about(v)
(<type 'str'>, '\xc4pple')

>>> v
'\xc4pple'        # representation in memory

>>> print v
?pple             # map the iso-8859-1 in-memory to iso-8859-1 chars
                  # note that '\xc4' has no representation in iso-8859-1, 
                  # so is printed as "?".

7 có logic riêng để mã hóa, được đặt thành

>>> v = "\xC4pple" # iso-8859-1 aka latin1 encoded string

>>> tell_me_about(v)
(<type 'str'>, '\xc4pple')

>>> v
'\xc4pple'        # representation in memory

>>> print v
?pple             # map the iso-8859-1 in-memory to iso-8859-1 chars
                  # note that '\xc4' has no representation in iso-8859-1, 
                  # so is printed as "?".

8 và mặc định thành UTF-8

Người ta phải giải mã

>>> v = "\xC4pple" # iso-8859-1 aka latin1 encoded string

>>> tell_me_about(v)
(<type 'str'>, '\xc4pple')

>>> v
'\xc4pple'        # representation in memory

>>> print v
?pple             # map the iso-8859-1 in-memory to iso-8859-1 chars
                  # note that '\xc4' has no representation in iso-8859-1, 
                  # so is printed as "?".

5 thành Unicode trước khi chuyển đổi sang mã hóa khác.

Tất nhiên, tất cả những thay đổi này trong Python 3.x.

Hy vọng đó là ánh sáng.

đọc thêm

Ký tự so với byte, bởi Tim Bray.

Và những lời nói rất minh họa của Armin Ronacher:

Hướng dẫn cập nhật về Unicode trên Python (ngày 2 tháng 7 năm 2013)
Thông tin thêm về Unicode trong Python 2 và 3 (ngày 5 tháng 1 năm 2014)
UCS VS UTF-8 dưới dạng mã hóa chuỗi nội bộ (ngày 9 tháng 1 năm 2014)
Mọi thứ bạn không muốn biết về Unicode trong Python 3 (ngày 12 tháng 5 năm 2014)

Làm thế nào để bạn giải mã một giá trị trong Python?

Decode () là một phương thức được chỉ định trong các chuỗi trong Python 2. Phương pháp này được sử dụng để chuyển đổi từ một sơ đồ mã hóa, trong đó chuỗi đối số được mã hóa thành sơ đồ mã hóa mong muốn. Điều này hoạt động đối diện với mã hóa. Nó chấp nhận mã hóa của chuỗi mã hóa để giải mã nó và trả về chuỗi gốc. is a method specified in Strings in Python 2. This method is used to convert from one encoding scheme, in which argument string is encoded to the desired encoding scheme. This works opposite to the encode. It accepts the encoding of the encoding string to decode it and returns the original string.

Giải mã là gì ('utf

Python Python giải mã Python UTF-8.Được tạo: Tháng 1-06, 2022. Mã hóa đề cập đến việc mã hóa một chuỗi bằng sơ đồ mã hóa như UTF-8.Giải mã đề cập đến việc chuyển đổi một chuỗi được mã hóa từ một mã hóa sang sơ đồ mã hóa khác.converting an encoded string from one encoding to another encoding scheme.

UTF là gì

UTF-8 mã hóa một ký tự thành một chuỗi nhị phân gồm một, hai, ba hoặc bốn byte.UTF-16 mã hóa một ký tự unicode thành một chuỗi gồm hai hoặc bốn byte.Sự khác biệt này là rõ ràng từ tên của họ.Trong UTF-8, biểu diễn nhị phân nhỏ nhất của một ký tự là một byte hoặc tám bit. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.

Tôi có nên sử dụng UTF không

Hầu hết các thư viện không chứa nhiều tài liệu ngoại ngữ sẽ hoàn toàn ổn với định dạng mã hóa ISO8859-1 (còn gọi là Latin-1 hoặc ASCII mở rộng), nhưng nếu bạn có nhiều tài liệu ngoại ngữ, bạn nên chọn UTF-8 Vì điều đó cung cấp quyền truy cập vào nhiều nhân vật nước ngoài hơn.if you do have a lot of foreign language materials you should choose UTF-8 since that provides access to a lot more foreign characters.