Hướng dẫn ansi python

MS Notepad gives the user a choice of 4 encodings, expressed in clumsy confusing terminology:

"Unicode" is UTF-16, written little-endian. "Unicode big endian" is UTF-16, written big-endian. In both UTF-16 cases, this means that the appropriate BOM will be written. Use utf-16 to decode such a file.

"UTF-8" is UTF-8; Notepad explicitly writes a "UTF-8 BOM". Use utf-8-sig to decode such a file.

"ANSI" is a shocker. This is MS terminology for "whatever the default legacy encoding is on this computer".

Here is a list of Windows encodings that I know of and the languages/scripts that they are used for:

cp874 Thai cp932 Japanese cp936 Unified Chinese (P.R. China, Singapore) cp949 Korean cp950 Traditional Chinese (Taiwan, Hong Kong, Macao(?)) cp1250 Central and Eastern Europe cp1251 Cyrillic ( Belarusian, Bulgarian, Macedonian, Russian, Serbian, Ukrainian) cp1252 Western European languages cp1253 Greek cp1254 Turkish cp1255 Hebrew cp1256 Arabic script cp1257 Baltic languages cp1258 Vietnamese cp???? languages/scripts of India

If the file has been created on the computer where it is being read, then you can obtain the "ANSI" encoding by locale.getpreferredencoding(). Otherwise if you know where it came from, you can specify what encoding to use if it's not UTF-16. Failing that, guess.

Be careful using codecs.open() to read files on Windows. The docs say: """Note Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing.""" This means that your lines will end in \r\n and you will need/want to strip those off.

Putting it all together:

Sample text file, saved with all 4 encoding choices, looks like this in Notepad:

The quick brown fox jumped over the lazy dogs. àáâãäå

Here is some demo code:

import locale def guess_notepad_encoding(filepath, default_ansi_encoding=None): with open(filepath, 'rb') as f: data = f.read(3) if data[:2] in ('\xff\xfe', '\xfe\xff'): return 'utf-16' if data == u''.encode('utf-8-sig'): return 'utf-8-sig' # presumably "ANSI" return default_ansi_encoding or locale.getpreferredencoding() if __name__ == "__main__": import sys, glob, codecs defenc = sys.argv[1] for fpath in glob.glob(sys.argv[2]): print print (fpath, defenc) with open(fpath, 'rb') as f: print "raw:", repr(f.read()) enc = guess_notepad_encoding(fpath, defenc) print "guessed encoding:", enc with codecs.open(fpath, 'r', enc) as f: for lino, line in enumerate(f, 1): print lino, repr(line) print lino, repr(line.rstrip('\r\n'))

and here is the output when run in a Windows "Command Prompt" window using the command \python27\python read_notepad.py "" t1-*.txt

('t1-ansi.txt', '') raw: 'The quick brown fox jumped over the lazy dogs.\r\n\xe0\xe1\xe2\xe3\xe4\xe5 \r\n' guessed encoding: cp1252 1 u'The quick brown fox jumped over the lazy dogs.\r\n' 1 u'The quick brown fox jumped over the lazy dogs.' 2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n' 2 u'\xe0\xe1\xe2\xe3\xe4\xe5' ('t1-u8.txt', '') raw: '\xef\xbb\xbfThe quick brown fox jumped over the lazy dogs.\r\n\xc3\xa0\xc3 \xa1\xc3\xa2\xc3\xa3\xc3\xa4\xc3\xa5\r\n' guessed encoding: utf-8-sig 1 u'The quick brown fox jumped over the lazy dogs.\r\n' 1 u'The quick brown fox jumped over the lazy dogs.' 2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n' 2 u'\xe0\xe1\xe2\xe3\xe4\xe5' ('t1-uc.txt', '') raw: '\xff\xfeT\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\x00w \x00n\x00 \x00f\x00o\x00x\x00 \x00j\x00u\x00m\x00p\x00e\x00d\x00 \x00o\x00v\x00e \x00r\x00 \x00t\x00h\x00e\x00 \x00l\x00a\x00z\x00y\x00 \x00d\x00o\x00g\x00s\x00. \x00\r\x00\n\x00\xe0\x00\xe1\x00\xe2\x00\xe3\x00\xe4\x00\xe5\x00\r\x00\n\x00' guessed encoding: utf-16 1 u'The quick brown fox jumped over the lazy dogs.\r\n' 1 u'The quick brown fox jumped over the lazy dogs.' 2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n' 2 u'\xe0\xe1\xe2\xe3\xe4\xe5' ('t1-ucb.txt', '') raw: '\xfe\xff\x00T\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\ x00w\x00n\x00 \x00f\x00o\x00x\x00 \x00j\x00u\x00m\x00p\x00e\x00d\x00 \x00o\x00v\ x00e\x00r\x00 \x00t\x00h\x00e\x00 \x00l\x00a\x00z\x00y\x00 \x00d\x00o\x00g\x00s\ x00.\x00\r\x00\n\x00\xe0\x00\xe1\x00\xe2\x00\xe3\x00\xe4\x00\xe5\x00\r\x00\n' guessed encoding: utf-16 1 u'The quick brown fox jumped over the lazy dogs.\r\n' 1 u'The quick brown fox jumped over the lazy dogs.' 2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n' 2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

Things to be aware of:

(1) "mbcs" is a file-system pseudo-encoding which has no relevance at all to decoding the contents of files. On a system where the default encoding is cp1252, it makes like latin1 (aarrgghh!!); see below

>>> all_bytes = "".join(map(chr, range(256))) >>> u1 = all_bytes.decode('cp1252', 'replace') >>> u2 = all_bytes.decode('mbcs', 'replace') >>> u1 == u2 False >>> [(i, u1[i], u2[i]) for i in xrange(256) if u1[i] != u2[i]] [(129, u'\ufffd', u'\x81'), (141, u'\ufffd', u'\x8d'), (143, u'\ufffd', u'\x8f') , (144, u'\ufffd', u'\x90'), (157, u'\ufffd', u'\x9d')] >>>

(2) chardet is very good at detecting encodings based on non-Latin scripts (Chinese/Japanese/Korean, Cyrillic, Hebrew, Greek) but not much good at Latin-based encodings (Western/Central/Eastern Europe, Turkish, Vietnamese) and doesn't grok Arabic at all.

Hướng dẫn ansi python

Bài Viết Liên Quan

Hướng dẫn dùng the locale trong PHP

How do you indent print statements in python?

How does python identify data type?

Hướng dẫn install python 3.7 mac

Hướng dẫn php generator read file

Tử vi tuổi tỵ 1965 năm 2023

Special literal none in python

Hướng dẫn check version mysql workbench

Mysql select from multiple tables

Hướng dẫn python regex alphabetic character

Toplist

Top 30 bài tập bổ trợ tiếng anh 6 i learn smart world 2022

Top 10 giáo án tự nhiên xã hội lớp 3 cả năm môi nhất violet 2022

Top 9 download mẫu phong bì mừng đám cưới 2022

Top 9 gia đình và con cái ông nguyễn phú trọng 2022

Top 29 lời dân chương trình bài hát gửi về quan họ 2022

Top 10 giáo án i learn smart world violet 2022

Top 9 đề thi vào lớp 6 trường lê lợi hà đông môn toán 2022

Top 10 thủ tục giám đốc thẩm và tái thẩm trong tố tụng hành chính 2022

Top 9 lễ cô sáu ở công viên tuổi trẻ 2022

Bài mới nhất

Hóa ra anh vẫn ở đây truyện tóm tắt năm 2024

Bài thu hoạch về học tập chuyên đề năm 2023 năm 2024

Bài thuyết trình về tập đoàn xăng dầu việt nam năm 2024

Lỗi cập phép đã bị thủ hồi pubg mobile năm 2024

Lỗi không xem được video trên windows media player năm 2024

Top trung tâm anh ngữ giá rẻ tphcm năm 2024

Trung bình mỗi lần quan hệ tốn bao nhiêu calo năm 2024

Đạo đức kinh doanh và văn hóa công ty năm 2024

Treo gương bát quái lồi như thế nào năm 2024

Bài tập số ít số nhiều của động từ năm 2024

Chủ đề