Python requests response utf 8

When the content-type of the server is 'Content-Type:text/html', requests.get() returns improperly encoded data.

However, if we have the content type explicitly as 'Content-Type:text/html; charset=utf-8', it returns properly encoded data.

Also, when we use urllib.urlopen(), it returns properly encoded data.

Has anyone noticed this before? Why does requests.get() behave like this?

APhillips

1,1358 silver badges17 bronze badges

asked May 26, 2017 at 13:54

Educated guesses (mentioned above) are probably just a check for Content-Type header as being sent by server (quite misleading use of educated imho).

For response header Content-Type: text/html the result is ISO-8859-1 (default for HTML4), regardless any content analysis (ie. default for HTML5 is UTF-8).

For response header Content-Type: text/html; charset=utf-8 the result is UTF-8.

Luckily for us, requests uses chardet library and that usually works quite well (attribute requests.Response.apparent_encoding), so you usually want to do:

r = requests.get("//martin.slouf.name/") # override encoding by real educated guess as provided by chardet r.encoding = r.apparent_encoding # access the data r.text

answered Oct 2, 2018 at 19:34

bubakbubak

1,33412 silver badges10 bronze badges

From requests documentation:

When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property.

>>> r.encoding 'utf-8' >>> r.encoding = 'ISO-8859-1'

Check the encoding requests used for your page, and if it's not the right one - try to force it to be the one you need.

Regarding the differences between requests and urllib.urlopen - they probably use different ways to guess the encoding. Thats all.

answered May 26, 2017 at 13:59

DekelDekel

58.3k8 gold badges95 silver badges126 bronze badges

After getting response, take response.content instead of response.text and that will be of encoding utf-8.

response = requests.get(download_link, auth=(myUsername, myPassword), headers={'User-Agent': 'Mozilla'}) print (response.encoding) if response.status_code is 200: body = response.content else: print ("Unable to get response with Code : %d " % (response.status_code))

answered Jan 30, 2020 at 18:21

Hari_pbHari_pb

6,4741 gold badge41 silver badges51 bronze badges

The default assumed content encoding for text/html is ISO-8859-1 aka Latin-1 :( See RFC-2854. UTF-8 was too young to become the default, it was born in 1993, about the same time as HTML and HTTP.

Use .content to access the byte stream, or .text to access the decoded Unicode stream. If the HTTP server does not care about the correct encoding, the value of .text may be off.

glhr

4,2201 gold badge14 silver badges24 bronze badges

answered May 26, 2017 at 14:05

90009000

39k9 gold badges65 silver badges102 bronze badges

What is encoding UTF

UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding.

How do I decode a UTF

To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.

What is SIG utf8?

"sig" in "utf-8-sig" is the abbreviation of "signature" (i.e. signature utf-8 file). Using utf-8-sig to read a file will treat the BOM as metadata that explains how to interpret the file, instead of as part of the file contents.

Python requests response utf 8

What is encoding UTF

How do I decode a UTF

What is SIG utf8?

Bài Viết Liên Quan

Hướng dẫn python multidimensional array

How do i download phpcs?

Tuổi kỷ tỵ 2023 nam mạng

Should i learn css or use bootstrap?

Php add date +1 day

Thuộc tính input trong css

Hướng dẫn php between operator

Get device id from browser javascript

Php double colon vs arrow

Bộ sưu tập thời trang elise năm 2023

Toplist

Top 30 bài tập bổ trợ tiếng anh 6 i learn smart world 2022

Top 10 giáo án tự nhiên xã hội lớp 3 cả năm môi nhất violet 2022

Top 9 download mẫu phong bì mừng đám cưới 2022

Top 9 gia đình và con cái ông nguyễn phú trọng 2022

Top 29 lời dân chương trình bài hát gửi về quan họ 2022

Top 10 giáo án i learn smart world violet 2022

Top 9 đề thi vào lớp 6 trường lê lợi hà đông môn toán 2022

Top 10 thủ tục giám đốc thẩm và tái thẩm trong tố tụng hành chính 2022

Top 9 lễ cô sáu ở công viên tuổi trẻ 2022

Bài mới nhất

Himax premium lub ts40 là gì năm 2024

Cách mạng công nghiệp 4.0 cơ hội nào cho startup năm 2024

Các dạng toán chương 1 kinh tế vĩ mô năm 2024

Cách bố trí hàng hóa trên các kệ hàng năm 2024

Khắc phục lỗi phần mềm fot làm bị out game năm 2024

Bài tập tự luận unit 12 tiếng anh 8 năm 2024

724 22 lê văn lương.phước kiểng nhà bè năm 2024

Top phim xuyen khong hay nhat trung quoc năm 2024

Backtory nghĩa là gì câu chuyện hậu trường năm 2024

Mức học phí trường đại học văn lang năm 2024

Chủ đề