Beautiful Soup là một thư viện Python để lấy dữ liệu ra khỏi các tệp HTML và XML. Nó hoạt động với trình phân tích cú pháp yêu thích của bạn để cung cấp các cách điều hướng, tìm kiếm và sửa đổi cây phân tích thành ngữ. Nó thường tiết kiệm cho lập trình viên hàng giờ hoặc ngày làm việc
Các hướng dẫn này minh họa tất cả các tính năng chính của Beautiful Soup 4, với các ví dụ. Tôi chỉ cho bạn biết thư viện tốt cho việc gì, cách nó hoạt động, cách sử dụng nó, cách khiến nó làm những gì bạn muốn và phải làm gì khi nó vi phạm mong đợi của bạn
Tài liệu này bao gồm Beautiful Soup phiên bản 4. 11. 0. Các ví dụ trong tài liệu này được viết cho Python 3. 8
Có thể bạn đang tìm tài liệu về Beautiful Soup 3. Nếu vậy, bạn nên biết rằng Beautiful Soup 3 không còn được phát triển nữa và mọi hỗ trợ dành cho nó đã bị hủy bỏ vào ngày 31 tháng 12 năm 2020. Nếu bạn muốn tìm hiểu về sự khác biệt giữa Beautiful Soup 3 và Beautiful Soup 4, hãy xem Chuyển mã sang BS4
Tài liệu này đã được người dùng Beautiful Soup dịch sang các ngôn ngữ khác
这篇文档当然还有中文版
このページは日本語で利用できます(外部リンク)
이 문서는 한국어 번역도 가능합니다
Tài liệu này cũng được phát hành tại Bồ Đào Nha ở Brasil
Эта документация доступна на русском языке
Tìm sự giúp đỡ¶
If you have questions about Beautiful Soup, or run into problems, send mail to the discussion group. Nếu vấn đề của bạn liên quan đến việc phân tích cú pháp tài liệu HTML, hãy nhớ đề cập đến hàm chẩn đoán() nói gì về tài liệu đó.
Bắt đầu nhanh¶
Đây là một tài liệu HTML mà tôi sẽ sử dụng làm ví dụ xuyên suốt tài liệu này. Đó là một phần của câu chuyện từ Alice in Wonderland
html_doc="""<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""
Chạy tài liệu “ba chị em” thông qua Beautiful Soup cho chúng ta một đối tượng
06, đại diện cho tài liệu dưới dạng cấu trúc dữ liệu lồng nhau
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
Dưới đây là một số cách đơn giản để điều hướng cấu trúc dữ liệu đó
Một nhiệm vụ phổ biến khác là trích xuất tất cả văn bản từ một trang
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
Does this look like what you need? If so, read on
Cài đặt Beautiful Soup¶
Nếu bạn đang sử dụng phiên bản Debian hoặc Ubuntu Linux gần đây, bạn có thể cài đặt Beautiful Soup với trình quản lý gói hệ thống
$ apt-get install python3-bs4
Beautiful Soup 4 được xuất bản thông qua PyPi, vì vậy nếu bạn không thể cài đặt nó bằng trình đóng gói hệ thống, bạn có thể cài đặt nó bằng
06 không phải là thứ bạn muốn. Đó là bản phát hành chính trước đó, Beautiful Soup 3. Rất nhiều phần mềm sử dụng BS3, vì vậy nó vẫn có sẵn, nhưng nếu bạn đang viết mã mới, bạn nên cài đặt
19 directory into your application’s codebase, and use Beautiful Soup without installing it at all
Tôi sử dụng Python3. 8 để phát triển Beautiful Soup, nhưng nó sẽ hoạt động với các phiên bản gần đây khác
Cài đặt trình phân tích cú pháp¶
Beautiful Soup hỗ trợ trình phân tích cú pháp HTML có trong thư viện chuẩn của Python, nhưng nó cũng hỗ trợ một số trình phân tích cú pháp Python của bên thứ ba. Một là trình phân tích cú pháp lxml. Tùy thuộc vào thiết lập của bạn, bạn có thể cài đặt lxml bằng một trong các lệnh sau
$ apt-get cài đặt python-lxml
$ easy_install lxml
$ pip cài đặt lxml
Another alternative is the pure-Python html5lib parser, which parses HTML the way a web browser does. Tùy thuộc vào thiết lập của bạn, bạn có thể cài đặt html5lib bằng một trong các lệnh sau
$ apt-get cài đặt python-html5lib
$ easy_install html5lib
$ pip cài đặt html5lib
Bảng này tóm tắt những ưu điểm và nhược điểm của từng thư viện trình phân tích cú pháp
Nếu có thể, tôi khuyên bạn nên cài đặt và sử dụng lxml để tăng tốc. Nếu bạn đang sử dụng phiên bản Python rất cũ – sớm hơn 3. 2. 2 – it’s essential that you install lxml or html5lib. Trình phân tích cú pháp HTML tích hợp của Python không tốt lắm trong các phiên bản cũ đó
Lưu ý rằng nếu một tài liệu không hợp lệ, các trình phân tích cú pháp khác nhau sẽ tạo ra các cây Súp đẹp khác nhau cho tài liệu đó. Xem Sự khác biệt giữa các trình phân tích cú pháp để biết chi tiết
Nấu súp¶
Để phân tích cú pháp một tài liệu, hãy chuyển nó vào hàm tạo
06. Bạn có thể truyền vào một chuỗi hoặc một xử lý tệp đang mở
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
5
Đầu tiên, tài liệu được chuyển đổi thành Unicode và các thực thể HTML được chuyển đổi thành các ký tự Unicode
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
6
Beautiful Soup sau đó phân tích cú pháp tài liệu bằng trình phân tích cú pháp tốt nhất hiện có. Nó sẽ sử dụng trình phân tích cú pháp HTML trừ khi bạn đặc biệt yêu cầu nó sử dụng trình phân tích cú pháp XML. (Xem Phân tích cú pháp XML. )
Các loại đối tượng¶
Beautiful Soup biến một tài liệu HTML phức tạp thành một cây các đối tượng Python phức tạp. Nhưng bạn sẽ chỉ phải xử lý khoảng bốn loại đối tượng.
Thẻ có rất nhiều thuộc tính và phương thức, và tôi sẽ trình bày hầu hết chúng trong Điều hướng cây và Tìm kiếm cây. Hiện tại, các tính năng quan trọng nhất của thẻ là tên và thuộc tính của nó
33 có thuộc tính “id” có giá trị là “đậm nhất”. Bạn có thể truy cập các thuộc tính của thẻ bằng cách coi thẻ như một cuốn từ điển
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
0
Bạn có thể truy cập trực tiếp từ điển đó dưới dạng
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
1
Bạn có thể thêm, xóa và sửa đổi thuộc tính của thẻ. Một lần nữa, điều này được thực hiện bằng cách coi thẻ như một từ điển
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
2
Thuộc tính đa giá trị¶
HTML 4 định nghĩa một vài thuộc tính có thể có nhiều giá trị. HTML 5 loại bỏ một vài trong số chúng, nhưng định nghĩa thêm một số. The most common multi-valued attribute is
40. Beautiful Soup trình bày (các) giá trị của thuộc tính đa giá trị dưới dạng danh sách
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
3
Nếu một thuộc tính có vẻ như có nhiều hơn một giá trị, nhưng đó không phải là thuộc tính đa giá trị như được định nghĩa bởi bất kỳ phiên bản nào của tiêu chuẩn HTML, thì Beautiful Soup sẽ để nguyên thuộc tính đó
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
4
Khi bạn biến thẻ trở lại thành chuỗi, nhiều giá trị thuộc tính được hợp nhất
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
43 để nhận giá trị luôn là danh sách, cho dù đó có phải là thuộc tính đa giá trị hay không
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
7
Nếu bạn phân tích một tài liệu dưới dạng XML, sẽ không có thuộc tính đa giá trị nào
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
8
Một lần nữa, bạn có thể định cấu hình điều này bằng cách sử dụng đối số
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
9
Có thể bạn sẽ không cần làm điều này, nhưng nếu có, hãy sử dụng các giá trị mặc định làm hướng dẫn. Họ thực hiện các quy tắc được mô tả trong đặc tả HTML
27 giống như một chuỗi Python Unicode, ngoại trừ việc nó cũng hỗ trợ một số tính năng được mô tả trong Điều hướng cây và Tìm kiếm cây. Bạn có thể chuyển đổi một
27 hỗ trợ hầu hết các tính năng được mô tả trong Điều hướng cây và Tìm kiếm cây, nhưng không phải tất cả chúng. Cụ thể, vì một chuỗi không thể chứa bất kỳ thứ gì (theo cách một thẻ có thể chứa một chuỗi hoặc một thẻ khác), các chuỗi không hỗ trợ các thuộc tính
55 trên nó để biến nó thành một chuỗi Python Unicode bình thường. Nếu không, chuỗi của bạn sẽ mang một tham chiếu đến toàn bộ cây phân tích Beautiful Soup, ngay cả khi bạn đã sử dụng xong Beautiful Soup. Đây là một sự lãng phí bộ nhớ lớn
06 đại diện cho toàn bộ tài liệu được phân tích cú pháp. Đối với hầu hết các mục đích, bạn có thể coi nó như một đối tượng Thẻ . Điều này có nghĩa là nó hỗ trợ hầu hết các phương pháp được mô tả trong Điều hướng cây và Tìm kiếm cây.
06 vào một trong các phương thức được xác định trong phần Sửa đổi cây, giống như cách bạn thực hiện với Thẻ . Điều này cho phép bạn làm những việc như kết hợp hai tài liệu được phân tích cú pháp.
Tôi sẽ lấy phần này làm ví dụ để chỉ cho bạn cách di chuyển từ phần này sang phần khác của tài liệu
Đi xuống¶
Thẻ có thể chứa chuỗi và các thẻ khác. Các phần tử này là phần tử con của thẻ. Beautiful Soup cung cấp rất nhiều thuộc tính khác nhau để điều hướng và lặp qua phần con của thẻ
Lưu ý rằng các chuỗi Beautiful Soup không hỗ trợ bất kỳ thuộc tính nào trong số này, bởi vì một chuỗi không thể có con
Điều hướng bằng cách sử dụng tên thẻ¶
The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the
90 attribute lets you iterate over all of a tag’s children, recursively: its direct children, the children of its direct children, and so on:
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
0
The
tag has only one child, but it has two descendants: the tag and the tag’s child. The
06 object only has one direct child (the tag), but it has a whole lot of descendants:
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
2
Nếu con duy nhất của một thẻ là một thẻ khác và thẻ đó có một
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
4
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
03 và for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
04¶
Nếu có nhiều thứ bên trong một thẻ, bạn vẫn có thể chỉ xem các chuỗi. Sử dụng trình tạo
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
5
Các chuỗi này có xu hướng có nhiều khoảng trắng thừa, bạn có thể loại bỏ khoảng trắng này bằng cách sử dụng trình tạo
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
6
Ở đây, các chuỗi bao gồm toàn bộ khoảng trắng bị bỏ qua và khoảng trắng ở đầu và cuối chuỗi bị xóa
Đi lên¶
Tiếp tục phép loại suy “cây gia đình”, mọi thẻ và mọi chuỗi đều có cha. thẻ chứa nó
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
07¶
07 attribute. In the example “three sisters” document, the tag is the parent of the tag:
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
7
The title string itself has a parent: the
tag that contains it:
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
50
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
12¶
You can iterate over all of an element’s parents with
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
58
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
27 và for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
28¶
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
59
Quay đi quay lại¶
Hãy xem phần đầu của tài liệu “ba chị em”
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
60
An HTML parser takes this string of characters and turns it into a series of events: “open an tag”, “open a
tag”, “open a tag”, “add a string”, “close the tag”, “open a
tag”, and so on. Beautiful Soup offers tools for reconstructing the initial parse of the document.
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
31 và for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
32¶
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
31. Nó trỏ đến bất kỳ phần tử nào đã được phân tích cú pháp ngay trước phần tử này
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
63
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
39 và for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
40¶
Bạn nên có ý tưởng ngay bây giờ. Bạn có thể sử dụng các vòng lặp này để tiến hoặc lùi trong tài liệu khi nó được phân tích cú pháp
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
64
Tìm kiếm trên cây¶
Beautiful Soup định nghĩa rất nhiều phương pháp để tìm kiếm cây phân tích cú pháp, nhưng chúng đều rất giống nhau. Tôi sẽ dành nhiều thời gian để giải thích hai phương pháp phổ biến nhất.
42 và các phương thức tương tự, tôi muốn đưa ra các ví dụ về các bộ lọc khác nhau mà bạn có thể chuyển vào các phương thức này. Các bộ lọc này hiển thị lặp đi lặp lại trong toàn bộ API tìm kiếm. Bạn có thể sử dụng chúng để lọc dựa trên tên của thẻ, trên thuộc tính của thẻ, trên văn bản của chuỗi hoặc trên một số kết hợp của những điều này
Một chuỗi¶
The simplest filter is a string. Pass a string to a search method and Beautiful Soup will perform a match against that exact string. This code finds all the tags in the document:
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
66
Nếu bạn chuyển vào một chuỗi byte, Beautiful Soup sẽ cho rằng chuỗi được mã hóa dưới dạng UTF-8. Thay vào đó, bạn có thể tránh điều này bằng cách chuyển vào một chuỗi Unicode
Biểu thức chính quy¶
Nếu bạn chuyển vào một đối tượng biểu thức chính quy, Beautiful Soup sẽ lọc theo biểu thức chính quy đó bằng cách sử dụng phương thức
45 của nó. Mã này tìm tất cả các thẻ có tên bắt đầu bằng chữ cái “b”;
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
67
Mã này tìm tất cả các thẻ có tên chứa chữ 't'
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
68
Một danh sách¶
If you pass in a list, Beautiful Soup will allow a string match against any item in that list. This code finds all the tags and all the tags:
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
69
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
46¶
42 method looks through a tag’s descendants and retrieves all descendants that match your filters. I gave several examples in Kinds of filters, but here are a few more
The for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
62 argument¶
Nhập một giá trị cho ________ 462 và bạn sẽ yêu cầu Beautifulsoup chỉ xem xét các thẻ có tên nhất định. Text strings will be ignored, as will tags whose names that don’t match
Bạn không thể sử dụng đối số từ khóa để tìm kiếm phần tử 'tên' HTML, vì Beautifulsoup sử dụng đối số ________ 462 để chứa tên của chính thẻ đó. Instead, you can give a value to ‘name’ in the
Hãy nhớ rằng một thẻ có thể có nhiều giá trị cho thuộc tính “lớp” của nó. When you search for a tag that matches a certain CSS class, you’re matching against any of its CSS classes.
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
24 và
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
05, và. cha mẹ và. thuộc tính cha mẹ đã đề cập trước đó. Kết nối rất mạnh mẽ. These search methods actually use
12 to iterate over all the parents, and check each one against the provided filter to see if it matches
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
37 và print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
38¶
Chữ ký phương thức. find_next_siblings( tên , attrs , string, limit, **kwargs)
Chữ ký phương thức. find_next_sibling( tên , attrs , string, **kwargs)
Những phương pháp này sử dụng . next_siblings to iterate over the rest of an element’s siblings in the tree. Phương thức
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
37 trả về tất cả các anh chị em phù hợp và
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
38 chỉ trả về anh chị em đầu tiên.
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
03
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
41 và print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
42¶
Chữ ký phương thức. find_previous_siblings( tên , attrs , string, limit, **kwargs)
Chữ ký phương thức. find_previous_sibling( tên , attrs , string, **kwargs)
Những phương pháp này sử dụng . previous_siblings để lặp lại các phần tử anh chị em của phần tử đứng trước nó trong cây. Phương thức
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
41 trả về tất cả các anh chị em phù hợp và
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
42 chỉ trả về anh chị em đầu tiên.
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
04
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
45 và print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
46¶
Chữ ký phương thức. find_next( tên , attrs , string, **kwargs)
Những phương pháp này sử dụng . next_elements để lặp qua bất kỳ thẻ và chuỗi nào xuất hiện sau nó trong tài liệu. Phương thức
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
45 trả về tất cả các kết quả khớp và
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
46 chỉ trả về kết quả khớp đầu tiên.
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
49 và print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
50¶
Chữ ký phương thức. find_all_previous( tên , attrs , string, limit, **kwargs)
Chữ ký phương thức. find_previous( tên , attrs , string, **kwargs)
Những phương pháp này sử dụng . previous_elements để lặp lại các thẻ và chuỗi trước nó trong tài liệu. Phương thức
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
49 trả về tất cả các kết quả phù hợp và
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
50 chỉ trả về kết quả khớp đầu tiên.
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
06
The call to
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
53 found the first paragraph in the document (the one with class=”title”), but it also finds the second paragraph, the
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
55 sử dụng gói SoupSieve để chạy bộ chọn CSS đối với tài liệu được phân tích cú pháp và trả về tất cả các phần tử phù hợp.
26 has a similar method which runs a CSS selector against the contents of a single tag
(Tích hợp SoupSieve đã được thêm vào Beautiful Soup 4. 7. 0. Các phiên bản trước cũng có phương thức
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
55, nhưng chỉ hỗ trợ các bộ chọn CSS được sử dụng phổ biến nhất. Nếu bạn đã cài đặt Beautiful Soup đến
08, SoupSieve đã được cài đặt cùng lúc, vì vậy bạn không phải làm gì thêm. )
Tài liệu SoupSieve liệt kê tất cả các bộ chọn CSS hiện được hỗ trợ, nhưng đây là một số điều cơ bản
Bạn có thể tìm thấy các thẻ
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
07
Tìm các thẻ bên dưới các thẻ khác
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
08
Tìm các thẻ ngay bên dưới các thẻ khác
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
09
Tìm anh chị em của thẻ
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
10
Tìm thẻ theo lớp CSS
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
11
Tìm thẻ theo ID
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
12
Tìm các thẻ khớp với bất kỳ bộ chọn nào từ danh sách các bộ chọn
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
13
Kiểm tra sự tồn tại của một thuộc tính
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
14
Tìm thẻ theo giá trị thuộc tính
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
15
Ngoài ra còn có một phương pháp gọi là
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
59, chỉ tìm thấy thẻ đầu tiên khớp với bộ chọn
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
16
Nếu bạn đã phân tích cú pháp XML xác định không gian tên, thì bạn có thể sử dụng chúng trong bộ chọn CSS
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
17
Khi xử lý bộ chọn CSS sử dụng không gian tên, Beautiful Soup luôn cố gắng sử dụng các tiền tố không gian tên có ý nghĩa dựa trên những gì nó thấy trong khi phân tích cú pháp tài liệu. Bạn luôn có thể cung cấp từ điển viết tắt của riêng bạn
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
18
Tất cả nội dung bộ chọn CSS này là một tiện ích cho những người đã biết cú pháp bộ chọn CSS. Bạn có thể làm tất cả những điều này với API Beautiful Soup. Và nếu bộ chọn CSS là tất cả những gì bạn cần, bạn nên phân tích cú pháp tài liệu bằng lxml. nó nhanh hơn rất nhiều. Nhưng điều này cho phép bạn kết hợp các bộ chọn CSS với API Beautiful Soup
Sửa đổi cây¶
Điểm mạnh chính của Beautiful Soup là tìm kiếm cây phân tích cú pháp, nhưng bạn cũng có thể sửa đổi cây và viết các thay đổi của mình dưới dạng tài liệu HTML hoặc XML mới
Thay đổi tên thẻ và thuộc tính¶
Tôi đã đề cập đến điều này trước đó, trong Thuộc tính, nhưng nó lặp đi lặp lại. Bạn có thể đổi tên thẻ, thay đổi giá trị của thuộc tính, thêm thuộc tính mới và xóa thuộc tính
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
52 attribute to a new string, the tag’s contents are replaced with that string
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
20
Hãy cẩn thận. nếu thẻ chứa các thẻ khác, chúng và tất cả nội dung của chúng sẽ bị hủy
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
62¶
Bạn có thể thêm vào nội dung của thẻ bằng
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
63. Nó hoạt động giống như gọi
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
64 trong danh sách Python
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
21
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
65¶
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
67, phương thức này thêm mọi phần tử của danh sách vào một
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
22
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
69 và print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
70¶
Nếu bạn cần thêm một chuỗi vào tài liệu, không vấn đề gì – bạn có thể chuyển một chuỗi Python vào
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
23
Nếu bạn muốn tạo một bình luận hoặc một số lớp con khác của
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
24
(Đây là tính năng mới trong Beautiful Soup 4. 4. 0. )
What if you need to create a whole new tag? The best solution is to call the factory method
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
74
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
25
Chỉ đối số đầu tiên, tên thẻ, là bắt buộc
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
75¶
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
76 cũng giống như
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
63, ngoại trừ phần tử mới không nhất thiết phải ở cuối phần tử mẹ của nó là
51. Nó sẽ được chèn vào bất kỳ vị trí số nào bạn nói. Nó hoạt động giống như
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
79 trong danh sách Python
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
26
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
80 và print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
81¶
Phương thức
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
80 chèn các thẻ hoặc chuỗi ngay trước một thứ khác trong cây phân tích cú pháp
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
27
Phương thức
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
81 chèn các thẻ hoặc chuỗi ngay sau một thứ khác trong cây phân tích cú pháp
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
28
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
84¶
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
85 xóa nội dung của thẻ
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
29
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
86¶
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
87 xóa thẻ hoặc chuỗi khỏi cây. It returns the tag or string that was extracted
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
30
At this point you effectively have two parse trees. one rooted at the
06 object you used to parse the document, and one rooted at the tag that was extracted. Bạn có thể tiếp tục gọi
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
89 trên phần tử con của phần tử mà bạn đã trích xuất
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
31
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
90¶
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
91 removes a tag from the tree, then completely destroys it and its contents
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
27 bị phân tách không được xác định và bạn không nên sử dụng nó cho bất cứ điều gì. If you’re not sure whether something has been decomposed, you can check its
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
94 property (new in Beautiful Soup 4. 9. 0)
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
33
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
95¶
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
96 xóa thẻ hoặc chuỗi khỏi cây và thay thế bằng một hoặc nhiều thẻ hoặc chuỗi bạn chọn
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
34
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
95 trả về thẻ hoặc chuỗi đã được thay thế để bạn có thể kiểm tra hoặc thêm lại vào phần khác của cây
Khả năng chuyển nhiều đối số vào replace_with() là tính năng mới trong Beautiful Soup 4. 10. 0
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
98¶
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
99 bọc một phần tử trong thẻ bạn chỉ định. Nó trả về trình bao bọc mới
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
35
Phương pháp này mới trong Beautiful Soup 4. 0. 5
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
# <html>
# <head>
# <title>
# The Dormouse's story
# </title>
# </head>
# <body>
# <p class="title">
# <b>
# The Dormouse's story
# </b>
# </p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">
# Elsie
# </a>
# ,
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>
# and
# <a class="sister" href="http://example.com/tillie" id="link3">
# Tillie
# </a>
# ; and they lived at the bottom of a well.
# </p>
# <p class="story">
# ...
# </p>
# </body>
# </html>
500¶
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
501 ngược lại với
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
98. Nó thay thế một thẻ bằng bất cứ thứ gì bên trong thẻ đó. Nó tốt cho việc loại bỏ đánh dấu
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
36
Giống như
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
95,
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
500 trả về thẻ đã được thay thế
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
# <html>
# <head>
# <title>
# The Dormouse's story
# </title>
# </head>
# <body>
# <p class="title">
# <b>
# The Dormouse's story
# </b>
# </p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">
# Elsie
# </a>
# ,
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>
# and
# <a class="sister" href="http://example.com/tillie" id="link3">
# Tillie
# </a>
# ; and they lived at the bottom of a well.
# </p>
# <p class="story">
# ...
# </p>
# </body>
# </html>
505¶
Sau khi gọi một loạt các phương thức sửa đổi cây phân tích cú pháp, bạn có thể có hai hoặc nhiều đối tượng
27 cạnh nhau. Beautiful Soup không có bất kỳ vấn đề nào với điều này, nhưng vì nó không thể xảy ra trong một tài liệu mới được phân tích cú pháp nên bạn có thể không mong đợi hành vi như sau
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
37
Bạn có thể gọi
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
507 để dọn sạch cây phân tích cú pháp bằng cách hợp nhất các chuỗi liền kề
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
38
Phương pháp này mới trong Beautiful Soup 4. 8. 0
Đầu ra¶
In đẹp¶
Phương thức
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
508 sẽ biến cây phân tích cú pháp Beautiful Soup thành một chuỗi Unicode được định dạng độc đáo, với một dòng riêng cho mỗi thẻ và mỗi chuỗi
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
39
Bạn có thể gọi
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
40
Since it adds whitespace (in the form of newlines),
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
508 changes the meaning of an HTML document and should not be used to reformat one. Mục tiêu của
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
508 là giúp bạn hiểu một cách trực quan cấu trúc của các tài liệu mà bạn làm việc với
Bản in không đẹp¶
Nếu bạn chỉ muốn một chuỗi, không có định dạng ưa thích, bạn có thể gọi
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
41
Hàm
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
514 trả về một chuỗi được mã hóa bằng UTF-8. Xem Mã hóa để biết các tùy chọn khác
Bạn cũng có thể gọi
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
518 để lấy bytestring và
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
519 để lấy Unicode
Output formatters¶
Nếu bạn cung cấp cho Beautiful Soup một tài liệu chứa các thực thể HTML như “&lquot;”, chúng sẽ được chuyển đổi thành các ký tự Unicode
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
42
Sau đó, nếu bạn chuyển đổi tài liệu thành một chuỗi ký tự, thì các ký tự Unicode sẽ được mã hóa thành UTF-8. Bạn sẽ không lấy lại được các thực thể HTML
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
43
By default, the only characters that are escaped upon output are bare ampersands and angle brackets. These get turned into “&”, “<”, and “>”, so that Beautiful Soup doesn’t inadvertently generate invalid HTML or XML:
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
44
Bạn có thể thay đổi hành vi này bằng cách cung cấp một giá trị cho đối số
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
520 thành
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
508,
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
518 hoặc
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
519. Beautiful Soup recognizes five possible values for
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
520
Mặc định là
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
525. Chuỗi sẽ chỉ được xử lý đủ để đảm bảo rằng Beautiful Soup tạo HTML/XML hợp lệ
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
45
Nếu bạn vượt qua
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
526, Beautiful Soup sẽ chuyển đổi các ký tự Unicode thành các thực thể HTML bất cứ khi nào có thể
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
46
Nếu bạn vượt qua
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
527, nó tương tự như
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
526, nhưng Beautiful Soup sẽ bỏ qua dấu gạch chéo trong các thẻ trống HTML như “br”
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
47
Ngoài ra, bất kỳ thuộc tính nào có giá trị là chuỗi rỗng sẽ trở thành thuộc tính boolean kiểu HTML
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
48
(Hành vi này là mới kể từ Beautiful Soup 4. 10. 0. )
Nếu bạn vượt qua
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
529, Beautiful Soup sẽ không sửa đổi chuỗi nào ở đầu ra. Đây là tùy chọn nhanh nhất, nhưng nó có thể dẫn đến việc Beautiful Soup tạo HTML/XML không hợp lệ, như trong các ví dụ sau
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
49
Nếu bạn cần kiểm soát tinh vi hơn đối với đầu ra của mình, bạn có thể sử dụng lớp
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
530 của Beautiful Soup. Đây là một trình định dạng chuyển đổi chuỗi thành chữ hoa, cho dù chúng xuất hiện trong một nút văn bản hay trong một giá trị thuộc tính
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
50
Đây là một trình định dạng giúp tăng độ thụt đầu dòng khi in đẹp
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
51
Phân lớp
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
531 hoặc
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
532 sẽ cung cấp cho bạn nhiều quyền kiểm soát hơn đối với đầu ra. Ví dụ: Beautiful Soup sắp xếp các thuộc tính trong mọi thẻ theo mặc định
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
52
Để tắt tính năng này, bạn có thể phân lớp phương thức
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
533, phương thức này kiểm soát thuộc tính nào được xuất và theo thứ tự nào. This implementation also filters out the attribute called “m” whenever it appears
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
75, văn bản bên trong đối tượng đó luôn được trình bày chính xác như nó xuất hiện, không có định dạng. Beautiful Soup sẽ gọi hàm thay thế thực thể của bạn, chỉ trong trường hợp bạn đã viết một hàm tùy chỉnh đếm tất cả các chuỗi trong tài liệu hoặc thứ gì đó, nhưng nó sẽ bỏ qua giá trị trả về
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
54
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
# <html>
# <head>
# <title>
# The Dormouse's story
# </title>
# </head>
# <body>
# <p class="title">
# <b>
# The Dormouse's story
# </b>
# </p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">
# Elsie
# </a>
# ,
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>
# and
# <a class="sister" href="http://example.com/tillie" id="link3">
# Tillie
# </a>
# ; and they lived at the bottom of a well.
# </p>
# <p class="story">
# ...
# </p>
# </body>
# </html>
535¶
Nếu bạn chỉ muốn văn bản con người có thể đọc được bên trong tài liệu hoặc thẻ, bạn có thể sử dụng phương pháp
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
535. Nó trả về tất cả văn bản trong tài liệu hoặc bên dưới thẻ, dưới dạng một chuỗi Unicode
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
55
Bạn có thể chỉ định một chuỗi được sử dụng để nối các đoạn văn bản lại với nhau
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
56
Bạn có thể yêu cầu Beautiful Soup loại bỏ khoảng trắng từ đầu và cuối mỗi đoạn văn bản
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
57
Nhưng tại thời điểm đó, bạn có thể muốn sử dụng . Thay vào đó, trình tạo striped_strings và tự xử lý văn bản.
frombs4importBeautifulSoupsoup=BeautifulSoup(html_doc,'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
58
As of Beautiful Soup version 4.9.0, when lxml or html.parser are in use, the contents of