Hướng dẫn extract table html python

Nội dung chính Show

Parse table using requests and Beautiful Soup
Parse HTML table using Scrapy
Using Python Pandas to parse HTML tables

HTML tables are a very common format for displaying information. When building scrapers you often need to extract data from an HTML table and turn it into some different structured format, for example, JSON, CSV, or Excel. In this article, we will talk about extracting data from an HTML table in Python and Scrapy.

But before we start, here are a few articles to brush up on your web scraping knowledge:

What is web scraping
What is the difference between web scraping and web crawling

Now that we’re clear on the basics, let’s get started!

HTML table element represents tabular data, and presents information in two-dimensional format comprised of rows and columns. A table starts with <table> tag followed by optional tags table header <thead> containing header, <tbody> containing body of a table and <tfooter> containing footer. Within the table you have rows marked by <tr> tag and inside them there are cells with <td> or <th> tag.

As our example table, we will scrape a sample page from toscrape.com educational website maintained by Zyte for testing purposes. - https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html

The table contains UPC, price, tax, and availability information.

To extract a table from HTML, you first need to open your developer tools to see how the HTML looks and verify if it really is a table and not some other element. You open developer tools with the F12 key, see the “Elements” tab, and highlight the element you’re interested in. HTML source of this table looks like this:

Now that you have verified that your element is indeed a table, and you see how it looks, you can extract this data into your expected format.

To achieve this, you first need to download this page and then parse HTML. For downloading you can use different tools, such as python-requests or Scrapy.

Parse table using requests and Beautiful Soup

Beautiful Soup is a Python package for parsing HTML, python-requests is a popular and simple HTTP client library.

First, you download the page using requests by issuing an HTTP GET request. Response method raise_for_status() checks response status to make sure it is 200 code and not an error response. If there is something wrong with the response it will raise an exception. If all is good, your return response text.

import requests
from bs4 import BeautifulSoup
def download_page(url):
    response = requests.get(url)
    response.raise_for_status()
    return response.text

Then you parse the table with BeautifulSoup extracting text content from each cell and storing the file in JSON

def main(url):
    content = download_page(url)
    soup = BeautifulSoup(content, 'html.parser')
    result = {}
    for row in soup.table.find_all('tr'):
        row_header = row.th.get_text()
        row_cell = row.td.get_text()
        result[row_header] = row_cell
    with open('book_table.json', 'w') as storage_file:
        storage_file.write(json.dumps(result))

Full sample:

import json
import requests
from bs4 import BeautifulSoup
def download_page(url):
    response = requests.get(url)
    response.raise_for_status()
    return response.text
def main(url):
    content = download_page(url)
    soup = BeautifulSoup(content, 'html.parser')
    result = {}
    for row in soup.table.find_all('tr'):
        row_header = row.th.get_text()
        row_cell = row.td.get_text()
        result[row_header] = row_cell
    with open('book_table.json', 'w') as storage_file:
        storage_file.write(json.dumps(result))
if __name__ == "__main__":
    main("https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html")

Parse HTML table using Scrapy

You can scrape tables from websites using python-requests, and it might often work well for your needs, but in some cases, you will need more powerful tools. For example, let’s say you have 1 thousand book pages with different tables, and you need to parse them fast. In this case, you may need to make requests concurrently, and you may need to utilize an asynchronous framework that won’t block the execution thread for each request.

You may also need to handle failed responses, let’s say the site is temporarily down, and you need to retry your request if the response status is 503. If you’d like to do it with python-requests, you will have to add an if clause around the response downloader, check response status, and redownload response again if an error happens. In Scrapy, you don’t have to write any code for this because it is handled already by the downloader middleware, it will retry failed responses for you automatically without any action needed from your side.

To extract table data with Scrapy, you need to download and install Scrapy. When you have Scrapy installed you then need to create a simple spider

scrapy genspider books.toscrape.com

Then you edit spider code and you place HTML parsing logic inside the parse spider method. Scrapy response exposes Selector object allowing you to extract data from response content by calling “CSS” or “XPath” methods of Selector via response.

import scrapy
class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['toscrape.com']
    start_urls = ['https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html']
    def parse(self, response):
        table = response.css('table')
        result = {}
        for tr in table.css('tr'):
            row_header = tr.css('th::text').get()
            row_value = tr.css('td::text').get()
            result[row_header] = row_value
        yield result

You then run your spider using the runspider command passing the argument -o telling scrapy to place extracted data into output.json file.

scrapy runspider books.py -o output.json

You will see quite a lot of log output because it will start all built-in tools in Scrapy, components handling download timeouts, referrer header, redirects, cookies, etc. In the output you will also see your item extracted, it will look like this:

2021-11-25 09:16:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html> (referer: None)
2021-11-25 09:16:20 [scrapy.core.scraper] DEBUG: Scraped from <200 https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html>
{'UPC': 'a897fe39b1053632', 'Product Type': 'Books', 'Price (excl. tax)': '£51.77', 'Price (incl. tax)': '£51.77', 'Tax': '£0.00', 'Availability': 'In stock (22 available)', 'Number of reviews': '0'}
2021-11-25 09:16:20 [scrapy.core.engine] INFO: Closing spider (finished)

Scrapy will create a file output.json file in the directory where you run your spider and it will export your extracted data into JSON format and place it in this file.

Using Python Pandas to parse HTML tables

So far, we have extracted a simple HTML table, but tables in the real world are usually more complex. You may need to handle different layouts and occasionally there will be several tables available on-page, and you will need to write some selector to match the right one. You may not want to write parser code for each table you see. For this, you can use different python libraries that help you extract content from the HTML table.

One such method is available in the popular python Pandas library, it is called read_html(). The method accepts numerous arguments that allow you to customize how the table will be parsed.

You can call this method with a URL or file or actual string. For example, you might do it like this:

>>> import pandas
>>> tables_on_page = pandas.read_html("https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html")
>>> table = tables_on_page[0]
>>> table.to_json("table.json", index=False, orient='table')

In the output, you can see pandas generated not only the table data but also schema. read_html returns a list of Pandas DataFrames and it allows you to easily export each DataFrame to a preferred format such as CSV, XML, Excel file, or JSON.

For a simple use case, this might be the easiest option for you, and you can also combine it with Scrapy. You can import pandas in Scrapy callback and call read the HTML with response text. This allows you to have a powerful generic spider handling different tables and extracting them from different types of websites.

That’s it! Extracting an HTML table from a web page is that simple!

But if you’d like to leave the heavy-lifting of data extraction to experts and get simple access to data in your preferred format, we’re here for you. Just give us a shout!

Paweł is a Technical Team Lead in Delivery Department at Zyte, has several years of experience developing advanced crawling solutions using Scrapy framework, contributes to open source, is one of the authors of ScrapyRT framework, contributed to Splash.