How do you scrape data from local html files using python?

BeautifulSoup module in Python allows us to scrape data from local HTML files. For some reason, website pages might get stored in a local (offline environment), and whenever in need, there may be requirements to get the data from them. Sometimes there may be a need to get data from multiple Locally stored HTML files too. Usually HTML files got the tags like <h2>, <h2>,…<p>, <div> tags etc., Using BeautifulSoup, we can scrap the contents and get the necessary details.

Nội dung chính Show

Installation
Getting Started
How do I scrape HTML data in Python?
How do I read a local HTML file in Python?
How do you scrape data from an HTML file?
How do you process HTML files in Python?

Installation

It can be installed by typing the below command in the terminal.

pip install beautifulsoup4

Getting Started

If there is an HTML file stored in one location, and we need to scrap the content via Python using BeautifulSoup, the lxml is a great API as it meant for parsing XML and HTML. It supports both one-step parsing and step-by-step parsing.

The Prettify() function in BeautifulSoup helps to view the tag nature and their nesting.

Example: Let’s create a sample HTML file.

Python3

import sys

import urllib.request

original_stdout = sys.stdout

outputHtml = webPageResponse.read()

with open('samplehtml.html', 'w') as f:

sys.stdout = f

print(outputHtml)

sys.stdout = original_stdout

Output:

Now, use prettify() method to view tags and content in an easier way.

Python3

from bs4 import BeautifulSoup

HTMLFileToBeOpened = open("samplehtml.html", "r")

contents = HTMLFileToBeOpened.read()

beautifulSoupText = BeautifulSoup(contents, 'lxml')

print(beautifulSoupText.body.prettify())

Output :

https://media.geeksforgeeks.org/wp-content/uploads/20210419123712/gfg-priyaraj-article-page-scraped-to-offline-mode-html-and-printing-in-console.mp4

In this way can get HTML data. Now do some operations and some insightful in the data.

Example 1:

We can use find() methods and as HTML contents dynamically change, we may not be knowing the exact tag name. In that time, we can use findAll(True) to get the tag name first, and then we can do any kind of manipulation. For example, get the tag name and length of the tag

Python3

from bs4 import BeautifulSoup

HTMLFileToBeOpened = open("samplehtml.html", "r")

contents = HTMLFileToBeOpened.read()

beautifulSoupText = BeautifulSoup(contents, 'lxml')

for tag in beautifulSoupText.findAll(True):

print(tag.name, " : ", len(beautifulSoupText.find(tag.name).text))

Output:

https://media.geeksforgeeks.org/wp-content/uploads/20210419124258/gfg-scraped-data-and-get-the-tag-names-and-length.mp4

Example 2 :

Now, instead of scraping one HTML file, we want to do for all the HTML files present in that directory(there may be necessities for such cases as on daily basis, a particular directory may get filled with the online data and as a batch process, scraping has to be carried out).

We can use “os” module functionalities. Let us take the current directory all HTML files for our examples

So our task is to get all HTML files to get scrapped. In the below way, we can achieve. Entire folder HTML files got scraped one by one and their length of tags for all files are retrieved, and it is showcased in the attached video.

Python3

import os

from bs4 import BeautifulSoup

directory = os.getcwd()

for filename in os.listdir(directory):

if filename.endswith('.html'):

fname = os.path.join(directory, filename)

print("Current file name ..", os.path.abspath(fname))

with open(fname, 'r') as file:

beautifulSoupText = BeautifulSoup(file.read(), 'html.parser')

for tag in beautifulSoupText.findAll(True):

print(tag.name, " : ", len(beautifulSoupText.find(tag.name).text))

Output:

https://media.geeksforgeeks.org/wp-content/uploads/20210419125444/gfg-scraping-multiple-files.mp4

How do I scrape HTML data in Python?

To extract data using web scraping with python, you need to follow these basic steps:.

Find the URL that you want to scrape..

Inspecting the Page..

Find the data you want to extract..

Write the code..

Run the code and extract the data..

Store the data in the required format..

How do I read a local HTML file in Python?

open() to open an HTML file within Python. Call codecs. open(filename, mode, encoding) with filename as the name of the HTML file, mode as "r" , and encoding as "utf-8" to open an HTML file in read-only mode.

How do you scrape data from an HTML file?

There are roughly 5 steps as below:.

Inspect the website HTML that you want to crawl..

Access URL of the website using code and download all the HTML contents on the page..

Format the downloaded content into a readable format..

Extract out useful information and save it into a structured format..

How do you process HTML files in Python?

BeautifulSoup is a Python library for parsing HTML and XML documents. It is often used for web scraping. BeautifulSoup transforms a complex HTML document into a complex tree of Python objects, such as tag, navigable string, or comment.