Requests-HTML: HTML Parsing for Humans (writing Python 3)!¶ Show This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible. When using this library you automatically get:
Installation¶$ pipenv install requests-html ✨🍰✨ Only Python 3.6 is supported. Tutorial & Usage¶Make a GET request to python.org, using Requests: >>> from requests_html import HTMLSession >>> session = HTMLSession() >>> r = session.get('https://python.org/') Or want to try our async session: >>> from requests_html import AsyncHTMLSession >>> asession = AsyncHTMLSession() >>> r = await asession.get('https://python.org/') But async is fun when fetching some sites at the same time: >>> from requests_html import AsyncHTMLSession >>> asession = AsyncHTMLSession() >>> async def get_pythonorg(): ... r = await asession.get('https://python.org/') >>> async def get_reddit(): ... r = await asession.get('https://reddit.com/') >>> async def get_google(): ... r = await asession.get('https://google.com/') >>> session.run(get_pythonorg, get_reddit, get_google) Grab a list of all links on the page, as–is (anchors excluded): >>> r.html.links {'//docs.python.org/3/tutorial/', '/about/apps/', 'https://github.com/python/pythondotorg/issues', '/accounts/login/', '/dev/peps/', '/about/legal/', '//docs.python.org/3/tutorial/introduction.html#lists', '/download/alternatives', 'http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html', '/download/other/', '/downloads/windows/', 'https://mail.python.org/mailman/listinfo/python-dev', '/doc/av', 'https://devguide.python.org/', '/about/success/#engineering', 'https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event', 'https://www.openstack.org', '/about/gettingstarted/', 'http://feedproxy.google.com/~r/PythonInsider/~3/AMoBel8b8Mc/python-3.html', '/success-stories/industrial-light-magic-runs-python/', 'http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator', '/', 'http://pyfound.blogspot.com/', '/events/python-events/past/', '/downloads/release/python-2714/', 'https://wiki.python.org/moin/PythonBooks', 'http://plus.google.com/+Python', 'https://wiki.python.org/moin/', 'https://status.python.org/', '/community/workshops/', '/community/lists/', 'http://buildbot.net/', '/community/awards', 'http://twitter.com/ThePSF', 'https://docs.python.org/3/license.html', '/psf/donations/', 'http://wiki.python.org/moin/Languages', '/dev/', '/events/python-user-group/', 'https://wiki.qt.io/PySide', '/community/sigs/', 'https://wiki.gnome.org/Projects/PyGObject', 'http://www.ansible.com', 'http://www.saltstack.com', 'http://planetpython.org/', '/events/python-events', '/about/help/', '/events/python-user-group/past/', '/about/success/', '/psf-landing/', '/about/apps', '/about/', 'http://www.wxpython.org/', '/events/python-user-group/665/', 'https://www.python.org/psf/codeofconduct/', '/dev/peps/peps.rss', '/downloads/source/', '/psf/sponsorship/sponsors/', 'http://bottlepy.org', 'http://roundup.sourceforge.net/', 'http://pandas.pydata.org/', 'http://brochure.getpython.info/', 'https://bugs.python.org/', '/community/merchandise/', 'http://tornadoweb.org', '/events/python-user-group/650/', 'http://flask.pocoo.org/', '/downloads/release/python-364/', '/events/python-user-group/660/', '/events/python-user-group/638/', '/psf/', '/doc/', 'http://blog.python.org', '/events/python-events/604/', '/about/success/#government', 'http://python.org/dev/peps/', 'https://docs.python.org', 'http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html', '/users/membership/', '/about/success/#arts', 'https://wiki.python.org/moin/Python2orPython3', '/downloads/', '/jobs/', 'http://trac.edgewall.org/', 'http://feedproxy.google.com/~r/PythonInsider/~3/wh73_1A-N7Q/python-355rc1-and-python-348rc1-are-now.html', '/privacy/', 'https://pypi.python.org/', 'http://www.riverbankcomputing.co.uk/software/pyqt/intro', 'http://www.scipy.org', '/community/forums/', '/about/success/#scientific', '/about/success/#software-development', '/shell/', '/accounts/signup/', 'http://www.facebook.com/pythonlang?fref=ts', '/community/', 'https://kivy.org/', '/about/quotes/', 'http://www.web2py.com/', '/community/logos/', '/community/diversity/', '/events/calendars/', 'https://wiki.python.org/moin/BeginnersGuide', '/success-stories/', '/doc/essays/', '/dev/core-mentorship/', 'http://ipython.org', '/events/', '//docs.python.org/3/tutorial/controlflow.html', '/about/success/#education', '/blogs/', '/community/irc/', 'http://pycon.blogspot.com/', '//jobs.python.org', 'http://www.pylonsproject.org/', 'http://www.djangoproject.com/', '/downloads/mac-osx/', '/about/success/#business', 'http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now-available-for.html', 'http://wiki.python.org/moin/TkInter', 'https://docs.python.org/faq/', '//docs.python.org/3/tutorial/controlflow.html#defining-functions'} Grab a list of all links on the page, in absolute form (anchors excluded): >>> r.html.absolute_links {'https://github.com/python/pythondotorg/issues', 'https://docs.python.org/3/tutorial/', 'https://www.python.org/about/success/', 'http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html', 'https://www.python.org/dev/peps/', 'https://mail.python.org/mailman/listinfo/python-dev', 'https://www.python.org/doc/', 'https://www.python.org/', 'https://www.python.org/about/', 'https://www.python.org/events/python-events/past/', 'https://devguide.python.org/', 'https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event', 'https://www.openstack.org', 'http://feedproxy.google.com/~r/PythonInsider/~3/AMoBel8b8Mc/python-3.html', 'https://docs.python.org/3/tutorial/introduction.html#lists', 'http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator', 'http://pyfound.blogspot.com/', 'https://wiki.python.org/moin/PythonBooks', 'http://plus.google.com/+Python', 'https://wiki.python.org/moin/', 'https://www.python.org/events/python-events', 'https://status.python.org/', 'https://www.python.org/about/apps', 'https://www.python.org/downloads/release/python-2714/', 'https://www.python.org/psf/donations/', 'http://buildbot.net/', 'http://twitter.com/ThePSF', 'https://docs.python.org/3/license.html', 'http://wiki.python.org/moin/Languages', 'https://docs.python.org/faq/', 'https://jobs.python.org', 'https://www.python.org/about/success/#software-development', 'https://www.python.org/about/success/#education', 'https://www.python.org/community/logos/', 'https://www.python.org/doc/av', 'https://wiki.qt.io/PySide', 'https://www.python.org/events/python-user-group/660/', 'https://wiki.gnome.org/Projects/PyGObject', 'http://www.ansible.com', 'http://www.saltstack.com', 'https://www.python.org/dev/peps/peps.rss', 'http://planetpython.org/', 'https://www.python.org/events/python-user-group/past/', 'https://docs.python.org/3/tutorial/controlflow.html#defining-functions', 'https://www.python.org/community/diversity/', 'https://docs.python.org/3/tutorial/controlflow.html', 'https://www.python.org/community/awards', 'https://www.python.org/events/python-user-group/638/', 'https://www.python.org/about/legal/', 'https://www.python.org/dev/', 'https://www.python.org/download/alternatives', 'https://www.python.org/downloads/', 'https://www.python.org/community/lists/', 'http://www.wxpython.org/', 'https://www.python.org/about/success/#government', 'https://www.python.org/psf/', 'https://www.python.org/psf/codeofconduct/', 'http://bottlepy.org', 'http://roundup.sourceforge.net/', 'http://pandas.pydata.org/', 'http://brochure.getpython.info/', 'https://www.python.org/downloads/source/', 'https://bugs.python.org/', 'https://www.python.org/downloads/mac-osx/', 'https://www.python.org/about/help/', 'http://tornadoweb.org', 'http://flask.pocoo.org/', 'https://www.python.org/users/membership/', 'http://blog.python.org', 'https://www.python.org/privacy/', 'https://www.python.org/about/gettingstarted/', 'http://python.org/dev/peps/', 'https://www.python.org/about/apps/', 'https://docs.python.org', 'https://www.python.org/success-stories/', 'https://www.python.org/community/forums/', 'http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html', 'https://www.python.org/community/merchandise/', 'https://www.python.org/about/success/#arts', 'https://wiki.python.org/moin/Python2orPython3', 'http://trac.edgewall.org/', 'http://feedproxy.google.com/~r/PythonInsider/~3/wh73_1A-N7Q/python-355rc1-and-python-348rc1-are-now.html', 'https://pypi.python.org/', 'https://www.python.org/events/python-user-group/650/', 'http://www.riverbankcomputing.co.uk/software/pyqt/intro', 'https://www.python.org/about/quotes/', 'https://www.python.org/downloads/windows/', 'https://www.python.org/events/calendars/', 'http://www.scipy.org', 'https://www.python.org/community/workshops/', 'https://www.python.org/blogs/', 'https://www.python.org/accounts/signup/', 'https://www.python.org/events/', 'https://kivy.org/', 'http://www.facebook.com/pythonlang?fref=ts', 'http://www.web2py.com/', 'https://www.python.org/psf/sponsorship/sponsors/', 'https://www.python.org/community/', 'https://www.python.org/download/other/', 'https://www.python.org/psf-landing/', 'https://www.python.org/events/python-user-group/665/', 'https://wiki.python.org/moin/BeginnersGuide', 'https://www.python.org/accounts/login/', 'https://www.python.org/downloads/release/python-364/', 'https://www.python.org/dev/core-mentorship/', 'https://www.python.org/about/success/#business', 'https://www.python.org/community/sigs/', 'https://www.python.org/events/python-user-group/', 'http://ipython.org', 'https://www.python.org/shell/', 'https://www.python.org/community/irc/', 'https://www.python.org/about/success/#engineering', 'http://www.pylonsproject.org/', 'http://pycon.blogspot.com/', 'https://www.python.org/about/success/#scientific', 'https://www.python.org/doc/essays/', 'http://www.djangoproject.com/', 'https://www.python.org/success-stories/industrial-light-magic-runs-python/', 'http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now-available-for.html', 'http://wiki.python.org/moin/TkInter', 'https://www.python.org/jobs/', 'https://www.python.org/events/python-events/604/'} Select an >>> about = r.html.find('#about', first=True) Grab an >>> print(about.text) About Applications Quotes Getting Started Help Python Brochure Introspect an >>> about.attrs {'id': 'about', 'class': ('tier-1', 'element-1'), 'aria-haspopup': 'true'} Render out an >>> about.html '<li aria-haspopup="true" class="tier-1 element-1 " id="about">\n<a class="" href="/about/" title="">About</a>\n<ul aria-hidden="true" class="subnav menu" role="menu">\n<li class="tier-2 element-1" role="treeitem"><a href="/about/apps/" title="">Applications</a></li>\n<li class="tier-2 element-2" role="treeitem"><a href="/about/quotes/" title="">Quotes</a></li>\n<li class="tier-2 element-3" role="treeitem"><a href="/about/gettingstarted/" title="">Getting Started</a></li>\n<li class="tier-2 element-4" role="treeitem"><a href="/about/help/" title="">Help</a></li>\n<li class="tier-2 element-5" role="treeitem"><a href="http://brochure.getpython.info/" title="">Python Brochure</a></li>\n</ul>\n</li>' Crab an Show the line number that an Select an >>> about.find('a') [<Element 'a' href='/about/' title='' class=''>, <Element 'a' href='/about/apps/' title=''>, <Element 'a' href='/about/quotes/' title=''>, <Element 'a' href='/about/gettingstarted/' title=''>, <Element 'a' href='/about/help/' title=''>, <Element 'a' href='http://brochure.getpython.info/' title=''>] Search for links within an element: >>> about.absolute_links {'http://brochure.getpython.info/', 'https://www.python.org/about/gettingstarted/', 'https://www.python.org/about/', 'https://www.python.org/about/quotes/', 'https://www.python.org/about/help/', 'https://www.python.org/about/apps/'} Search for text on the page: >>> r.html.search('Python is a {} language')[0] programming More complex CSS Selector example (copied from Chrome dev tools): >>> r = session.get('https://github.com/') >>> sel = 'body > div.application-main > div.jumbotron.jumbotron-codelines > div > div > div.col-md-7.text-center.text-md-left > p' >>> print(r.html.find(sel, first=True).text) GitHub is a development platform inspired by the way you work. From open source to business, you can host and review code, manage projects, and build software alongside millions of other developers. XPath is also supported (learn more): >>> r.html.xpath('a') [<Element 'a' class='btn' href='https://help.github.com/articles/supported-browsers'>] You can also select only elements containing certain text: >>> r = session.get('http://python-requests.org/') >>> r.html.find('a', containing='kenneth') [<Element 'a' href='http://kennethreitz.com/pages/open-projects.html'>, <Element 'a' href='http://kennethreitz.org/'>, <Element 'a' href='https://twitter.com/kennethreitz' class=('twitter-follow-button',) data-show-count='false'>, <Element 'a' class=('reference', 'internal') href='dev/contributing/#kenneth-reitz-s-code-style'>] JavaScript Support¶Let’s grab some text that’s rendered by JavaScript: >>> r = session.get('http://python-requests.org/') >>> r.html.render() >>> r.html.search('Python 2 will retire in only {months} months!')['months'] '<time>25</time>' Or you can do this async also: >>> r = asession.get('http://python-requests.org/') >>> await r.html.arender() >>> r.html.search('Python 2 will retire in only {months} months!')['months'] '<time>25</time>' Note, the first time you ever run the Using without Requests¶You can also use this library without Requests: >>> from requests_html import HTML >>> doc = """<a href='https://httpbin.org'>""" >>> html = HTML(html=doc) >>> html.links {'https://httpbin.org'} You can also render JavaScript pages without Requests: # ^^ proceeding from above ^^ >>> script = """ () => { return { width: document.documentElement.clientWidth, height: document.documentElement.clientHeight, deviceScaleFactor: window.devicePixelRatio, } } """ >>> val = html.render(script=script, reload=False) >>> print(val) {'width': 800, 'height': 600, 'deviceScaleFactor': 1} >>> print(html.html) <html><head></head><body><a href="https://httpbin.org"></a></body></html> For using arender just pass async_=True to HTML. # ^^ using above script ^^ >>> html = HTML(html=doc, async_=True) >>> val = await html.arender(script=script, reload=False) >>> print(val) {'width': 800, 'height': 600, 'deviceScaleFactor': 1} Main Classes¶These classes are the main interface to requests_html. HTML (*, session: Union[HTMLSession, AsyncHTMLSession] = None, url: str = 'https://example.org/', html: Union[str, bytes], default_encoding: str = 'utf-8', async_: bool =
False)[source]¶An HTML document, ready for parsing.
absolute_links ¶All found links on page, in absolute form (learn more). arender (retries: int =
8, script: str = None, wait: float = 0.2, scrolldown=False, sleep: int = 0, reload: bool = True, timeout: Union[float, int] = 8.0, keep_page: bool = False, cookies: list = [{}], send_cookies_session: bool =
False)[source]¶Async version of render. Takes same parameters. base_url ¶The base URL for the page. Supports the encoding ¶The encoding string to be used, extracted from the HTML and find (selector: str = '*', *, containing: Union[str, List[str]] = None, clean: bool = False, first: bool =
False, _encoding: str = None) → Union[List[requests_html.Element], requests_html.Element]¶Given a CSS Selector, returns a list of
Example CSS Selectors:
See W3School’s CSS Selectors Reference for more details. If full_text ¶The full text content (including links) of the html ¶Unicode representation of the HTML content (learn more). links ¶All found links on page, in as–is form. lxml ¶
lxml representation of the next (fetch: bool = False,
next_symbol: List[str] = ['next', 'more', 'older']) → Union[requests_html.HTML, List[str]][source]¶Attempts to find the next page, if there is one. If pq ¶
PyQuery representation of the raw_html ¶Bytes representation of the HTML content. (learn more). render (retries: int = 8, script: str = None, wait: float = 0.2, scrolldown=False, sleep: int = 0,
reload: bool = True, timeout: Union[float, int] = 8.0, keep_page: bool = False, cookies: list = [{}], send_cookies_session: bool = False)[source]¶Reloads the response in Chromium, and replaces HTML content with an updated version, with JavaScript executed.
If If just If script = """ () => { return { width: document.documentElement.clientWidth, height: document.documentElement.clientHeight, deviceScaleFactor: window.devicePixelRatio, } } """ Returns the return value of the executed >>> r.html.render(script=script) {'width': 800, 'height': 600, 'deviceScaleFactor': 1} Warning: the first
time you run this method, it will download Chromium into your home directory ( search (template: str) → parse.Result¶Search the
search_all (template: str) → Union[List[parse.Result], parse.Result]¶Search the
text ¶The text content of the xpath (selector: str, *, clean: bool = False, first: bool = False, _encoding: str = None) → Union[List[str], List[requests_html.Element], str,
requests_html.Element]¶Given an XPath selector, returns a list of
If a sub-selector is specified (e.g. See W3School’s XPath Examples for more details. If requests_html. Element (*, element, url: str, default_encoding: str = None)[source]¶An element of HTML.
absolute_links ¶All found links on page, in absolute form (learn more). attrs ¶Returns a dictionary of the attributes of the base_url ¶The base URL for the page. Supports the encoding ¶The encoding string to be used, extracted from the HTML and find (selector: str = '*', *, containing: Union[str, List[str]] = None, clean: bool = False, first: bool
= False, _encoding: str = None) → Union[List[requests_html.Element], requests_html.Element]¶Given a CSS Selector, returns a list of
Example CSS Selectors:
See W3School’s CSS Selectors Reference for more details. If full_text ¶The full text content (including links) of the html ¶Unicode representation of the HTML content (learn more). links ¶All found links on page, in as–is form. lxml ¶lxml representation of the pq ¶PyQuery representation of the
raw_html ¶Bytes representation of the HTML content. (learn more). search (template: str) →
parse.Result¶Search the
search_all (template: str) → Union[List[parse.Result], parse.Result]¶Search the
text ¶The text content of the xpath (selector: str, *, clean: bool = False, first: bool = False, _encoding: str = None) → Union[List[str], List[requests_html.Element], str,
requests_html.Element]¶Given an XPath selector, returns a list of
If a sub-selector is specified (e.g. See W3School’s XPath Examples for more details. If Utility Functions¶requests_html. user_agent (style=None) →
str[source]¶Returns an apparently legit user-agent, if not requested one of a specific style. Defaults to a Chrome-style User-Agent. HTML Sessions¶These sessions are for making HTTP requests: classrequests_html. HTMLSession (**kwargs)[source]¶
close ()[source]¶If a browser was created close it first. delete (url, **kwargs)¶Sends a DELETE request. Returns
get (url, **kwargs)¶Sends a GET request. Returns
get_adapter (url)¶Returns the appropriate connection adapter for the given URL.
get_redirect_target (resp)¶Receives a Response. Returns a redirect URI or head (url,
**kwargs)¶Sends a HEAD request. Returns
merge_environment_settings (url, proxies, stream, verify, cert)¶Check the environment and merge it with some settings.
mount (prefix, adapter)¶Registers a connection adapter to a prefix. Adapters are sorted in descending order by prefix length. options (url, **kwargs)¶Sends a OPTIONS request. Returns
patch (url, data=None, **kwargs)¶Sends a PATCH request. Returns
post (url, data=None, json=None, **kwargs)¶Sends a POST request. Returns
prepare_request (request)¶Constructs a
put (url, data=None, **kwargs)¶Sends a PUT request. Returns
rebuild_auth (prepared_request, response)¶When being redirected we may want to strip authentication from the request to avoid leaking credentials. This method intelligently removes and reapplies authentication where possible to avoid credential loss. rebuild_method (prepared_request, response)¶When being redirected we may want to change the method of the request based on certain specs or browser behavior. rebuild_proxies (prepared_request, proxies)¶This method re-evaluates the proxy configuration by considering the environment variables. If we are redirected to a URL covered by NO_PROXY, we strip the proxy configuration. Otherwise, we set missing proxy keys for this URL (in case they were stripped by a previous redirect). This method also replaces the Proxy-Authorization header where necessary.
request (method, url, params=None, data=None, headers=None, cookies=None, files=None, auth=None, timeout=None, allow_redirects=True, proxies=None, hooks=None, stream=None, verify=None, cert=None,
json=None)¶Constructs a
resolve_redirects (resp, req, stream=False, timeout=None, verify=True, cert=None, proxies=None, yield_requests=False,
**adapter_kwargs)¶Receives a Response. Returns a generator of Responses or Requests. response_hook (response, **kwargs) →
requests_html.HTMLResponse¶Change response enconding and replace it by a HTMLResponse. send (request,
**kwargs)¶Send a given PreparedRequest.
should_strip_auth (old_url, new_url)¶Decide whether Authorization header should be removed when redirecting classrequests_html. AsyncHTMLSession (loop=None, workers=None, mock_browser: bool = True, *args, **kwargs)[source]¶
An async consumable session. close ()[source]¶If a browser was created close it first. delete (url, **kwargs)¶Sends a DELETE request. Returns
get (url, **kwargs)¶Sends a GET request. Returns
get_adapter (url)¶Returns the appropriate connection adapter for the given URL.
get_redirect_target (resp)¶Receives a Response. Returns a redirect URI or head (url, **kwargs)¶Sends a HEAD request. Returns
merge_environment_settings (url, proxies, stream, verify, cert)¶Check the environment and merge it with some settings.
mount (prefix, adapter)¶Registers a connection adapter to a prefix. Adapters are sorted in descending order by prefix length. options (url, **kwargs)¶Sends a OPTIONS request. Returns
patch (url, data=None, **kwargs)¶Sends a PATCH request. Returns
post (url, data=None, json=None, **kwargs)¶Sends a POST request. Returns
prepare_request (request)¶Constructs a
put (url, data=None, **kwargs)¶Sends a PUT request. Returns
rebuild_auth (prepared_request, response)¶When being redirected we may want to strip authentication from the request to avoid leaking credentials. This method intelligently removes and reapplies authentication where possible to avoid credential loss. rebuild_method (prepared_request, response)¶When being redirected we may want to change the method of the request based on certain specs or browser behavior. rebuild_proxies (prepared_request, proxies)¶This method re-evaluates the proxy configuration by considering the environment variables. If we are redirected to a URL covered by NO_PROXY, we strip the proxy configuration. Otherwise, we set missing proxy keys for this URL (in case they were stripped by a previous redirect). This method also replaces the Proxy-Authorization header where necessary.
request (*args,
**kwargs)[source]¶
Partial original request func and run it in a thread. resolve_redirects (resp, req, stream=False, timeout=None, verify=True, cert=None, proxies=None, yield_requests=False,
**adapter_kwargs)¶Receives a Response. Returns a generator of Responses or Requests. response_hook (response, **kwargs) →
requests_html.HTMLResponse¶Change response enconding and replace it by a HTMLResponse. run (*coros)[source]¶Pass in all the coroutines you want to run, it will wrap each one in a task, run it and wait for the result. Return a list with all results, this is returned in the same order coros are passed in. send (request, **kwargs)¶Send a given PreparedRequest.
should_strip_auth (old_url, new_url)¶Decide whether Authorization header should be removed when redirecting |