Build super fast web scraper with Python x100 than BeautifulSoup

Web scraper is a technique for extracting structured information from a web page. With Python, you can build a efficient web scraper by using BeautifulSoup, requests and other libraries. However, these solution is not fast enough. In this article, I will show you some tips to build a super fast web scraper with Python.

Don't use BeautifulSoup4 #

BeautifulSoup4 is friendly and user-friendly, but it is not fast. Even you use external extractor such as lxml for HTML parsing or use cchardet to detect the encoding, it is still slow.

Use selectolax instead of BeautifulSoup4 for HTML parsing #

selectolax is a Python binding to Modest and Lexbor engines.

To install selectolax with pip:

pip install selectolax

The usage of selectolax is similar to BeautifulSoup4.

from selectolax.parser import HTMLParser

html = """
<body>
    <h1 class='>Welcome to selectolax tutorial</h1>
    <div id="text">
        <p class='p3'>Lorem ipsum</p>
        <p class='p3'>Lorem ipsum 2</p>
    </div>
    <div>
        <p id='stext'>Lorem ipsum dolor sit amet, ea quo modus meliore platonem.</p>
    </div>
</body>
"""
# Select all elements with class 'p3'
parser = HTMLParser(html)
parser.select('p.p3')

# Select first match
parser.css_first('p.p3')

# Iterate over all nodes on the current level
for node in parser.css('div'):
    for cnode in node.iter():
        print(cnode.tag, cnode.html)

For more information, please visit selectolax walkthrough tutorial

Use httpx instead of requests #

Python requests is a HTTP client for humans. It is easy to use, but it is not fast. It only supports synchronous requests.

httpx is a fully featured HTTP client for Python 3, which provides sync and async APIs, and support for both HTTP/1.1 and HTTP/2. It offers a standard synchronous API by default, but also gives you the option of an async client if you need it. To install httpx with pip:

pip install httpx

httpx offers the same api with requests:

import httpx
async def main():
    async with httpx.AsyncClient() as client:
        response = await client.get('https://httpbin.org/get')
        print(response.status_code)
        print(response.json())

import asyncio
asyncio.run(main())

For examples and usage, please visit httpx home page

Use aiofiles for file IO #

aiofiles is a Python library for asyncio-based file I/O. It provides a high-level API for working with files. To install aiofiles with pip:

pip install aiofiles

Basic usage:

import aiofiles
async def main():
    async with aiofiles.open('test.txt', 'w') as f:
        await f.write('Hello world!')

    async with aiofiles.open('test.txt', 'r') as f:
        print(await f.read())

import asyncio
asyncio.run(main())

For more information, please visit aiofiles repository

Tags:

Python