Python异步IO实战：使用asyncio构建高性能爬虫 | Python高级编程

文章小助手 python

2025-07-11 0 626

Python异步IO实战：使用asyncio构建高性能爬虫

在网络请求密集型应用中，传统的同步请求方式效率低下。本文将展示如何使用Python的asyncio库构建一个高性能异步爬虫，相比同步请求速度提升5-10倍。

1. 基础异步爬虫实现

首先实现一个简单的异步HTTP请求示例：

import asyncio
import aiohttp

async def fetch_page(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
return await response.text()

async def main():
urls = [
‘https://example.com/page1’,
‘https://example.com/page2’,
‘https://example.com/page3’
]
tasks = [fetch_page(url) for url in urls]
pages = await asyncio.gather(*tasks)
for page in pages:
print(len(page)) # 打印页面内容长度

asyncio.run(main())

关键点： aiohttp是专为asyncio设计的HTTP客户端/服务端库，比requests更适合异步环境。

2. 完整异步爬虫案例

下面是一个带有异常处理和限流功能的完整爬虫实现：

import asyncio
import aiohttp
from bs4 import BeautifulSoup

class AsyncCrawler:
def __init__(self, concurrency=10):
self.semaphore = asyncio.Semaphore(concurrency)

async def parse_page(self, url):
try:
async with self.semaphore:
async with aiohttp.ClientSession(
timeout=aiohttp.ClientTimeout(total=10)
) as session:
async with session.get(url) as response:
html = await response.text()
soup = BeautifulSoup(html, ‘lxml’)
title = soup.title.string
return f”URL: {url}, Title: {title}”
except Exception as e:
return f”Error fetching {url}: {str(e)}”

async def crawl(self, urls):
tasks = [self.parse_page(url) for url in urls]
results = await asyncio.gather(*tasks)
for result in results:
print(result)

if __name__ == “__main__”:
urls = [f”https://example.com/page{i}” for i in range(1, 21)]
crawler = AsyncCrawler(concurrency=5)
asyncio.run(crawler.crawl(urls))