Python网页数据抓取实战：BeautifulSoup与CSV存储完整指南

为什么需要网页数据抓取？

在当今数据驱动的时代，获取网络信息已成为许多业务和研究的关键环节。网页抓取技术能够：

自动化收集市场竞争对手的价格信息
聚合多个新闻来源的最新报道
监控社交媒体趋势和用户反馈
为机器学习项目收集训练数据

Python凭借其丰富的库生态系统成为网页抓取的首选语言。

环境准备：安装必要库

开始之前，请确保安装了以下Python库：

pip install requests beautifulsoup4 pandas

requests：发送HTTP请求获取网页内容
BeautifulSoup：解析HTML/XML文档
pandas：数据处理和CSV文件操作

实战案例：抓取图书信息并存储

我们将以抓取豆瓣读书Top250的图书信息为例，演示完整流程。

步骤1：分析目标网页结构

打开豆瓣读书Top250，使用浏览器开发者工具（F12）查看：

每本书都包含在<tr class="item">元素中
书名位于<div class="pl2">内的<a>标签
评分在<span class="rating_nums">中
评价人数在<span class="pl">中

步骤2：发送HTTP请求获取网页内容

import requests
from bs4 import BeautifulSoup

url = "https://book.douban.com/top250"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

try:
    response = requests.get(url, headers=headers)
    response.raise_for_status()  # 检查请求是否成功
    html_content = response.text
except requests.exceptions.RequestException as e:
    print(f"请求失败: {e}")
    exit()

注意：设置User-Agent是为了模拟浏览器访问，避免被网站屏蔽。

步骤3：解析HTML并提取数据

soup = BeautifulSoup(html_content, 'html.parser')
books = []

# 查找所有图书项
book_items = soup.find_all('tr', class_='item')

for item in book_items:
    # 提取书名
    title_tag = item.find('div', class_='pl2').a
    title = title_tag.get_text(strip=True) if title_tag else "未知书名"
    
    # 提取评分
    rating_tag = item.find('span', class_='rating_nums')
    rating = rating_tag.get_text(strip=True) if rating_tag else "无评分"
    
    # 提取评价人数
    comment_tag = item.find('span', class_='pl')
    comment_text = comment_tag.get_text(strip=True) if comment_tag else ""
    # 从字符串中提取数字
    comment_count = ''.join(filter(str.isdigit, comment_text)) or "0"
    
    # 提取出版信息
    pub_info = item.find('p', class_='pl').get_text(strip=True) if item.find('p', class_='pl') else ""
    
    # 将数据添加到列表
    books.append({
        'title': title,
        'rating': rating,
        'comment_count': comment_count,
        'publication_info': pub_info
    })

步骤4：数据处理与CSV存储

import pandas as pd

# 创建DataFrame
df = pd.DataFrame(books)

# 数据清洗：转换数据类型
df['rating'] = pd.to_numeric(df['rating'], errors='coerce')
df['comment_count'] = pd.to_numeric(df['comment_count'], errors='coerce')

# 添加排名列
df['rank'] = range(1, len(df) + 1)

# 保存为CSV文件
df.to_csv('douban_top_books.csv', index=False, encoding='utf-8-sig')

print(f"成功保存{len(df)}条图书数据到douban_top_books.csv")

使用pandas可以轻松进行数据清洗和格式转换。

数据抓取进阶技巧

1. 处理分页数据

许多网站将数据分布在多个页面，需要循环处理：

base_url = "https://book.douban.com/top250?start={}"
all_books = []

for page in range(0, 250, 25):  # 每页25条，共10页
    url = base_url.format(page)
    # 发送请求和解析数据的代码...
    # 将每页数据添加到all_books列表

2. 处理AJAX动态加载内容

对于JavaScript动态生成的内容，可以使用Selenium：

from selenium import webdriver

driver = webdriver.Chrome()
driver.get(url)
# 等待内容加载
driver.implicitly_wait(10)
html = driver.page_source
# 然后用BeautifulSoup解析html

3. 避免被封IP的策略

设置请求间隔时间：time.sleep(random.uniform(1, 3))
使用代理IP池
轮换User-Agent
遵守robots.txt协议

数据分析与可视化示例

使用获取的数据进行简单分析：

import matplotlib.pyplot as plt

# 读取CSV数据
df = pd.read_csv('douban_top_books.csv')

# 评分分布分析
plt.figure(figsize=(10, 6))
plt.hist(df['rating'], bins=20, color='skyblue', edgecolor='black')
plt.title('豆瓣Top250图书评分分布')
plt.xlabel('评分')
plt.ylabel('图书数量')
plt.grid(axis='y', alpha=0.75)
plt.savefig('rating_distribution.png')
plt.show()

# 评价人数与评分的关系
plt.figure(figsize=(10, 6))
plt.scatter(df['rating'], df['comment_count'], alpha=0.6)
plt.title('图书评分与评价人数关系')
plt.xlabel('评分')
plt.ylabel('评价人数')
plt.grid(True)
plt.savefig('rating_vs_comments.png')
plt.show()