python网络爬虫

一、Python爬虫核心库

HTTP请求库
- requests：简单易用的HTTP请求库，处理GET/POST请求。
- aiohttp：异步HTTP客户端，适合高并发场景。
HTML/XML解析库
- BeautifulSoup：基于DOM树的解析库，支持多种解析器（如lxml）。
- lxml：高性能解析库，支持XPath语法。
动态页面处理
- Selenium：模拟浏览器操作，处理JavaScript渲染的页面。
- Playwright（推荐）：新一代自动化工具，支持多浏览器。
数据存储
- pandas：数据清洗与导出（CSV/Excel）。
- SQLAlchemy：数据库ORM工具（如MySQL、PostgreSQL）。
框架
- Scrapy：高性能爬虫框架，支持分布式、中间件、管道等特性。

二、爬虫开发步骤

1. 发起HTTP请求

import requestsurl = "https://example.com"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}response = requests.get(url, headers=headers)
if response.status_code == 200:html = response.text  # 或 response.content

2. 解析HTML内容

使用BeautifulSoup：

from bs4 import BeautifulSoupsoup = BeautifulSoup(html, "lxml")
titles = soup.find_all("h1", class_="title")
for title in titles:print(title.text.strip())

使用XPath（配合lxml）：

from lxml import etreetree = etree.HTML(html)
items = tree.xpath('//div[@class="item"]/a/@href')

3. 处理动态页面（Selenium示例）

from selenium import webdriver
from selenium.webdriver.common.by import Bydriver = webdriver.Chrome()
driver.get("https://example.com")
dynamic_content = driver.find_element(By.CSS_SELECTOR, ".dynamic-element").text
driver.quit()

4. 存储数据

保存到CSV：

import csvwith open("data.csv", "w", newline="", encoding="utf-8") as f:writer = csv.writer(f)writer.writerow(["标题", "链接"])writer.writerow(["Example", "https://example.com"])

保存到数据库（SQLAlchemy）：

from sqlalchemy import create_engine, Column, String
from sqlalchemy.orm import declarative_baseBase = declarative_base()
class Article(Base):__tablename__ = "articles"title = Column(String(200), primary_key=True)url = Column(String(200))engine = create_engine("sqlite:///data.db")
Base.metadata.create_all(engine)# 插入数据
from sqlalchemy.orm import sessionmaker
Session = sessionmaker(bind=engine)
session = Session()
session.add(Article(title="Example", url="https://example.com"))
session.commit()

三、实战示例：爬取豆瓣电影Top250

import requests
from bs4 import BeautifulSoup
import csvurl = "https://movie.douban.com/top250"
headers = {"User-Agent": "Mozilla/5.0"}def get_movies():response = requests.get(url, headers=headers)soup = BeautifulSoup(response.text, "lxml")movies = []for item in soup.find_all("div", class_="item"):title = item.find("span", class_="title").textrating = item.find("span", class_="rating_num").textmovies.append((title, rating))return moviesdef save_to_csv(movies):with open("douban_top250.csv", "w", newline="", encoding="utf-8") as f:writer = csv.writer(f)writer.writerow(["电影名称", "评分"])writer.writerows(movies)if __name__ == "__main__":movies = get_movies()save_to_csv(movies)

四、反爬虫策略与应对

常见反爬手段
- User-Agent检测：伪装浏览器头（如使用fake_useragent库）。
- IP封禁：使用代理IP池（如requests + proxies参数）。
- 验证码：接入打码平台（如超级鹰）或OCR识别。
- 频率限制：设置随机请求间隔（如time.sleep(random.uniform(1,3))）。
推荐工具
- 代理IP：快代理、芝麻代理。
- 分布式爬虫：Scrapy + Redis（去重与任务队列）。