Python网络爬虫中的编码乱码如何解决

项目场景：

开发一个Python爬虫程序，需要爬取某电商网站的商品信息（包括商品名称、价格、评论等），并将数据保存为CSV文件。项目使用requests库进行网页请求，Beautiful Soup进行解析，需要处理大量包含中文、emoji等特殊字符的数据。

问题描述

在运行爬虫程序时，遇到以下问题：

保存的CSV文件打开后显示乱码
某些特殊商品名称显示为问号或方块
emoji表情无法正确显示
读取CSV文件时出现UnicodeDecodeError异常

问题代码示例：

import requests
from bs4 import BeautifulSoup
import csvdef crawl_product_info():url = "http://example.com/products"response = requests.get(url)soup = BeautifulSoup(response.text, 'html.parser')# 提取商品信息products = []for item in soup.find_all('div', class_='product-item'):name = item.find('h2').textprice = item.find('span', class_='price').textcomments = item.find('div', class_='comments').textproducts.append([name, price, comments])# 保存到CSVwith open('products.csv', 'w') as f:writer = csv.writer(f)writer.writerows(products)

原因分析：

网页编码问题：
- requests默认使用其自动识别的编码
- 某些网站的编码声明可能不准确
- response.text可能使用错误的编码解码
CSV文件编码问题：
- 默认使用系统编码（Windows中文系统通常是cp936）
- 没有指定UTF-8 BOM标记
- Excel打开UTF-8文件可能无法正确识别编码
特殊字符处理：
- emoji字符超出基本Unicode平面
- 某些特殊符号需要特殊编码处理
- 不同Python版本的默认编码处理可能不同

解决方案：

完整的改进版代码：

import requests
from bs4 import BeautifulSoup
import csv
import codecs
from typing import List, Dictclass ProductCrawler:def __init__(self):self.session = requests.Session()# 设置默认请求头self.session.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}def get_page_content(self, url: str) -> str:"""获取页面内容并确保编码正确"""response = self.session.get(url)# 显式设置编码response.encoding = response.apparent_encodingreturn response.textdef parse_product(self, html: str) -> List[Dict]:"""解析商品信息"""soup = BeautifulSoup(html, 'html.parser')products = []for item in soup.find_all('div', class_='product-item'):try:product = {'name': self._clean_text(item.find('h2').text),'price': self._clean_text(item.find('span', class_='price').text),'comments': self._clean_text(item.find('div', class_='comments').text)}products.append(product)except AttributeError as e:print(f"解析商品信息失败: {e}")continuereturn productsdef _clean_text(self, text: str) -> str:"""清理文本数据"""if text is None:return ""# 移除特殊字符和空白return text.strip().replace('\n', ' ').replace('\r', '')def save_to_csv(self, products: List[Dict], filename: str):"""保存数据到CSV文件"""try:# 使用utf-8-sig编码（带BOM），确保Excel能正确识别with codecs.open(filename, 'w', encoding='utf-8-sig') as f:writer = csv.DictWriter(f, fieldnames=['name', 'price', 'comments'])writer.writeheader()writer.writerows(products)except Exception as e:print(f"保存CSV文件失败: {e}")raisedef crawl_and_save(self, url: str, output_file: str):"""爬取并保存数据的主函数"""try:html = self.get_page_content(url)products = self.parse_product(html)self.save_to_csv(products, output_file)print(f"成功爬取并保存{len(products)}条商品信息")except Exception as e:print(f"爬取过程发生错误: {e}")

使用示例：

def main():crawler = ProductCrawler()url = "http://example.com/products"try:crawler.crawl_and_save(url, "products.csv")except Exception as e:print(f"程序执行失败: {e}")if __name__ == "__main__":main()

关键改进点：

使用response.apparent_encoding自动识别网页编码
使用utf-8-sig编码保存CSV（带BOM标记）
添加文本清理函数处理特殊字符
使用异常处理机制提高程序稳定性
采用类的方式组织代码，提高可维护性

读取CSV文件的正确方式：

def read_csv(filename: str) -> List[Dict]:"""正确读取CSV文件的方法"""try:with codecs.open(filename, 'r', encoding='utf-8-sig') as f:reader = csv.DictReader(f)return list(reader)except UnicodeDecodeError:# 尝试使用其他编码with codecs.open(filename, 'r', encoding='gb18030') as f:reader = csv.DictReader(f)return list(reader)

通过以上优化，程序可以正确处理各种编码情况，确保数据的完整性和可读性。特别是在处理中文和特殊字符时，不会再出现乱码问题。

Python网络爬虫中的编码乱码如何解决

项目场景：

问题描述

原因分析：

解决方案：

相关资讯

热文排行

最新新闻

推荐新闻

热搜词