xpath解析实战——爬取《少有人走的路：心智成熟的旅程》

2025/7/8 16:56:29 来源：https://blog.csdn.net/2302_79795489/article/details/141258135 浏览: 次关键词：xpath解析实战——爬取《少有人走的路：心智成熟的旅程》

代码：

import requests
from lxml import etree
import os#创建文件目录
if not os.path.exists('./book'):os.mkdir('./book')#获取主页面内容
url="https://wwww.xyyuedu.com/lizhishuji/shaoyourenzoudelu/"
resp=requests.get(url)
resp.encoding = resp.apparent_encoding  # 使用 requests 的自动检测编码
text=resp.text#解析主页面内容，定位到各个章节
tree=etree.HTML(text)
chapter_list=tree.xpath("/html/body/div[2]/div[3]/div/ul/li")for li in chapter_list:#获取标题信息title=li.xpath("./a/text()")[0].strip()#获取各章节urlcontent_url="https://wwww.xyyuedu.com/"+li.xpath("./a/@href")[0]#获取和解析各章节内容，并定位到文本节点resp1=requests.get(content_url)resp1.encoding = resp1.apparent_encodingtext1=resp1.texttree1=etree.HTML(text1)content=tree1.xpath("/html/body/div[2]/div[4]/div//text()")#将内容写进文件里book_path='book/'+title+'.txt'with open(book_path,'a',encoding='utf-8') as f:f.write(title)for i in content:f.write(i.strip())print(title+"爬取成功！")print("Over!")

几个问题：

1、为什么title=li.xpath("./a/text()")[0].strip()要加[0]？

因为li.xpath("./a/text()")返回的是列表，即使只有一个元素，也要加上[0]才能取列表里的值

注意：列表不能和字符串拼接，列表里的字符串值才可以

href_list = li.xpath("./a/@href")
# 直接使用 href_list 作为 URL，将会导致问题
print(href_list)  # 输出: ['***']
print(href_list[0])  # 输出: '***'content_url = "https://www.xyyuedu.com/" + href_list[0] #少[0]则错误

同理，content = tree1.xpath("/html/body/div[2]/div[4]/div//text()")得到的也是列表，所以在写入文件里要这样写

for i in content:

f.write(i.strip()) 或者写f.write(content[0].strip())

2、在使用 print(resp.text) 输出网页内容时遇到乱码，怎么解决？

字符编码问题：网页内容使用的编码与你在 resp.encoding 中指定的编码不一致

可以使用自动检测编码：

resp.encoding = resp.apparent_encoding  # 使用 requests 的自动检测编码

原文中使用的是GB2312（GB2312 是一种用于简体中文的字符编码标准，主要用于简体中文计算机系统中的字符表示）

一些早期的网页和应用程序可能使用 GB2312 编码来处理简体中文。现代网页和系统通常使用 UTF-8 编码，因为它支持更多的字符集和语言，并且与 Unicode 标准兼容。

但是需要将修改文件写入部分的编码设置为 utf-8：

UnicodeEncodeError: 'gb2312' codec can't encode character 错误表示在尝试将 Unicode 字符串编码为 gb2312 时，遇到了无法编码的字符

使用utf-8是为了确保在写入文件时能正确处理所有字符

xpath解析实战——爬取《少有人走的路：心智成熟的旅程》

代码：

几个问题：

相关资讯

热文排行

最新新闻

推荐新闻

热搜词