欢迎来到尧图网

客户服务 关于我们

您的位置:首页 > 科技 > IT业 > Python爬虫,爬取某网站小说

Python爬虫,爬取某网站小说

2024/10/24 15:15:18 来源:https://blog.csdn.net/HSJ0170/article/details/141476676  浏览:    关键词:Python爬虫,爬取某网站小说

代码仅供学习研究,请勿非法使用!

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# @Time    : 2024/8/23 12:41
# @Author  : 何胜金-heshengjin
# @Site    :
# @File    : http_test.py
# @Software: PyCharm
"""
虚拟virtualenv
pip install requests
pip install beautifulsoup4
"""import requests
from bs4 import BeautifulSoup
import time# 请求头,添加你的浏览器信息后才可以正常运行
host = 'www.xdingdian.info'
host_http = 'https://www.xdingdian.info'
headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7','Accept-Encoding':'gzip, deflate, br, zstd','Accept-Language':'zh-CN,zh;q=0.9,en;q=0.8','Cache-Control':'max-age=0','Referer': 'https://www.xdingdian.info/xs/11569/','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36','Cookie': "articlevisited=1; __vtins__KFWsxhk6w799qkMJ=%7B%22sid%22%3A%20%228eca31dc-a28b-5dde-a36c-8c77d5599689%22%2C%20%22vd%22%3A%201%2C%20%22stt%22%3A%200%2C%20%22dr%22%3A%200%2C%20%22expires%22%3A%201724422709694%2C%20%22ct%22%3A%201724420909694%7D; __51uvsct__KFWsxhk6w799qkMJ=1; __51vcke__KFWsxhk6w799qkMJ=3f439a1a-01c4-5644-90a2-46c2c38c280b; __51vuft__KFWsxhk6w799qkMJ=1724420909699",'Host': host,'Connection': 'keep-alive'
}
content_txt = "魅王宠妻鬼医纨绔妃.txt"
tmp_html = "temp.html"
next_text = '下一章'
# 小说起始页
main_url = "https://www.xdingdian.info/txt/11569/1290242.html"while True:print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))# 使用get方法请求网页source_html = requests.get(main_url, headers=headers)# 设置编码source_html.encoding = 'utf-8'# 覆盖写入 temp.htmlwith open(tmp_html, "w+", encoding="utf-8") as f:f.write(source_html.text)f.seek(0)html_handle = f.read()title_text = ''soup = BeautifulSoup(html_handle, "html.parser")if next_text == '下一章':title = soup.find('div', id='amain').find('h1').texttitle_text += '正文 'title_text += title# 打印titleprint(title_text)title_text += '\n'text = soup.find('dd', id='contents').texttitle_text += text# print(text)children = soup.find('div', id='amain').find('h3').find_all("a")last_children = children[-1]main_url = host_http + last_children['href']next_text = last_children.get_text()print(next_text + main_url + "\n")# 追加写入 魅王宠妻鬼医纨绔妃.txtif next_text == '下一页':title_text += '\n'with open(content_txt, "a+", encoding="utf-8") as fc:# 处理NBSPfc.write(title_text.replace(u'\xa0', ''))# 30stime.sleep(30)

版权声明:

本网仅为发布的内容提供存储空间,不对发表、转载的内容提供任何形式的保证。凡本网注明“来源:XXX网络”的作品,均转载自其它媒体,著作权归作者所有,商业转载请联系作者获得授权,非商业转载请注明出处。

我们尊重并感谢每一位作者,均已注明文章来源和作者。如因作品内容、版权或其它问题,请及时与我们联系,联系邮箱:809451989@qq.com,投稿邮箱:809451989@qq.com