Python爬虫基础

爬虫原理

简单来讲就是以代码逻辑模拟浏览器（应用）访问，自动化地来获取目标数据，一般就是基于HTTP、HTTPS等网络协议，基于URL进行网络请求访问；然后解析响应内容。

`robots.txt`协议

进行网站内容爬取之前，需要考虑法律问题，一般约定俗成的就是robots.txt协议，这个协议，也被称为爬虫协议或机器人协议，它的核心目的是告诉网络爬虫哪些页面可以抓取，哪些页面不可以抓取。通过一系列的规则来指导搜索引擎爬虫的行为，如允许或禁止爬虫访问特定的页面或目录。robots.txt文件通常位于网站根目录下，正规网站可以直接在地址栏进行访问，例如：

User-Agent: * 这行指定了规则适用于所有类型的爬虫（搜索引擎的机器人）。星号（*）是一个通配符，代表“所有”。
Allow: / 这行表示允许所有爬虫访问网站的根目录。这是一个比较宽松的规则，因为通常网站的内容都是从根目录开始的。
Disallow: /?* 这行表示禁止爬虫访问任何以问号（?）开始的URL。这通常用于动态页面或查询字符串，意味着爬虫不能抓取任何查询参数。
Disallow: /*/tag/*/?* 这行表示禁止爬虫访问任何包含/tag/的URL，并且该URL后有问号（?）的查询字符串。
Disallow: /*/tag/*/default.html?* 这行表示禁止爬虫访问任何以/tag/开始并以/default.html?结尾的URL，同样适用于带有查询字符串的情况。
Disallow: /index.html* 这行表示禁止爬虫访问任何以/index.html开头的URL。这可能用于防止爬虫抓取特定页面的不同版本或变体。
Disallow: /default.aspx* 这行表示禁止爬虫访问任何以/default.aspx开头的URL。这通常用于ASP.NET网站，用于防止爬虫抓取特定页面的不同版本或变体。
Sitemap: https://www.cnblogs.com/sitemap.xml 这行指定了网站地图（sitemap）的URL。网站地图是一个XML文件，其中列出了网站中所有可供爬虫抓取的页面的URL。通过指定这个URL，网站管理员告诉爬虫哪些页面是他们希望被索引的。

robots常见规则：

规则标识	含义	样例
User-agent	这个指令用于指定规则适用于哪些网络爬虫。它可以是一个具体的爬虫名称，如`Googlebot`，或者使用通配符`*`来表示所有爬虫。	*
Disallow	这个指令用于指定不允许爬虫访问的URL路径。它可以是完整的URL或者路径模式。如果`User-agent`后面跟着`*`，则表示该规则适用于所有爬虫。	`/admin/` 表示禁止爬虫访问`/admin/`目录下的所有页面。
Allow	这个指令用于指定允许爬虫访问的URL路径。与`Disallow`相反，`Allow`指令通常在`Disallow`规则之后指定特定的例外。	`/public/` 表示允许爬虫访问`/public/`目录下的所有页面。
Crawl-delay	这个指令用于指定爬虫在两次抓取之间的延迟时间（以秒为单位）。这个指令是可选的，用于控制爬取速度，防止对服务器造成过大压力。	`5` 表示爬虫在访问下一个页面前应该至少等待5秒。

Python爬虫基本类库

爬虫的第一步就是获取目标的数据，这里要用到的就是请求库

请求库

python一般是使用requests库、aiohttp库等，本篇文章主要是以request为基础，requests的API文档：开发接口 — Requests 2.18.1 文档

一般使用request的get、post等请求方法，通过构造cookie、header等http协议相关的参数进行接口、页面请求，返回内容为 Reponse Object，内容一般以 reponse.text进行获取。例如：

import requests#百度 robot url 
robot_url = "https://www.baido.com/robots.txt"#模拟浏览器headers
header = {"accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7","accept-encoding":"gzip, deflate, br, zstd","accept-language":"zh-CN,zh;q=0.9","cache-control":"max-age=0"
}
#发起请求
reponse = requests.get(robot_url,headers=header)
#输出内容
print(reponse.text)

解析库

通过接口、地址调用后，需要将返回的数据进行解析，提炼出需要的数据。这时候就要用到解析库

json

json为python本身自带的类库，主要用于针对固定API或者返回内容为json结构的数据进行解析：

主要方法包括：

方法	作用
json.dumps()	将Python对象编码成json字符串
json.loads()	将json字符串解码成Python对象
json.dump()	将Python对象序列化为json对象后写入文件
json.load()	读取文件中json形式的字符串元素转化为Python类型

dumps和loads使用样例：

import jsonprint("-----------------dumps()用法样例------------")
demo = {'name':'张三','rel':[{'id':1,'patten':'sdfwsdfssdf'},{'id':2,'patten':'sdfwsdfssdf'}],'cont':(7,1,2,2,2,1)
}
#ensure_ascii=false是不进行转码
json_domo = json.dumps(demo,ensure_ascii=False)
print("将python对象转为json{0}".format(json_domo))print("-----------------loads()用法样例------------")
demo_str_one = "{ \"name\":\"王五\"}"
print("将json字符串转为就python对象：")
print(json.loads(demo_str_one))

输出：

-----------------dumps()用法样例------------
将python对象转为json{"name": "张三", "rel": [{"id": 1, "patten": "sdfwsdfssdf"}, {"id": 2, "patten": "sdfwsdfssdf"}], "cont": [7, 1, 2, 2, 2, 1]}    
-----------------loads()用法样例------------
将json字符串转为就python对象：
{'name': '王五'}

dump和load使用样例

import jsonprint("-----------------dump()用法样例------------")
demo_obj = {'age':12,'sex':'男'
}
print("将python对象序列化为json并写入文件")
with open('demo_obj.txt', 'w', encoding='utf-8') as f:#ensure_ascii=False 以保留非 ASCII 字符，indent=4 用于美化输出，sort_keys=True 用于确保字典的键按照字母顺序排列。json.dump(demo_obj, f, ensure_ascii=False, indent=4, sort_keys=True)
#demo_obj.txt文件内容样例：
#   {
#    "age": 12,
#    "sex": "男"
#   }print("-----------------load()用法样例------------")
#jsonfile.txt文件内容样例：
#  {
#      "name": "胡汉三",
#      "sex": "男",
#      "student":[
#         {
#             "id":1
#         }
#      ]
#   }
with open('jsonfile.txt', 'r', encoding='utf-8') as file:# 使用 json.load 函数从文件中加载 JSON 数据data = json.load(file)print("jsonfile.txt 中内容为：")
print(data)

BeautifulSoup库

BeautifulSoup 是一个用于解析HTML和XML文档的Python库。它创建了一个解析树，可以方便地提取HTML和XML中的数据。BeautifulSoup 支持多种解析器，包括Python内置的 html.parser 和 xml.parser，以及第三方的 lxml 和 html5lib。

不同于json，使用BeautifulSoup需要先进行安装：pip install beautifulsoup4

关于beautifulSoup库的具体使用可以查看官方网站：Beautiful Soup 中文文档

以下为样例：

from bs4 import BeautifulSoup# 假设我们有一段HTML代码
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p><p class="story">...</p>
</body>
</html>
"""# 使用BeautifulSoup解析HTML
soup = BeautifulSoup(html_doc, 'html.parser')# 获取所有<a>标签
a_tags = soup.find_all('a')# 打印每个<a>标签的文本和href属性
for tag in a_tags:print(tag.get_text(), tag['href'])# 使用CSS选择器获取第一个<a>标签
first_a_tag = soup.select_one('a.sister')
print(first_a_tag.get_text())# 修改第一个<p>标签的内容
first_p = soup.find('p', class_='title')
first_p.string = "New Title"# 打印修改后的HTML
print(soup.prettify())

存储库

将爬取的数据解析后，就需要进行存储，便于后续的分析、使用，一般常用的有两种，一种是放入数据库，另一种生成相关的excel文件

数据库

一般使用SQLLite或者Mysql等其他相关数据库，这里以SQLLite为例进行解释。

QLite是一个轻量级的数据库管理系统，它被设计为不需要一个独立的服务器进程或系统。在Python中，sqlite3 模块已经内置在标准库中，所以你不需要单独安装SQLite来在Python程序中使用它。这意味着只要你有Python环境，你就可以使用SQLite数据库。

首先封装一个SQLLite的工具类：

import sqlite3
class SQLLite:__db_conn = Nonedef __init__(self,dbname=''):try:self.__db_conn = sqlite3.connect(dbname)except:raise Exception("SQLLite not exists")def __del__(self):self.__db_conn.close()def save_dict_objects(self,table_name = '',objects=[]):if table_name != '' and len(objects) > 0 and isinstance(objects[0],dict): table = self.excute_query_sql("select COUNT(*) AS isexsits from sqlite_master where type = 'table' and name = '"+table_name+"'")if table[0]['isexsits'] > 0:self.__save_data_not_sql(table_name=table_name,datas=objects)else:default_sql = "CREATE TABLE IF NOT EXISTS "+table_name+" "temp = objects[0]default_sql = default_sql + self.__get_cloum_from_dict(temp)self.excute_not_query_sql(sqls=default_sql)self.__save_data_not_sql(table_name=table_name,datas=objects)return len(objects)else:raise Exception("table_name must not empty!")def excute_not_query_sql(self,sqls):cursor = self.__db_conn.cursor()if sqls != '':try: cursor.execute(sqls)self.__db_conn.commit()except:cursor.close()raise Exception("SQL is ERROR")     return ""def excute_query_sql(self,sql):cursor = self.__db_conn.cursor()if sql != '':try: cursor.execute(sql)ret = cursor.fetchall()self.__db_conn.commit()field_dic = dict()index = 0for field in cursor.description:field_dic[field[0]] = indexindex = index + 1result = []for data in ret:resi = dict()for fieldi in field_dic:resi[fieldi] = data[field_dic[fieldi]]result.append(resi)    return resultexcept:cursor.close()raise Exception("SQL is ERROR")     return []def __save_data_not_sql(self,table_name,datas):cursor = self.__db_conn.cursor()try:for data in datas:vs = tuple(data.values())sql = "INSERT INTO " + table_name + " VALUES " + str(vs) + ";"cursor.execute(sql)self.__db_conn.commit()cursor.close()except:raise Exception("unknown error")return len(datas)def __get_cloum_from_dict(self,dict_data):ret_str = "("keys = dict_data.keys()for field in keys:ret_str = ret_str + str(field)    if  isinstance(dict_data[field],(int)):ret_str =  ret_str  + " INTEGER,"elif isinstance(dict_data[field],(float)) :ret_str =  ret_str  + + " REAL,"else:ret_str =  ret_str  + " TEXT,"sql = ret_str[:-1] + ')'    return sql

进行调用测试：

from sqlliteUtil import SQLLitesql = SQLLite(dbname="db_test")sql.excute_not_query_sql("CREATE TABLE IF NOT EXISTS users (id INTEGER PRIMARY KEY, name TEXT);")
sql.excute_not_query_sql("INSERT INTO users (name) VALUES ('Alice');")
sql.excute_not_query_sql("INSERT INTO users (name) VALUES ('张三');")
ret = sql.excute_query_sql("SELECT * FROM users")
print(ret)-------------------------------------------------------------------------------------------
输出：
[{'id': 1, 'name': 'Alice'}, {'id': 2, 'name': '张三'}]

excel

一般是以excel文件的形式进行数据的存储，在Python中操作Excel文件，可以使用几个流行的库，如openpyxl、xlrd、xlwt、pandas等，相关使用可以查阅官方文档。

简单样例

准备工作完成，进行一个简单基础Demo的编写，这里以博客园首页为例，爬取其每天推送博客的前10篇文章；

先进行观察其首页特点，F12查看源码

通过分析可以看到，我们的目标主要集中在a标签中，这里面记录了标题和连接地址

首先定义一个博客数据的目标结构

{"art_name":"", // 文章名称"art_desc":"", // 文章描述"art_url":"",  //文章url"art_time":""  //文章发布时间
}

接下来就是构建爬虫：获取数据，组装成目标数据结构，进行存储了。

import requests
from bs4 import BeautifulSoup
from sqlliteUtil import SQLLiterobot_url = "https://www.cnblogs.com"#模拟浏览器headers
header = {"accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7","accept-encoding":"gzip, deflate, br, zstd","accept-language":"zh-CN,zh;q=0.9","cache-control":"max-age=0"
}
#请求其页面
reponse = requests.get(robot_url,headers=header)#使用BeautifulSoup 配合 html.parser进行内容解析
soup = BeautifulSoup(reponse.text, 'html.parser') 
#获取返回内容中所有的文章信息的标签
div_tag = soup.findAll('div',class_='post-item-text')
feeter_tag = soup.findAll('footer',class_='post-item-foot')
#解析组装目标结构
ret = []
for i in range(0,len(div_tag)):blog_data = dict()title_tag = div_tag[i].find('a',class_='post-item-title')desc_tag = div_tag[i].find('p',class_='post-item-summary')time = feeter_tag[i].find('span',class_='post-meta-item').textblog_data['art_name'] = str(title_tag.text)blog_data['art_desc'] = str(desc_tag.text).replace("\n","")blog_data['art_url'] = str(title_tag['href'])blog_data['art_time'] = str(time).replace("\n","")ret.append(blog_data)
#存储数据库中 db_blog
sql = SQLLite(dbname="db_blog")
sql.save_dict_objects(table_name='tb_artitle_data',objects=ret)

效果：

Python爬虫基础

爬虫原理

`robots.txt`协议

Python爬虫基本类库

请求库

解析库

json

BeautifulSoup库

存储库

数据库

excel

简单样例

相关资讯

热文排行

最新新闻

推荐新闻

热搜词

Python爬虫基础

爬虫原理

robots.txt协议

Python爬虫基本类库

请求库

解析库

json

BeautifulSoup库

存储库

数据库

excel

简单样例

相关资讯

热文排行

最新新闻

推荐新闻

热搜词

`robots.txt`协议