Python爬虫技术第12节设置headers和cookies

在使用Python进行网络爬虫开发时，经常需要模拟浏览器行为，这包括设置请求头（headers）和处理cookies。下面我将详细介绍如何在Python中使用requests库来设置headers和处理cookies。

设置Headers

Headers包含了客户端发送给服务器的信息，比如用户代理（User-Agent）、接受的内容类型（Accept）、语言偏好（Accept-Language）等。设置headers可以帮助你更好地伪装成一个真实的浏览器访问网站。

示例代码：

import requestsheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Language': 'en-US,en;q=0.5',
}url = 'https://www.example.com'response = requests.get(url, headers=headers)# 检查响应状态码
if response.status_code == 200:print(response.text)
else:print(f"Request failed with status code {response.status_code}")

处理Cookies

Cookies是服务器存储在客户端的一小段数据，它们用于保持会话状态，比如登录状态。当一个网站设置了cookie，你需要在后续的请求中带上这个cookie，否则可能无法访问某些页面。

示例代码：

import requests# 初始请求，可能用于登录操作，服务器会返回一个cookie
login_url = 'https://www.example.com/login'
data = {'username': 'your_username','password': 'your_password'
}session = requests.Session()# 发送POST请求进行登录
response = session.post(login_url, data=data)# 检查响应状态码
if response.status_code == 200:print("Login successful.")
else:print(f"Login failed with status code {response.status_code}")# 使用同一会话发送另一个请求，会自动携带上一步设置的cookies
url = 'https://www.example.com/protected_page'
response = session.get(url)if response.status_code == 200:print(response.text)
else:print(f"Request failed with status code {response.status_code}")

在上述代码中，我们使用requests.Session()创建了一个会话对象，这样可以在多个请求之间共享cookies。在登录后，cookies会被保存在会话对象中，并自动用于后续的所有请求。

注意事项

安全性：在处理敏感信息如密码时，请确保连接是安全的（HTTPS）。
隐私政策：遵守网站的robots.txt文件和隐私政策，不要抓取不允许抓取的数据。
频率限制：不要过于频繁地发送请求，以免被网站封禁IP地址或被视为攻击。

使用requests库时，你可以灵活地设置headers和管理cookies，从而更自然地与网站交互。

我们将整合之前讨论的所有概念，包括设置headers、处理cookies、使用代理、重试机制以及日志记录。下面的代码示例展示了如何使用requests库和requests.Session来实现这些功能：

import requests
import logging
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from bs4 import BeautifulSouplogging.basicConfig(level=logging.INFO)# 创建一个Session对象，用于管理整个会话的cookies
session = requests.Session()# 设置重试策略
retries = Retry(total=5, backoff_factor=0.5, status_forcelist=[500, 502, 503, 504])# 添加适配器以应用重试策略
session.mount('http://', HTTPAdapter(max_retries=retries))
session.mount('https://', HTTPAdapter(max_retries=retries))# 设置请求头
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Language': 'en-US,en;q=0.5',
}# 设置代理（如果需要）
proxies = {'http': 'http://10.10.1.10:3128','https': 'http://10.10.1.10:1080',
}def fetch_word_definition(word, url='https://www.example.com/word'):full_url = f"{url}/{word}"try:# 发送GET请求response = session.get(full_url, headers=headers, proxies=proxies)response.raise_for_status()  # 如果状态码不是200，则抛出异常# 使用BeautifulSoup解析HTMLsoup = BeautifulSoup(response.content, 'html.parser')# 假设定义在某个特定的<div>标签内definition = soup.find('div', {'class': 'definition'})if definition:return definition.get_text(strip=True)else:logging.warning(f"No definition found for {word} on the page.")return Noneexcept requests.RequestException as e:logging.error(f"Request failed for {word}: {e}")return None# 使用示例
word = 'example'
definition = fetch_word_definition(word)if definition:print(f"The definition of '{word}' is: {definition}")
else:print(f"Could not retrieve definition for '{word}'.")

这段代码包含了以下功能：

使用requests.Session来管理会话，包括自动处理cookies。
自动重试失败的请求。
设置请求头和代理。
使用BeautifulSoup解析HTML以提取单词定义。

记得替换URL、代理和HTML选择器以适应你正在抓取的实际网站。如果该网站有严格的访问规则或反爬虫机制，你可能需要进一步调整代码以规避这些限制。

可以考虑以下几个方面优化细节：

异常处理的增强：确保代码能够优雅地处理各种异常情况，比如网络故障、服务器错误、无效的HTML结构等。
日志记录的细化：记录更多的信息，如请求的URL、响应的状态码、请求和响应的时间戳等，以便于调试和监控。
数据持久化：将获取的数据存储到数据库或文件中，以便后续分析或使用。
性能优化：使用异步请求或线程池来提高爬虫的效率。
遵守robots.txt：确保爬虫不会访问禁止抓取的页面。

下面是一个包含了上述部分改进的示例代码：

import requests
import logging
import time
import json
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from bs4 import BeautifulSoup# 配置日志
logging.basicConfig(level=logging.INFO, format='%(asctime)s [%(levelname)s] %(message)s')# 创建一个Session对象
session = requests.Session()# 设置重试策略
retries = Retry(total=5, backoff_factor=0.5, status_forcelist=[500, 502, 503, 504])# 添加适配器以应用重试策略
session.mount('http://', HTTPAdapter(max_retries=retries))
session.mount('https://', HTTPAdapter(max_retries=retries))# 设置请求头
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Language': 'en-US,en;q=0.5',
}# 设置代理（如果需要）
proxies = {'http': 'http://10.10.1.10:3128','https': 'http://10.10.1.10:1080',
}# 读取robots.txt文件
def read_robots_txt(url):robots_url = url.rstrip('/') + '/robots.txt'try:response = session.get(robots_url)if response.status_code == 200:lines = response.text.splitlines()disallowed_urls = []for line in lines:if line.startswith('Disallow:'):disallowed_urls.append(line[10:])return disallowed_urlsexcept Exception as e:logging.warning(f"Failed to read robots.txt at {robots_url}: {str(e)}")return []# 检查URL是否在robots.txt的禁止列表中
def check_robots(url, disallowed_urls):for disallowed in disallowed_urls:if url.startswith(disallowed):return Truereturn False# 抓取单词定义
def fetch_word_definition(word, base_url='https://www.example.com/word'):full_url = f"{base_url}/{word}"disallowed_urls = read_robots_txt(base_url)if check_robots(full_url, disallowed_urls):logging.warning(f"Skipping URL {full_url} because it's disallowed by robots.txt.")return Nonetry:start_time = time.time()response = session.get(full_url, headers=headers, proxies=proxies)response.raise_for_status()end_time = time.time()logging.info(f"Fetched {full_url} in {end_time - start_time:.2f} seconds.")soup = BeautifulSoup(response.content, 'html.parser')definition = soup.find('div', {'class': 'definition'})if definition:return definition.get_text(strip=True)else:logging.warning(f"No definition found for {word} on the page.")return Noneexcept requests.RequestException as e:logging.error(f"Request failed for {word}: {e}")return None# 将结果存储到文件
def save_results(results, filename='definitions.json'):with open(filename, 'w') as file:json.dump(results, file, indent=4)# 主程序入口
def main():words = ['example', 'test', 'sample']  # 替换为你的单词列表results = {}for word in words:definition = fetch_word_definition(word)if definition:results[word] = definitionsave_results(results)if __name__ == "__main__":main()