爬虫基本库的使用之使用urllib

在Python的爬虫开发领域，urllib是一个非常重要的基础库。它提供了丰富的接口来发送HTTP请求并处理响应，非常适合初学者以及需要快速实现HTTP请求的开发者。本文将详细介绍如何使用urllib库进行基本的网络爬虫开发。

1、urllib库简介

urllib是Python标准库中用于处理URL的模块集合，包含多个模块，如urllib.request、urllib.parse、urllib.error等。其中，urllib.request用于发送HTTP请求，urllib.parse用于解析URL，urllib.error用于处理请求过程中出现的异常。

import urllib.request  # 目标URL  
url = 'http://example.com'  # 发送GET请求  
response = urllib.request.urlopen(url)  # 读取响应内容  
html = response.read().decode('utf-8')  # 打印响应内容  
print(html)  # 关闭响应对象  
response.close() # 目标URL  
url = 'http://example.com'  # 发送GET请求  
response = urllib.request.urlopen(url)  # 读取响应内容  
html = response.read().decode('utf-8')  # 打印响应内容  
print(html)  # 关闭响应对象  
response.close()
发送POST请求
python
import urllib.request  
import urllib.parse  # 目标URL  
url = 'http://example.com/login'  # POST数据  
data = {  'username': 'your_username',  'password': 'your_password'  
}  # 将数据编码为字节串  
data = urllib.parse.urlencode(data).encode('utf-8')  # 创建一个请求对象  
request = urllib.request.Request(url, data=data, method='POST')  # 添加请求头  
request.add_header('Content-Type', 'application/x-www-form-urlencoded')  # 发送请求  
response = urllib.request.urlopen(request)  # 读取响应内容  
html = response.read().decode('utf-8')  # 打印响应内容  
print(html)  # 关闭响应对象  
response.close()

2、处理异常

在发送HTTP请求时，可能会遇到各种网络问题或服务器错误，因此使用try…except语句来捕获并处理异常是非常必要的。

from urllib.request import urlopen  
from urllib.error import URLError, HTTPError  try:  url = 'http://example.com'  response = urlopen(url)  html = response.read().decode('utf-8')  print(html)  
except HTTPError as e:  print('HTTP Error:', e.code, e.reason)  
except URLError as e:  print('URL Error:', e.reason)  
finally:  if 'response' in locals():  response.close()
使用urllib.parse解析URL

3.解析链接

3.1urlparse 与 urlsplit

urlparse 和 urlsplit 函数用于将URL分解为不同的组件。它们的主要区别在于urlparse会将查询参数（query string）进一步分割为字典，而urlsplit则将其视为一个整体字符串。

from urllib.parse import urlparse, urlsplit  url = 'http://www.example.com:80/path?query=string#fragment'  
parsed_url = urlparse(url)  
split_url = urlsplit(url)  print(parsed_url)  
print(split_url)

3.2 urlunparse 与 urlunsplit

这两个函数是urlparse和urlsplit的逆操作，用于将URL的各个组件重新组合成一个完整的URL字符串。

from urllib.parse import urlunparse, urlunsplit  # 假设已有解析后的组件  
components = ('http', 'www.example.com', '/path', '', 'query=string', 'fragment')  
reconstructed_url = urlunparse(components)  # 对于urlsplit，需要省略查询参数的字典形式  split_components = ('http', 'www.example.com', '/path', 'query=string', 'fragment')  
reconstructed_split_url = urlunsplit(split_components)  print(reconstructed_url)  
print(reconstructed_split_url)

3.3 urljoin

urljoin函数用于将基本URL（base URL）和另一个URL（通常是相对路径）合并成一个完整的URL。

from urllib.parse import urljoin  base_url = 'http://www.example.com/path'  
relative_url = 'newpath/file.html'  
full_url = urljoin(base_url, relative_url)  print(full_url)  # 输出: http://www.example.com/newpath/file.html

3.4 urlencode

urlencode函数用于将字典或包含两个元素的元组（键和值）的列表转换为经过URL编码的查询字符串。

from urllib.parse import urlencode  params = {'query': 'string', 'limit': 10}  
encoded_params = urlencode(params)  print(encoded_params)  # 输出: query=string&limit=10

3.5 parse_qs 与 parse_qsl

这两个函数用于解析查询字符串（query string），将其转换为Python数据结构。parse_qs返回字典，其中键是查询参数的名字，值是参数值的列表（因为可能有多个相同的参数名）；而parse_qsl返回的是查询参数名和值组成的元组列表。

from urllib.parse import parse_qs, parse_qsl  query_string = 'query=string&limit=10&limit=20'  
parsed_qs = parse_qs(query_string)  
parsed_qsl = parse_qsl(query_string)  print(parsed_qs)  # 输出: {'query': ['string'], 'limit': ['10', '20']}  
print(parsed_qsl)  # 输出: [('query', 'string'), ('limit', '10'), ('limit', '20')]

3.6 quote 与 unquote

quote函数用于对URL中的非ASCII字符和某些特殊字符进行百分比编码（percent-encoding），而unquote则用于对百分比编码的字符串进行解码。

from urllib.parse import quote, unquote  encoded_string = quote('Hello, 世界!')  
decoded_string = unquote(encoded_string)  print(encoded_string)  # 输出类似: Hello%2C%20%E4%B8%96%E7%95%8C%21  
print(decoded_string)  # 输出: Hello, 世界!

urllib是Python中处理HTTP请求和URL解析的强大工具。通过urllib.request，我们可以轻松发送GET和POST请求，并处理响应；通过urllib.parse，我们可以解析、组合和编码URL。