掌握Python爬虫中的BeautifulSoup4：从环境搭建到实战演示

🔸 环境搭建

首先，我们需要安装BeautifulSoup4和requests库。在命令行中运行以下命令：

pip install beautifulsoup4 requests

🔹 这些库将帮助我们发送HTTP请求并解析HTML文档，为我们的爬虫工作打下坚实的基础。

🔸 bs4节点选择器

BeautifulSoup4提供了多种方式来选择HTML节点，其中最常用的是find和find_all方法。

import requests
from bs4 import BeautifulSoup# 发送HTTP请求获取网页内容
url = 'http://example.com'
response = requests.get(url)
html_content = response.content# 解析HTML文档
soup = BeautifulSoup(html_content, 'html.parser')# 使用find方法选择节点
title = soup.find('h1').text
print(f"Title: {title}")# 使用find_all方法选择节点
paragraphs = soup.find_all('p')
for p in paragraphs:print(p.text)

🔹 在这个示例中，我们使用find方法选择第一个<h1>节点，使用find_all方法选择所有<p>节点。

🔸 bs4属性选择器

通过BeautifulSoup4，我们可以使用属性选择器来选择带有特定属性的节点。例如，选择所有具有特定class或id属性的节点。

import requests
from bs4 import BeautifulSoup# 发送HTTP请求获取网页内容
url = 'http://example.com'
response = requests.get(url)
html_content = response.content# 解析HTML文档
soup = BeautifulSoup(html_content, 'html.parser')# 使用属性选择器选择节点
content_div = soup.find('div', class_='content')
print(f"Content: {content_div.text}")# 选择特定id的节点
header = soup.find(id='header')
print(f"Header: {header.text}")

🔹 在这个示例中，我们使用属性选择器选择class属性值为content的<div>节点和id属性值为header的节点。

🔸 bs4层级选择器

BeautifulSoup4还提供了层级选择器，可以选择某个节点的子节点、父节点和兄弟节点等。

import requests
from bs4 import BeautifulSoup# 发送HTTP请求获取网页内容
url = 'http://example.com'
response = requests.get(url)
html_content = response.content# 解析HTML文档
soup = BeautifulSoup(html_content, 'html.parser')# 选择子节点
content_div = soup.find('div', class_='content')
all_links = content_div.find_all('a')
for link in all_links:print(f"Link: {link['href']}")# 选择父节点
footer = soup.find('div', id='footer')
parent = footer.parent
print(f"Footer's parent tag: {parent.name}")# 选择兄弟节点
sibling = footer.find_next_sibling()
print(f"Footer's next sibling tag: {sibling.name}")

🔹 在这个示例中，我们选择了<div>节点的所有子节点<a>，选择了<div>节点的父节点，并选择了<div>节点的下一个兄弟节点。

🔸 实战演示

让我们结合以上知识，进行一个实际的爬虫示例，爬取并解析一个网页中的标题、段落和链接。

import requests
from bs4 import BeautifulSoup# 发送HTTP请求获取网页内容
url = 'https://example.com/article'
response = requests.get(url)
html_content = response.content# 解析HTML文档
soup = BeautifulSoup(html_content, 'html.parser')# 使用节点选择器提取标题
title = soup.find('h1').text
print(f"Title: {title}")# 使用属性选择器提取作者
author = soup.find('span', class_='author').text
print(f"Author: {author}")# 使用层级选择器提取文章内容
content_div = soup.find('div', class_='article-content')
paragraphs = content_div.find_all('p')
for p in paragraphs:print(p.text)# 提取文章中的所有链接
links = content_div.find_all('a')
for link in links:print(f"Link: {link['href']}")