数据分析系列----beautifulsoup4模块

🌈个人主页：羽晨同学

💫个人格言:“成为自己未来的主人~”

beautifulSoup4是一个用于从HTML或XML文件中提取数据的Python模块。

使用BeautifulSoup模块，你可以提取到需要的任何信息。

BeautifulSoup4是BeautifulSoup系列模块的第四个大版本。

在使用这个模块之前，先要确保我们拿到了网页的源代码，怎么拿到网页的源代码，我们在上一篇文章中有说到，不会的同学可以翻上去看一下。

假设，我们所拿到的源代码是这个：

html_str = """
<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>
"""

我们怎么使用BeautifulSoup这个模块对数据进行解析呢？

首先，我们先导入这个模块。

from bs4 import BeautifulSoup

这里是BeautifulSoup的语法，

BeautifulSoup(网页源代码，解析器) ----> 使用BeautifulSoup方法针对网页源代码进行文档解析

返回一个BeautifulSoup对象，（本质是树结构），这个解析过程需要解析器。

soup = BeautifulSoup(html_str,'html.parser')

对文档解析的过程其实就是将html源代码转换成树结构，便于后续的内容查找。

# print(soup,type(soup)) # <class 'bs4.BeautifulSoup'>

我们接下来学一下怎么提取树结构中的方法和属性。

select:使用CSS选择器（标签选择器，id选择器，class选择器，父子选择器，后代选择器，nth-of-type选择器等），从树结构中遍历符合CSS选择器的所有结果，存放在列表中。

select_one 使用CSS选择器（标签选择器，id选择器，class选择器，父子选择器，后代选择器，nth-of-type选择器等）从树结构中遍历符合CSS选择器的第一个结果，存在列表中。

text：从标签内获取标签内容。

attrs：从标签内获取指定属性名对应的属性值。

提取p标签：

标签选择器，只写标签名，会获取整个html源代码中的所有的某标签。

p_list = soup.select('p')
print(p_list)

[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>, <p class="story">...</p>]

父子选择器：从外层向最内容写，使用 > 连接（ > 左右一定留一个空格）。

p_list2 = soup.select('html > body > p')
print(p_list2)

[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>, <p class="story">...</p>]

后代选择器：从外层写向内层，使用空格连接（空格右边的是空格左边的后代）

p_list3=soup.select('html p')
print(p_list3)

[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>, <p class="story">...</p>]

获取三个拥有sister属性值的a标签。

class选择器，使用，调用标签内的class属性值

a_list = soup.select('a.sister')
print(a_list)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

获取第二个a标签。

id选择器，使用，# 调用标签内的id属性值。

id属性用， # 调用

a_list2 = soup.select_one('a#link2')
print(a_list2)
a_list2 = soup.select('a#link2')
print(a_list2)

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

获取第二个a的href属性值和标签内容

a = soup.select_one('html > body a#link2')
print(a,a.text,a.attrs['href'])

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> Lacie http://example.com/lacie

nth-of-tyoe(N):获取第N个标签

a = soup.select_one('html > body a:nth-of-type(2)')
print(a,a.text,a.attrs['href'])

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> Lacie http://example.com/lacie

好了，今天的内容就到这里，我们明天再见。

数据分析系列----beautifulsoup4模块

相关资讯

热文排行

最新新闻

推荐新闻

热搜词