Python爬虫库BeautifulSoup的介绍与简单使用实例

2023-08-02 15:38:06 76

一、介绍

BeautifulSoup库是灵活又方便的网页解析库，处理高效，支持多种解析器。利用它不用编写正则表达式即可方便地实现网页信息的提取。

Python常用解析库

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup,“html.parser”)	Python的内置标准库、执行速度适中、文档容错能力强	Python2.7.3or3.2.2)前的版本中文容错能力差
lxmlHTML解析器	BeautifulSoup(markup,“lxml”)	速度快、文档容错能力强	需要安装C语言库
lxmlXML解析器	BeautifulSoup(markup,“xml”)	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup,“html5lib”)	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

二、快速开始

给定html文档，产生BeautifulSoup对象

frombs4importBeautifulSoup
html_doc="""
TheDormouse'sstory

TheDormouse'sstory

Onceuponatimetherewerethreelittlesisters;andtheirnameswere
Elsie,
Lacieand
Tillie;
andtheylivedatthebottomofawell.

...
"""
soup=BeautifulSoup(html_doc,'lxml')

输出完整文本

print(soup.prettify())




TheDormouse'sstory





TheDormouse'sstory



Onceuponatimetherewerethreelittlesisters;andtheirnameswere

Elsie

,

Lacie

and

Tillie

;
andtheylivedatthebottomofawell.


...

浏览结构化数据

print(soup.title)#标签及内容
print(soup.title.name)#<title>name属性
print(soup.title.string)#<title>内的字符串
print(soup.title.parent.name)#<title>的父标签name属性(head)
print(soup.p)#第一个<p></p>
print(soup.p['class'])#第一个<p></p>的class
print(soup.a)#第一个<a></a>
print(soup.find_all('a'))#所有<a></a>
print(soup.find(id="link3"))#所有id='link3'的标签</pre>
<pre>
<title>TheDormouse'sstory
title
TheDormouse'sstory
head
TheDormouse'sstory
['title']
Elsie
[Elsie,Lacie,Tillie]
Tillie

找出所有标签内的链接

forlinkinsoup.find_all('a'):
print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

获得所有文字内容

print(soup.get_text())

TheDormouse'sstory

TheDormouse'sstory
Onceuponatimetherewerethreelittlesisters;andtheirnameswere
Elsie,
Lacieand
Tillie;
andtheylivedatthebottomofawell.
...

自动补全标签并进行格式化

html="""
TheDormouse'sstory

TheDormouse'sstory
Onceuponatimetherewerethreelittlesisters;andtheirnameswere
,
Lacieand
Tillie;
andtheylivedatthebottomofawell.
...
"""
frombs4importBeautifulSoup
soup=BeautifulSoup(html,'lxml')#传入解析器：lxml
print(soup.prettify())#格式化代码，自动补全
print(soup.title.string)#得到title标签里的内容

标签选择器

选择元素

html="""
TheDormouse'sstory

TheDormouse'sstory
Onceuponatimetherewerethreelittlesisters;andtheirnameswere
,
Lacieand
Tillie;
andtheylivedatthebottomofawell.
...
"""
frombs4importBeautifulSoup
soup=BeautifulSoup(html,'lxml')#传入解析器：lxml
print(soup.title)#选择了title标签
print(type(soup.title))#查看类型
print(soup.head)

获取标签名称

frombs4importBeautifulSoup
soup=BeautifulSoup(html,'lxml')#传入解析器：lxml
print(soup.title.name)

获取标签属性

frombs4importBeautifulSoup
soup=BeautifulSoup(html,'lxml')#传入解析器：lxml
print(soup.p.attrs['name'])#获取p标签中，name这个属性的值
print(soup.p['name'])#另一种写法，比较直接

获取标签内容

print(soup.p.string)

标签嵌套选择

frombs4importBeautifulSoup
soup=BeautifulSoup(html,'lxml')#传入解析器：lxml
print(soup.head.title.string)

子节点和子孙节点

html="""


TheDormouse'sstory



Onceuponatimetherewerethreelittlesisters;andtheirnameswere

Elsie

Lacie
and
Tillie
andtheylivedatthebottomofawell.

...
"""


frombs4importBeautifulSoup
soup=BeautifulSoup(html,'lxml')#传入解析器：lxml
print(soup.p.contents)#获取指定标签的子节点，类型是list

另一个方法，child：

frombs4importBeautifulSoup
soup=BeautifulSoup(html,'lxml')#传入解析器：lxml
print(soup.p.children)#获取指定标签的子节点的迭代器对象
fori,childreninenumerate(soup.p.children):#i接受索引，children接受内容
	print(i,children)

输出结果与上面的一样，多了一个索引。注意，只能用循环来迭代出子节点的信息。因为直接返回的只是一个迭代器对象。

获取子孙节点：

frombs4importBeautifulSoup
soup=BeautifulSoup(html,'lxml')#传入解析器：lxml
print(soup.p.descendants)#获取指定标签的子孙节点的迭代器对象
fori,childinenumerate(soup.p.descendants):#i接受索引，child接受内容
	print(i,child)

父节点和祖先节点

parent

frombs4importBeautifulSoup
soup=BeautifulSoup(html,'lxml')#传入解析器：lxml
print(soup.a.parent)#获取指定标签的父节点

parents

frombs4importBeautifulSoup
soup=BeautifulSoup(html,'lxml')#传入解析器：lxml
print(list(enumerate(soup.a.parents)))#获取指定标签的祖先节点

兄弟节点

frombs4importBeautifulSoup
soup=BeautifulSoup(html,'lxml')#传入解析器：lxml
print(list(enumerate(soup.a.next_siblings)))#获取指定标签的后面的兄弟节点
print(list(enumerate(soup.a.previous_siblings)))#获取指定标签的前面的兄弟节点

标准选择器

find_all(name,attrs,recursive,text,**kwargs)

可根据标签名、属性、内容查找文档。

name

html='''


Hello

Foo Bar Jay Foo Bar

frombs4importBeautifulSoup soup=BeautifulSoup(html,'lxml') print(soup.find_all(id='list-1'))#id是个特殊的属性，可以直接使用 print(soup.find_all(class_='element'))#class是关键字所以要用class_

html='''

Hello

Foo Bar Jay Foo Bar ''' frombs4importBeautifulSoup soup=BeautifulSoup(html,'lxml') print(soup.select('.panel.panel-heading'))#.代表class，中间需要空格来分隔 print(soup.select('ulli'))#选择ul标签下面的li标签 print(soup.select('#list-2.element'))#'#'代表id。这句的意思是查找id为"list-2"的标签下的，class=element的元素 print(type(soup.select('ul')[0]))#打印节点类型

Python爬虫库BeautifulSoup的介绍与简单使用实例

Hello

Hello

Hello

Hello

热门推荐

随机推荐