select()方法
我们也可以通过css选择器的方式来提取数据。但是需要注意的是这里面需要我们掌握css语法
https://www.w3school.com.cn/cssref/css_selectors.asp
html_doc = “””
<html><head><title>The Dormouse s story</title></head>
<body>
<p class=”title”><b>The Dormouse s story</b></p>
<p class=”story”>Once upon a time there were three little sisters; and their names were
<a href=”http://example.com/elsie” class=”sister” id=”link1″>Elsie</a>,
<a href=”http://example.com/lacie” class=”sister” id=”link2″>Lacie</a> and
<a href=”http://example.com/tillie” class=”sister” id=”link3″>Tillie</a>;
and they lived at the bottom of a well.</p>
<p class=”story”>…</p>
“””
- 查找title标签
soup = BeautifulSoup(html_doc, lxml )
print(soup.select( title ))
运行结果:
[<title>The Dormouse s story</title>]
- title标签文本值
soup = BeautifulSoup(html_doc, lxml )
tie = soup.select( title )[0].string
print(tie)
运行结果
The Dormouse s story
- 查找 class=”sister”且 id=”link1’’的值
class=”sister”在css中表明: .sister
soup = BeautifulSoup(html_doc, lxml )
tie = soup.select( .sister )[0].get( id )
print(tie)
运行结果
link1
- 查找id=”link2″的值
id=”link2″在css中表明: #link2
soup = BeautifulSoup(html_doc, lxml )
tie = soup.select( #link2 )[0].string
print(tie)
运行结果
Lacie
select()拓展案例
from bs4 import BeautifulSoup
html = """
<table class="tablelist" cellpadding="0" cellspacing="0">
<tbody>
<tr class="h">
<td class="l" width="374">职位名称</td>
<td>职位类别</td>
<td>人数</td>
<td>地点</td>
<td>发布时间</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218">22989-金融云区块链高级研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218">22989-金融云高级后台开发</a></td>
<td>技术类</td>
<td>2</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐运营开发工程师(深圳)</a></td>
<td>技术类</td>
<td>2</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=31235&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐业务运维工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=34531&keywords=python&tid=87&lid=2218">TEG03-高级研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=34532&keywords=python&tid=87&lid=2218">TEG03-高级图像算法研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=31648&keywords=python&tid=87&lid=2218">TEG11-高级AI开发工程师(深圳)</a></td>
<td>技术类</td>
<td>4</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=32218&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=32217&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="odd">
<td class="l square"><a id="test" class="test" target= _blank href="position_detail.php?id=34511&keywords=python&tid=87&lid=2218">SNG11-高级业务运维工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
</tbody>
</table>
"""
soup = BeautifulSoup(html, lxml )
trs = soup.select( tr )[1:]
for tr in trs:
jobs = tr.select( td )[0]
jobs_work =list(jobs.stripped_strings)[0]
print(jobs_work)
运行结果:
22989-金融云区块链高级研发工程师(深圳)
22989-金融云高级后台开发
SNG16-腾讯音乐运营开发工程师(深圳)
SNG16-腾讯音乐业务运维工程师(深圳)
TEG03-高级研发工程师(深圳)
TEG03-高级图像算法研发工程师(深圳)
TEG11-高级AI开发工程师(深圳)
15851-后台开发工程师
15851-后台开发工程师
SNG11-高级业务运维工程师(深圳)
bs4案例总结
- 需求:获取‘https://pt.597.com/zhaopin’人才招聘信息.如职位 , 发布公司 , 薪资 , 求职地区 , 经验要求 等
- 页面结构分析:
https://pt.597.com/zhaopin/?q=%E6%96%87%E5%91%98&page=1 第一页
https://pt.597.com/zhaopin/?q=%E6%96%87%E5%91%98&page=2 第二页
https://pt.597.com/zhaopin/?q=%E6%96%87%E5%91%98&page=3 第三页
以此类推,input()函数获取自定义查询岗位并要获取多页
workjob = input( 请你输入要查找的工作岗位: )
for x in range(31):
url = https://pt.597.com/zhaopin/?q=%s&page={} .format(x+1)%workjob - 分析网页源码,经检查网页源码与elements元素一致,经判断网页为静态网页,可以用正则、xpath及BS4进行网页解析,本案例总结用BS4解析
- 打开‘https://pt.597.com/zhaopin/?q=%E6%96%87%E5%91%98&page=2’网页,选中职位名称下的‘仓库文员’右键检查,得<a href=”/job-4b2a4d5132951.html” data-jid=”4b2a4d5132951″ data-act=”1″ target=”blank” class=”fb des_title” style=”” rel=””>仓库文员</a>。它的上级标签为<li class=”firm-l”;<li class=”firm-l”上级标签为
<li class=”firm-l”对应的是一行完整的招聘信息。而div class=”firm_box” id=”firm_box”>代表着该网页中所有的完整的聘信息,所以我们需要用bs4方法将其查找出来
soup = BeautifulSoup(res, lxml )
firm_box = soup.find( div , class=”firm_box”)
进一步缩小范围,通过查找所有的<div class=”firm-item”,将除了表头以外的每行信息提取出来
firm_items = firm_box.find_all( div , class_=”firm-item”)[1:]
for firm_item in firm_items:
职位等信息在firm_item对象所应的li标签里
lis = firm_item.find_all( li )
lis是个列表,其中发布公司是在lis[1].string里,以此类推 - 将解析的字符串装进lst列表,通过append()方法将每行信息写入到列表中
- 根椐列表法写入CSV文件
def WriteData(self,job_lst):
hr=[ 职位 , 发布公司 , 薪资 , 求职地区 , 经验要求 ]
with open( 招聘统计表.csv , w ,encoding= utf-8 ,newline= )as f:
writer = csv.writer(f)
writer.writerow(hr)
writer.writerows(job_lst)
import requests
import csv
from bs4 import BeautifulSoup
class Sprider():
def __init__(self):
self.headers = { User-Agent : Mozilla/5.0 (Windows NT 6.3; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36 }
def UrlSourse(self,url):
response = requests.get(url, headers=self.headers)
response.encoding = utf-8
res = response.text
return res
def parserurl(self,res):
soup = BeautifulSoup(res, lxml )
firm_box = soup.find( div , class_="firm_box")
firm_items = firm_box.find_all( div , class_="firm-item")[1:]
job_lst = []
for firm_item in firm_items:
lis = firm_item.find_all( li )
target = lis[1].string
sary = lis[2].string
diqu = lis[3].string
jyan = lis[4].string
jobs = firm_item.find_all( a )
job = jobs[0].string
lst = [job, target, sary, diqu, jyan]
job_lst.append(lst)
return job_lst
def WriteData(self,job_lst):
hr=[ 职位 , 发布公司 , 薪资 , 求职地区 , 经验要求 ]
with open( 招聘统计表.csv , w ,encoding= utf-8 ,newline= )as f:
writer = csv.writer(f)
writer.writerow(hr)
writer.writerows(job_lst)
def main(self):
workjob = input( 请你输入要查找的工作岗位: )
job_lst=[]
for x in range(31):
url = https://pt.597.com/zhaopin/?q=%s&page={} .format(x+1)%workjob
res = self.UrlSourse(url)
job_lst += self.parserurl(res)
self.WriteData(job_lst)
if __name__ == __main__ :
s = Sprider()
s.main()
bs4案例进阶总结
- 需求:获取全国各地天气信息
- 页面结构:
http://www.weather.com.cn/textFC/hb.shtml 华北地区
http://www.weather.com.cn/textFC/xn.shtml 西南地区
通过分析我们发现页面结构无法通过range()方法批量获取,只能将要查询的url地址通过遍历获取响应
urls =[ http://www.weather.com.cn/textFC/hb.shtml , http://www.weather.com.cn/textFC/db.shtml , http://www.weather.com.cn/textFC/hd.shtml ,
http://www.weather.com.cn/textFC/hz.shtml , http://www.weather.com.cn/textFC/hn.shtml , http://www.weather.com.cn/textFC/xb.shtml ,
http://www.weather.com.cn/textFC/xn.shtml , http://www.weather.com.cn/textFC/gat.shtml ]
for url in urls: - 取 http://www.weather.com.cn/textFC/hb.shtml 地址第一整页数据在 div=conMidtab 标签下,包含北京、天津及河北地区的所有数据。 conMidtab = soup.find( div , class_=”conMidtab”)
- 接下来去找每一个省会或者直辖市所对应的table标签
tables = conMidtab.find_all( table ) - 接下来查找每个城市每行对应的天气数据,在table标签里面的tr标签(需要注意 要把前2个tr过滤掉)
for table in tables:
trs = table.find_all( tr )[2:] - 去tr标签里面找td标签(第0个是城市 倒数第二个是温度)
for index, tr in enumerate(trs):
tds = tr.find_all( td )
city = list(tds[0].stripped_strings)[0]
import requests
import csv
from bs4 import BeautifulSoup
class Sprider():
def __init__(self):
self.headers = { User-Agent : Mozilla/5.0 (Windows NT 6.3; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36 }
def urlsourse(self,url):
response = requests.get(url, headers=self.headers)
response.encoding = utf-8
res = response.text
return res
def parseurl(self,res):
soup = BeautifulSoup(res, html5lib )
conMidtab = soup.find( div , class_="conMidtab")
tables = conMidtab.find_all( table )
weather_list=[]
for table in tables:
trs = table.find_all( tr )[2:]
for index, tr in enumerate(trs):
tds = tr.find_all( td )
city = list(tds[0].stripped_strings)[0]
if index == 0:
city = list(tds[1].stripped_strings)[0]
lst_dict= {}
temp = list(tds[-2].stripped_strings)[0]
lst_dict[ city ] = city
lst_dict[ temp ] = temp
weather_list.append(lst_dict)
return weather_list
def writerdata(self,weather_list):
hr =[ city , temp ]
with open( 天气预报.csv , w ,encoding= utf-8 ,newline= )as f:
writer = csv.DictWriter(f,hr)
writer.writeheader()
writer.writerows(weather_list)
def main(self):
urls =[ http://www.weather.com.cn/textFC/hb.shtml , http://www.weather.com.cn/textFC/db.shtml , http://www.weather.com.cn/textFC/hd.shtml ,
http://www.weather.com.cn/textFC/hz.shtml , http://www.weather.com.cn/textFC/hn.shtml , http://www.weather.com.cn/textFC/xb.shtml ,
http://www.weather.com.cn/textFC/xn.shtml , http://www.weather.com.cn/textFC/gat.shtml ]
lst =[]
for url in urls:
res = self.urlsourse(url)
lst += self.parseurl(res)
self.writerdata(lst)
if __name__ == __main__ :
s = Sprider()
s.main()
新的知识点
- enumerate() 函数
trs = [1,2,3]
trs
[1, 2, 3]
for index,tr in enumerate(trs):
print(index,tr)
我这里面 需要做一个判断 判断某一个标签。需要得到它的下标来确定是这个出现问题的标签
0 1
1 2
2 3
lst = [4,2,25,35,24,39]
for index,n in enumerate(lst):
if index == 0:
n = 100
print(index,n)
运行结果:
0 100
1 2
2 25
3 35
4 24
5 39
© 版权声明
文章版权归作者所有,未经允许请勿转载。
相关文章
暂无评论...


