[TOC]
20200812
【绪论】 网络爬虫之前奏
主要包括requests、Beautiful Soup/Re/Scrapy
the website is API
“网络爬虫”课程内容导学
Python语言开发工具选择
文本工具类IDE
| 文本工具类 | 特点 |
|---|---|
| Idle | 入门、方便 |
| sublime Text | 专业、免费 |
| Notepad++ | |
| Vim&Emacs | |
| Atom | |
| Komodo Edit |
集成工具类IDE
| 集成工具类 | 特点 |
|---|---|
| Pycharm | |
| Anaconda&Spyder | 科学计算、数据分析,提议可以用这种 |
| Wing | 收费,多人协作 |
| Visul Studio | 微软 |
| Pydev&Eclipse |
【第一周】网络爬虫之规则
单元1:Requests库入门
import requests
url = http://www.baidu.com
r = requests.get(url)
r.raise_for_status()
r.encoding = utf-8
print(r.text)
requests7个主要方法
| 方法 | 说明 |
|---|---|
| requests.request() | 构造一个请求,支撑以下各方法的基础方法 |
| requests.get() | 获取HTML网页的主要方法,对应于HTTP的GET |
| requests.head() | 获取HTML网页头的方法,对应于HTTP的HEAD |
| requests.post() | 获取HTML网页提交POST请求的方法,对应于HTTP的POST |
| requests.put() | 获取HTML网页提交PUT请求的方法,对应于HTTP的PUT |
| requests.patch() | 获取HTML网页提交局部修改请求的方法,对应于HTTP的PATCH |
| requests.delete() | 获取HTML网页提交删除的方法,对应于HTTP的DELETE |
requests.get
r = requests.get(url,params=None,**kwargs)
url:链接
params:url中的额外参数,字典或字节流格式,可选
**kwargs:12个控制访问的参数
Response对象的属性(重大,5种)
| 属性 | 说明 |
|---|---|
| r.status_code | HTTP请求的返回状态,200表明连接成功,404表明失败 |
| r.text | HTTP响应内容的字符串形式,即url对应的页面内容 |
| r.encoding | 从HTTP header中猜测的响应内容编码形式 |
| r.apparent_encoding | 从内容中分析出的响应内容编码方式(备选编码方式) |
| r.content | HTTP响应内容的二进制形式 |
爬取网页代码的通用框架
Requests库异常(6种)
| 异常 | 说明 |
|---|---|
| requests.ConnectionError | 网络连接错误,如DNS查询失败、拒绝连接等 |
| requests.HTTPError | HTTP错误异常 |
| requests.URLREQUIRED | URL缺失异常 |
| requests.TooManyRedirects | 超过最大重方向次数,产生重定向异常 |
| requests.ConnectTimeout | 连接远程服务器超时异常 |
| requests.Timeout | 请求URL超时,产生超时异常 |
r.raise|_for_status() 如果状态码不为200,则为异常
import requests
def getHTMLError(url):
try:
r = requests.get(url,timeout = 30)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return 异常
if __name__ == __main__ :
url = http://www.baidu.com
print(getHTMLError(url))
UPL格式:http://host[:port][path]
host:合法的Internet主机域名或IP地址
port:端口号,缺省端口为80
path:请求资源的路径
理解PATCH和PUT的区别
采用PATCH,仅局部修改,好处是节省带宽
采用PUT,全部修改
requests.request(method,url,**kwargs)
method:对应get/post等7种方式
url:url连接
**开挖如果是:控制访问的参数,共13个
r = requests.request( GET ,url,**kwargs)
r = requests.request( HEAD ,url,**kwargs)
r = requests.request( POST ,url,**kwargs)
r = requests.request( PUT ,url,**kwargs)
r = requests.request( PATCH ,url,**kwargs)
r = requests.request( delete ,url,**kwargs)
r = requests.request( OPTIONS ,url,**kwargs)
**kwargs
| 序号 | **kwargs | 说明 | 重大 |
|---|---|---|---|
| 1 | params | 字典或字节序列,作为参数增加到URL中; | 重大 |
| 2 | data | 字典、字节序列或文件对象,作为Request中的内容 | 重大 |
| 3 | json | Json格式的数据,作为Request的内容 | 重大 |
| 4 | headers | 字典,HTTP定制头 | 重大 |
| 5 | cookies | 字典或cookieJar,Request中的cookie | |
| 6 | auth | 元祖,支持HTTP认证功能 | |
| 7 | files | 字典类型,传输文件 | |
| 8 | timeout | 设定超时间,以秒为单位 | 重大 |
| 9 | proxies | 字典类型,设定访问代理服务器,可以增加登陆认证 | 重大 |
| 10 | allow_redirects | True/False,默认为True,重定开关 | |
| 11 | stream | True/False,默认为True,获取内容立即下载开关 | |
| 12 | verify | True/False,默认为True,认证SSL证书开关 | |
| 13 | cert | 本地SSL证书路径 |
# params
import requests
kv = { key1 : value1 , key2 : value2 }
url = http://python123.io/ws
r = requests.request( GET ,url,params=kv)
print(r.url)
https://python123.io/ws?key1=value1&key2=value2
# data
import requests
kv = { key1 : value1 , key2 : value2 }
url = http://python123.io/ws
r = requests.request( GET ,url,data=kv)
body = 主题内容
r = requests.request( GET ,url,data = body)
# headers
headers = { user-agent , Chrome/10 }
r = requests.request( POST ,url,headers = headers)
# files
files = { file :open( data.xls , rb )}
r = requests.request( POST ,url,files = files)
# proxies
proxies = { http : http://user:pass@10.10.10.1:1234 ,
https : https://10.10.10.1:4321 }
r = request.get(url,proxies = proxies)
request.get
| 序号 | method | 表达 |
|---|---|---|
| 1 | get | requests.get(url,params=None,**kwargs) |
| 2 | head | requests.head(url,**kwargs) |
| 3 | post | requests.post(url,data=None,json=None,**kwargs) |
| 4 | put | requests.put(url,data=None,**kwargs) |
| 5 | pathc | requests.patch(url,data=None,**kwargs) |
| 6 | delete | requests.delete(url,**kwargs) |
单元2:网络爬虫的“盗亦有道”
单元3:Requests库网络爬虫实战(5个实例)
京东
import requests
url = https://item.yiyaojd.com/100006214779.html#crumb-wrap
try:
r = requests.get(url)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text[:1000])
except:
print( Wrong )
# 爬取的内容貌似有误
<script>window.location.href= https://passport.jd.com/uc/login?ReturnUrl=http%3A%2F%2Fitem.jd.com%2F100006214779.html </script>
亚马逊页面的爬取
import requests
url = https://www.amazon.cn/gp/product/B01CDEXB56/ref=cn_ags_s9_asin?pf_rd_p=33e63d50-addd-4d44-a917-c9479c457e1a&pf_rd_s=merchandised-search-3&pf_rd_t=101&pf_rd_i=1403206071&pf_rd_m=A1AJ19PSB66TGU&pf_rd_r=MPPKY8VGVD6X9PDDPMSB&pf_rd_r=MPPKY8VGVD6X9PDDPMSB&pf_rd_p=33e63d50-addd-4d44-a917-c9479c457e1a&ref=cn_ags_s9_asin_1403206071_merchandised-search-3
try:
# 增加了headers
headers = { user-agent : Mozilla/5.0 }
r = requests.get(url,headers = headers)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text)
except:
print(r.status_code)
r.request.headers
{ User-Agent : python-requests/2.23.0 , Accept-Encoding : gzip, deflate , Accept : */* , Connection : keep-alive }
百度/360搜索关键词提交
# 百度搜索
import requests
keyword = python
try:
params = { wd :keyword}
r = requests.get( http://www.baidu.com/s ,params = params)
print(r.url)
r.raise_for_status()
print(len(r.text))
except:
print( 提取异常 )
# 360搜索
import requests
keyword = python
try:
params = { q :keyword}
r = requests.get( http://www.so.com/s ,params = params)
print(r.request.url)
r.raise_for_status()
print(len(r.text))
except:
print( 提取异常 )
网络图片的爬取和存储
import requests
try:
headers = { user-agent : Mozilla/5.0 }
url = http://placekitten.com/200/300
r = requests.get(url,headers = headers)
r.raise_for_status()
cat_img = r.content
with open( c:/Users/Administrator/desktop/cat_200_300.jpg , wb ) as f:
f.write(cat_img)
except:
print( 出现错误 )
Ip地址归属地的自动查询
# 临时保留,有错误
import requests
def ipsearch(keyword):
try:
headers = { users-agent : Mozilla/5.0 }
params = { ip :keyword, action : 2 }
url = https://ip138.com/iplookup.asp
r = requests.get(url,params = params,headers = headers)
r.raise_for_status
r.text[-500:]
except:
print( 出现错误 )
if __name__== __main__ :
keyword = 218.75.83.182
ipsearch(keyword)
网络爬虫之提取
import requests
from bs4 import BeautifulSoup
url = http://python123.io/ws/demo.html
r = requests.get(url)
demo = r.text
soup = BeautifulSoup(demo, html.parser )
print(soup.prettify()) # soup.prettify() 可以展现网页标签树
BeautifulSoup的基本元素
Beautifulsoup库是解析、遍历、维护“标签树”的功能库
https://www.jianshu.com/p/424e037c5dd8
上述网页讲的比较清楚
BeautifulSoup库解析器
| 解析器 | 使用方法 | 条件 |
|---|---|---|
| bs4的HTML解析器 | BeautifulSoup(mk, html.parser ) | 安装bs4库 |
| lxml的HTML解析器 | BeautifulSoup(mk, lxml ) | pip install lxml |
| lxml的XML解析器 | BeautifulSoup(mk, xml ) | pip install lxml |
| html5lib的解析器 | BeautifulSoup(mk, html5lib ) | pip install html5lib |
BeautifulSoup类的基本元素
| 基本元素 | 说明 |
|---|---|
| Tag | 标签,最基本的信息组织单元,分别用<>和</>标明开头和结尾。对于 Tag,它有两个重大的属性,是 name 和 attrs |
| Name | 标签的名字,<p>—</p>的名字是 p ,格式<tag>.name |
| Attributes | 标签的属性,字典形式组织,格式:<tag>.attrs |
| NavigableString | 标签内非属性字符,,<>—</>中字符串,格式:<tag>.string |
| Comment | 标签内字符串的注释部分,一种特殊的Comment类型 |
- 下行遍历;
- 上行遍历;
- 平行遍历。
基于bs4库的HTML内容的遍历方法
标签树的下行遍历
| 属性 | 说明 |
|---|---|
| .contents | 子节点的列表,将<tag>所有儿子节点存入列表 |
| .children | 子节点的迭代类型,与.contents类似,用于循环遍历儿子节点 |
| .descendants | 子孙节点的迭代类型,包含所有子孙节点,用于循环遍历 |
标签树的上行遍历
| 属性 | 说明 |
|---|---|
| parent | 节点的父亲标签 |
| parents | 节点先辈标签的迭代类型,用于循环遍历先辈节点 |
import requests
from bs4 import BeautifulSoup
url = http://python123.io/ws/demo.html
r = requests.get(url)
demo = r.text
soup = BeautifulSoup(demo, html.parser )
for parent in soup.a.parents:
if parent is None:
print(parent)
else:
print(parent.name)
p
body
html
[document]
标签树的平行遍历
| 属性 | 说明 |
|---|---|
| .next_sibling | 返回HTML文本顺序的下一个平行节点标签 |
| .previous_sibling | 返回HTML文本顺序的上一个平行节点标签 |
| .next_siblings | 迭代类型,返回HTML文本顺序的后续所有平行节点标签 |
| .previous_siblings | 迭代类型,返回HTML文本顺序的前续所有平行节点标签 |
soup.a.next_sibling
soup.a.privious_sibling
soup.find
soup.findAll
soup.findAllNext
soup.findAllPrevious
soup.findChild
soup.findChildren
soup.findNext
soup.findNextSibling
soup.findNextSiblings
soup.findParent
soup.findParents
soup.findPrevious
soup.findPreviousSibling
soup.findPreviousSiblings
soup.find_all
soup.find_all_next
soup.find_all_previous
soup.find_next
soup.find_next_sibling
soup.find_next_siblings
soup.find_parent
soup.find_parents
soup.find_previous
soup.find_previous_sibling
soup.find_previous_siblings
基于bs4库的HTML格式输出
信息组织和提取方法
####信息标记的三种方式
XML,JSON,YAML
Json
"key":"value"
"key":[“value1”,"value2"]
"key":{"subkey":"subvalue"}
信息提取的一般方法
提取网页中所有URL链接
import requests
from bs4 import BeautifulSoup
url = http://python123.io/ws/demo.html
r = requests.get(url)
demo = r.text
soup = BeautifulSoup(demo, html.parser )
for link in soup.find_all( a ):
print(link.get( href ))
基于bs4库的HTML内容查找方法
<>.find_all(name,attrs,recursive,string,**kwargs)
soup( a ) 等同于 soup.find_all( a )
soup(id = link1 ) = soup.find_all(id = link1 )
soup.find_all([ a , b ]) #两个参数的方法用列表
for tag in soup.find_all(True):
print(tag.name)
for i in soup.find_all(re.compile( b )):
print(i)
for i in soup.find_all(string = re.compile( python )):
print(i)
import re
for tag in soup.find_all(re.compile( b )):
print(tag)
soup.find_all(id = re.compile( link ))
soup.find_all( p , course )
soup.find_all(id = link1 )
soup的其他表达方式
soup.title
soup.a.name
soup.a.parent.name
soup.a.parent.parent.name
soup.a.attrs
soup.prettify() #全部展示
soup.a.text = soup.a.string
soup.a.string #bs4显示文本用string
soup(class_= ) #class属性用下划线
soup.a.text与soup.a.string
一般用text比较好
https://zhuanlan.zhihu.com/p/30911642
上述网址讲text和string 的区别
lxml库
from lxml import etree
import requests
text=
<div>
<ul>
<li class="item-0"><a href="link1.html">第一个</a></li>
<li class="aaa item-1"><a href="link2.html">second item</a></li>
<li class="item-0"><a href="link5.html">a属性</a>
</ul>
</div>
html = etree.HTML(text,etree.HTMLParser()) #解析HTML文本内容
result = html.xpath( //li/a/text() ) # 列表集合
print(result) #出现 [ 第一个 , second item , a属性 ]
result = html.xpath( //li[contains(@class,"aaa")]/a/text() )
#contains包含多种内容,属性值要用英文双引号""
result = html.xpath( //li[@class="aaa item-1"]//text() )
print(result)
[ second item ]
from lxml import etree
import requests
text=
<div>
<ul>
<li class="item-0"><a href="link1.html">第一个</a></li>
<li class="aaa item-1"><a href="link2.html">second item</a></li>
<li class="item-0"><a href="link5.html">a属性</a>
</ul>
</div>
parse_etree = etree.HTML(text,etree.HTMLParser())
dd_xpath = parse_etree.xpath( //ul )
for i in dd_xpath:
print(i.xpath( //li/a/text() ))
# lxml
from bs4 import BeautifulSoup
from lxml import etree
text = """
<html><head><title>The Dormouse s story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse s story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister1" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
xpath = //p/a[@class="sister1"]/text() #[]里面需要"",注意
html = etree.HTML(text,etree.HTMLParser())
result = html.xpath(xpath)
print(result)
# 两种显示文本的方法
html = etree.HTML(text)
html_data = html.xpath( /html/body/div/ul/li/a )
print(html)
for i in html_data:
print(i.text) #此处加text
html = etree.HTML(tex)
html_data = html.xpath( /html/body/div/ul/li/a/text() ) #此处加text()
print(html)
for i in html_data:
print(i)
xpath 的方法
xpath = //p/a/@href #相对路径下li标签下的a标签下的href属性的值,注意,a标签后面需要双//,但我用单/也可以?20210129(网上说,@href不是子节点,提议用双引号)
xpath = //p/a/text()
xpath = //*/a/text() # ,*是通配符
xpath = //p/a[@class="sister"]/text() #属性必定要双引号""
xpath = //p/a[2]//@href #查找a标签里面的第二个
xpath = //p/a[last()]//@href #查找a标签里面的最后一个
xpath = //p/a[last()-1]//@href #查找a标签里面的倒数第二个
xpath( //li[contains(@class,"inac")]/a/text() ) # 属性中包含的关键字或者片段
xpath( //div|//table ) #标签div或者|标签table
xpath( //div[contains(@id,”ma”) and contains(@id,”in”)] ) #选取id值包含ma和in的div节点
xpath( //li[contains(@class,"item-0") and @name="one"]/a/text() )#使用and操作符将两个条件相连。
xpath( //a[contains(text(),"sec")]//@href ) #a标签文本有“sec”的href属性
xpath( //li[contains(@class,"item")][position()=1]/a/text() ) #标签li的class属性包含"item"的第一个位置
xpath( //li[contains(@class,"item")][position<=2]/a/text() ) #标签li的class属性包含"item"的前两个位置
xpath( //a[@href="https://hao.360.cn/?a1004"]/../@class ) #属性href为“”的标签a的父节点的属性class
xpath见上面,下面备用
from lxml import etree
text =
<div>
<ul>
<li class="sp item-0" name="one"><a href="www.baidu.com">baidu</a>
<li class="sp item-1" name="two"><a href="https://blog.csdn.net/qq_25343557">myblog</a>
<li class="sp item-2" name="two"><a href="https://www.csdn.net/">csdn</a>
<li class="sp item-3" name="four"><a href="https://hao.360.cn/?a1004">hao123</a>
html = etree.HTML(text)
result = html.xpath( //li[1]/ancestor::* )#ancestor表明选取当前节点祖先节点,*表明所有节点。合:选择当前节点的所有祖先节点。
print(result)
result = html.xpath( //li[1]/ancestor::div )#ancestor表明选取当前节点祖先节点,div表明div节点。合:选择当前节点的div祖先节点。
print(result)
result = html.xpath( //li[1]/ancestor-or-self::* )#ancestor-or-self表明选取当前节点及祖先节点,*表明所有节点。合:选择当前节点的所有祖先节点及本及本身。
print(result)
result = html.xpath( //li[1]/attribute::* )#attribute表明选取当前节点的所有属性,*表明所有节点。合:选择当前节点的所有属性。
print(result)
result = html.xpath( //li[1]/attribute::name )#attribute表明选取当前节点的所有属性,name表明name属性。合:选择当前节点的name属性值。
print(result)
result = html.xpath( //ul/child::* )#child表明选取当前节点的所有直接子元素,*表明所有节点。合:选择ul节点的所有直接子节点。
print(result)
result = html.xpath( //ul/child::li[@name="two"] )#child表明选取当前节点的所有直接子元素,li[@name="two"]表明name属性值为two的li节点。合:选择ul节点的所有name属性值为two的li节点。
print(result)
result = html.xpath( //ul/descendant::* )#descendant表明选取当前节点的所有后代元素(子、孙等),*表明所有节点。合:选择ul节点的所有子节点。
print(result)
result = html.xpath( //ul/descendant::a/text() )#descendant表明选取当前节点的所有后代元素(子、孙等),a/test()表明a节点的文本内容。合:选择ul节点的所有a节点的文本内容。
print(result)
result = html.xpath( //li[1]/following::* )#following表明选取文档中当前节点的结束标签之后的所有节点。,*表明所有节点。合:选择第一个li节点后的所有节点。
print(result)
result = html.xpath( //li[1]/following-sibling::* )#following-sibling表明选取当前节点之后的所有同级节点。,*表明所有节点。合:选择第一个li节点后的所有同级节点。
print(result)
result = html.xpath( //li[1]/parent::* )#选取当前节点的父节点。父节点只有一个,祖先节点可能多个。
print(result)
result = html.xpath( //li[3]/preceding::* )#preceding表明选取文档中当前节点的开始标签之前的所有同级节点及同级节点下的节点。,*表明所有节点。合:选择第三个li节点前的所有同级节点及同级节点下的子节点。
print(result)
result = html.xpath( //li[3]/preceding-sibling::* )#preceding-sibling表明选取当前节点之前的所有同级节点。,*表明所有节点。合:选择第三个li节点前的所有同级节点。
print(result)
result = html.xpath( //li[3]/self::* )#选取当前节点。
print(result)
bs4与lxml的异同
# bs4
import requests
from bs4 import BeautifulSoup as bs
url = https://www.python123.io/ws/demo.html
res = requests.get(url)
soup = bs(res.text, html.parser )
for i in soup( a ):
print(i.string) #文本用string
# for i in soup( a ,class_= py2 ): #class_用要到下划线
# for i in soup( a ,class_=re.compile( py )): 用到re模块
# lxml
import requests
from lxml import etree
url = https://www.python123.io/ws/demo.html
res = requests.get(url)
html = etree.HTML(res.text,etree.HTMLParser())
xpath = //*/a
#xpath = //a[@class="py2"] #属性在xpath里面,要用双引号
#xpath = //a[contains(@class,"py2")] #用到属性,contains
text1 = html.xpath(xpath)
print(i.text) # 文本用textfor i in text1:
bs4与lxml的异同 概括
1.文本表达不同:bs4用string,lxml用text;
2. bs4用re,lxml用xpath
政府网站cdata
<![CDATA[ ]]> 是什么,这是XML语法。在CDATA内部的所有内容都会被解析器忽略。
如果文本包含了许多的”<“字符 <=和”&”字符——就象程序代码一样,那么最好把他们都放到CDATA部件中。
找到关键tag,然后re,再soup汤
import requests
from bs4 import BeautifulSoup,CData
import re
response=requests.get( http://www.nbrd.gov.cn/col/col3422/index.html )
response.encoding = utf-8
html=response.text
soup=BeautifulSoup(html, lxml )
for link in soup( div ):
for i in re.findall(r <li>(.*?)</li> , str(link)):
if len(i)>0:
soup2 = BeautifulSoup(i, lxml )
for string in soup2( a ):
print(string)
re爬虫象山人大的相关数据
import requests
from bs4 import BeautifulSoup
import re
headers ={ User-Agent : Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0)
Gecko/20100101 Firefox/87.0 }
response=requests.get( http://rd.xiangshan.gov.cn/col/col115466/index.html ,headers = headers)
response.encoding = utf-8
html=response.text
#res = re.findall(r headers(.*?)year ,html,re.I|re.M|re.S)
res = re.findall(r headers(.*?)year ,html,re.I|re.M|re.S)
resNew = [i.replace(r"[i]= · ", ).replace(r" ;", ) for i in res]
print(resNew)
实例1:中国大学排名爬虫
from lxml import etree
from bs4 import BeautifulSoup
import bs4,requests
def getHTMLText(url):
try:
r = requests.get(url,timeout = 30)
r.raise_for_status
r.encoding = r.apparent_encoding
return r.text
except:
return ""
def fillUnivList(ulist,html):
soup = BeautifulSoup(html, html.parser )
for tr in soup.find( tbody ).children:
if isinstance(tr,bs4.element.Tag):
tds = tr( td )
ulist.append([tds[0].string.strip(),tds[1].a.string.strip(),tds[2].string.strip(),
tds[3].string.strip(),tds[4].string.strip(),tds[5].string.strip()])
def printUnivList(ulist,num):
tplt = "{0:^5} {1:^15} {2:^10} {3} {4} {5}"
print(tplt.format( 序号 , 排名 , 学校名称 , 省市 , 总分 , 办学层次 ))
for i in range(num):
u = ulist[i]
print(tplt.format(u[0],u[1],u[2],u[3],u[4],u[5]))
def main():
uinfo = []
url = http://www.shanghairanking.cn/rankings/bcur/2020
html = getHTMLText(url)
fillUnivList(uinfo,html)
printUnivList(uinfo,20) # 20 univs
main()
【第三周】网络爬虫之实战
单元7:Re(正则表达式)库入门
Re库的主要功能函数
1. re.search(pattern,string,flags=0)
flags==> re.I(大小写),re.M(多行),re.S
import re
match = re.search(r [1-9]d{5} , BIT 100081 )
if match:
print(match.group(0))
import re
content = Hello 123456789 Word_This is just a test 666 Test
result = re.search( (d+).*?(d+).* , content)
print(result)
print(result.group()) # print(result.group(0)) 同样效果字符串
print(result.groups())
print(result.group(1))
print(result.group(2))
<_sre.SRE_Match object; span=(6, 49), match= 123456789 Word_This is just a test 666 Test >
123456789 Word_This is just a test 666 Test
( 123456789 , 666 )
123456789
666
import re
S = re.search(r"(?P<name>[a-z]*)(?P<age>d+)", "zhouke18kelaiji19keke16")
print(S.group())
print(S.group("name")) # 命名组
print(S.group("age"))
zhouke18
zhouke
18
2.re.match(pattern,string,flags =0)
3.re.findall(pattern,string,maxsplit=0,flags =0)
import re
ls = re.findallr( [1-9]d{5} , 123456 BIT 987654 TES ) # 返回列表
4.re.split(pattern,string,flags =0)
import re
ls = re.split(r [1-9]d{5} , 123456 BIT 987654 TES )
5.re.finditer(pattern,string,flags =0)
import re
ls = re.finditer(r [1-9]d{5} , 123456 BIT 987654 TES )
for i in ls:
if i:
print(i)
6.re.sub(pattern,repl,count=0,string,flags =0)
import re
ls = re.sub(r [1-9]d{5} , :zipcode , 123456 BIT 987654 TES )
for i in ls:
if i:
print(i)
import re
ls = re.finditer(r [1-9]d{5} , 123456 BIT 987654 TES )
for i in ls:
if i:
print(i)
re.compile应用
import re
ls = re.combile(r [1-9]d{5} )
ps = ls.search( 123456 BIT 987654 TES )
单元8:实例2-淘宝商品比价定向爬虫
没有成功
# coding: utf-8
import requests
import re
def get_html(url):
"""获取源码html"""
try:
r = requests.get(url=url, timeout=10)
r.encoding = r.apparent_encoding
return r.text
except:
print("获取失败")
def get_data(html, goodlist):
"""使用re库解析商品名称和价格
tlist:商品名称列表
plist:商品价格列表"""
tlist = re.findall(r "raw_title":".*?" , html)
plist = re.findall(r "view_price":"[d.]*" , html)
for i in range(len(tlist)):
title = eval(tlist[i].split( : )[1]) # eval()函数简单说就是用于去掉字符串的引号
price = eval(plist[i].split( : )[1])
goodlist.append([title, price])
def write_data(list, num):
# with open( E:/Crawler/case/taob2.txt , a ) as data:
# print(list, file=data)
for i in range(num): # num控制把爬取到的商品写进多少到文本中
u = list[i]
with open( c:/users/Administrator/desktop/taob.txt , a ) as data:
print(u, file=data)
def main():
goods = 水杯
depth = 3 # 定义爬取深度,即翻页处理
start_url = https://s.taobao.com/search?q= + goods
infoList = []
for i in range(depth):
try:
url = start_url + &s= + str(44 * i) # 由于淘宝显示每页44个商品,第一页i=0,一次递增
html = get_html(url)
get_data(html, infoList)
except:
continue
write_data(infoList, len(infoList))
if __name__ == __main__ :
main()
单元9:实例3-股票数据定向爬虫
import requests
import re
from bs4 import BeautifulSoup
import traceback
def getHtmlText(url,code= utf-8 ): #爬取网页信息
try:
r=requests.get(url, timeout=30)
r.raise_for_status
r.encoding = code
return r.text
except:
return ""
def getStockList(lst,stockURL): #lst参数:列表保存的列表类型,里面存储了所有股票的信息/stockURL参数:获得股票列表的url网站
html = getHtmlText(stockURL, GB2312 ) #获得一个页面/东方财富使用的是GB2312编码,使用过程中直接把编码赋值给函数
soup = BeautifulSoup(html, html.parser ) #解析页面并找到其中所有的<a>
a = soup.find_all( a ) #使用find_all方法找到a标签
for i in a: #对<a>标签进行遍历
try:
href=i.attrs[ href ] #找到每个href标签的属性并判断属性中间的链接,拿出属性链接后面的数字
lst.append(re.findall(r"[s][hz]d{6}",href)[0]) #上交所/深交所股票代码sh/sz开头
except:
continue
def getStockInfo(lst,stockURL,fpath): #lst保存股票信息的列表/stockURL获取股票信息的网站/fpath存储
count = 0
for stock in lst: #从lst中得到股票名称放到url中进行搜索
url = stockURL + stock + ".html" #百度股票链接+股票名称形成访问各股票名称的链接、
html = getHtmlText(url) #通过HTML页面获取url链接中的内容
try:
if html == "":
continue
infoDict = {}
soup = BeautifulSoup(html, html.parser )
stockInfo = soup.find( div ,attrs={ class : stock-bets })
name = stockInfo.find_all(attrs={ class : bets-name })[0]
infoDict.update({ 股票名称 :name.text.split()[0]})
keyList = stockInfo.find_all( dt )
valueList = stockInfo.find_all( dd )
for i in range(len(keyList)):
key = keyList[i].text
val = valueList[i].text
infoDict[key] = val
with open(fpath, a ,encoding= utf-8 ) as f:
f.write(str(infoDict)+
)
count = count + 1
print( /r当前速度:{:.2f}% .format(count*100/len(lst)),end= )
except :
count = count + 1
print( /r当前速度:{:.2f}% .format(count*100/len(lst)),end= )
continue
def main():
stock_list_url = http://quote.eastmoney.com/stocklist.html
stock_info_url = https://gupiao.baidu.com/stock/
output_file = C:/BaiduStockInfo.txt
slist = [] #存储股票的信息
getStockList(slist,stock_list_url) #获得股票列表
getStockInfo(slist,stock_info_url,output_file) #根据股票列表到相关网站上获取股票信息,并存储到本地文件中
main()
单元9.1 re爬取医院,20210810成功
import requests
import re
url = "https://yyk.99.com.cn/sanjia/shanghai/"
# 模拟浏览器的访问
headers ={ User-Agent : Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0)
Gecko/20100101 Firefox/87.0 }
res = requests.get(url,headers=headers)
if res.status_code == 200:
#1.获取网页源代码
raw_text = res.text
#2.正则表达式书写:
#2.2注意:正则表达式默认匹配的是一行 我们的源代码是多行匹配的要加另一个参数 re.DOTALL
#re.findall() 返回的是lsit集合 一次过滤
re_res = re.findall(r <div class="province-box">(.*)<div class="wrap-right"> , raw_text,re.M|re.S)
#re_res[0] 获取下标是的数据 二次过滤
res=re.findall(r title="(.*[院心部])*)" ,re_res[0])
#检查打印获取到的信息
print(res)
# 写入文件中
read = open("上海医院名单", "w", encoding= utf-8 )
for i in res:
read.write(i)
read.write("
")
read.close()
else:
print("error")
永Python网络爬虫来看看最近电影院有哪些上映的电影
【第四周】网络爬虫之框架
单元10:Scrapy爬虫框架
单元11:Scrapy爬虫基本使用
单元12:实例4-股票数据Scrapy爬虫
import requests
from lxml import etree
from fake_useragent import UserAgent
import time
import random
class MaoyanSpider(object):
def __init__(self):
self.url = https://maoyan.com/films?showType=2&offset={}
ua = UserAgent(verify_ssl=False)
for i in range(1, 50):
self.headers = {
User-Agent : ua.random,
}
# 添加计数(页数)
self.page = 1
# 获取页面
def get_page(self, url):
# random.choice必定要写在这里,每次请求都会随机选择
res = requests.get(url, headers=self.headers)
res.encoding = utf-8
html = res.text
self.parse_page(html)
# 解析页面
def parse_page(self, html):
# 创建解析对象
parse_html = etree.HTML(html)
# 基准xpath节点对象列表
dd_list = parse_html.xpath( //dl[@class="movie-list"]//dd )
print(len(dd_list))
movie_dict = {}
# 依次遍历每个节点对象,提取数据
for dd in dd_list:
name = dd.xpath( .//div[@class="movie-hover-title"]//span[@class="name noscore"]/text() )[0].strip()
star = dd.xpath( .//div[@class="movie-hover-info"]//div[@class="movie-hover-title"][3]/text() )[1].strip()
type = dd.xpath( .//div[@class="movie-hover-info"]//div[@class="movie-hover-title"][2]/text() )[1].strip()
dowld=dd.xpath( .//div[@class="movie-item-hover"]/a/@href )[0].strip()
# print(movie_dict)
movie = 【即将上映】
电影名字: %s
主演:%s
类型:%s
详情链接:https://maoyan.com%s
=========================================================
% (name, star, type,dowld)
print(movie)
# 主函数
def main(self):
for offset in range(0, 90, 30):
url = self.url.format(str(offset))
self.get_page(url)
# print("hgkhgkk")
print(url)
print( 第%d页完成 % self.page)
time.sleep(random.randint(1, 3))
self.page += 1
if __name__ == __main__ :
spider = MaoyanSpider()
spider.main()
import requests
import re
from bs4 import BeautifulSoup
url = http://www.shanghairanking.cn/rankings/bcur/2020
r = requests.get(url)
r.encoding = r.apparent_encoding
demo = r.text
soup = BeautifulSoup(demo, html.parser )
for link in soup( a ):
print(link.string)
临时,查找100大学
from lxml import etree
from bs4 import BeautifulSoup
import bs4,requests
def getHTMLText(url):
try:
r = requests.get(url,timeout = 30)
r.raise_for_status
r.encoding = r.apparent_encoding
return r.text
except:
return ""
def fillUnivList(ulist,html):
soup = BeautifulSoup(html, html.parser )
for tr in soup.find( tbody ).children:
if isinstance(tr,bs4.element.Tag):
tds = tr( td )
ulist.append([tds[0].string,tds[1].string,tds[2].string])
def printUnivList(ulist,num):
tplt = "{0:^10} {1:^6} {2:^10}"
print(tplt.format( 排名 , 学校名称 , 总分 ))
for i in range(num):
u = ulist[i]
print(tplt.format(u[0],u[1],u[2]))
def main():
uinfo = []
url = http://www.shanghairanking.cn/rankings/bcur/2020
html = getHTMLText(url)
fillUnivList(uinfo,html)
printUnivList(uinfo,20) # 20 univs
main()
Python网络爬虫从入门到精通
20210128
# hello.py
# the zen of python
# 利用lxml.etree
import lxml.etree #必定要将lxml.etree一起
import requests
url= https://www.python.org/dev/peps/pep-0020/
res = requests.get(url)
ht = lxml.etree.HTML(res.text,lxml.etree.HTMLParser())
xpath = //*[@id="the-zen-of-python"]/pre/text()
text = ht.xpath(xpath)
print( .join(text))
# 利用lxml这个也是可以
import lxml.html
import requests
url= https://www.python.org/dev/peps/pep-0020/
res = requests.get(url)
ht = lxml.html.fromstring(res.text)
xpath = //*[@id="the-zen-of-python"]/pre/text()
text = ht.xpath(xpath)
print( .join(text))
# 利用bs4库
from bs4 import BeautifulSoup
import requests
url= https://www.python.org/dev/peps/pep-0020/
res = requests.get(url)
soup = BeautifulSoup(res.text, html.parser )
for i in soup.pre.strings: #或者print(soup.find( pre ).string)
print(i)