python爬取Ajax动态加载网页过程解析

2023-08-15 06:10:04 298

常见的反爬机制及处理方式

1、Headers反爬虫：Cookie、Referer、User-Agent

解决方案:通过F12获取headers,传给requests.get()方法

2、IP限制：网站根据IP地址访问频率进行反爬,短时间内进制IP访问

解决方案:

1、构造自己IP代理池,每次访问随机选择代理,经常更新代理池

2、购买开放代理或私密代理IP

3、降低爬取的速度

3、User-Agent限制：类似于IP限制

解决方案:构造自己的User-Agent池,每次访问随机选择

5、对查询参数或Form表单数据认证(salt、sign)

解决方案:找到JS文件,分析JS处理方法,用Python按同样方式处理

6、对响应内容做处理

解决方案:打印并查看响应内容,用xpath或正则做处理

python中正则处理headers和formdata

1、pycharm进入方法：Ctrl+r，选中Regex

2、处理headers和formdata

(.*):(.*)

"1":"1":"2",

3、点击ReplaceAll

民政部网站数据抓取

目标:抓取最新中华人民共和国县以上行政区划代码

URL:http://www.mca.gov.cn/article/sj/xzqh/2019/-民政数据-行政区划代码

实现步骤

1、从民政数据网站中提取最新行政区划代码链接

最新的在上面，命名格式:2019年X月中华人民共和国县以上行政区划代码

importrequests
fromlxmlimportetree
importre

url='http://www.mca.gov.cn/article/sj/xzqh/2019/'
headers={'User-Agent':'Mozilla/5.0(WindowsNT10.0;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/74.0.3729.169Safari/537.36'}
html=requests.get(url,headers=headers).text
parse_html=etree.HTML(html)
article_list=parse_html.xpath('//a[@class="artitlelist"]')

forarticleinarticle_list:
title=article.xpath('./@title')[0]
#正则匹配title中包含这个字符串的链接
iftitle.endswith('代码'):
#获取到第1个就停止即可，第1个永远是最新的链接
two_link='http://www.mca.gov.cn'+article.xpath('./@href')[0]
print(two_link)
break

2、从二级页面链接中提取真实链接（反爬-响应网页内容中嵌入JS，指向新的网页链接）

向二级页面链接发请求得到响应内容，并查看嵌入的JS代码

正则提取真实的二级页面链接

#爬取二级“假”链接
two_html=requests.get(two_link,headers=headers).text
#从二级页面的响应中提取真实的链接（此处为JS动态加载跳转的地址）
new_two_link=re.findall(r'window.location.href="(.*?)"rel="externalnofollow"rel="externalnofollow"',two_html,re.S)[0]

3、在数据库表中查询此条链接是否已经爬取，建立增量爬虫

数据库中建立version表，存储爬取的链接

每次执行程序和version表中记录核对，查看是否已经爬取过

cursor.execute('select*fromversion')
result=self.cursor.fetchall()
ifresult:
ifresult[-1][0]==two_link:
print('已是最新')
else:
#有更新，开始抓取
#将链接再重新插入version表记录

4、代码实现

importrequests
fromlxmlimportetree
importre
importpymysql
classGovementSpider(object):
def__init__(self):
self.url='http://www.mca.gov.cn/article/sj/xzqh/2019/'
self.headers={'User-Agent':'Mozilla/5.0'}
#创建2个对象
self.db=pymysql.connect('127.0.0.1','root','123456','govdb',charset='utf8')
self.cursor=self.db.cursor()
#获取假链接
defget_false_link(self):
html=requests.get(url=self.url,headers=self.headers).text
#此处隐藏了真实的二级页面的url链接，真实的在假的响应网页中，通过js脚本生成，
#假的链接在网页中可以访问，但是爬取到的内容却不是我们想要的
parse_html=etree.HTML(html)
a_list=parse_html.xpath('//a[@class="artitlelist"]')
foraina_list:
#get()方法:获取某个属性的值
title=a.get('title')
iftitle.endswith('代码'):
#获取到第1个就停止即可，第1个永远是最新的链接
false_link='http://www.mca.gov.cn'+a.get('href')
print("二级“假”链接的网址为",false_link)
break
#提取真链接
self.incr_spider(false_link)
#增量爬取函数
defincr_spider(self,false_link):
self.cursor.execute('selecturlfromversionwhereurl=%s',[false_link])
#fetchall:(('http://xxxx.html',),)
result=self.cursor.fetchall()

#notresult:代表数据库version表中无数据
ifnotresult:
self.get_true_link(false_link)
#可选操作:数据库version表中只保留最新1条数据
self.cursor.execute("deletefromversion")

#把爬取后的url插入到version表中
self.cursor.execute('insertintoversionvalues(%s)',[false_link])
self.db.commit()
else:
print('数据已是最新,无须爬取')
#获取真链接
defget_true_link(self,false_link):
#先获取假链接的响应,然后根据响应获取真链接
html=requests.get(url=false_link,headers=self.headers).text
#从二级页面的响应中提取真实的链接（此处为JS动态加载跳转的地址）
re_bds=r'window.location.href="(.*?)"rel="externalnofollow"rel="externalnofollow"'
pattern=re.compile(re_bds,re.S)
true_link=pattern.findall(html)[0]

self.save_data(true_link)#提取真链接的数据
#用xpath直接提取数据
defsave_data(self,true_link):
html=requests.get(url=true_link,headers=self.headers).text

#基准xpath,提取每个信息的节点列表对象
parse_html=etree.HTML(html)
tr_list=parse_html.xpath('//tr[@height="19"]')
fortrintr_list:
code=tr.xpath('./td[2]/text()')[0].strip()#行政区划代码
name=tr.xpath('./td[3]/text()')[0].strip()#单位名称
print(name,code)

#主函数
defmain(self):
self.get_false_link()
if__name__=='__main__':
spider=GovementSpider()
spider.main()

动态加载数据抓取-Ajax

特点

右键->查看网页源码中没有具体数据

滚动鼠标滑轮或其他动作时加载

抓取

F12打开控制台，选择XHR异步加载数据包，找到页面动作抓取网络数据包

通过XHR-->Header-->General-->RequestURL，获取json文件URL地址

通过XHR-->Header-->QueryStringParameters(查询参数)

豆瓣电影数据抓取案例

目标

地址:豆瓣电影-排行榜-剧情

https://movie.douban.com/typerank?

type_name=%E5%89%A7%E6%83%85&type=11&interval_id=100:90&action=

目标:爬取电影名称、电影评分

F12抓包（XHR）

1、RequestURL(基准URL地址)：https://movie.douban.com/j/chart/top_list?

2、QueryStringParamaters(查询参数)

#查询参数如下：
type:13#电影类型
interval_id:100:90
action:'[{},{},{}]'
start:0#每次加载电影的起始索引值
limit:20#每次加载的电影数量

json文件在以下地址：

基准URL地址+查询参数

'https://movie.douban.com/j/chart/top_list?'+'type=11&interval_id=100%3A90&action=&start=20&limit=20'

代码实现

importrequests
importtime
fromfake_useragentimportUserAgent
classDoubanSpider(object):
def__init__(self):
self.base_url='https://movie.douban.com/j/chart/top_list?'
self.i=0
defget_html(self,params):
headers={'User-Agent':UserAgent().random}
res=requests.get(url=self.base_url,params=params,headers=headers)
res.encoding='utf-8'
html=res.json()#将json格式的字符串转为python数据类型
self.parse_html(html)#直接调用解析函数
defparse_html(self,html):
#html:[{电影1信息},{电影2信息},{}]
item={}
foroneinhtml:
item['name']=one['title']#电影名
item['score']=one['score']#评分
item['time']=one['release_date']#打印测试
#打印显示
print(item)
self.i+=1
#获取电影总数
defget_total(self,typ):
#异步动态加载的数据都可以在XHR数据抓包
url='https://movie.douban.com/j/chart/top_list_count?type={}&interval_id=100%3A90'.format(typ)
ua=UserAgent()
html=requests.get(url=url,headers={'User-Agent':ua.random}).json()
total=html['total']

returntotal

defmain(self):
typ=input('请输入电影类型(剧情|喜剧|动作):')
typ_dict={'剧情':'11','喜剧':'24','动作':'5'}
typ=typ_dict[typ]
total=self.get_total(typ)#获取该类型电影总数量

forpageinrange(0,int(total),20):
params={
'type':typ,
'interval_id':'100:90',
'action':'',
'start':str(page),
'limit':'20'}
self.get_html(params)
time.sleep(1)
print('爬取的电影的数量:',self.i)
if__name__=='__main__':
spider=DoubanSpider()
spider.main()

腾讯招聘数据抓取(Ajax)

确定URL地址及目标

URL:百度搜索腾讯招聘-查看工作岗位https://careers.tencent.com/search.html

目标:职位名称、工作职责、岗位要求

要求与分析

通过查看网页源码，得知所需数据均为Ajax动态加载

通过F12抓取网络数据包，进行分析

一级页面抓取数据:职位名称

二级页面抓取数据:工作职责、岗位要求

一级页面json地址(pageIndex在变,timestamp未检查)

https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1563912271089&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn

二级页面地址(postId在变,在一级页面中可拿到)

https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1563912374645&postId={}&language=zh-cn

useragents.py文件

ua_list=[
'Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/535.1(KHTML,likeGecko)Chrome/14.0.835.163Safari/535.1',
'Mozilla/5.0(WindowsNT6.1;WOW64;rv:6.0)Gecko/20100101Firefox/6.0',
'Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.1;WOW64;Trident/4.0;SLCC2;.NETCLR2.0.50727;.NETCLR3.5.30729;.NETCLR3.0.30729;MediaCenterPC6.0;.NET4.0C;InfoPath.3)',
]

importtime
importjson
importrandom
importrequests
fromuseragentsimportua_list
classTencentSpider(object):
def__init__(self):
self.one_url='https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1563912271089&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn'
self.two_url='https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1563912374645&postId={}&language=zh-cn'
self.f=open('tencent.json','a')#打开文件
self.item_list=[]#存放抓取的item字典数据
#获取响应内容函数
defget_page(self,url):
headers={'User-Agent':random.choice(ua_list)}
html=requests.get(url=url,headers=headers).text
html=json.loads(html)#json格式字符串转为Python数据类型
returnhtml
#主线函数:获取所有数据
defparse_page(self,one_url):
html=self.get_page(one_url)
item={}
forjobinhtml['Data']['Posts']:
item['name']=job['RecruitPostName']#名称
post_id=job['PostId']#postId，拿postid为了拼接二级页面地址
#拼接二级地址,获取职责和要求
two_url=self.two_url.format(post_id)
item['duty'],item['require']=self.parse_two_page(two_url)
print(item)
self.item_list.append(item)#添加到大列表中
#解析二级页面函数
defparse_two_page(self,two_url):
html=self.get_page(two_url)
duty=html['Data']['Responsibility']#工作责任
duty=duty.replace('\r\n','').replace('\n','')#去掉换行
require=html['Data']['Requirement']#工作要求
require=require.replace('\r\n','').replace('\n','')#去掉换行
returnduty,require
#获取总页数
defget_numbers(self):
url=self.one_url.format(1)
html=self.get_page(url)
numbers=int(html['Data']['Count'])//10+1#每页有10个推荐
returnnumbers
defmain(self):
number=self.get_numbers()
forpageinrange(1,3):
one_url=self.one_url.format(page)
self.parse_page(one_url)
#保存到本地json文件:json.dump
json.dump(self.item_list,self.f,ensure_ascii=False)
self.f.close()
if__name__=='__main__':
start=time.time()
spider=TencentSpider()
spider.main()
end=time.time()
print('执行时间:%.2f'%(end-start))

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持毛票票。

python爬取Ajax动态加载网页过程解析

热门推荐

随机推荐