深入解析Python中的urllib2模块

2024-03-25 18:53:03 38

Python标准库中有很多实用的工具类，但是在具体使用时，标准库文档上对使用细节描述的并不清楚，比如urllib2这个HTTP客户端库。这里总结了一些urllib2的使用细节。

Proxy的设置
Timeout设置
在HTTPRequest中加入特定的Header
Redirect
Cookie
使用HTTP的PUT和DELETE方法
得到HTTP的返回码
DebugLog

Proxy的设置

urllib2默认会使用环境变量http_proxy来设置HTTPProxy。如果想在程序中明确控制Proxy而不受环境变量的影响，可以使用下面的方式

importurllib2
enable_proxy=True
proxy_handler=urllib2.ProxyHandler({"http":'http://some-proxy.com:8080'})
null_proxy_handler=urllib2.ProxyHandler({})

ifenable_proxy:
opener=urllib2.build_opener(proxy_handler)
else:
opener=urllib2.build_opener(null_proxy_handler)

urllib2.install_opener(opener)

这里要注意的一个细节，使用urllib2.install_opener()会设置urllib2的全局opener。这样后面的使用会很方便，但不能做更细粒度的控制，比如想在程序中使用两个不同的Proxy设置等。比较好的做法是不使用install_opener去更改全局的设置，而只是直接调用opener的open方法代替全局的urlopen方法。

Timeout设置

在老版Python中，urllib2的API并没有暴露Timeout的设置，要设置Timeout值，只能更改Socket的全局Timeout值。

importurllib2


importsocket
socket.setdefaulttimeout(10)#10秒钟后超时
urllib2.socket.setdefaulttimeout(10)#另一种方式

在Python2.6以后，超时可以通过urllib2.urlopen()的timeout参数直接设置。

importurllib2
response=urllib2.urlopen('http://www.google.com',timeout=10)

在HTTPRequest中加入特定的Header

要加入header，需要使用Request对象：

importurllib2
request=urllib2.Request(uri)
request.add_header('User-Agent','fake-client')
response=urllib2.urlopen(request)

对有些header要特别留意，服务器会针对这些header做检查

User-Agent:有些服务器或Proxy会通过该值来判断是否是浏览器发出的请求

Content-Type:在使用REST接口时，服务器会检查该值，用来确定HTTPBody中的内容该怎样解析。常见的取值有：

application/xml：在XMLRPC，如RESTful/SOAP调用时使用
application/json：在JSONRPC调用时使用
application/x-www-form-urlencoded：浏览器提交Web表单时使用

在使用服务器提供的RESTful或SOAP服务时，Content-Type设置错误会导致服务器拒绝服务

Redirect

urllib2默认情况下会针对HTTP3XX返回码自动进行redirect动作，无需人工配置。要检测是否发生了redirect动作，只要检查一下Response的URL和Request的URL是否一致就可以了。

importurllib2
response=urllib2.urlopen('http://www.google.cn')
redirected=response.geturl()=='http://www.google.cn'

如果不想自动redirect，除了使用更低层次的httplib库之外，还可以自定义HTTPRedirectHandler类。

importurllib2

classRedirectHandler(urllib2.HTTPRedirectHandler):
defhttp_error_301(self,req,fp,code,msg,headers):
pass
defhttp_error_302(self,req,fp,code,msg,headers):
pass

opener=urllib2.build_opener(RedirectHandler)
opener.open('http://www.google.cn')

Cookie

urllib2对Cookie的处理也是自动的。如果需要得到某个Cookie项的值，可以这么做：

importurllib2
importcookielib

cookie=cookielib.CookieJar()
opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
response=opener.open('http://www.google.com')
foritemincookie:
ifitem.name=='some_cookie_item_name':
printitem.value

使用HTTP的PUT和DELETE方法

urllib2只支持HTTP的GET和POST方法，如果要使用HTTPPUT和DELETE，只能使用比较低层的httplib库。虽然如此，我们还是能通过下面的方式，使urllib2能够发出PUT或DELETE的请求：

importurllib2

request=urllib2.Request(uri,data=data)
request.get_method=lambda:'PUT'#or'DELETE'
response=urllib2.urlopen(request)

得到HTTP的返回码

对于200OK来说，只要使用urlopen返回的response对象的getcode()方法就可以得到HTTP的返回码。但对其它返回码来说，urlopen会抛出异常。这时候，就要检查异常对象的code属性了：

importurllib2
try:
response=urllib2.urlopen('http://restrict.web.com')
excepturllib2.HTTPError,e:
printe.code
DebugLog

使用urllib2时，可以通过下面的方法把debugLog打开，这样收发包的内容就会在屏幕上打印出来，方便调试，有时可以省去抓包的工作

importurllib2

httpHandler=urllib2.HTTPHandler(debuglevel=1)
httpsHandler=urllib2.HTTPSHandler(debuglevel=1)
opener=urllib2.build_opener(httpHandler,httpsHandler)

urllib2.install_opener(opener)
response=urllib2.urlopen('http://www.google.com')

PS:借助urllib2抓取网站生成RSS
看了看OsChina的博客页面,发现可以使用python来抓取.记得前段时间看到有人使用python的RSS模块PyRSS2Gen生成了RSS.于是忍不住手痒自己试着实现了下,幸好还是成功了,下面代码共享给大家.
首先需要安装PyRSS2Gen模块和BeautifulSoup模块,pip安装下就好了,我就不再赘述了.
下面贴出代码

#-*-coding:utf-8-*-


frombs4importBeautifulSoup
importurllib2

importdatetime
importtime
importPyRSS2Gen
fromemail.Utilsimportformatdate
importre
importsys
importos
reload(sys)
sys.setdefaultencoding('utf-8')




classRssSpider():
def__init__(self):
self.myrss=PyRSS2Gen.RSS2(title='OSChina',
link='http://my.oschina.net',
description=str(datetime.date.today()),
pubDate=datetime.datetime.now(),
lastBuildDate=datetime.datetime.now(),
items=[]
)
self.xmlpath=r'/var/www/myrss/oschina.xml'

self.baseurl="http://www.oschina.net/blog"
#ifos.path.isfile(self.xmlpath):
#os.remove(self.xmlpath)
defuseragent(self,url):
i_headers={"User-Agent":"Mozilla/5.0(WindowsNT6.1;WOW64)\
AppleWebKit/537.36(KHTML,likeGecko)Chrome/36.0.1985.125Safari/537.36",\
"Referer":'http://baidu.com/'}
req=urllib2.Request(url,headers=i_headers)
html=urllib2.urlopen(req).read()
returnhtml
defenterpage(self,url):
pattern=re.compile(r'\d{4}\S\d{2}\S\d{2}\s\d{2}\S\d{2}')
rsp=self.useragent(url)
soup=BeautifulSoup(rsp)
timespan=soup.find('div',{'class':'BlogStat'})
timespan=str(timespan).strip().replace('\n','').decode('utf-8')
match=re.search(r'\d{4}\S\d{2}\S\d{2}\s\d{2}\S\d{2}',timespan)
timestr=str(datetime.date.today())
ifmatch:
timestr=match.group()
#printtimestr
ititle=soup.title.string
div=soup.find('div',{'class':'BlogContent'})
rss=PyRSS2Gen.RSSItem(
title=ititle,
link=url,
description=str(div),
pubDate=timestr
)

returnrss
defgetcontent(self):
rsp=self.useragent(self.baseurl)
soup=BeautifulSoup(rsp)
ul=soup.find('div',{'id':'RecentBlogs'})
forliinul.findAll('li'):
div=li.find('div')
ifdivisnotNone:
alink=div.find('a')
ifalinkisnotNone:
link=alink.get('href')
printlink
html=self.enterpage(link)
self.myrss.items.append(html)
defSaveRssFile(self,filename):
finallxml=self.myrss.to_xml(encoding='utf-8')
file=open(self.xmlpath,'w')
file.writelines(finallxml)
file.close()



if__name__=='__main__':
rssSpider=RssSpider()
rssSpider.getcontent()
rssSpider.SaveRssFile('oschina.xml')

可以看到,主要是使用BeautifulSoup来抓取站点然后使用PyRSS2Gen来生成RSS并保存为xml格式文件.
顺便共享下我生成的RSS地址

http://104.224.129.109/myrss/oschina.xml

大家如果不想折腾的话直接使用feedly订阅就行了.
脚本我会每10分钟执行一次的.

深入解析Python中的urllib2模块

热门推荐

随机推荐