好资源导航 » 文章资讯 » Python Scrapy框架：通用爬虫之CrawlSpider用法简单示例

Python Scrapy框架：通用爬虫之CrawlSpider用法简单示例

2023-07-27 08:00:04 424

本文实例讲述了PythonScrapy框架：通用爬虫之CrawlSpider用法。分享给大家供大家参考，具体如下：

步骤01:创建爬虫项目

scrapystartprojectquotes

步骤02:创建爬虫模版

scrapygenspider-tquotesquotes.toscrape.com

步骤03:配置爬虫文件quotes.py

importscrapy
fromscrapy.spidersimportCrawlSpider,Rule
fromscrapy.linkextractorsimportLinkExtractor

classQuotes(CrawlSpider):
#爬虫名称
name="get_quotes"
allow_domain=['quotes.toscrape.com']
start_urls=['http://quotes.toscrape.com/']

#设定规则
rules=(
#对于quotes内容页URL，调用parse_quotes处理，
#并以此规则跟进获取的链接
Rule(LinkExtractor(allow=r'/page/\d+'),callback='parse_quotes',follow=True),
#对于author内容页URL，调用parse_author处理，提取数据
Rule(LinkExtractor(allow=r'/author/\w+'),callback='parse_author')
)

#提取内容页数据方法
defparse_quotes(self,response):
forquoteinresponse.css(".quote"):
yield{'content':quote.css('.text::text').extract_first(),
'author':quote.css('.author::text').extract_first(),
'tags':quote.css('.tag::text').extract()
}
#获取作者数据方法

defparse_author(self,response):
name=response.css('.author-title::text').extract_first()
author_born_date=response.css('.author-born-date::text').extract_first()
author_bron_location=response.css('.author-born-location::text').extract_first()
author_description=response.css('.author-description::text').extract_first()

return({'name':name,
'author_bron_date':author_born_date,
'author_bron_location':author_bron_location,
'author_description':author_description
})

步骤04:运行爬虫

scrapycrawlquotes

更多相关内容可查看本站专题：《PythonSocket编程技巧总结》、《Python正则表达式用法总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》

希望本文所述对大家基于Scrapy框架的Python程序设计有所帮助。

返回顶部
3162201930
czq8825@qq.com

Python Scrapy框架：通用爬虫之CrawlSpider用法简单示例

热门推荐

随机推荐