• 首页
  • 期刊简介
  • 编委会
  • 投稿指南
  • 收录情况
  • 杂志订阅
  • 联系我们
引用本文:杨 健,陈 伟.基于Python的三种网络爬虫技术研究[J].软件工程,2023,26(2):24-27.【点击复制】
【打印本页】   【下载PDF全文】   【查看/发表评论】  【下载PDF阅读器】  
←前一篇|后一篇→ 过刊浏览
分享到: 微信 更多
基于Python的三种网络爬虫技术研究
杨 健,陈 伟
(诸暨市公安局,浙江 绍兴 311800)
716291923@qq.com; 396293104@qq.com
摘 要: 针对网络爬虫技术选型较多,影响抓取效率和准确性的问题,对基于Python语言的Requests、Scrapy和Selenium三种主流爬虫技术进行分析。首先,安装配置开发环境,完成单线程和多线程爬虫软件开发;其次,爬取“站长之家”网站10 页、100 页、500 页和1,000 页简历数据,计算爬取时间;最后,通过爬取“中国裁判文书网”中的数据验证突破反爬虫机制的能力。实验结果表明,Requests爬虫使用一行代码就能实现数据爬取,开发定制灵活;Scrapy爬虫技术平均每页的抓取时间为0.02 s,并发性能突出;Selenium爬虫技术破解网站反爬虫机制能力强。因此,开发网络爬虫技术要综合考虑业务需求和技术特点,只有这样,才能达到最佳的数据抓取效果。
关键词: 网络爬虫;Requests技术;Scrapy技术;Selenium技术
中图分类号: TP302.7    文献标识码: A
Research on Three Web Crawler Technologies based on Python
YANG Jian, CHEN Wei
(Zhuji Public Security Bureau, Shaoxing 311800, China )
716291923@qq.com; 396293104@qq.com
Abstract: As there are many types of web crawler technologies, which affect the efficiency and accuracy of crawling, this paper proposes to analyze three mainstream crawler technologies based on Python: Requests, Scrapy and Selenium. Firstly, the development environment is installed and configured to complete the development of single threaded and multithreaded crawler software. Secondly, the three crawlers crawl 10, 100, 500 and 1,000 pages of resume data from the "Home of Webmasters", and the crawling time is calculated. Finally, the ability to break through the anti-crawler mechanism is verified by crawling the data on the website of "China Judgements Online". The results show that Requests crawler technology uses one line of code to achieve data crawling, and the development and customization are flexible. The average crawling time per page of Scrapy crawler technology is 0.02 seconds, and its concurrency performance is outstanding. Selenium crawler technology has strong ability to crack website anti-crawler mechanism. Therefore, the development of web crawler technology should comprehensively consider the business needs and technical characteristics. Only in this way can the best data grabbing effect be achieved.
Keywords: web crawler; Requests; Scrapy; Selenium


版权所有:软件工程杂志社
地址:辽宁省沈阳市浑南区新秀街2号 邮政编码:110179
电话:0411-84767887 传真:0411-84835089 Email:semagazine@neusoft.edu.cn
备案号:辽ICP备17007376号-1
技术支持:北京勤云科技发展有限公司

用微信扫一扫

用微信扫一扫