• 首页
  • 期刊简介
  • 编委会
  • 投稿指南
  • 收录情况
  • 杂志订阅
  • 联系我们
引用本文:胡学军,李嘉诚.基于Scrapy-Redis的分布式爬取当当网图书数据[J].软件工程,2022,25(10):8-11.【点击复制】
【打印本页】   【下载PDF全文】   【查看/发表评论】  【下载PDF阅读器】  
←前一篇|后一篇→ 过刊浏览
分享到: 微信 更多
基于Scrapy-Redis的分布式爬取当当网图书数据
胡学军,李嘉诚
(上海理工大学机械工程系,上海 200082)
HXJ20161645@163.com; jiujiuniansheng@163.com
摘 要: 单机的网络爬虫爬取数据效率较低,而研究分布式网络爬虫能有效提高数据的爬取效率。文中选择使用上更为简单的Scrapy-Redis框架,设计一个架构模式为主从式的分布式网络爬虫系统,实现对当当网图书信息的爬取;并对布隆过滤器算法进行研究,分析影响其性能的参数,将算法集成到Scrapy-Redis的Scheduler的去重模块中。系统使用一台主机做Master,两台从机做Slave,最终运行1 小时后,抓取图书信息18,000余条。
关键词: 网络爬虫;Scrapy框架;Scrapy-Redis框架;布隆过滤器算法
中图分类号: TP391.1    文献标识码: A
Distributed Crawling of Dangdang Book Data based on Scrapy-Redis
HU Xuejun, LI Jiacheng
(School of Mechanical Engineering, University of Shanghai for Science and Technology, Shanghai 200082, China)
HXJ20161645@163.com; jiujiuniansheng@163.com
Abstract: Aiming at the low efficiency of single-machine web crawler, this paper conducts a research on distributed web crawling that can effectively improve the efficiency of data crawling. This paper proposes to use the simpler Scrapy-Redis framework and design a distributed web crawler system with master-slave architecture mode to realize the book information crawling of Dangdang. In addition, Bloom Filter algorithm is studied, and the parameters affecting its performance are analyzed. The algorithm is integrated into the deduplication module of Scrapy-Redis Scheduler. With one host computer as Master and two slave ones as Slaves, the system captures more than 18,000 pieces of book information after running for one hour.
Keywords: web crawler; Scrapy framework; Scrapy-Redis framework; Bloom Filter algorithm


版权所有:软件工程杂志社
地址:辽宁省沈阳市浑南区新秀街2号 邮政编码:110179
电话:0411-84767887 传真:0411-84835089 Email:semagazine@neusoft.edu.cn
备案号:辽ICP备17007376号-1
技术支持:北京勤云科技发展有限公司

用微信扫一扫

用微信扫一扫