• 首页
  • 期刊简介
  • 编委会
  • 投稿指南
  • 收录情况
  • 杂志订阅
  • 联系我们
引用本文:霍 英,李小帆,丘志敏,李彦廷.基于大数据的网络数据采集研究与实践[J].软件工程,2023,26(4):28-32.【点击复制】
【打印本页】   【下载PDF全文】   【查看/发表评论】  【下载PDF阅读器】  
←前一篇|后一篇→ 过刊浏览
分享到: 微信 更多
基于大数据的网络数据采集研究与实践
霍 英1,李小帆1,丘志敏2,李彦廷1
(1.韶关学院信息工程学院,广东 韶关 512005;
2.韶关学院智能工程学院,广东 韶关 512005)
huoying@sgu.edu.cn; 14929099@qq.com; 250437325@qq.com; kidi@qq.com
摘 要: 在微博大数据环境下,文章以舆情数据采集、用户行为分析为应用背景,提出了一种爬虫数据采集系统的设计与实现方案。该方案主要采用的是聚焦爬虫和增量式爬虫相结合,同时基于内容评价的爬行策略,对用户给定的关键词进行搜索,并在其发生变化时对相关内容进行更新,从而实现数据采集的及时性和有效性。通过实际数据采集效果来看,本方案单机日数据采集量约为88万条,实际应用中用户可根据需求自定义爬取数据的速度,也可通过增加分布式爬虫数量提升爬取数据量与速度。
关键词: 大数据;数据采集;网络爬虫
中图分类号: TP319    文献标识码: A
基金项目: 广东省哲学社会科学规划学科共建项目(GD18XXW07);广东省自然科学基金项目(2021A1515011803).
Research and Practice of Network Data Acquisition based on Big Data
HUO Ying1, LI Xiaofan1, QIU Zhimin2, LI Yanting1
( 1.School of Information Engineering, Shaoguan University, Shaoguan 512005, China ;
2.School of Intelligent Engineering, Shaoguan University, Shaoguan 512005, China)
huoying@sgu.edu.cn; 14929099@qq.com; 250437325@qq.com; kidi@qq.com
Abstract: In the context of Weibo big data, this paper proposes to design and implement a crawler data acquisition system based on the application background of public opinion data collection and user behavior analysis. In this solution, the focused crawler is combined with the incremental crawler, and a content evaluation-based crawling strategy is used to search for the keywords given by the user and update the relevant content with the changes of the keywords, so as to achieve the timeliness and effectiveness of data acquisition. According to the actual data acquisition effect, the daily data acquisition volume of a single machine in this solution is about 1 million pieces. In practical application, users can customize the speed of crawling data according to their needs, and can also increase the amount and speed of crawling data by increasing the number of distributed crawlers.
Keywords: big data; data acquisition; network crawler


版权所有:软件工程杂志社
地址:辽宁省沈阳市浑南区新秀街2号 邮政编码:110179
电话:0411-84767887 传真:0411-84835089 Email:semagazine@neusoft.edu.cn
备案号:辽ICP备17007376号-1
技术支持:北京勤云科技发展有限公司

用微信扫一扫

用微信扫一扫