• 首页
  • 期刊简介
  • 编委会
  • 投稿指南
  • 收录情况
  • 杂志订阅
  • 联系我们
引用本文:董宗然,闻柏智,朱 毅.一种新型高效全文检索引擎的设计[J].软件工程,2024,(2):44-48.【点击复制】
【打印本页】   【下载PDF全文】   【查看/发表评论】  【下载PDF阅读器】  
←前一篇|后一篇→ 过刊浏览
分享到: 微信 更多
一种新型高效全文检索引擎的设计
董宗然1,2, 闻柏智3, 朱 毅1,2
[1.大连外国语大学软件学院, 辽宁 大连 116044;
2.大连外国语大学大数据图书情报研究中心, 辽宁 大连 116044;
3.联通(辽宁)产业互联网有限公司, 辽宁 沈阳 110041]
dongzongran@163.com; wbz1234569@126.com; zhuyidl@163.com
摘 要: 为了改善常规存储方式模糊查询性能较低的问题,提出一种针对大文本文档数据的高效模糊查询方法。通过对文档建立倒排索引,将索引以及部分文档信息提取到内存中以降低磁盘输入和输出(Input/Output, I/O)。根据内存中的倒排索引和数据库中主键形成的映射查询数据,然后通过相关度算法对这些数据进行排序,并以字典树作为搜索提示,实现高效的全文检索。实验结果表明:与ElasticSearch使用相同词集时,随着测试数据量的变化,所设计的全文检索引擎的查询效率是ElasticSearch效率的80~1 200倍,其效率优势随着数据量增加呈现反比例关系变化,并且在17 919条文档数据下,其内存占用不超过2.5 GB,适合用于海量文档数据检索。
关键词: 倒排索引;全文检索;检索引擎;模糊查询;字典树
中图分类号: TP391.3    文献标识码: A
基金项目: 2022年度辽宁省高等学校基本科研项目(LJKMZ20221547)
Design of a New Efficient Full-text Search Engine
DONG Zongran1,2, WEN Baizhi3, ZHU Yi1,2
[1. School of Sof tware, Dalian University of Foreign Languages, Dalian 116044, China;
2.Big Data Library and Inf ormation Research Center, Dalian University of Foreign Languages, Dalian 116044, China;
3.China Unicom (Liaoning) Industrial Internet Co., Ltd., Shenyang 110041, China]

dongzongran@163.com; wbz1234569@126.com; zhuyidl@163.com
Abstract: In order to improve the low performance issue of fuzzy query in conventional storage, an effective fuzzy query method for large-text document data is proposed. By establishing inverted indexes on documents, the indexes and some document information are extracted into memory to reduce disk I/O. The data is queried based on the maps formed by inverted indexes in memory and the primary keys in database, and then these data is sorted by the relevance algorithm, and the Tire tree is used as the search prompt to achieve an efficient full-text search. The experimental results show that when using the same word set as ElasticSearch, the efficiency of the designed full-text search engine is 80 to 1 200 times that of ElasticSearch, depending on the amount of test data. With 17 919 document data, the memory size does not exceed 2.5 GB, making it suitable for massive document data retrieval.
Keywords: inverted index; full-text search; search engine; fuzzy query; tire tree


版权所有:软件工程杂志社
地址:辽宁省沈阳市浑南区新秀街2号 邮政编码:110179
电话:0411-84767887 传真:0411-84835089 Email:semagazine@neusoft.edu.cn
备案号:辽ICP备17007376号-1
技术支持:北京勤云科技发展有限公司

用微信扫一扫

用微信扫一扫