软件工程

引用本文:

郭涛，霸元婕，李绍昂.基于公共词集对长篇小说相似度的研究[J].软件工程,2018,21(10):11-13.【点击复制】

【打印本页】【下载PDF全文】【查看/发表评论】【下载PDF阅读器】

←前一篇|后一篇→

过刊浏览

分享到：微信更多

基于公共词集对长篇小说相似度的研究

郭涛，霸元婕，李绍昂

(吉林大学计算机科学与技术系，吉林长春 130012)

摘要: 传统的文本相似度计算基于向量空间模型(VSM)，文本映射成独立的、互不关联的词构成的向量。由于长篇小说具有比普通文本更为复杂的构成元素，以及更加紧密的上下文联系，传统算法忽略词项的上下文联系，并且产生高维向量，因此算法的效率和精度不理想。为此，本文基于公共词集对长篇小说进行相似度计算，并对公共词集进行上下文约束检查，得到关联比较紧密的词集，作为小说的主要特征。实验结果表明，对于某些小说类型，效果有很大的提升。

关键词: 公共词集小说相似度上下文约束

中图分类号: TP391.1 文献标识码: A

Similarity of Long Novels Based on Common Word Sets

GUO Tao,BA Yuanjie,LI Shaoang

( School of Computer Science, Jilin University, Changchun 130012, China)

Abstract: Traditional text similarity computation is based on Vector Space Model (VSM),where the text is mapped into independent and unrelated words.Because novels have more complex elements and much closer context than ordinary texts,the traditional algorithm ignores the context of the words and produces the high dimensional vector,so that the efficiency and accuracy of the algorithm are not ideal.For this reason,this paper calculates the similarity of the novels based on the common word set,and carries out the context constraint check on the common word set to achieve a more closely related word set as the main feature of the novel.The experimental results show that for some types of novels,the effect is greatly improved.

Keywords: common word set novel similarity context constraint

用微信扫一扫