• 首页
  • 期刊简介
  • 编委会
  • 投稿指南
  • 收录情况
  • 杂志订阅
  • 联系我们
引用本文:夏益凡,王端虹,李纪龙,姜 枫.多尺度可变Vision Transformer及其在动物图像识别中的应用[J].软件工程,2024,27(5):27-31.【点击复制】
【打印本页】   【下载PDF全文】   【查看/发表评论】  【下载PDF阅读器】  
←前一篇|后一篇→ 过刊浏览
分享到: 微信 更多
多尺度可变Vision Transformer及其在动物图像识别中的应用
夏益凡, 王端虹, 李纪龙, 姜 枫
(南京理工大学泰州科技学院, 江苏 泰州 225300)
1115000760@qq.com;; 1530891210@qq.com; 2419267020@qq.com; jf@nustti.edu.cn
摘 要: 针对动物图像存在图像背景复杂多变、类间特征差异小、类内特征差异大的特点,提出多尺度可变ViT(Vision Transformer)图像识别模型。在ViT模型的基础上,融合卷积神经网络多层特征图,并提出可变注意力机制,使模型能较好地融合图像的局部特征和全局特征,能较好地识别图像中各种尺度的动物。构建包含90种类别、共21 142张图像的动物数据集,在数据集上进行实验的结果表明,所提出的模型取得了90.34%和97.59%的Top-1准确率和Top-5准确率。
关键词: 动物图像;ViT;可变注意力机制;多层特征图
中图分类号: TP39    文献标识码: A
Multi-Scale Adaptable Vision Transformer and Its Application in Animal Image Recognition
XIA Yifan, WANG Duanhong, LI Jilong, JIANG Feng
(Taizhou Institute of Sci. & Tech., NJUST., Taizhou 225300, China)
1115000760@qq.com;; 1530891210@qq.com; 2419267020@qq.com; jf@nustti.edu.cn
Abstract: This paper proposes a multi-scale adaptable ViT (Vision Transformer) image recognition model to address the problems of complex and diverse image backgrounds, small inter-class feature differences, and large intraclass feature differences in animal images. Based on the ViT model, a multi-layer feature map of convolutional neural network is integrated, and an adaptable Attention Mechanism is proposed to enable the model to effectively integrate local and global features of images and accurately recognize animals of various scales in images. A dataset containing 90 categories with a total of 21 142 animal images is constructed, and experimental results on the dataset show that the proposed model achieves Top-1 and Top-5 accuracies of 90.34% and 97.59% , respectively.
Keywords: animal image; ViT; adaptable Attention Mechanism; multi-layer feature map


版权所有:软件工程杂志社
地址:辽宁省沈阳市浑南区新秀街2号 邮政编码:110179
电话:0411-84767887 传真:0411-84835089 Email:semagazine@neusoft.edu.cn
备案号:辽ICP备17007376号-1
技术支持:北京勤云科技发展有限公司

用微信扫一扫

用微信扫一扫