基于自然语言处理的山楂果实品种近红外无损鉴别方法

邓志扬; 廖强; 邵淑娟; 刘军

doi:10.13386/j.issn1002-0306.2023010132

基于自然语言处理的山楂果实品种近红外无损鉴别方法

Nondestructive Near-infrared Identification of Hawthorn Fruit Cultivars Based on Natural Language Processing

摘要

摘要: 不同品种的山楂果实在营养组成、感官品质等方面存在差异，在工业生产中适用不同的加工方式。传统的检测方法耗时长、具有破坏性以及成本高，为适应规模化生产山楂果实制品的需要，需对山楂果实品种进行无损鉴别。研究共收集了4个品种240个山楂果实样本的近红外光谱数据，采用不同的预处理算法处理光谱数据后，使用自然语言处理（Natural Language Processing，NLP）模型进行分析，以实现山楂果实品种的无损鉴别。结果表明，长短期记忆网络（Long Short-Term Memory，LSTM）以及门控循环单元（Gated Recurrent Unit，GRU）神经网络模型对主成分分析法（Principal Component Analysis，PCA）预处理后的光谱的鉴别准确率高，验证集的准确率均为99.46%±0.00%，测试集的准确率均为100%±0.00%。逻辑回归模型对山楂果实光谱鉴别能力优异，除对二阶差分（Difference Of Second Order，D2）预处理的光谱鉴别能力较差外（验证集准确率96.65%，测试集准确率89.58%），其他预处理方式验证集、测试集的准确率均达到或极接近100%。朴素贝叶斯模型对经PCA处理后的光谱的鉴别效果较优，验证集准确率为95.65%，测试集准确率为95.83%。本研究证实了NLP运用于山楂果实近红外无损鉴别是可行的。

Abstract: Hawthorn fruits of different varieties have varied nutritional composition, sensory properties etc., thus required for different processing for product development. Due to the limitations of traditional analytical methods of time-consuming, destructive sample preparation, and high cost ect., non-destructive techniques for variety identification are needed which would benefit for large scale production of foods with hawthorn fruits. In this study, a total of 240 hawthorn fruit samples from four different varieties were subjected for near-infrared spectroscopy analysis and the collected spectral data were pre-processed by different algorithms. In order to achieve non-destructive identification of hawthorn varieties, natural language processing (NLP) model was applied for data analysis, including long short-term memory (LSTM), gated recurrent unit (GRU) neural network, logistic regression, native Bayes, decision trees, and k-nearest neighbors. The results showed that the two deep learning models both had the best discrimination effect on the spectral preprocessed by principal component analysis (PCA) with the accuracy of the validation set and test set reached 99.46%±0.00% and 100%±0.00%. While, the logistic regression model showed excellent discrimination ability for hawthorn fruit spectra but poor discrimination ability for the difference of second order (D2) pretreatment spectra (accuracy of 96.65% in the validation set and 89.58% in the test set). The naive Bayes model also showed excellent discrimination effect on the spectra processed by PCA, and the accuracy of the validation set was 95.65%, and the accuracy of the test set was 95.83%. Results gained in this study confirmed the feasibility of applying NLP to the near-infrared non-destructive identification of hawthorn fruits.