In an article recently submitted to the arxiv* server, researchers explored the feasibility of using large language models (LLMs), including generative pre-trained transformers (GPT) and finetuned language net (FLAN) series, for generating pest management advice in agriculture.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
The authors introduced an innovative evaluation approach using GPT-4 to assess the quality of the generated content across various metrics. Results demonstrated LLMs' effectiveness, with GPT-3.4 and GPT-4 outperforming FLAN models, showcasing their potential as tools for providing pest management suggestions in agriculture.
Background
LLMs are pivotal in natural language processing, utilizing vast text data to understand and generate human language. LLMs like GPT-3.5 and GPT-4, with billions of parameters, have revolutionized natural language processing, but their application in agriculture remains limited. Previous research has explored LLMs in various domains such as finance, medicine, and education, demonstrating their potential in specialized tasks. However, few studies have investigated their efficacy in agriculture, highlighting a significant research gap.
The present paper bridged this gap by conducting a feasibility study on using LLMs for pest management advice generation in agriculture, leveraging GPT-4 for evaluation. Previous studies have shown the effectiveness of LLMs in specialized domains like finance and medicine, but their application in agriculture is relatively unexplored. This paper pioneered the evaluation of LLMs, particularly GPT-4, in agriculture, focusing on pest management.
The authors introduced an innovative evaluation methodology using GPT-4 for multi-dimensional assessment of generated pest management suggestions. Additionally, they demonstrated the effectiveness of instruction-based prompting techniques in achieving a 72% accuracy rate in LLM-driven pest management decisions.
Experiment Design
The experiment focused on evaluating two leading LLMs, GPT and FLAN-T5. GPT, based on the transformer architecture, was represented by iterations such as GPT-3.5 and GPT-4, with the latter exhibiting improved reasoning capabilities. FLAN-T5, a variant of Google's T5 model, was known for its text-to-text framework, achieving state-of-the-art results in natural language processing tasks.
The experiment's baseline was established using an expert system sourced from the Agriculture and Horticulture Development Board's (AHDB) encyclopedia of pests and natural enemies in field crops. This system provided datasets detailing pest-crop associations, thresholds, and management suggestions, serving as a benchmark for assessing LLMs' factual accuracy in determining pest management actions.
Generating labeled pest samples involved extracting data from the expert system and creating scenarios representing various pest management situations. Positive samples paired affected crops with density values exceeding thresholds, while negative samples represented scenarios where no action was required. These samples were augmented with random parameters for diversity.
Experiment prompting techniques included zero-shot, few-shot, instruction-based, and self-consistency prompting. Zero-shot and few-shot prompting relied on the model's pre-training and example guidance, respectively. Instruction-based prompting provided structured instructions and context to guide the model, while self-consistency prompting aggregated responses from various prompts for consistency. Overall, the experiment aimed to evaluate LLMs' performance in providing accurate and comprehensive pest management advice using different prompting techniques, considering contextual factors for improved accuracy.
Results
The evaluation conducted using GPT-4 revealed significant differences in linguistic quality and performance metrics among FLAN, GPT-3.5, and GPT-4 models across various prompting methods. FLAN consistently scored lower in linguistic quality dimensions compared to GPT-3.5 and GPT-4, indicating its limitations in handling complex language tasks without specific training or guidance.
On the other hand, both GPT-3.5 and GPT-4 exhibited superior performance, with GPT-4 achieving perfect scores on fluency and consistently high scores across all dimensions. In terms of performance metrics, GPT-3.5 and GPT-4 outperformed FLAN, particularly in accuracy, precision, recall, and F1 scores. Interestingly, while GPT-4 appeared "smarter" in understanding prompts, it occasionally made inaccurate judgments, leading to misclassifications of negative samples.
GPT-3.5, however, adhered strictly to thresholds specified in prompts, resulting in more accurate judgments on negative samples. The instruction-based prompting method consistently demonstrated the best performance across models, particularly with GPT-3.5, indicating the importance of providing specific information like affected crops and threshold levels in prompts. Conversely, the self-consistency prompting method exhibited poorer performance, especially in misclassifying negative scenarios as positive.
Overall, FLAN's performance remained inferior to GPT-3.5 and GPT-4 across all prompting methods, highlighting its limitations in the agricultural domain. Meanwhile, the GPT series models showed consistently high performance, with the instruction-based method yielding the best results, emphasizing the significance of contextual information in guiding model responses. Zero-shot and few-shot methods generally scored lower, suggesting the importance of detailed prompts for accurate and relevant advice generation.
Conclusion
In conclusion, the authors demonstrated the feasibility of utilizing LLMs like GPT-3.5 and GPT-4 for generating pest management advice in agriculture. By introducing an innovative evaluation methodology and experimenting with different prompting techniques, the study showcased the effectiveness of instruction-based prompting in improving LLM-driven pest management decisions.
While FLAN models lagged in performance, GPT-3.5 and GPT-4 exhibited superior capabilities, especially when provided with specific contextual information in prompts. Overall, this research underscored the potential of LLMs to revolutionize pest management practices in agriculture through accurate and comprehensive advice generation.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
Journal reference:
- Preliminary scientific report.
Yang, S., Yuan, Z., Li, S., Peng, R., Liu, K., & Yang, P. (2024, March 18). GPT-4 as Evaluator: Evaluating Large Language Models on Pest Management in Agriculture. ArXiv.org. https://doi.org/10.48550/arXiv.2403.11858, https://arxiv.org/abs/2403.11858