Enhancing Large Language Models with Multimodality and Knowledge Graphs for Hallucination-free Open-set Object Recognition

Abstract

Open-set object recognition plays a significant role in today’s production and daily life, such as in surface defect detection, biometric identification, and autonomous driving recognition. However, due to the diversity of unknown categories and the complexity of scenarios, existing methods often perform poorly. Therefore, open-set object recognition remains an important and popular research topic. Recently, collaborative utilization of multiple pre-trained Large Language Models (LLMs) has emerged rapidly, which becomes a new research hotspot in addressing open-set object recognition tasks. Among this, a core challenge lies in amplifying the strengths of individual LLMs while mitigating their weaknesses. In this paper, we propose a novel joint framework tailored for open-set object recognition tasks, aiming to more efficiently harness the capabilities of diverse LLMs and Knowledge Graphs (KGs). Initially, for the text data generated by textual LLMs, we use Wikipedia to correct and complete it. Then, we designed a text-image multi-modal fusion method to further correct and complete the text information by utilizing the implicit semantic information in the image. Additionally, we propose some novel designs to alleviate the hallucination issue of LLMs and reduce their instability. Extensive experiments demonstrate that our approach outperforms all the comparison methods.

Publication
VLDB Workshops
Xinfu Liu
Xinfu Liu
Ph.d. student
Yirui Wu
Yirui Wu
Young Professor, CCF Senior Member

My research interests include Computer Vision, Artifical Intelligence, Multimedia Computing and Intelligent Water Conservancy.

Yuting Zhou
Yuting Zhou
M.E. student