[Audio] Welcome, everyone, and thank you for being here. Today, I'll present our comprehensive survey on the factuality of Large Language Models, or L-L-Ms My name is Yuxia Wang, and this research was conducted in collaboration with colleagues from MBZUAI, Monash University, Google DeepMind, and Sofia University. We'll explore how L-L-Ms handle factual accuracy, which is vital as they increasingly become part of daily digital assistance..
[Audio] In today's talk, we'll begin by discussing our motivation for focusing on factuality in LLMs I'll provide some background on critical concepts, including hallucination and trustworthiness, before reviewing how factuality is evaluated. We'll then explore various methods for enhancing factual accuracy at different stages: pretraining, finetuning, and inference. We'll also cover factuality issues specific to multimodal L-L-Ms and conclude with an overview of current challenges and promising future directions..
[Audio] As L-L-Ms become increasingly integrated into our lives as digital assistants, ensuring they deliver factually accurate information is crucial. While many surveys on L-L-Ms exist, most lack depth in covering essential aspects of factuality and its unique challenges, especially for multimodal L-L-Ms Our survey aims to address these gaps, providing a more detailed exploration of the issues..
[Audio] Our motivation also stems from understanding that factuality is complex, involving multiple factors. As we illustrate in Table 1, few previous surveys tackle this topic comprehensively, and even fewer discuss multimodal L-L-M-s-, which integrate visual, auditory, and textual data.
[Audio] Our survey makes several key contributions. First, we distinguish between factuality, hallucination, and trustworthiness, helping to clarify these often-confused terms. Second, we provide an in-depth analysis of the current evaluation and mitigation techniques. Lastly, we focus on factuality in multimodal L-L-M-s-, which present unique challenges due to their complex inputs and outputs.
[Audio] To understand factuality in L-L-M-s-, we need to distinguish it from hallucination. Hallucination occurs when models generate content that isn't grounded in the provided context or source, leading to unreliable information. Factuality, on the other hand, refers to the model's ability to learn and produce information accurately..
[Audio] Another essential distinction is between factuality and trustworthiness. While factuality is about accuracy, trustworthiness includes other dimensions like safety, fairness, and ethics. These broader dimensions impact how users perceive the model's outputs beyond just their factual accuracy..
[Audio] Evaluating L-L-M factuality requires various datasets and metrics, ranging from open-ended tasks to structured question-answer formats. Current benchmarks assess L-L-Ms on how well they can distinguish factual accuracy in content and produce reliable responses. Common evaluation metrics include accuracy, entailment ratios, and human annotation..
[Audio] Methods for enhancing factuality in L-L-Ms are typically categorized by the model stage where they're applied. We'll discuss three main stages where factuality improvements can be implemented: pretraining, finetuning, and inference. Each stage offers distinct ways to improve the model's knowledge reliability..
[Audio] Pretraining provides the foundational knowledge for L-L-M-s-, derived from vast amounts of data. Improving factuality at this stage includes careful data selection and retrieval-augmented techniques, which add relevant information during training. However, these methods come with limitations, such as the computational demands of retrieval-augmented generation (R-A-G--) methods, which increase processing time and require frequent updates..
[Audio] Supervised finetuning is essential for aligning models with user expectations and improving domain-specific knowledge. In cases where models lack sufficient knowledge, S-F-T allows them to abstain from answering, reducing factual errors. However, finetuning requires curated, domain-specific data and can sometimes impact the retention of core parametric knowledge..
[Audio] Preference tuning, commonly through reinforcement learning from human feedback, enhances response quality and reduces harmful outputs. Feedback from automatic fact-checkers or model confidence is increasingly used to refine responses, particularly for specialized topics. Nevertheless, tuning methods struggle to generalize across different domains..
[Audio] At inference, factuality can be improved using optimized decoding and in-context learning strategies. Techniques like greedy decoding, context-aware decoding, and DoLa decoding—where layers are adjusted during decoding—all help maintain factual accuracy. Additionally, self-reasoning approaches, such as multi-agent debate, allow models to validate responses collectively, further reducing factual errors..
[Audio] Multimodal L-L-Ms add complexity to factuality evaluation, involving three main categories: existence factuality, attribute factuality, and relationship factuality. Tools like the C-H-A-I-R and pope benchmarks help assess object hallucination, but human evaluations are often necessary. Mitigation techniques include finetuning with R-L-H-F on specific instructions and refining feature representations to improve accuracy across modes..
[Audio] Several challenges persist, including the difficulty of automatically evaluating open-ended generation factuality, the limitations of language modeling for factual consistency, and latency in retrieval-augmented generation systems. Future research should focus on improving retrieval algorithms, enhancing fact-checking efficiency, and exploring ways to streamline automated evaluation processes..