GlobalDoc: A Cross-Modal Vision-Language Framework for Real-World Document Image Retrieval and Classification

Published on Jan 25, 2025

Scene 1 (0s)

[Audio] Hello everyone, today I am presenting Globaldoc: a cross-modal vision and language framework for real-workd document image retrieval and classification..

Scene 2 (14s)

[Audio] Despite significant advancements in Visual Document Understanding (VDU) through multimodal language models, real-world deployment in industrial settings remains challenging. State-of-the-art (SoTA) VDU models struggle to maintain performance and efficiency due to the following critical limitations. Most existing approaches, such as LayoutLM series and DocFormer, heavily rely on OCR-based text extraction to learn local positional information with designed pre-text tasks like masked vision-language modeling. This dependence reduces flexibility, limits the model's ability to capture global document structures and hinders generalization in real-world industrial settings, as OCR outputs can be noisy, domain-specific, and error-prone, leading to performance degradation in diverse document types..

Scene 3 (1m 12s)

[Audio] Additionally, many self-supervised learning (SSL) methods like selfdoc, and unidoc, require region-level feature extraction, utilizing pre-trained object detectors to process visual elements for table understanding, and document layout analysis. This results in increased computational overhead, making real-time or large-scale processing inefficient..

Scene 4 (1m 45s)

[Audio] Pretrained VDU models often rely on offline datasets that fail to capture the diversity of real-world, industrial documents, limiting their ability to generalize to new formats and layouts. Additionally, many models are restricted to joint vision-language processing, whereas practical applications require flexible solutions that can efficiently handle both uni-modal (vision-only or text-only) and multi-modal inputs without compromising performance..

Scene 5 (2m 17s)

[Audio] Our approach introduces GlobalDoc, a page-level pre-training framework designed to capture global relationships between vision and language modalities at the page-level, rather than relying solely on local word- or region-level features. Unlike traditional VDU models that depend on OCR-based 2D positional encodings, GlobalDoc enhances cross-modal representation learning, by proposing three novel pretext tasks. Furthermore, we introduce two document-level downstream tasks—Few-Shot Document Image Classification (Few-Shot DIC) and Content-Based Document Image Retrieval (DIR)—to evaluate the model's generalizability in industrial scenarios..

Scene 6 (3m 4s)

[Audio] We humans naturally learn by forming associations between new and previously encountered concepts, enabling quick adaptation to novel information. For example, when shown a technical article, a person can instinctively relate it to a scientific publication rather than an advertisement. Mimicking this ability in VDU can enhance semantic representation learning, leading to more generalizable and flexible models. This raises our key research question: How can the ability to identify cross-modal similarities within previously encountered document samples enhance semantic representation learning for VDU models? By integrating this principle, we aim to develop robust VDU models capable of adapting to new document types without extensive labeled data, making them more efficient and human-like in perception..

Scene 7 (3m 56s)

[Audio] We humans naturally learn by forming associations between new and previously encountered concepts, enabling quick adaptation to novel information. For example, when shown a technical article, a person can instinctively relate it to a scientific publication rather than an advertisement. Mimicking this ability in VDU can enhance semantic representation learning, leading to more generalizable and flexible models. This raises our key research question: How can the ability to identify cross-modal similarities within previously encountered document samples enhance semantic representation learning for VDU models? By integrating this principle, we aim to develop robust VDU models capable of adapting to new document types without extensive labeled data, making them more efficient and human-like in perception..

Scene 8 (4m 53s)

[Audio] We propose a two-step self-supervised representation learning approach that distinguishes itself from recent VDU methods. The first step introduces two novel pre-training objectives: L2M (Learning-to-Mine), which focuses on enhancing the richness of latent representations by leveraging nearest-neighbor relationships to create diverse positive pairs, requiring a representative support set of embeddings. This objective aids in aligning image and text features, enabling cross-modal learning and improving the model's understanding of semantic content. It also facilitates the creation of a shared low-dimensional space for both modalities,.

Scene 9 (5m 37s)

[Audio] This allows the second objective, L2U, to predict whether a pair of document image and its corresponding text is a match or not, hence, matching more informative sample pairs,.

Scene 10 (5m 51s)

[Audio] In the second step, we mine nearest neighbors based on feature similarity and integrate these semantically meaningful neighbors into a learnable framework. Using L2R (Learning-to-Reorganize), we encourage consistent and discriminative predictions for both image and text, along with their nearest neighbors, further enhancing the model's representation learning..

Scene 11 (6m 17s)

[Audio] As a downstream application which better simulates real-world industrial settings where new document categories frequently emerge with limited labeled data. The Few-Shot DIC task challenges the model to quickly adapt its pre-trained embeddings to task changes, enabling efficient learning even in low-data scenarios. This aligns with the need for flexible and scalable VDU systems capable of handling new, unseen document types..

Scene 12 (6m 45s)

[Audio] The Content-based DIR task assesses the model's ability to retrieve relevant document images based on both uni-modal (vision-only or text-only), cross-modal (from vision-to-language or from language-to-vision) inputs. This task ensures that our model can handle complex queries and retrieve information from diverse document formats..

Scene 13 (7m 8s)

[Audio] We conduct a thorough investigation on how effective are the designed pretext objectives. Through an ablation study, we experiment with various settings, combining our 3 proposed objectives (L2M, L2U, L2R) and meta-training techniques to evaluate performance across 1-shot, 5-shot, and 20-shot settings. The results, averaged over 600 experiments, show that our two-step pre-training approach significantly boosts semantic representation learning, improving both uni-modal vision and language modalities. We evaluate GlobalDoc in a multi-modal setting with both offline and online stages, ensuring a fair comparison with the LayoutLMv3 baseline. Our results demonstrate that GlobalDoc significantly outperforms LayoutLMv3 across k-shot classification tasks, proving its ability to function effectively in an industrial online setting. This challenges the assumption that massive pre-training on millions of document samples is necessary, highlighting GlobalDoc's efficiency and adaptability in real-world applications..

Scene 14 (8m 21s)

[Audio] We also evaluate GlobalDoc for Content-based Document Image Retrieval (DIR) across uni-modal and cross-modal settings. In uni-modal retrieval, it performs well in retrieving top-K relevant documents. However, cross-modal retrieval initially faces performance drops due to mismatches between vision and language features. By integrating the CMAE module and the L2U pretext task, we enhance cross-modal alignment, resulting in a significant improvement in both uni-modal and cross-modal tasks. And when combined with L2R, GlobalDoc achieves better results in all unimodal and crossmodal tasks. It also outperforms the LayoutLMv3 model in the multimodal retrieval setting, in all top-K retrieval scores, where layoutlmv3 faces a drastic performance drop. This highlights its ability to create semantically rich embeddings and handle complex document retrieval in real-world settings..

Scene 15 (9m 27s)

[Audio] On the standart Document Image Classification (DIC) on the rvlcdip dataset, GlobalDoc achieves SOTA performance in the uni-modal setting (92.58% accuracy with only pretraining on 1,4M document samples, outperforming DiTBase_Base, pre-trained on 42M document images. It also excels in text-based classification (93.82%) and narrows the performance gap in the Text+Layout setting, and outperforms some of the vision+text+layout models which rely heavily on extensive pretraining and commercial OCR engines. Regarding the Visual Information Extraction (VIE) task on the FUNSD dataset, Unlike traditional VDU models that depend on OCR-based positional encodings, GlobalDoc leverages page-level information, achieving competitive performance despite lacking token-level layout features..

Scene 16 (10m 29s)

[Audio] Here we show some qualitative results in different settings of content-based document image retrieval. We analyze retrieval performance using challenging queries from different categories. While successful retrievals occur in both uni-modal and cross-modal settings, failure cases emerge due to inter-class variability—e.g., "specification" documents resemble "forms" or "publications" due to structured tables. However, strong retrieval results are observed in categories like "advertisement," "news article," and "publication," showcasing GlobalDoc's ability to capture meaningful semantic relationships..

Scene 17 (11m 10s)

[Audio] We introduced GlobalDoc, a page-level pre-training framework that captures global relationships between vision and language modalities, outperforming traditional VDU models without requiring massive datasets. Our two-step pre-training strategy (L2M, L2U, L2R) enhances semantic representation learning, achieving compelling performance in Few-Shot Document Classification (DIC) and Content-based Document Retrieval (DIR). Looking ahead, we aim to improve zero-shot adaptability, enhance cross-modal learning, optimize for real-time industrial applications, and explore LightweightMM integration for richer document understanding. GlobalDoc paves the way for more generalizable, flexible, and human-like multimodal learning in VDU..

Scene 18 (12m 5s)

[Audio] Thank you for your attention.. Informatique Image Interaction IJniv. La Rochelle.