Lumos Enhances Multimodal AI with On-Device STR

Download PDF Copy

By Dr Silpaja Chandrasekar, PhDReviewed by Susha Cheriyedath, M.Sc.Aug 28 2024

In an article recently posted to the Meta Research website, researchers introduced Lumos, the first end-to-end multimodal question-answering system with text-understanding capabilities. Lumos integrated a scene text recognition (STR) component to extract text from first-person images, enhancing input for a multimodal large language model (MM-LLM).

*Study: Lumos Enhances Multimodal AI with On-Device STR. Image Credit: Ken stocker/Shutterstock.com*

The paper addressed STR quality, latency, and model inference challenges. It detailed the system architecture, design choices, techniques, and a comprehensive evaluation demonstrating high quality and efficiency.

Background

Past work in visual question answering has seen increased attention with recent progress in LLMs and vision-language pre-training. Industry forecasts that smart assistants will soon achieve a human-like understanding of scenes and images. Prior approaches often used MM-LLMs for text understanding in images without standalone STR components. Implementations typically involved transferring high-resolution images to the cloud for processing, which faced latency issues and degraded performance with low-resolution thumbnails.

Lumos System Architecture

The architecture of Lumos involves a streamlined process for handling multimodal queries. Upon triggering the system, the device captures and processes an image at two resolutions: 3K × 4K (full resolution) and 450 × 600 (thumbnail). Concurrently, automatic speech recognition (ASR) processes the voice query while image capture, compression, and transfer to the cloud proceed. STR begins as soon as the full-resolution image is available, with the system designed to parallelize time-consuming tasks such as STR inference and image transfer to minimize latency.

On the cloud side, a proprietary MM-LLM integrates advanced techniques to process the low-resolution thumbnail, recognized text from STR, and the user query from ASR to generate responses. Text-to-speech (TTS) then converts these responses into voice and sends them back to the user. The design carefully balances efficiency by processing text and images optimally, reflecting the constraints and requirements of on-device and cloud processing.

The architecture incorporates three key design choices: Firstly, STR is performed on-device using the full-resolution image to ensure high-quality text recognition. Secondly, latency in STR is minimized by utilizing hardware acceleration and executing STR and image transfer in parallel. These strategies help maintain efficiency despite the high computational demands of on-device STR.

Finally, the system extends to MM-LLM use cases where STR might not be necessary to answer queries. Lumos ensures flexibility and effectiveness by deferring the decision to the MM-LLM, which can handle both text-heavy and generic questions.

Although on-device STR imposes constraints on model architecture, latency, memory, and battery life, the performance of the on-device STR model remains competitive with cloud-based solutions, thanks to significant optimizations and efficient hardware use.

On-Device STR

Lumos employs a comprehensive on-device STR pipeline with four key components. The ROI detection identifies and extracts a relevant portion of the image, reducing computational cost and noise. Text detection identifies word bounding boxes, while text recognition converts these boxes into readable text.

Reading-order reconstruction organizes recognized words into coherent paragraphs based on their layout. This system addresses challenges specific to on-device STR, including hardware constraints and the variability of in-the-wild text, by using efficient models like Facebook neural architecture search v2 (FBNetv2) and techniques such as keypoint detection for ROI and curriculum learning for text recognition. The result is a robust and efficient STR pipeline that balances accuracy and performance under practical constraints.

Enhanced Text Recognition

Through experimental evaluation, Lumos has demonstrated significant improvements in two key areas: end-to-end question answering and its on-device STR solution's quality, efficiency, and hardware usage. The experiments compared three variants of Lumos: the basic MM-LLM, MM-LLM with on-device STR, and MM-LLM with additional positional information from the reading order reconstruction module.

The results show that integrating on-device STR boosts the question answering (QA) accuracy from 52% to 78%, particularly enhancing performance in summarization tasks. Including positional information further improves accuracy to 79.6%, highlighting the system's capability to handle spatial relationships between words better.

The performance metrics also indicate that Lumos' on-device STR achieves competitive word error rate (WER) scores compared to established STR systems, with the device model being notably efficient despite a slight trade-off in quality.

In terms of efficiency, Lumos' on-device STR solution shows impressive gains. With an export size of approximately 8MB, the model achieves up to a 9x reduction in latency and a 3x decrease in energy consumption when run on a hardware accelerator, compared to a central processing unit (CPU).

The ROI detection component significantly enhances performance by reducing image size while maintaining high word recall, and advanced techniques like data augmentation and model quantization contribute to overall efficiency. These results underscore the effectiveness of Lumos' approach in balancing high-quality text recognition with optimized performance and resource usage on edge devices.

Conclusion

To sum up, this paper introduced Lumos as a pioneering smart multimodal assistant with advanced text understanding capabilities optimized for device compatibility. The evaluation demonstrated that the hybrid approach, combining on-device STR with on-cloud MM-LLM, achieved superior accuracy and met all stringent on-device requirements.

This work marked a significant advancement in integrating MM-LLMs for real-world text recognition applications. Future research will optimize on-device models and explore end-to-end text recognition with MM-LLM.

Journal reference:

Shenoy A et al. (2024) Lumos: Empowering Multimodal LLMs with Scene Text Recognition. Meta.com. https://ai.meta.com/research/publications/lumos-empowering-multimodal-llms-with-scene-text-recognition/

Posted in: AI Research News

Comments (0)

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Chandrasekar, Silpaja. (2024, August 28). Lumos Enhances Multimodal AI with On-Device STR. AZoAi. Retrieved on June 17, 2026 from https://www.azoai.com/news/20240828/Lumos-Enhances-Multimodal-AI-with-On-Device-STR.aspx.
MLA
Chandrasekar, Silpaja. "Lumos Enhances Multimodal AI with On-Device STR". AZoAi. 17 June 2026. <https://www.azoai.com/news/20240828/Lumos-Enhances-Multimodal-AI-with-On-Device-STR.aspx>.
Chicago
Chandrasekar, Silpaja. "Lumos Enhances Multimodal AI with On-Device STR". AZoAi. https://www.azoai.com/news/20240828/Lumos-Enhances-Multimodal-AI-with-On-Device-STR.aspx. (accessed June 17, 2026).
Harvard
Chandrasekar, Silpaja. 2024. Lumos Enhances Multimodal AI with On-Device STR. AZoAi, viewed 17 June 2026, https://www.azoai.com/news/20240828/Lumos-Enhances-Multimodal-AI-with-On-Device-STR.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.

Post a new comment

(Logout)

Post

Sign in to keep reading

We're committed to providing free access to quality science. By registering and providing insight into your preferences you're joining a community of over 1m science interested individuals and help us to provide you with insightful content whilst keeping our service free.