New research shows how the right UI element order can dramatically enhance LM agents' ability to navigate web and desktop environments, setting a new benchmark for task success rates in challenging pixel-only settings.
Research: The Impact of Element Ordering on LM Agent Performance
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
In an article recently submitted to the arXiv preprint* server, researchers at Carnegie Mellon University investigated how the ordering of user interface (UI) elements impacted language model (LM) agents navigating virtual environments. They found that randomizing element order significantly degraded performance, causing up to a 50% drop in task success rates across various models, akin to removing visible text.
In pixel-only environments, dimensionality reduction was identified as an effective method for ordering elements. The t-SNE dimensionality reduction technique, which preserves local spatial structure, allowed agents to detect functionally associated elements more easily, improving their task success rate. Applying this approach to the omniscient agent for computer tasks (OmniACT) benchmark, their method completed over twice as many functions as the previous state-of-the-art.
Related Work
Past work introduced benchmarks for evaluating agents navigating web, desktop, and mobile environments, like world-of-bits for web graphical UI (GUI) navigation. Despite advancements, agents still struggle with realistic benchmarks, completing only 12-15% of tasks. Most approaches rely on underlying structures like document object models (DOM) or accessibility trees, while newer methods focus on navigating graphical-only environments. Unlike previous models that depend on accessibility features, the current approach trains an object detection model to detect interactable UI elements directly from pixels.
Advancements in Agent Navigation
Benchmarks have been developed to evaluate agents that navigate web, desktop, and mobile environments. World-of-bits is a notable example of web GUI navigation. This benchmark was one of the earliest attempts to test agents' abilities to interact with web-based environments, offering foundational insights into how agents can process and respond to graphical interfaces.
However, agents need help to perform well in realistic settings, often completing only 12-15% of tasks. This performance gap indicates these systems' difficulties when applied to more complex and dynamic environments. Despite improvements in agent design, navigating these real-world benchmarks remains a significant challenge.
Many traditional methods rely on underlying structures like DOM or accessibility trees to facilitate agent navigation. These elements provide detailed representations of the environment's layout, allowing agents to easily identify and interact with specific components. However, access to these underlying structures is often unavailable in all environments, limiting the applicability of these approaches.
In contrast, newer techniques focus on navigating environments that provide only graphical representations, where agents must interpret pixels directly. Recent methods train object detection models specifically to identify interactable UI elements without relying on predefined structural features like accessibility trees. This approach enhances agents' ability to generalize across environments, even when conventional hierarchical structures like DOM trees are unavailable.
Enhancing LM Agents' Interaction
The study addressed the challenge of enabling LM agents to operate in pixel-only environments, such as VisualWebArena, where hierarchical representations like the DOM are inconsistent. Many mobile applications need proper labeling, complicating the detection of interactive elements. The authors focused on pixel data scenarios, using VisualWebArena and the OmniACT benchmark for testing.
OmniACT offers agents both web and desktop environments, containing 177 application screenshots and 2021 tasks in its test set. The agents aim to generate Python automatic (PyAutoGUI) code to navigate application screenshots. Since OmniACT operates under the premise of only having access to pixel information, the authors employed an object detection model to identify interactive UI elements within the screenshots. In addition, they used EasyOCR to extract text content from images, a key step in helping agents recognize interactive elements even when limited to pixel-based environments.
They used easy optical character recognition (EasyOCR) to extract visible text. This methodology involved collecting a dataset of 67,530 interactable elements from 1468 Common Crawl webpages. Despite the shift from webpage data to desktop applications, the object detection model performed reasonably well within the OmniACT benchmark, and the model was publicly released for further research.
Evaluation Metrics
The primary evaluation metric used in this research was the task success rate, a standard for agent evaluations, while OmniACT introduced two metrics: sequence score and action score. The sequence score assesses whether the output contains the correct high-level action, while the action score evaluates both the action and its associated parameters. The action score is particularly important, as it closely mirrors the task success rate, providing a more granular view of agent performance by evaluating if both the action and its parameters are correct. The researchers reported that t-SNE ordering outperformed random and raster methods, improving action scores by more than 10% in some tests.
The focus on action score was chosen as it closely resembles the task success rate. Additionally, the authors explored various ordering methods, including random ordering as a baseline, raster scanning, and t-distributed stochastic neighbor embedding (t-SNE) for dimensionality reduction to improve element organization and agent performance.
Experimental results highlighted the significant impact of ordering on performance across various models. Random ordering led to considerable drops in performance compared to proper ordering methods. For instance, when tested with human annotations, raster ordering performed well but was limited by the scarcity and specificity of the annotations. When using ground truth data from VisualWebArena, ordering via t-SNE outperformed raster ordering and random processes.
The findings indicated that raster ordering performed best when utilizing human-annotated elements. Still, these annotations were limited in quantity and scope, underscoring the challenge of scaling high-quality annotations across different applications.
The research demonstrated state-of-the-art performance on OmniACT by integrating optimal feature selections and t-SNE ordering, highlighting the benefits of multimodal representations. The study showed significant improvements in agent interactions with applications based solely on pixel information by identifying interactive elements through an object detection model and employing effective ordering strategies.
Broader Impact
This research not only enhances LM agents’ navigation abilities but also raises critical questions about the ethical and practical implications of such advancements. The potential for LM agents to automate complex tasks in real-world environments opens up significant opportunities in areas such as accessibility, where agents could assist users with disabilities. However, there are also risks associated with privacy and security, as these agents often need access to sensitive user data or system interfaces. Furthermore, the ability of agents to perform actions autonomously could raise concerns about accountability, particularly in cases where errors occur or malicious use is involved, such as automated cyber-attacks.
Conclusion
To sum up, the study conducted thorough ablations demonstrating that element ordering significantly impacted agent performance. The team provided a method for ordering elements through dimensionality reduction, showing optimal results in realistic environments.
A UI element detection model was trained on common crawl data and shared publicly. The research demonstrated an end-to-end method enabling an LM agent to operate solely on pixel information, achieving state-of-the-art performance on OmniACT.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
Journal reference:
- Preliminary scientific report.
Chi, W., et al. (2024). The Impact of Element Ordering on LM Agent Performance. ArXiv. DOI:10.48550/arXiv.2409.12089, https://arxiv.org/abs/2409.12089v2