26 sept. 2024 - OSWORLD: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments citado:56

Description:

Autonomous agents that accomplish complex computer tasks with minimal human
interventions can significantly enhance accessibility and productivity of humancomputer interactions. Existing benchmarks either lack interactive environments
or are limited to specific applications/domains, failing to reflect the diversity and
complexity of real-world computer use and limiting agent scalability. We introduce
OSWORLD, the first-of-its-kind scalable real computer environment for multimodal
agents, supporting task setup, interactive learning, and execution-based evaluation
of open-ended computer tasks across arbitrary applications in Ubuntu, Windows,
and macOS. Using OSWORLD, we create a benchmark of 369 tasks involving
real web and desktop apps in open domains, OS file I/O, and multi-app workflows.
Each example derives from real-world use cases and includes detailed setup and
execution-based evaluation for reproducibility. Extensive evaluation of state-of-theart LLM/VLM agents on OSWORLD reveals deficiencies in their ability to serve
as computer assistants. While humans accomplish 72.4% of the tasks, the best
agents achieve <12.2%, struggling with GUI grounding and operational knowledge.
Comprehensive analysis using OSWORLD provides valuable insights for developing multimodal generalist agents that were not possible with previous benchmarks.
Implementation and experiments are at https://os-world.github.io

https://openreview.net/pdf?id=tN61DTr4Ed

Ajouté au bande de temps:

Artigos

Bycecilia

il y a 4 mois

Date:

26 sept. 2024

Maintenaint

~ Il y a 8 mois