WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks

Abstract

Powered by a large language model (LLM), a web browsing agent operates web browsers in a human-like manner and offers a highly transparent path toward automating a wide range of everyday tasks. As web agents become increasingly capable and demonstrate proficiency in general browsing tasks, a critical question emerges: Can they go beyond general browsing to robustly handle tasks that are tedious and complex, or chores that humans often avoid doing themselves?

In this paper, we introduce WebChoreArena, a new fully reproducible benchmark comprising 532 carefully curated tasks designed to extend the scope of WebArena beyond general browsing to more labor-intensive and tedious tasks. WebChoreArena systematically integrates three key challenges: (i) Massive Memory tasks requiring accurate retrieval of large amounts of information in the observations, (ii) Calculation tasks demanding precise mathematical reasoning, and (iii) Long-Term Memory tasks necessitating long-term memory across multiple webpages.

Built on top of the fully reproducible and widely adopted four WebArena simulation environments, WebChoreArena ensures strict reproducibility and enables fair, direct comparisons with the established WebArena benchmark, offering key insights into agent progress.

Our experimental results demonstrate that as LLMs evolve, represented by GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro, significant improvements in performance are observed on WebChoreArena. These findings suggest that WebChoreArena is well-suited to measure the advancement of state-of-the-art LLMs with greater clarity. Nevertheless, the results also indicate that even with Gemini 2.5 Pro, there remains substantial room for improvement compared to WebArena, highlighting the increased challenges posed by WebChoreArena.

Dataset Statistics

Distribution of websites and task types

WebChoreArena includes a total of 532 human-curated tasks. These consist of 117 tasks for Shopping, 132 for Shopping Admin, 91 for Reddit, 127 for GitLab, and 65 Cross-site tasks that span multiple platforms.

All tasks can be categorized into four types based on their reasoning requirements: Massive Memory tasks, Calculation tasks, Long-Term Memory tasks, and Others.

Example of Each Task Type

(i) Massive Memory tasks require accurately memorizing a large amount of information from the given page. (ii) Calculation tasks involve performing arithmetic operations. (iii) Long-Term Memory tasks require the agent to retain relevant information across many steps and interactions. (iv) Others involve tasks that require special or domain-specific operations. Task Type Breakdown

Benchmark Construction

We assigned three annotators (selected from the authors) to each of the four simulated websites. To ensure consistency in task quality across different websites, one annotator was assigned to all four websites. In total, ten annotators were involved in the task creation process. This annotation process was both meticulous and labor-intensive, totaling over 300 hours.

1. Emphasis on Memory-intensive Analytical Tasks.

We deliberately focused on collecting tasks that require memory—that is, tasks in which information from past observations is essential to reach the correct answer. Such tasks are common in real-world scenarios but remain largely underrepresented in existing benchmarks such as WebArena.

To avoid overly simplistic tasks, we first prototyped early task ideas and evaluated them using a Claude-based agent to identify model limitations and refine the task designs. This process ensured that our final tasks were both meaningful and appropriately challenging.

2. Reducing Ambiguity in Task Specification and Evaluation.

We explicitly instructed annotators to eliminate ambiguity in both task descriptions and evaluation criteria. While handling ambiguous instructions is important for agents aiming to operate flexibly in real-world human interactions, we prioritized clear evaluability, since reliable evaluation is essential for measuring progress.

In WebArena, vague instructions often lead to scenarios where agents produce reasonable answers that are incorrectly marked as failures. In addition, we observed that the evaluation protocol in WebArena can fail to reliably assess answers due to vague output format expectations. To mitigate ambiguity in answer evaluation, we standardized the required output formats, e.g., "Provide only the answer without any additional words.", when aiming for exact matching with the ground truth.

3. Template-based Task Construction and Extension.

Following WebArena, we instructed annotators to create task templates and extend them to several task instances. The annotators were also responsible for developing several instantiations for each variable. This templated design enables a more robust and systematic evaluation of agent performance across tasks that share semantic similarity but exhibit diverse execution traces.

We created a total of 117 task templates: 25 for Shopping, 29 for Shopping Admin, 20 for Reddit, 28 for GitLab, and 15 for Cross-site tasks. On average, each template yielded approximately 4.5 task instances. Here, WebArena includes several tasks based on the map website (OpenStreetMap).

Experimental Results

The following figures illustrate the overall and task type-wise results on WebChoreArena.

Overall Results

Results per Task Type

F1: GPT-4o Struggles Significantly on WebChoreArena.

It is evident that GPT-4o struggles significantly on WebChoreArena. This indicates that WebChoreArena is significantly more challenging than WebArena, emphasizing the need for more advanced LLMs to tackle these tasks.

F2: Latest LLMs Show Progress but Have Significant Room for Improvement.

As LLMs have evolved with models such as Claude 3.7 Sonnet and Gemini 2.5 Pro, their performance in WebChoreArena demonstrates improvements, but there remains significant room for further advancement.

F3: WebChoreArena Enables a Clearer and Deeper Measurement of the Performance Differences among the Models.

WebChoreArena serves as a more effective benchmark for distinguishing model performance. Unlike WebArena, which presents a narrower performance spectrum, WebChoreArena exposes a substantial performance divergence. Therefore, WebChoreArena provides model developers and evaluators with clear insights into the strengths and weaknesses of each model.

F4: WebChoreArena Enables Fine-grained Analysis of Task-specific Performance.

Fig. 4 presents a detailed analysis of each agent’s performance across diverse task typologies. The results underscore the significant influence of agent architecture, beyond the type of LLMs, on type-wise performance.

BibTeX

@article{miyai2025webchorearena,
  author    = {Miyai, Atsuyuki and Zhao, Zaiying and Egashira, Kazuki and Sato, Atsuki and Sunada, Tatsumi and Onohara, Shota and Yamanishi, Hiromasa and Toyooka, Mashiro and Nishina, Kunato and Maeda, Ryoma and Aizawa, Kiyoharu and Yamasaki, Toshihiko},
  title     = {WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks},
  journal   = {arXiv preprint arXiv:2506.01952},
  year      = {2025},
}