- Opening remarks
- Alex Shaw and Ryan Marten - Evaluating Agents with Harbor and TerminalBench 3.0
- Coffee break
- Alex Dimakis - Evaluating and Optimizing Agents with RL Environments We are moving from 'What does AI know' to 'What can AI do'. To evaluate and train autonomous agents, we are moving from datasets of questions and answers to the creation of environments, which is the new type of data. We will discuss how Terminal Bench environments can be used as the paradigm for training and optimizing agents, starting from prompt optimization, harness optimization (with GEPA or other evolutionary algorithms) to weight updates with RL. We'll also cover recent work on adaptive taxonomies and how GEPA-type evolution techniques can be used for environment building.
- Ram Sampath - RL-ADA: Co-Evolutionary Adversarial Training for Self-Improving Customer Support Agents Without Human Feedback
- Victor Ojewale - What Benchmarks Don’t Measure: The Case for Evaluating Abstention Competence in Autonomous Agents
- Ayush Sawarni - CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation
ACM CAIS 2026 Workshop
RLEval: Methods and Reinforcement Learning Environments for Evaluating AI Agents
We invite submissions of papers on AI agent evaluation: methods, RL environment design, benchmarks, and real-world case studies.
Workshop Focus
Trillions are being invested in LLM-powered AI Agents, but many open questions remain regarding how to effectively evaluate them.
Held at the ACM Conference on AI and Agentic Systems, this workshop provides the first-ever research venue for such questions:
- What principles make reinforcement learning environments useful for evaluating (and training) AI agents?
- What are effective methods for agent evaluation (particularly interventional/counterfactual techniques and other causal methodologies)?
- What benchmarks should we trust to measure and guide meaningful agentic progress?
- What is and is not working in real-world/enterprise agent deployments?
Agenda
- Alex Smola - Submodular Benchmark Selection
- Lunch break
- Corby Rosset - The Art of Building Verifiers for Computer Use Agents
- Rasool Fakoor - Don’t Let Systems Swallow the Algorithm: Rethinking RL for Large Models
- Max Lamparth - Reward Bias Substitution: Single-Axis Mitigations Shift Optimization Pressure
- Poster session (in different room: Carmel/Monterey; ends at 17:00)
Invited Speakers
Accepted Papers
- RL-ADA: Co-Evolutionary Adversarial Training for Self-Improving Customer Support Agents Without Human Feedback
Ram Narayanan Ananthakrishnapuram Sampath, Harshit Rajgarhia, Abhishek Mukherji - What Benchmarks Don’t Measure: The Case for Evaluating Abstention Competence in Autonomous Agents
Victor Ojewale, Suresh Venkatasubramanian - CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation
Ayush Sawarni, Jiyuan Tan, Vasilis Syrgkanis - Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents
Matthew Turk - Reward Bias Substitution: Single-Axis Mitigations Shift Optimization Pressure
Max Lamparth, Daniel Fein, Andreas Haupt, Marcel Hussing, Mykel Kochenderfer - Verifying Agents in Rubric-Graded Environments
Markus Dücker, Vaibhav Kumar, Yi Liu, Ronak Chaudhary, Andreas Plesner, Francisco Guzmán, Anish Athalye - Stochastic Collapse Environments: A Benchmark for Agentic AI Research
Advait Parulekar, Alex Dimakis - BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows
Elaine Lau, Markus Dücker, Ronak Chaudhary, Hui Wen Goh, Rosemary Wei, Vaibhav Kumar, Saed Qunbar, Guram Gogia, Yi Liu, Scott Millslagle, Nasim Borazjanizadeh, Ulyana Tkachenko, Samuel Eshun Danquah, Collin Schweiker, Vijay Karumathil, Asrith Devalaraju, Andrew Peter Martin, Varsha Sandadi, Haemi Nam, Punit Arani, Ray Epps, Abdullah Arif, Sahil Bhaiwala, Curtis Northcutt, Skyler Wang, Anish Athalye, Jonas Mueller, Francisco Guzmán - Anchor: Mitigating Artifact Drift in Agent Benchmark Generation
Maksim Ivanov, Abhijay Rana - How Reliable Are Agent Leaderboards? A Variance-Decomposition Analysis
Michael Hardy, Anka Reuel, Ruhana Azam, Mykel Kochenderfer, Sanmi Koyejo - TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing
Pei Yang, Wanyi Chen, Tongyun Yang, Pengbin Feng, Jiarong Xing, Wentao Guo, Yuhang Yao, Yuhang Han, Hanchen Li, Xu Wang, Zeyu Wang, Jie Xiao, Anjie Yang, Lynn Ai, Eric Yang, Tianyu Shi - Rethinking MedAgentBench: A Framework for Fair Medical LLM Agent Evaluation
Ananya Mantravadi, Prasanna Desikan, Abhishek Mukherji - Beyond Pass@1: K-Sample Behavioral Equivalence for Code-Agent Evaluation
Ahmad A Rushdi - Organizational Control Layer: Governance Infrastructure for Mixed Human-AI Economic Systems
Tianyu Shi, Yang Mo, Zhuonan Hao, Zhoumeng, Wenzhuo Hu, Nan Yu, Fucheng Deng, Jiangbo Yu - Beyond Leaderboards: Tokenomics of Agentic Small Language Model Ensembles
Alexei Skurikhin, Emily Taylor, Nathan A. DeBardeleben - MERA: Model Evolution and Routing with Skill Adaptation for Agentic Systems at Scale
Yuhang Yao, Zeyu Wang, Tongyun Yang, Wanyi Chen, Yuhang Han, Jie Xiao, Chengke Bao, Tianyu Shi - Beyond Static Evaluation: Building Simulation Environments for Scalable Agentic Reinforcement Learning
Akshay Arora, Ishan Nigam, Ashutosh Aggarwal, Shefali Bansal, Krishna Kumar Singh, Sweta Kumari, Nikhil Mittal, Md Shariq Farhan, Siddarth Malreddy - ClawBot-Matching: Bidirectional, Explainable, and Learnable Collaboration Matching in Mixed Human-Agent Networks
Jiayao Gu, Kexin Chu, Peidong Liu, Yue Yang, Lynn Ai, Ling Yang, Tianyu Shi - Adaptive Adversarial Evaluation of Agentic Email Graders
Karthikeya Aditya Vissa, Ajay Krishna Anugu, Prasanna Desikan, Abhishek Mukherji - MultiHop-Tool-N1: Long-Horizon Tool Calling in Language Models Via Interleaved Reasoning and Tool-Use
Anushka Deshpande
Organizers
Jonas Mueller
Handshake
Director of AI Research
Anish Athalye
Handshake
Director of AI Research
Priyaranjan Pattnayak
Oracle AI
Senior Principal Scientist
Aziza Mirsaidova
Oracle AI
Applied Scientist
Natasha Jaques
University of Washington, Google DeepMind
Assistant Professor and Staff Research Scientist
Andi Peng
humans&
Co-Founder
Rasool Fakoor
Boson AI
Member of Technical Staff
Ahmed Elgohary
Microsoft Research
Senior Research Scientist
Alina Gavrilov
Stanford University
Graduate Student
Aparna Elangovan
Collate
Head of AI
Miguel Ballesteros
Oracle AI
AI Architect
Graham Horwood
Oracle AI
Senior Director of Applied Science
Additional Reviewers
Shikhar Gupta, Arjun Chakraborty, Nikhil Reddy Pallepati, Sivakumar Selvaraj, Meher Gitika Karumuri, Thomas Brink, Joseph Axisa, Kalyani LimayeConference Venue
This workshop is co-located with the ACM Conference on AI and Agentic Systems (ACM CAIS 2026), held from May 26–29, 2026 in San Jose, California.
Address: DoubleTree by Hilton San Jose, 2050 Gateway Place, San Jose, CA 95110
Rooms: San Juan for the general workshop presentations, Carmel/Monterey for afternoon poster presentations
Call for Papers
We invite submissions of 4-page short papers on new agent evaluation methods, RL environment design, agentic benchmarks, and real-world case studies. Work-in-progress submissions are encouraged!
✨ Topics of Interest
- Agent evaluation methods, in particular interventional and causal/counterfactual techniques.
- RL environments: design principles, software frameworks, synthetic data, tool design.
- Automated graders: LLM-as-a-judge, verifiers, rubrics, reward hacking, human feedback.
- Benchmarks: new benchmarks, analyses of existing benchmarks.
- Enterprise agent case studies: production evaluation and deployment lessons.
- Considerations in the above topics for Agents with particular capabilities: code execution, computer use, multimodal I/O, NL2SQL, skills, memory, web search.
📝 Submission Details
- Paper length: 4 pages main text; additional pages allowed for references and appendix
- Formatting: ACM acmart/sigconf template
- Review process: Single-blind (no anonymization required)
- Visibility: Reviews and paper decisions will not be made public
- Workshop format: interactive poster session + selected Contributed Talks + Best Paper/Poster Award
- Policy: Under-review papers elsewhere are allowed; already-published papers are not
- At least one author of each accepted paper must register and attend
Links
Submit Paper: OpenReview
Conference Website: caisconf.org
Contact Us: rl-eval@googlegroups.com
Key Dates
FAQ
Will accepted papers be archival?
Accepted papers are not archival.
Can previously published work be submitted?
Submissions under review elsewhere are allowed, but already published papers are not.
Who should attend?
Researchers and practitioners working on agentic evaluation, RL environments, benchmarks, and enterprise deployments.