ACM CAIS 2026 Workshop

RLEval: Methods and Reinforcement Learning Environments for Evaluating AI Agents

We invite submissions of papers on AI agent evaluation: methods, RL environment design, benchmarks, and real-world case studies.

Call for Papers Submit on OpenReview

Workshop Focus

Trillions are being invested in LLM-powered AI Agents, but many open questions remain regarding how to effectively evaluate them.

Held at the ACM Conference on AI and Agentic Systems, this workshop provides the first-ever research venue for such questions:

What principles make reinforcement learning environments useful for evaluating (and training) AI agents?
What are effective methods for agent evaluation (particularly interventional/counterfactual techniques and other causal methodologies)?
What benchmarks should we trust to measure and guide meaningful agentic progress?
What is and is not working in real-world/enterprise agent deployments?

Agenda

09:00 Opening remarks
09:10 Alex Shaw and Ryan Marten - Evaluating Agents with Harbor and TerminalBench 3.0
10:00Coffee break
10:30Alex Dimakis - Evaluating and Optimizing Agents with RL Environments
11:10Ram Sampath - RL-ADA: Co-Evolutionary Adversarial Training for Self-Improving Customer Support Agents Without Human Feedback
11:25Victor Ojewale - What Benchmarks Don’t Measure: The Case for Evaluating Abstention Competence in Autonomous Agents
11:45Ayush Sawarni - CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation

12:00Alex Smola - Submodular Benchmark Selection
12:30 Lunch break
13:30Corby Rosset - The Art of Building Verifiers for Computer Use Agents
14:10Rasool Fakoor - Don’t Let Systems Swallow the Algorithm: Rethinking RL for Large Models
14:40Max Lamparth - Reward Bias Substitution: Single-Axis Mitigations Shift Optimization Pressure
15:00 Poster session (in different room: Carmel/Monterey; ends at 17:00)

Invited Speakers

Alex Dimakis

UC Berkeley, Bespoke Labs

Alex Smola

Boson AI

Alex Shaw

Laude Institute

Ryan Marten

Laude Institute

Corby Rosset

Microsoft Research

Rasool Fakoor

Boson AI

Accepted Papers

RL-ADA: Co-Evolutionary Adversarial Training for Self-Improving Customer Support Agents Without Human Feedback
Ram Narayanan Ananthakrishnapuram Sampath, Harshit Rajgarhia, Abhishek Mukherji
What Benchmarks Don’t Measure: The Case for Evaluating Abstention Competence in Autonomous Agents
Victor Ojewale, Suresh Venkatasubramanian
CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation
Ayush Sawarni, Jiyuan Tan, Vasilis Syrgkanis
Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents
Matthew Turk
Reward Bias Substitution: Single-Axis Mitigations Shift Optimization Pressure
Max Lamparth, Daniel Fein, Andreas Haupt, Marcel Hussing, Mykel Kochenderfer
Verifying Agents in Rubric-Graded Environments
Markus Dücker, Vaibhav Kumar, Yi Liu, Ronak Chaudhary, Andreas Plesner, Francisco Guzmán, Anish Athalye
Stochastic Collapse Environments: A Benchmark for Agentic AI Research
Advait Parulekar, Alex Dimakis
BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows
Elaine Lau, Markus Dücker, Ronak Chaudhary, Hui Wen Goh, Rosemary Wei, Vaibhav Kumar, Saed Qunbar, Guram Gogia, Yi Liu, Scott Millslagle, Nasim Borazjanizadeh, Ulyana Tkachenko, Samuel Eshun Danquah, Collin Schweiker, Vijay Karumathil, Asrith Devalaraju, Andrew Peter Martin, Varsha Sandadi, Haemi Nam, Punit Arani, Ray Epps, Abdullah Arif, Sahil Bhaiwala, Curtis Northcutt, Skyler Wang, Anish Athalye, Jonas Mueller, Francisco Guzmán
Anchor: Mitigating Artifact Drift in Agent Benchmark Generation
Maksim Ivanov, Abhijay Rana
How Reliable Are Agent Leaderboards? A Variance-Decomposition Analysis
Michael Hardy, Anka Reuel, Ruhana Azam, Mykel Kochenderfer, Sanmi Koyejo
TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing
Pei Yang, Wanyi Chen, Tongyun Yang, Pengbin Feng, Jiarong Xing, Wentao Guo, Yuhang Yao, Yuhang Han, Hanchen Li, Xu Wang, Zeyu Wang, Jie Xiao, Anjie Yang, Lynn Ai, Eric Yang, Tianyu Shi
Rethinking MedAgentBench: A Framework for Fair Medical LLM Agent Evaluation
Ananya Mantravadi, Prasanna Desikan, Abhishek Mukherji
Beyond Pass@1: K-Sample Behavioral Equivalence for Code-Agent Evaluation
Ahmad A Rushdi
Organizational Control Layer: Governance Infrastructure for Mixed Human-AI Economic Systems
Tianyu Shi, Yang Mo, Zhuonan Hao, Zhoumeng, Wenzhuo Hu, Nan Yu, Fucheng Deng, Jiangbo Yu
Beyond Leaderboards: Tokenomics of Agentic Small Language Model Ensembles
Alexei Skurikhin, Emily Taylor, Nathan A. DeBardeleben
MERA: Model Evolution and Routing with Skill Adaptation for Agentic Systems at Scale
Yuhang Yao, Zeyu Wang, Tongyun Yang, Wanyi Chen, Yuhang Han, Jie Xiao, Chengke Bao, Tianyu Shi
Beyond Static Evaluation: Building Simulation Environments for Scalable Agentic Reinforcement Learning
Akshay Arora, Ishan Nigam, Ashutosh Aggarwal, Shefali Bansal, Krishna Kumar Singh, Sweta Kumari, Nikhil Mittal, Md Shariq Farhan, Siddarth Malreddy
ClawBot-Matching: Bidirectional, Explainable, and Learnable Collaboration Matching in Mixed Human-Agent Networks
Jiayao Gu, Kexin Chu, Peidong Liu, Yue Yang, Lynn Ai, Ling Yang, Tianyu Shi
Adaptive Adversarial Evaluation of Agentic Email Graders
Karthikeya Aditya Vissa, Ajay Krishna Anugu, Prasanna Desikan, Abhishek Mukherji
MultiHop-Tool-N1: Long-Horizon Tool Calling in Language Models Via Interleaved Reasoning and Tool-Use
Anushka Deshpande

Organizers

Jonas Mueller

Handshake

Director of AI Research

Anish Athalye

Handshake

Director of AI Research

Priyaranjan Pattnayak

Oracle AI

Senior Principal Scientist

Aziza Mirsaidova

Oracle AI

Applied Scientist

Natasha Jaques

University of Washington, Google DeepMind

Assistant Professor and Staff Research Scientist

Andi Peng

humans&

Co-Founder

Rasool Fakoor

Boson AI

Member of Technical Staff

Ahmed Elgohary

Microsoft Research

Senior Research Scientist

Alina Gavrilov

Stanford University

Graduate Student

Aparna Elangovan

Collate

Head of AI

Miguel Ballesteros

Oracle AI

AI Architect

Graham Horwood

Oracle AI

Senior Director of Applied Science

Additional Reviewers

Shikhar Gupta, Arjun Chakraborty, Nikhil Reddy Pallepati, Sivakumar Selvaraj, Meher Gitika Karumuri, Thomas Brink, Joseph Axisa, Kalyani Limaye

Conference Venue

This workshop is co-located with the ACM Conference on AI and Agentic Systems (ACM CAIS 2026), held from May 26–29, 2026 in San Jose, California.

Address: DoubleTree by Hilton San Jose, 2050 Gateway Place, San Jose, CA 95110

Rooms: San Juan for the general workshop presentations, Carmel/Monterey for afternoon poster presentations

Call for Papers

We invite submissions of 4-page short papers on new agent evaluation methods, RL environment design, agentic benchmarks, and real-world case studies. Work-in-progress submissions are encouraged!

✨ Topics of Interest

Agent evaluation methods, in particular interventional and causal/counterfactual techniques.
RL environments: design principles, software frameworks, synthetic data, tool design.
Automated graders: LLM-as-a-judge, verifiers, rubrics, reward hacking, human feedback.
Benchmarks: new benchmarks, analyses of existing benchmarks.
Enterprise agent case studies: production evaluation and deployment lessons.
Considerations in the above topics for Agents with particular capabilities: code execution, computer use, multimodal I/O, NL2SQL, skills, memory, web search.

📝 Submission Details

Paper length: 4 pages main text; additional pages allowed for references and appendix
Formatting: ACM acmart/sigconf template
Review process: Single-blind (no anonymization required)
Visibility: Reviews and paper decisions will not be made public
Workshop format: interactive poster session + selected Contributed Talks + Best Paper/Poster Award
Policy: Under-review papers elsewhere are allowed; already-published papers are not
At least one author of each accepted paper must register and attend

Links

Submit Paper: OpenReview

Conference Website: caisconf.org

Key Dates

Submission deadline (AoE) May 20, 2026

Accept/Reject notification May 22, 2026

Camera-ready deadline May 25, 2026

Workshop day at ACM CAIS 2026 May 26, 2026

FAQ

Will accepted papers be archival?

Accepted papers are not archival.

Can previously published work be submitted?

Submissions under review elsewhere are allowed, but already published papers are not.

Who should attend?

Researchers and practitioners working on agentic evaluation, RL environments, benchmarks, and enterprise deployments.