ACM CAIS 2026 Workshop

RLEval: Methods and Reinforcement Learning Environments for Evaluating AI Agents.

We invite submissions of 4-page short papers on methods, RL environment design principles, benchmarks, and real-world case studies for evaluating AI agents. Work-in-progress research is welcome.

Interventional Evaluation
Causal / Counterfactual Methods
Automated Graders and Verifiers
Benchmarks for Agentic Systems
Synthetic Data + RLE Design
Enterprise Deployment Case Studies

Workshop Focus

ACM CAIS Website

Agentic systems increasingly rely on autonomous reasoning, tool use, and multi-step planning. Traditional LLM and RL benchmarks rarely capture high-value professional workflows, especially under realistic constraints.

This workshop asks: what makes a reinforcement learning environment actually useful for training and evaluating AI agents that deliver measurable economic impact after deployment?

RLE Visual Landscape

Reinforcement learning, evaluation, and agentic systems
Illustration of a reinforcement learning loop.
Illustration of an evaluation matrix for agent benchmarking.
Illustration of components in an agentic system.

Core Tracks

Evaluation Methods

Interventional evaluations, causal frameworks, counterfactual test construction.

Automated Grading

Verifiers, rubric systems, LLM-as-a-judge reliability and calibration studies.

Data and Benchmarks

New agent benchmarks, failure analyses, benchmark contamination and drift.

RLE Engineering

Environment design, tool interfaces, synthetic trajectories, software frameworks.

Enterprise Evidence

Applied case studies connecting evaluation design to production outcomes.

Agent Capability Evaluation

Code execution, NL2SQL and structured queries, computer use, multimodal I/O, MCP tools, skills, memory, and web search.

Call for Papers

CAIS Workshops

Call for Papers - RLEval Workshop @ ACM CAIS 2026
RLEval: Methods and Reinforcement Learning Environments for Evaluating AI Agents
:date: May 26, 2026 · San Jose, CA

We are organizing a workshop at the ACM Conference on AI and Agentic Systems (ACM CAIS), and we would love your submission. We invite submissions of 4-page short papers on new methods, RL environment design principles, benchmarks, and real-world case studies for evaluating AI agents. This workshop brings together researchers and practitioners working on rigorous evaluation for agentic systems, with particular interest in reinforcement learning environments and techniques that enable systematic measurement of agent behavior. Work-in-progress submissions are strongly encouraged.

✨ Topics of Interest

  • Evaluation methods: interventional, causal, counterfactual, verifiers, rubrics, LLM-as-judge
  • Data and benchmarks: new benchmarks, analyses of existing benchmarks, design validity
  • RL environments: synthetic data, tool design, software frameworks, effective environment principles
  • Enterprise case studies: production evaluation and deployment lessons
  • Agent capability evaluation: code execution, NL2SQL, computer use, multimodal I/O, MCP tools, skills, memory, web search

📝 Submission Details

  • Paper length: 4 pages main text; additional pages allowed for references and appendix
  • Formatting: ACM acmart/sigconf template (template link)
  • Review process: Single-blind (no anonymization required)
  • Visibility: Reviews and paper decisions will not be made public
  • Workshop format: interactive poster session + selected Contributed Talks + Best Paper/Poster Award
  • Policy: Under-review papers elsewhere are allowed; already-published papers are not
  • At least one author of each accepted paper must register and attend

Important Dates

Submission deadline (AoE) May 11, 2026
Accept/Reject notification May 18, 2026
Camera-ready deadline May 22, 2026
Workshop day at ACM CAIS 2026 May 26, 2026

Agenda (Draft)

Morning

  • Opening remarks and workshop scope
  • Keynote 1: Agent evaluation in production
  • Paper session I: Evaluation and verifier design
  • Lightning talks and discussion
  • Lunch break

Afternoon

  • Keynote 2: Building economically valid RLEs
  • Paper session II: Data and environment engineering
  • Panel: Measuring practical deployment value
  • Open roadmap and closing discussion

Keynotes and Panel

Industry Keynote

TBA

Large-scale agent evaluation and reliability under production constraints.

Research Keynote

TBA

Frontiers in RL environments for tool-using language agents.

Panel Moderator

TBA

Cross-sector perspective on benchmark quality and economic outcomes.

Organizers

Photo of Jonas Mueller

Jonas Mueller

Handshake

Director of AI Research

Leads a group advancing Data and Evaluations for frontier AI. Previously at Cleanlab, where he pioneered methods to detect and remediate issues in enterprise AI agents. Earlier, he was a Senior Scientist at AWS building AutoML and deep learning services. He completed his PhD at MIT CSAIL and undergraduate studies at UC Berkeley.

Photo of Anish Athalye

Anish Athalye

Handshake

Director of AI Research

Leads research on RL environments and post-training. Previously co-founder and CTO at Cleanlab (acquired by Handshake). He completed his PhD at MIT CSAIL in the PDOS group, advised by Frans Kaashoek and Nickolai Zeldovich.

Photo of Priyaranjan Pattnayak

Priyaranjan Pattnayak

Oracle AI

Senior Principal Scientist

Senior Principal Scientist at Oracle Cloud, leading research initiatives in Generative AI, NLP, and AI robustness for enterprise-scale multilingual and multimodal systems. His work emphasizes production-grade agentic frameworks, low-resource language modeling, evaluation, and trustworthy AI deployment.

Photo of Aziza Mirsaidova

Aziza Mirsaidova

Oracle AI

Applied Scientist

Works in GenAI evaluations across multimodal, text, and code generation. Also an Ethics and Technology Practitioner 2026 Fellow at Stanford. Previously worked in AI safety at Microsoft Responsible AI research.

Photo of Natasha Jaques

Natasha Jaques

University of Washington, Google DeepMind

Assistant Professor and Staff Research Scientist

Focuses on social reinforcement learning in multi-agent and human-AI interactions. During her PhD at MIT, she developed foundational RLHF techniques for language models. Her work has received awards at NeurIPS and ICML and has been widely featured in research and media venues.

Photo of Andi Peng

Andi Peng

humans&

Co-Founder

Previously worked on RL and post-training at Anthropic across Claude 3.5 to 4.5, and completed her PhD at MIT CSAIL advised by Jacob Andreas and Julie Shah. Her work centers on agents that learn representations from rich human knowledge.

Photo of Rasool Fakoor

Rasool Fakoor

Boson AI

Member of Technical Staff

Leads reinforcement learning and modeling efforts for robust agentic systems across changing tasks and environments. Previously a researcher and tech lead at AWS, and a founding member of post-training efforts for Amazon’s early RLHF models.

Photo of Ahmed Elgohary

Ahmed Elgohary

Microsoft Research

Senior Research Scientist

Ahmed is Senior Researcher at Microsoft Research AI Frontiers focusing on reasoning and agentic models. Until very recently, he was part of the Microsoft Responsible AI team where he worked on the safety alignment and evaluation of LLMs and Agents.

Photo of Alina Gavrilov

Alina Gavrilov

Stanford University

Graduate Student

Alina is a graduate student in Computational and Mathematical Engineering at Stanford University, researching women's health modeling using wearables. Her previous experience is in structural civil engineering and project management in local government. She is interested in designing AI systems for public good that are measurable, interpretable, and robust to uncertainty.

AE

Aparna Elangovan

Collate

Head of AI

Holds a PhD in natural language processing from the University of Melbourne. Her work focuses on evaluation and reliability of machine learning and large language models, including human-centered and uncertainty-aware approaches.

Photo of Miguel Ballesteros

Miguel Ballesteros

Oracle AI

Organizer

Miguel is currently an AI Architect (Applied Scientist) at Oracle Cloud Infrastructure (OCI). Previously, he worked as a Principal Applied Scientist at AWS, and has also held positions at IBM, Carnegie Mellon University (CMU), and Universitat Pompeu Fabra (UPF). He has taught undergraduate and graduate courses at both CMU and UPF.

Photo of Graham

Graham Horwood

Oracle AI

Organizer

Graham is Senior Director of Applied Science at Oracle’s OCI AI Science team, leading applied science efforts at the intersection of large-scale ML systems and enterprise cloud AI. Prior to Oracle, he served as Senior Director of Applied Science at Amazon Web Services (AWS).

Venue and Conference Context

Co-located with ACM Conference on AI and Agentic Systems (ACM CAIS 2026), held from May 26–29, 2026 in San Jose, California.

Main venue: DoubleTree by Hilton San Jose, 2050 Gateway Place, San Jose, CA 95110.

FAQ

Will accepted papers be archival?

Publication mode is being finalized with the workshop and ACM CAIS organizing teams.

Can previously published work be submitted?

Submissions under review elsewhere are allowed, but already published papers are not.

Who should attend?

Researchers and practitioners in evaluation, benchmarks, RL environments, and deployment.