Evaluation Methods
Interventional evaluations, causal frameworks, counterfactual test construction.
ACM CAIS 2026 Workshop
We invite submissions of 4-page short papers on methods, RL environment design principles, benchmarks, and real-world case studies for evaluating AI agents. Work-in-progress research is welcome.
Agentic systems increasingly rely on autonomous reasoning, tool use, and multi-step planning. Traditional LLM and RL benchmarks rarely capture high-value professional workflows, especially under realistic constraints.
This workshop asks: what makes a reinforcement learning environment actually useful for training and evaluating AI agents that deliver measurable economic impact after deployment?
Interventional evaluations, causal frameworks, counterfactual test construction.
Verifiers, rubric systems, LLM-as-a-judge reliability and calibration studies.
New agent benchmarks, failure analyses, benchmark contamination and drift.
Environment design, tool interfaces, synthetic trajectories, software frameworks.
Applied case studies connecting evaluation design to production outcomes.
Code execution, NL2SQL and structured queries, computer use, multimodal I/O, MCP tools, skills, memory, and web search.
Call for Papers - RLEval Workshop @ ACM CAIS 2026
RLEval: Methods and Reinforcement Learning Environments for Evaluating AI Agents
:date: May 26, 2026 · San Jose, CA
We are organizing a workshop at the ACM Conference on AI and Agentic Systems (ACM CAIS), and we would love your submission. We invite submissions of 4-page short papers on new methods, RL environment design principles, benchmarks, and real-world case studies for evaluating AI agents. This workshop brings together researchers and practitioners working on rigorous evaluation for agentic systems, with particular interest in reinforcement learning environments and techniques that enable systematic measurement of agent behavior. Work-in-progress submissions are strongly encouraged.
Submit through the official OpenReview workshop portal.
OpenReview SubmissionWorkshop: rl-eval.github.io
Conference: caisconf.org
Questions: rl-eval@googlegroups.com
Industry Keynote
Large-scale agent evaluation and reliability under production constraints.
Research Keynote
Frontiers in RL environments for tool-using language agents.
Panel Moderator
Cross-sector perspective on benchmark quality and economic outcomes.
Handshake
Leads a group advancing Data and Evaluations for frontier AI. Previously at Cleanlab, where he pioneered methods to detect and remediate issues in enterprise AI agents. Earlier, he was a Senior Scientist at AWS building AutoML and deep learning services. He completed his PhD at MIT CSAIL and undergraduate studies at UC Berkeley.
Handshake
Leads research on RL environments and post-training. Previously co-founder and CTO at Cleanlab (acquired by Handshake). He completed his PhD at MIT CSAIL in the PDOS group, advised by Frans Kaashoek and Nickolai Zeldovich.
Oracle AI
Senior Principal Scientist at Oracle Cloud, leading research initiatives in Generative AI, NLP, and AI robustness for enterprise-scale multilingual and multimodal systems. His work emphasizes production-grade agentic frameworks, low-resource language modeling, evaluation, and trustworthy AI deployment.
Oracle AI
Works in GenAI evaluations across multimodal, text, and code generation. Also an Ethics and Technology Practitioner 2026 Fellow at Stanford. Previously worked in AI safety at Microsoft Responsible AI research.
University of Washington, Google DeepMind
Focuses on social reinforcement learning in multi-agent and human-AI interactions. During her PhD at MIT, she developed foundational RLHF techniques for language models. Her work has received awards at NeurIPS and ICML and has been widely featured in research and media venues.
humans&
Previously worked on RL and post-training at Anthropic across Claude 3.5 to 4.5, and completed her PhD at MIT CSAIL advised by Jacob Andreas and Julie Shah. Her work centers on agents that learn representations from rich human knowledge.
Boson AI
Leads reinforcement learning and modeling efforts for robust agentic systems across changing tasks and environments. Previously a researcher and tech lead at AWS, and a founding member of post-training efforts for Amazon’s early RLHF models.
Microsoft Research
Ahmed is Senior Researcher at Microsoft Research AI Frontiers focusing on reasoning and agentic models. Until very recently, he was part of the Microsoft Responsible AI team where he worked on the safety alignment and evaluation of LLMs and Agents.
Stanford University
Alina is a graduate student in Computational and Mathematical Engineering at Stanford University, researching women's health modeling using wearables. Her previous experience is in structural civil engineering and project management in local government. She is interested in designing AI systems for public good that are measurable, interpretable, and robust to uncertainty.
Collate
Holds a PhD in natural language processing from the University of Melbourne. Her work focuses on evaluation and reliability of machine learning and large language models, including human-centered and uncertainty-aware approaches.
Oracle AI
Miguel is currently an AI Architect (Applied Scientist) at Oracle Cloud Infrastructure (OCI). Previously, he worked as a Principal Applied Scientist at AWS, and has also held positions at IBM, Carnegie Mellon University (CMU), and Universitat Pompeu Fabra (UPF). He has taught undergraduate and graduate courses at both CMU and UPF.
Oracle AI
Graham is Senior Director of Applied Science at Oracle’s OCI AI Science team, leading applied science efforts at the intersection of large-scale ML systems and enterprise cloud AI. Prior to Oracle, he served as Senior Director of Applied Science at Amazon Web Services (AWS).
Co-located with ACM Conference on AI and Agentic Systems (ACM CAIS 2026), held from May 26–29, 2026 in San Jose, California.
Main venue: DoubleTree by Hilton San Jose, 2050 Gateway Place, San Jose, CA 95110.
Publication mode is being finalized with the workshop and ACM CAIS organizing teams.
Submissions under review elsewhere are allowed, but already published papers are not.
Researchers and practitioners in evaluation, benchmarks, RL environments, and deployment.