✨ Topics of Interest
- Agent evaluation methods, in particular interventional and causal/counterfactual techniques.
- RL environments: design principles, software frameworks, synthetic data, tool design.
- Automated graders: LLM-as-a-judge, verifiers, rubrics, reward hacking, human feedback.
- Benchmarks: new benchmarks, analyses of existing benchmarks.
- Enterprise agent case studies: production evaluation and deployment lessons.
- Considerations in the above topics for Agents with particular capabilities: code execution, computer use, multimodal I/O, NL2SQL, skills, memory, web search.