CJA AI Assistant Agent Evaluation

As part of the CJA AI Assistant project, the need to evaluate AI Agent responses has been core to collaborating on improvements and ensuring that LLM changes have an overall positive impact on AI Assistant response accuracy. As we’ve tried different techniques to measure accuracy, we have started to prioritize automated methods over manual methods while combining the strengths of each approach to inform the team.

In order to be efficient, we must limit expensive time consuming methods while investing in automation to provide continuous and relatively cheap feedback on changes to the system over time. We also must be able to quickly evaluate LLMs as new versions become available. LLM’s as a core component of a development system represent an entirely new variable in system development that is impossible to evaluate with traditional QA practices.

As real-world users interact with the AI Assistant we must collect examples of prompts and the conditions of the system at the time of usage that can help the team improve the user experience. This requires the ability to collect and annotate user generated prompts as a team and correlate accuracy information with automated signals.

All of these techniques must be combined together in a way to inform development, research and leadership as to the accuracy of the overall user experience we are delivering.