I use lots and lots of domain specific test cases at several layers, numbering i...

I use lots and lots of domain specific test cases at several layers, numbering in the hundreds or thousands. The score is the number of test cases that pass so it requires a different approach than all or nothing tests. The layers depend on your RAG “architecture” but I test the RAG query generation and scoring (comparing ordered lists is the simplest but I also include a lot of fuzzy comparisons), the LLM scoring the relevance of retrieved snippets before feeding into the final answering prompt, and the final answer. The most annoying part is the prompt to score the final answer, since it tends out to come out looking like a CollegeBoard AP test scoring rubric.

This requires a lot of domain specific work. For example, two of my test cases are “Is it [il]legal to build an atomic bomb” run against the entire USCode [1] so I have a list of sections that are relevant to the question that I’ve scored before eventually getting an answer of “it is illegal” followdd by several prompts that evaluate nuance in the answer (“it’s illegal except for…”). I have hundreds of these test cases, approaching a thousand. It’s a slog.

[1] 42 U.S.C. 2122 is one of the “right” sections in case anyone is wondering. Another step tests whether 2121 is pulled in based on the mention in 2122