<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>LLM-as-a-Judge on Horse with a Pointy Hat</title><link>https://www.horsewithapointyhat.com/tags/llm-as-a-judge/</link><description>Recent content in LLM-as-a-Judge on Horse with a Pointy Hat</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Fri, 05 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://www.horsewithapointyhat.com/tags/llm-as-a-judge/index.xml" rel="self" type="application/rss+xml"/><item><title>It's LLMs All the Way Down: A Practical Guide to GenAI Evals</title><link>https://www.horsewithapointyhat.com/posts/its-llms-all-the-way-down/</link><pubDate>Fri, 05 Jun 2026 00:00:00 +0000</pubDate><guid>https://www.horsewithapointyhat.com/posts/its-llms-all-the-way-down/</guid><description>&lt;p&gt;&lt;strong&gt;This is a reposting of a blog I wrote for the &lt;a href="https://technology.complyadvantage.com/its-llms-all-the-way-down-a-practical-guide-to-genai-evals/"&gt;ComplyAdvantage Tech Blog&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Generative AI (GenAI) systems and Large Language Models (LLMs) are empowering us to tackle new types of problems and enabling the implementation of smart, autonomous (or semi-autonomous) systems. However, the history of responsible machine learning and data science is rooted in the need to quantify and monitor the performance of the models we use.&lt;/p&gt;
&lt;p&gt;In the brave new world of GenAI, new challenges arise due to more complex modes that are inherently non-deterministic and for which evaluation is much more nuanced, given the nature of the outputs. In this scenario, we cannot purely rely on classic numerical metrics such as Recall and Precision, often derived from exact matching of strings. GenAI solutions can range from simple single-shot calls to an LLM to complex Agentic AI workflows that incorporate Retrieval-Augmented Generation (RAG), deterministic tools, and sub-agents; each component needs its own form of evaluation in addition to an end-to-end performance measurement.&lt;/p&gt;</description></item></channel></rss>