Steady the Course: Navigating the Evaluation of LLM-based Applications | by Stijn Goossens | Nov, 2023


Harness the Potential of AI Tools with ChatGPT. Our blog offers comprehensive insights into the world of AI technology, showcasing the latest advancements and practical applications facilitated by ChatGPT’s intelligent capabilities.

Why evaluating LLM apps matters and how to get started

Stijn Goossens

Towards Data Science

A pirate with a hurt knee asks his LLM-based first aid assistant for advice. Image generated by the author with DALL·E 3.

Large Language Models (LLMs) are all the hype, and lots of people are incorporating them into their applications. Chatbots that answer questions over relational databases, assistants that help programmers write code more efficiently, and copilots that take actions on your behalf are some examples. The powerful capabilities of LLMs allow you to start projects with rapid initial success. Nevertheless, as you transition from a prototype towards a mature LLM app, a robust evaluation framework becomes essential. Such an evaluation framework helps your LLM app reach optimal performance and ensures consistent and reliable results. In this blog post, we will cover:

  1. The difference between evaluating an LLM vs. an LLM-based application
  2. The importance of LLM app evaluation
  3. The challenges of LLM app evaluation
  4. Getting started
    a. Collecting data and building a test set
    b. Measuring performance
  5. The LLM app evaluation framework

Using the fictional example of FirstAidMatey, a first-aid assistant for pirates, we will navigate through the seas of evaluation techniques, challenges, and strategies. We’ll wrap up with key takeaways and insights. So, let’s set sail on this enlightening journey!

The evaluation of individual Large Language Models (LLMs) like OpenAI’s GPT-4, Google’s PaLM 2 and Anthropic’s Claude is typically done with benchmark tests like MMLU. In this blog post, however, we’re interested in evaluating LLM-based applications. These are applications that are powered by an LLM and contain other components like an orchestration framework that manages a sequence of LLM calls. Often Retrieval Augmented Generation (RAG) is used to provide context to the LLM and avoid hallucinations. In short, RAG requires the context documents to be embedded into a vector store from which the relevant snippets can be retrieved and shared with the LLM. In contrast to an LLM, an LLM-based application (or LLM app) is built to execute one or more specific tasks really well. Finding the right setup often involves some experimentation and iterative improvement. RAG, for example, can be implemented in many different ways. An evaluation framework as discussed in this blog post can help you find the best setup for your use case.

An LLM becomes even more powerful when being used in the context of an LLM-based application.

FirstAidMatey is an LLM-based application that helps pirates with questions like “Me hand got caught in the ropes and it’s now swollen, what should I do, mate?”. In it simplest form the Orchestrator consists of a single prompt that feeds the user question to the LLM and asks it to provide helpful answers. It can also instruct the LLM to answer in Pirate Lingo for optimal understanding. As an extension, a vector store with embedded first aid documentation could be added. Based on the user question, the relevant documentation can be retrieved and included into the prompt, so that the LLM can provide more accurate answers.

Before we get into the how, let’s look at why you should set up a system to evaluate your LLM-based application. The main goals are threefold:

  • Consistency: Ensure stable and reliable LLM app outputs across all scenarios and discover regressions when they occur. For example, when you improve your LLM app performance on a specific scenario, you want to be warned in case you compromise the performance on another scenario. When using proprietary models like OpenAI’s GPT-4, you are also subject to their update schedule. As new versions get released, your current version might be deprecated over time. Research shows that switching to a newer GPT version isn’t always for the better. Thus, it’s important to be able to assess how this new version affects the performance of your LLM app.
  • Insights: Understand where the LLM app performs well and where there is room for improvement.
  • Benchmarking: Establish performance standards for the LLM app, measure the effect of experiments and release new versions confidently.

As a result, you will achieve the following outcomes:

  • Gain user trust and satisfaction because your LLM app will perform consistently.
  • Increase stakeholder confidence because you can show how well the LLM app is performing and how new versions improve upon older ones.
  • Boost your competitive advantage as you can quickly iterate, make improvements and confidently deploy new versions.

Having read the above benefits, it’s clear why adopting an LLM-based application can be advantageous. But before we can do so, we must solve the following two main challenges:

  • Lack of labelled data: Unlike traditional machine learning applications, LLM-based ones don’t need labelled data to get started. LLMs can do many tasks (like text classification, summarization, generation and more) out of the box, without having to show specific examples. This is great because we don’t have to wait for data and labels, but on the other hand, it also means we don’t have data to check how well the application is performing.
  • Multiple valid answers: In an LLM app, the same input can often have more than one right answer. For instance, a chatbot might provide various responses with similar meanings, or code might be generated with identical functionality but different structures.

To address these challenges, we must define the appropriate data and metrics. We’ll do that in the next section.

Collecting data and building a test set

For evaluating an LLM-based application, we use a test set consisting of test cases, each with specific inputs and targets. What these contain depends on the application’s purpose. For example, a code generation application expects verbal instructions as input and outputs code in return. During evaluation, the inputs will be provided to the LLM app and the generated output can be compared to the reference target. Here are a few test cases for FirstAidMatey:

Discover the vast possibilities of AI tools by visiting our website at to delve deeper into this transformative technology.


There are no reviews yet.

Be the first to review “Steady the Course: Navigating the Evaluation of LLM-based Applications | by Stijn Goossens | Nov, 2023”

Your email address will not be published. Required fields are marked *

Back to top button