Prompt Experiments

Prompt Experiments allows you to test a prompt version from Prompt Management on a Dataset of inputs and expected outputs. Thereby, you can verify that the change yields the expected outputs and does not cause regressions. You can directly analyze the results of different prompt experiments side-by-side.

Optionally, you can use LLM-as-a-Judge Evaluators to automatically evaluate the responses based on the expected outputs to further analyze the results on an aggregate level.

This is a no-code feature within Langfuse. You can run more complex experiments via the Langfuse SDKs/API. Follow this guide to get started.

Key benefits

Feedback loop: Quickly iterate on prompts by running experiments and directly comparing evaluation results side-by-side.
Regression prevention: When making prompt changes, run an experiment to make sure that the new version does not cause bad outputs.

Overview

Introduction to Prompt Experiments

Requirements

⚠️

For prompt experiments to work correctly, you must ensure:

Your prompt contains at least one variable using the {{variableName}} syntax
Variable names in your prompt must exactly match the keys in your dataset items
Dataset items must have their input formatted as valid JSON

Variable Mapping Example

The following example demonstrates how prompt variables are mapped to dataset item inputs:

Prompt:

You are a Langfuse expert. Answer based on:
{{documentation}}
 
Question: {{question}}

Dataset Item:

{
  "documentation": "Langfuse is an LLM Engineering Platform",
  "question": "What is Langfuse?"
}

In this example:

The prompt variable {{documentation}} maps to the JSON key "documentation"
The prompt variable {{question}} maps to the JSON key "question"
Both keys must exist in the dataset item’s input JSON for the experiment to run successfully

Setup

If you already have a dataset and a prompt, you can skip the following steps.

In Prompt Experiments, the items of a dataset are mapped to the variables of the prompt. In the following example, the variables (documentation and question) are mapped to the input of the dataset which is a JSON object. The expected output contains a reference answer for the given dataset item.

Configure LLM connection

Prompt Experiments runs LLM calls within Langfuse. Thus, you need to configure an LLM connection in the project settings.

Supported LLM providers

OpenAI, or OpenAI-compatible providers (e.g. LiteLLM, Google Vertex AI)
Anthropic
Azure OpenAI
AWS Bedrock

Create a dataset

Create a dataset with the inputs and expected outputs that you want to test your prompt on.

langfuse.create_dataset(
    name="<dataset_name>",
    # optional description
    description="My first dataset",
    # optional metadata
    metadata={
        "author": "Alice",
        "date": "2022-01-01",
        "type": "benchmark"
    }
)

See low-level SDK docs for details on how to initialize the Python client.

langfuse.createDataset({
  name: "<dataset_name>",
  // optional description
  description: "My first dataset",
  // optional metadata
  metadata: {
    author: "Alice",
    date: "2022-01-01",
    type: "benchmark",
  },
});

Datasets: + New dataset

Create dataset items with test cases

Dataset items include the input variables that should be inserted into the prompt.

The input must be a JSON object where each key exactly matches a variable name in your prompt. For example, if your prompt contains {{question}}, your dataset item’s input JSON must have a "question" key.

Example Dataset Item with variables

input

{
  "question": "What is Langfuse?",
  "documentation": "Langfuse - the LLM Engineering Platform"
}

expected_output

Langfuse is the LLM Engineering Platform.

langfuse.create_dataset_item(
    dataset_name="<dataset_name>",
    # any python object or value, optional
    input={
        "text": "hello world"
    },
    # any python object or value, optional
    expected_output={
        "text": "hello world"
    },
    # metadata, optional
    metadata={
        "model": "llama3",
    }
)

See low-level SDK docs for details on how to initialize the Python client.

langfuse.createDatasetItem({
  datasetName: "<dataset_name>",
  // any JS object or value
  input: {
    text: "hello world",
  },
  // any JS object or value, optional
  expectedOutput: {
    text: "hello world",
  },
  // metadata, optional
  metadata: {
    model: "llama3",
  },
});

Datasets > Items: + New item

Create a prompt with variables

Use {{variables}} to insert the dataset variables into the prompt during experiments.

Each {{variableName}} in your prompt must have a corresponding key in your dataset items’ input JSON. The names must match exactly (case-sensitive).

Example Prompt

system

You are a Langfuse expert. Please answer questions based on the following documentation:

DOCUMENTATION
{{documentation}}

user

{{question}}

# Create a text prompt
langfuse.create_prompt(
    name="movie-critic",
    type="text",
    prompt="As a {{criticlevel}} movie critic, do you like {{movie}}?",
    labels=["production"],  # directly promote to production
    config={
        "model": "gpt-4o",
        "temperature": 0.7,
        "supported_languages": ["en", "fr"],
    },  # optionally, add configs (e.g. model parameters or model tools) or tags
)
 
# Create a chat prompt
langfuse.create_prompt(
    name="movie-critic-chat",
    type="chat",
    prompt=[
      { "role": "system", "content": "You are an {{criticlevel}} movie critic" },
      { "role": "user", "content": "Do you like {{movie}}?" },
    ],
    labels=["production"],  # directly promote to production
    config={
        "model": "gpt-4o",
        "temperature": 0.7,
        "supported_languages": ["en", "fr"],
    },  # optionally, add configs (e.g. model parameters or model tools) or tags
)

If you already have a prompt with the same name, the prompt will be added as a new version.

// Create a text prompt
await langfuse.createPrompt({
  name: "movie-critic",
  type: "text",
  prompt: "As a {{criticlevel}} critic, do you like {{movie}}?",
  labels: ["production"], // directly promote to production
  config: {
    model: "gpt-4o",
    temperature: 0.7,
    supported_languages: ["en", "fr"],
  }, // optionally, add configs (e.g. model parameters or model tools) or tags
});
 
// Create a chat prompt
await langfuse.createPrompt({
  name: "movie-critic-chat",
  type: "chat",
  prompt: [
    { role: "system", content: "You are an {{criticlevel}} movie critic" },
    { role: "user", content: "Do you like {{movie}}?" },
  ],
  labels: ["production"], // directly promote to production
  config: {
    model: "gpt-4o",
    temperature: 0.7,
    supported_languages: ["en", "fr"],
  }, // optionally, add configs (e.g. model parameters or model tools) or tags
});

If you already have a prompt with the same name, the prompt will be added as a new version.

Run a prompt experiment

Now that we have set up a prompt version and a dataset, we can run a prompt experiment in Langfuse for each prompt version that we want to test.

When viewing the prompt details or a dataset, use the following button to run a prompt experiment:

New Experiment Button

Select the prompt version, dataset, and model configuration that you want to test. Before running the experiment, you will see whether the prompt variables match the dataset variables.

ℹ️

Troubleshooting: If you see a warning about mismatched variables, ensure that:

Every {{variable}} in your prompt has a matching key in your dataset items’ input JSON
The names match exactly (including case sensitivity)
Your dataset input is valid JSON format

GitHub Discussions

Get Started Example (Python)

Was this page helpful?

Support