Prompt 工程测试用例¶

⚠️ 时效性说明：本章涉及前沿模型/价格/榜单等信息，可能随版本快速变化；请以论文原文、官方发布页和 API 文档为准。

测试目标: 验证 Prompt 工程技术的有效性 测试类型: 功能测试、质量测试、 A/B 测试 涉及技术: Prompt 设计、 Few-shot Learning 、 Chain-of-Thought

📋 测试概述¶

测试目标¶

功能测试: 验证不同 Prompt 策略的有效性
质量测试: 评估生成内容的质量
鲁棒性测试: 测试 Prompt 的稳定性
对比测试: 比较不同 Prompt 策略的性能

测试环境¶

Python 版本: 3.8+
LLM: OpenAI GPT-4 / 通义千问
测试框架: pytest
评估工具: 人工评估 + 自动化指标

🧪 测试用例列表¶

1. 基础 Prompt 测试¶

测试用例 1.1: 零样本 Prompt¶

测试目标: 验证零样本 Prompt 的有效性

测试代码:

Python

import openai

class PromptTester:
    """Prompt测试器"""

    def __init__(self, api_key: str, model: str = "gpt-4o"):
        """初始化测试器"""
        self.client = openai.OpenAI(api_key=api_key)
        self.model = model

    def zero_shot_test(
        self,
        task: str,
        input_text: str,
    ) -> str:
        """
        零样本测试

        Args:
            task: 任务描述
            input_text: 输入文本

        Returns:
            生成结果
        """
        # 构造零样本 Prompt：直接给出任务描述和输入，不提供任何示例
        prompt = f"""{task}

输入: {input_text}
输出:"""

        # 调用 LLM API 进行推理
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "user", "content": prompt}
            ],
            temperature=0.7,   # 控制生成随机性，值越低输出越确定
            max_tokens=500,    # 限制最大输出 token 数
        )

        # 提取模型返回的文本内容
        return response.choices[0].message.content

def test_zero_shot_sentiment():
    """测试零样本情感分析"""
    tester = PromptTester(api_key="your-api-key")

    # 测试用例
    test_cases = [
        ("这部电影太棒了！", "positive"),
        ("服务态度很差，不推荐。", "negative"),
        ("还可以，一般般吧。", "neutral"),
    ]

    task = "判断以下文本的情感倾向（positive/negative/neutral）"

    # 遍历测试用例，统计正确数
    correct = 0
    for text, expected in test_cases:
        result = tester.zero_shot_test(task, text)
        result = result.strip().lower()  # 去除空白并统一小写，便于比较

        # 检查期望标签是否出现在模型输出中
        if expected in result:
            correct += 1
            print(f"✓ 正确: {text} -> {result}")
        else:
            print(f"✗ 错误: {text} -> {result} (期望: {expected})")

    # 计算准确率并断言最低阈值
    accuracy = correct / len(test_cases)
    print(f"\n准确率: {accuracy:.2%}")
    assert accuracy >= 0.6, "零样本准确率过低"  # assert断言：条件False时抛出AssertionError

预期结果: 准确率≥60%

测试用例 1.2: 单样本 Prompt¶

测试目标: 验证单样本 Prompt 的有效性

测试代码:

Python

def one_shot_test(
    self,
    task: str,
    example_input: str,
    example_output: str,
    test_input: str,
) -> str:
    """
    单样本测试

    Args:
        task: 任务描述
        example_input: 示例输入
        example_output: 示例输出
        test_input: 测试输入

    Returns:
        生成结果
    """
    # 构造单样本 Prompt：提供一个示例帮助模型理解任务格式
    prompt = f"""{task}

示例:
输入: {example_input}
输出: {example_output}

现在请处理:
输入: {test_input}
输出:"""

    # 调用 LLM 生成结果
    response = self.client.chat.completions.create(
        model=self.model,
        messages=[
            {"role": "user", "content": prompt}
        ],
        temperature=0.7,
        max_tokens=500,
    )

    return response.choices[0].message.content

def test_one_shot_translation():
    """测试单样本翻译"""
    tester = PromptTester(api_key="your-api-key")

    task = "将以下中文翻译成英文"
    example_input = "你好，世界！"
    example_output = "Hello, World!"

    test_cases = [
        ("今天天气很好。", "good"),
        ("我喜欢编程。", "like"),
        ("这是一个测试。", "test"),
    ]

    correct = 0
    for text, keyword in test_cases:
        result = tester.one_shot_test(
            task,
            example_input,
            example_output,
            text,
        )

        # 检查翻译结果是否包含预期的英文关键词
        if keyword in result.lower():
            correct += 1
            print(f"✓ 正确: {text} -> {result}")
        else:
            print(f"✗ 错误: {text} -> {result}")

    accuracy = correct / len(test_cases)
    print(f"\n准确率: {accuracy:.2%}")
    assert accuracy >= 0.7, "单样本准确率过低"

预期结果: 准确率≥70%

测试用例 1.3: 少样本 Prompt¶

测试目标: 验证少样本 Prompt 的有效性

测试代码:

Python

def few_shot_test(
    self,
    task: str,
    examples: list[dict[str, str]],
    test_input: str,
) -> str:
    """
    少样本测试

    Args:
        task: 任务描述
        examples: 示例列表
        test_input: 测试输入

    Returns:
        生成结果
    """
    # 构造少样本 Prompt：先放任务描述，再逐个拼接示例
    prompt_parts = [task, "\n示例:"]

    # 动态拼接多个输入/输出示例，帮助模型学习任务模式
    for i, example in enumerate(examples):  # enumerate同时获取索引和元素
        prompt_parts.append(f"\n示例 {i+1}:")
        prompt_parts.append(f"输入: {example['input']}")
        prompt_parts.append(f"输出: {example['output']}")

    # 最后追加待处理的测试输入
    prompt_parts.append(f"\n现在请处理:")
    prompt_parts.append(f"输入: {test_input}")
    prompt_parts.append("输出:")

    # 将所有部分用换行符拼接为完整 Prompt
    prompt = "\n".join(prompt_parts)

    response = self.client.chat.completions.create(
        model=self.model,
        messages=[
            {"role": "user", "content": prompt}
        ],
        temperature=0.7,
        max_tokens=500,
    )

    return response.choices[0].message.content

def test_few_shot_classification():
    """测试少样本分类"""
    tester = PromptTester(api_key="your-api-key")

    task = "将以下文本分类为：科技、体育、娱乐"

    examples = [
        {"input": "人工智能技术发展迅速", "output": "科技"},
        {"input": "足球比赛精彩纷呈", "output": "体育"},
        {"input": "新电影即将上映", "output": "娱乐"},
    ]

    test_cases = [
        ("5G网络即将商用", "科技"),
        ("篮球比赛进入决赛", "体育"),
        ("明星发布新专辑", "娱乐"),
    ]

    correct = 0
    for text, expected in test_cases:
        result = tester.few_shot_test(task, examples, text)

        if expected in result:
            correct += 1
            print(f"✓ 正确: {text} -> {result}")
        else:
            print(f"✗ 错误: {text} -> {result} (期望: {expected})")

    accuracy = correct / len(test_cases)
    print(f"\n准确率: {accuracy:.2%}")
    assert accuracy >= 0.8, "少样本准确率过低"

预期结果: 准确率≥80%

2. 思维链 Prompt 测试¶

测试用例 2.1: 零样本思维链¶

测试目标: 验证零样本思维链 Prompt 的有效性

测试代码:

Python

def zero_shot_cot_test(
    self,
    question: str,
) -> str:
    """
    零样本思维链测试

    Args:
        question: 问题

    Returns:
        生成结果
    """
    # 零样本 CoT：在问题末尾添加"让我们一步步思考"引导模型展示推理过程
    prompt = f"""{question}

让我们一步步思考:"""

    # 使用较低温度(0.3)确保推理过程更加稳定和可靠
    response = self.client.chat.completions.create(
        model=self.model,
        messages=[
            {"role": "user", "content": prompt}
        ],
        temperature=0.3,    # 低温度保证推理一致性
        max_tokens=1000,    # 推理过程较长，需要更多 token
    )

    return response.choices[0].message.content

def test_zero_shot_cot_math():
    """测试零样本思维链数学推理"""
    tester = PromptTester(api_key="your-api-key")

    test_cases = [
        ("一个商店卖苹果，每个3元。如果买5个苹果，需要多少钱？", "15"),
        ("小明有10个苹果，吃了3个，又买了5个，现在有多少个？", "12"),
    ]

    for question, answer in test_cases:
        result = tester.zero_shot_cot_test(question)

        if answer in result:
            print(f"✓ 正确: {question}")
            print(f"  推理过程: {result}")
        else:
            print(f"✗ 错误: {question}")
            print(f"  回答: {result}")
            print(f"  期望: {answer}")

预期结果: 能够展示推理过程并得到正确答案

测试用例 2.2: 少样本思维链¶

测试目标: 验证少样本思维链 Prompt 的有效性

测试代码:

Python

def few_shot_cot_test(
    self,
    examples: list[dict[str, str]],
    question: str,
) -> str:
    """
    少样本思维链测试

    Args:
        examples: 示例列表
        question: 问题

    Returns:
        生成结果
    """
    prompt_parts = []

    # 拼接少样本思维链示例：每个示例包含问题、推理过程和答案
    for i, example in enumerate(examples):
        prompt_parts.append(f"问题 {i+1}: {example['question']}")
        prompt_parts.append(f"思考: {example['reasoning']}")   # 展示推理过程
        prompt_parts.append(f"答案: {example['answer']}\n")

    # 追加待回答的问题，只给出"思考:"前缀引导模型继续推理
    prompt_parts.append(f"问题: {question}")
    prompt_parts.append("思考:")

    prompt = "\n".join(prompt_parts)

    response = self.client.chat.completions.create(
        model=self.model,
        messages=[
            {"role": "user", "content": prompt}
        ],
        temperature=0.3,
        max_tokens=1000,
    )

    return response.choices[0].message.content

def test_few_shot_cot_logic():
    """测试少样本思维链逻辑推理"""
    tester = PromptTester(api_key="your-api-key")

    examples = [
        {
            "question": "如果A比B高，B比C高，那么A和C谁高？",
            "reasoning": "已知A比B高，B比C高。根据传递性，A比C高。",
            "answer": "A比C高"
        },
        {
            "question": "如果所有的鸟都会飞，企鹅是鸟，那么企鹅会飞吗？",
            "reasoning": "虽然企鹅是鸟，但企鹅实际上不会飞。这说明'所有的鸟都会飞'这个前提是错误的。",
            "answer": "不会飞"
        },
    ]

    test_question = "如果所有的猫都喜欢鱼，小花是猫，那么小花喜欢鱼吗？"

    result = tester.few_shot_cot_test(examples, test_question)

    print(f"问题: {test_question}")
    print(f"回答: {result}")

    # 验证回答包含推理过程
    assert "因为" in result or "所以" in result or "根据" in result

预期结果: 回答包含推理过程

3. 角色扮演 Prompt 测试¶

测试用例 3.1: 专业角色扮演¶

测试目标: 验证专业角色扮演 Prompt 的有效性

测试代码:

Python

def role_playing_test(
    self,
    role: str,
    task: str,
    user_input: str,
) -> str:
    """
    角色扮演测试

    Args:
        role: 角色
        task: 任务
        user_input: 用户输入

    Returns:
        生成结果
    """
    # 构造角色扮演 Prompt：指定角色身份、任务要求和用户输入
    prompt = f"""你现在是一个{role}。

{task}

用户: {user_input}
{role}:"""

    # 调用 LLM，让模型以指定角色生成回答
    response = self.client.chat.completions.create(
        model=self.model,
        messages=[
            {"role": "user", "content": prompt}
        ],
        temperature=0.7,
        max_tokens=500,
    )

    return response.choices[0].message.content

def test_role_playing_teacher():
    """测试教师角色扮演"""
    tester = PromptTester(api_key="your-api-key")

    role = "经验丰富的数学老师"
    task = "请用简单易懂的语言解释数学概念，并举例说明。"

    user_input = "请解释什么是质数"

    result = tester.role_playing_test(role, task, user_input)

    print(f"教师回答: {result}")

    # 验证回答包含解释和例子
    assert "质数" in result or "prime" in result.lower()
    assert len(result) > 50  # 回答应该足够详细

预期结果: 回答符合角色设定，内容详细

测试用例 3.2: 风格化角色扮演¶

测试目标: 验证风格化角色扮演 Prompt 的有效性

测试代码:

Python

def test_role_playing_style():
    """测试风格化角色扮演"""
    tester = PromptTester(api_key="your-api-key")

    # 定义多种角色及其对应风格，验证模型能否适配不同表达方式
    roles = [
        ("诗人", "用优美的诗歌语言回答"),
        ("程序员", "用代码和术语回答"),
        ("小朋友", "用简单可爱的语言回答"),
    ]

    user_input = "描述一下春天"

    # 逐个测试不同角色的回答风格
    for role, style in roles:
        result = tester.role_playing_test(role, style, user_input)

        print(f"\n{role}的回答:")
        print(result)

        # 验证回答长度合理
        assert len(result) > 20

预期结果: 不同角色产生不同风格的回答

4. 结构化 Prompt 测试¶

测试用例 4.1: JSON 输出格式¶

测试目标: 验证 JSON 输出格式 Prompt 的有效性

测试代码:

Python

def json_output_test(
    self,
    task: str,
    input_text: str,
    schema: dict,
) -> dict:
    """
    JSON输出测试

    Args:
        task: 任务
        input_text: 输入文本
        schema: JSON模式

    Returns:
        解析后的JSON
    """
    # 将 JSON Schema 转为字符串，嵌入 Prompt 中指导输出格式
    schema_str = str(schema)

    # 构造结构化输出 Prompt：明确要求模型以 JSON 格式返回
    prompt = f"""{task}

输入: {input_text}

请以JSON格式输出，格式如下:
{schema_str}

输出:"""

    response = self.client.chat.completions.create(
        model=self.model,
        messages=[
            {"role": "user", "content": prompt}
        ],
        temperature=0.3,
        max_tokens=500,
    )

    result = response.choices[0].message.content

    # 尝试将模型输出解析为 JSON 对象
    import json
    try:  # try/except捕获异常，防止程序崩溃
        return json.loads(result)        # 解析成功返回字典
    except:
        print(f"JSON解析失败: {result}")  # 模型输出不符合 JSON 格式
        return None

def test_json_output():
    """测试JSON输出"""
    tester = PromptTester(api_key="your-api-key")

    task = "从以下文本中提取关键信息"
    input_text = "张三，男，30岁，软件工程师，居住在北京"

    schema = {
        "name": "姓名",
        "gender": "性别",
        "age": "年龄",
        "occupation": "职业",
        "city": "城市"
    }

    result = tester.json_output_test(task, input_text, schema)

    print(f"提取结果: {result}")

    # 验证JSON格式正确
    assert result is not None
    assert isinstance(result, dict)  # isinstance检查类型
    assert "name" in result
    assert result["name"] == "张三"

预期结果: 输出符合 JSON 格式，包含正确信息

测试用例 4.2: 列表输出格式¶

测试目标: 验证列表输出格式 Prompt 的有效性

测试代码:

Python

def list_output_test(
    self,
    task: str,
    input_text: str,
) -> list[str]:
    """
    列表输出测试

    Args:
        task: 任务
        input_text: 输入文本

    Returns:
        列表
    """
    prompt = f"""{task}

输入: {input_text}

请以列表形式输出，每行一项:"""

    response = self.client.chat.completions.create(
        model=self.model,
        messages=[
            {"role": "user", "content": prompt}
        ],
        temperature=0.7,
        max_tokens=500,
    )

    result = response.choices[0].message.content

    # 按换行符拆分，过滤空行，得到原始列表项
    lines = [line.strip() for line in result.split('\n') if line.strip()]  # 链式调用：strip去除空白

    # 清洗列表项：移除模型输出中的序号和项目符号前缀
    import re
    cleaned_lines = []
    for line in lines:
        line = re.sub(r'^\d+[\.\)]\s*', '', line)  # 移除 "1." "2)" 等数字序号
        line = re.sub(r'^[-*]\s*', '', line)         # 移除 "-" "*" 等项目符号
        cleaned_lines.append(line)

    return cleaned_lines

def test_list_output():
    """测试列表输出"""
    tester = PromptTester(api_key="your-api-key")

    task = "列出Python的优点"
    input_text = ""

    result = tester.list_output_test(task, input_text)

    print(f"Python的优点:")
    for i, item in enumerate(result, 1):
        print(f"{i}. {item}")

    # 验证输出为列表
    assert isinstance(result, list)
    assert len(result) >= 3  # 至少3条

预期结果: 输出为列表格式，包含多条信息

5. 质量评估测试¶

测试用例 5.1: 一致性测试¶

测试目标: 测试 Prompt 的一致性

测试代码:

Python

def consistency_test(
    self,
    prompt: str,
    num_trials: int = 5,
) -> list[str]:
    """
    一致性测试

    Args:
        prompt: Prompt
        num_trials: 试验次数

    Returns:
        结果列表
    """
    results = []

    # 对同一 Prompt 重复调用多次，收集所有结果
    for _ in range(num_trials):
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "user", "content": prompt}
            ],
            temperature=0.3,  # 较低温度提高一致性，减少随机波动
            max_tokens=500,
        )

        result = response.choices[0].message.content
        results.append(result)

    return results

def test_prompt_consistency():
    """测试Prompt一致性"""
    tester = PromptTester(api_key="your-api-key")

    prompt = "用一句话描述人工智能"

    results = tester.consistency_test(prompt, num_trials=5)

    print("一致性测试结果:")
    for i, result in enumerate(results, 1):
        print(f"{i}. {result}")

    # 使用 SequenceMatcher 计算两两之间的文本相似度
    from difflib import SequenceMatcher

    similarities = []
    # 两两配对计算所有结果之间的相似度
    for i in range(len(results)):
        for j in range(i + 1, len(results)):
            similarity = SequenceMatcher(None, results[i], results[j]).ratio()
            similarities.append(similarity)

    # 计算平均相似度，衡量 Prompt 一致性
    avg_similarity = sum(similarities) / len(similarities)
    print(f"\n平均相似度: {avg_similarity:.2%}")

    # 相似度阈值断言：低于 50% 说明 Prompt 输出不稳定
    assert avg_similarity >= 0.5, "Prompt一致性过低"

预期结果: 多次生成的结果相似度≥50%

测试用例 5.2: 鲁棒性测试¶

测试目标: 测试 Prompt 对输入变化的鲁棒性

测试代码:

Python

def robustness_test(
    self,
    base_prompt: str,
    variations: list[str],
) -> dict[str, str]:
    """
    鲁棒性测试

    Args:
        base_prompt: 基础Prompt
        variations: 变体列表

    Returns:
        结果字典
    """
    results = {}

    # 逐个替换输入变体，测试同一 Prompt 模板对不同表述的适应能力
    for variation in variations:
        prompt = base_prompt.replace("{input}", variation)  # 将占位符替换为实际输入

        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "user", "content": prompt}
            ],
            temperature=0.7,
            max_tokens=500,
        )

        result = response.choices[0].message.content
        results[variation] = result

    return results

def test_prompt_robustness():
    """测试Prompt鲁棒性"""
    tester = PromptTester(api_key="your-api-key")

    base_prompt = "请总结以下内容: {input}"

    variations = [
        "人工智能是计算机科学的一个分支，它企图了解智能的实质，并生产出一种新的能以人类智能相似的方式做出反应的智能机器。",
        "AI，也就是人工智能，是计算机科学的一个重要领域。它的目标是创造能够像人类一样思考和行动的机器。",
        "Artificial Intelligence (AI) is a branch of computer science that aims to create machines that can think and act like humans.",
    ]

    results = tester.robustness_test(base_prompt, variations)

    print("鲁棒性测试结果:")
    for input_text, result in results.items():
        print(f"\n输入: {input_text[:50]}...")
        print(f"输出: {result}")

    # 验证所有输出都包含关键词
    for result in results.values():
        assert len(result) > 10, "输出过短"

预期结果: 不同输入都能得到合理的输出

6. A/B 测试¶

测试用例 6.1: Prompt 对比测试¶

测试目标: 对比不同 Prompt 策略的性能

测试代码:

Python

def ab_test(
    self,
    prompt_a: str,
    prompt_b: str,
    test_inputs: list[str],
    evaluation_func: callable,
) -> dict[str, float]:
    """
    A/B测试

    Args:
        prompt_a: Prompt A
        prompt_b: Prompt B
        test_inputs: 测试输入列表
        evaluation_func: 评估函数

    Returns:
        评分字典
    """
    scores_a = []   # Prompt A 的评分列表
    scores_b = []   # Prompt B 的评分列表

    # 对每个测试输入，分别使用两个 Prompt 生成结果并评分
    for test_input in test_inputs:
        # 测试Prompt A：将占位符替换为实际输入并调用 LLM
        response_a = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "user", "content": prompt_a.replace("{input}", test_input)}
            ],
            temperature=0.7,
            max_tokens=500,
        )
        result_a = response_a.choices[0].message.content
        score_a = evaluation_func(result_a)
        scores_a.append(score_a)

        # 测试Prompt B
        response_b = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "user", "content": prompt_b.replace("{input}", test_input)}
            ],
            temperature=0.7,
            max_tokens=500,
        )
        result_b = response_b.choices[0].message.content
        score_b = evaluation_func(result_b)
        scores_b.append(score_b)

    # 返回两个 Prompt 的平均分和详细评分，便于对比分析
    return {
        "prompt_a_avg": sum(scores_a) / len(scores_a),  # A 的平均分
        "prompt_b_avg": sum(scores_b) / len(scores_b),  # B 的平均分
        "prompt_a_scores": scores_a,
        "prompt_b_scores": scores_b,
    }

def test_prompt_comparison():
    """测试Prompt对比"""
    tester = PromptTester(api_key="your-api-key")

    # 定义两个Prompt
    prompt_a = "请翻译成英文: {input}"
    prompt_b = """你是一个专业的翻译官。请将以下中文翻译成地道的英文，注意保持原文的语气和风格。

中文: {input}
英文:"""

    test_inputs = [
        "你好，世界！",
        "今天天气很好。",
        "我喜欢编程。",
    ]

    # 定义评估函数（简单长度检查）
    def evaluate(result: str) -> float:
        return len(result)

    results = tester.ab_test(prompt_a, prompt_b, test_inputs, evaluate)

    print(f"Prompt A平均分: {results['prompt_a_avg']:.2f}")
    print(f"Prompt B平均分: {results['prompt_b_avg']:.2f}")

    if results['prompt_b_avg'] > results['prompt_a_avg']:
        print("Prompt B表现更好")
    else:
        print("Prompt A表现更好")

预期结果: 能够对比出不同 Prompt 的性能差异

📊 测试执行¶

运行所有测试¶

Bash

# 运行所有 Prompt 工程测试用例，-v 显示详细输出
pytest tests/test_prompt_engineering.py -v

# 只运行零样本情感分析测试（使用 :: 指定具体测试函数）
pytest tests/test_prompt_engineering.py::test_zero_shot_sentiment -v

# 运行测试并生成 HTML 格式的可视化测试报告
pytest tests/test_prompt_engineering.py --html=report.html

✅ 验证方法¶

1. 自动化验证¶

运行所有测试用例
检查断言是否通过
记录测试结果

2. 人工评估¶

评估生成内容的质量
检查是否符合预期
记录主观评价

3. 性能基准¶

建立性能基准
监控 Prompt 性能变化
优化 Prompt 策略

📝 测试报告¶

测试报告应包含：

测试概览
测试用例数量
通过/失败统计
准确率等指标
详细结果
每个测试用例的结果
Prompt 策略对比
问题和建议
最佳实践
有效的 Prompt 模式
常见陷阱和解决方案
优化建议

测试完成标准: 所有测试用例通过，准确率≥60% 推荐测试频率: 每次 Prompt 更新 测试维护周期: 每周