第九章评估（上）——存在一个简单的正确答案时¶

一、环境配置
- 1.1 加载 API 密钥和一些 Python 库。
- 1.2 获取相关产品和类别
二、找出相关产品和类别名称
三、在一些查询上进行评估
四、更难的测试用例
五、修改指令以处理难测试用例
六、在难测试用例上评估修改后的指令
七、回归测试：验证模型在以前的测试用例上仍然有效
八、收集开发集进行自动化测试
九、通过与理想答案比较来评估测试用例
十、在所有测试用例上运行评估，并计算正确的用例比例

在之前的章节中，我们展示了如何使用 LLM 构建应用程序，包括评估输入、处理输入以及在向用户显示输出之前进行最终输出检查。

构建这样的系统后，如何知道它的工作情况？甚至在部署后并让用户使用它时，如何跟踪它的运行情况，发现任何缺陷，并持续改进系统的答案质量？

在本章中，我们想与您分享一些最佳实践，用于评估 LLM 的输出。

构建基于 LLM 的应用程序与传统的监督学习应用程序有所不同。由于可以快速构建基于 LLM 的应用程序，因此评估方法通常不从测试集开始。相反，通常会逐渐建立一组测试示例。

在传统的监督学习环境中，需要收集训练集、开发集或保留交叉验证集，然后在整个开发过程中使用它们。

然而，如果能够在几分钟内指定 Prompt，并在几个小时内得到相应结果，那么暂停很长时间去收集一千个测试样本将是一件极其痛苦的事情。因为现在，可以在零个训练样本的情况下获得这个成果。

因此，在使用 LLM 构建应用程序时，您将体会到如下的过程：

首先，您会在只有一到三个样本的小样本中调整 Prompt，并尝试让 Prompt 在它们身上起作用。

然后，当系统进行进一步的测试时，您可能会遇到一些棘手的例子。Prompt 在它们身上不起作用，或者算法在它们身上不起作用。

这就是使用 ChatGPT API 构建应用程序的开发者所经历的挑战。

在这种情况下，您可以将这些额外的几个示例添加到您正在测试的集合中，以机会主义地添加其他棘手的示例。

最终，您已经添加了足够的这些示例到您缓慢增长的开发集中，以至于通过手动运行每个示例来测试 Prompt 变得有些不方便。

然后，您开始开发在这些小示例集上用于衡量性能的指标，例如平均准确性。

这个过程的一个有趣方面是，如果您觉得您的系统已经足够好了，您可以随时停在那里，不再改进它。事实上，许多已部署的应用程序停在第一或第二个步骤，并且运行得非常好。

需要注意的是，有很多大模型的应用程序没有实质性的风险，即使它没有给出完全正确的答案。

但是，对于部分高风险应用，如果存在偏见或不适当的输出可能对某人造成伤害，那么收集测试集、严格评估系统的性能、确保在使用之前它能够做正确的事情，就变得更加重要。

但是，如果您只是使用它来总结文章供自己阅读，而不是给别人看，那么可能造成的危害风险更小，您可以在这个过程中早早停止，而不必去花费更大的代价去收集更大的数据集。

一、环境配置¶

1.1 加载 API 密钥和一些 Python 库。¶

同上一章，我们首先需要配置使用 OpenAI API 的环境

In [1]:

import os
import openai
import sys
import time
sys.path.append('../..')
import utils_en
import utils_zh

openai.api_key  = "sk-..."
# 设置 API_KEY, 请替换成您自己的 API_KEY

# 以下为基于环境变量的配置方法示例，这样更加安全。仅供参考，后续将不再涉及。
# import openai
# import os
# OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
# openai.api_key = OPENAI_API_KEY

In [2]:

def get_completion_from_messages(messages, 
                                 model="gpt-3.5-turbo", 
                                 temperature=0, 
                                 max_tokens=500):
    '''
    封装一个访问 OpenAI GPT3.5 的函数

    参数: 
    messages: 这是一个消息列表，每个消息都是一个字典，包含 role(角色）和 content(内容)。角色可以是'system'、'user' 或 'assistant’，内容是角色的消息。
    model: 调用的模型，默认为 gpt-3.5-turbo(ChatGPT)，有内测资格的用户可以选择 gpt-4
    temperature: 这决定模型输出的随机程度，默认为0，表示输出将非常确定。增加温度会使输出更随机。
    max_tokens: 这决定模型输出的最大的 token 数。
    '''
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature, # 这决定模型输出的随机程度
        max_tokens=max_tokens, # 这决定模型输出的最大的 token 数
    )
    return response.choices[0].message["content"]

1.2 获取相关产品和类别¶

我们要获取前几章中提到的产品目录中的产品和类别列表。

In [5]:

products_and_category = utils_en.get_products_and_category()
products_and_category

Out[5]:

{'Computers and Laptops': ['TechPro Ultrabook',
  'BlueWave Gaming Laptop',
  'PowerLite Convertible',
  'TechPro Desktop',
  'BlueWave Chromebook'],
 'Smartphones and Accessories': ['SmartX ProPhone',
  'MobiTech PowerCase',
  'SmartX MiniPhone',
  'MobiTech Wireless Charger',
  'SmartX EarBuds'],
 'Televisions and Home Theater Systems': ['CineView 4K TV',
  'SoundMax Home Theater',
  'CineView 8K TV',
  'SoundMax Soundbar',
  'CineView OLED TV'],
 'Gaming Consoles and Accessories': ['GameSphere X',
  'ProGamer Controller',
  'GameSphere Y',
  'ProGamer Racing Wheel',
  'GameSphere VR Headset'],
 'Audio Equipment': ['AudioPhonic Noise-Canceling Headphones',
  'WaveSound Bluetooth Speaker',
  'AudioPhonic True Wireless Earbuds',
  'WaveSound Soundbar',
  'AudioPhonic Turntable'],
 'Cameras and Camcorders': ['FotoSnap DSLR Camera',
  'ActionCam 4K',
  'FotoSnap Mirrorless Camera',
  'ZoomMaster Camcorder',
  'FotoSnap Instant Camera']}

二、找出相关产品和类别名称¶

In [4]:

def find_category_and_product_v1(user_input, products_and_category):
    """
    从用户输入中获取到产品和类别

    参数：
    user_input：用户的查询
    products_and_category：产品类型和对应产品的字典
    """

    # 分隔符
    delimiter = "####"
    # 定义的系统信息，陈述了需要 GPT 完成的工作
    system_message = f"""
    You will be provided with customer service queries. \
    The customer service query will be delimited with {delimiter} characters.
    Output a Python list of json objects, where each object has the following format:
        'category': <one of Computers and Laptops, Smartphones and Accessories, Televisions and Home Theater Systems, \
    Gaming Consoles and Accessories, Audio Equipment, Cameras and Camcorders>,
    AND
        'products': <a list of products that must be found in the allowed products below>


    Where the categories and products must be found in the customer service query.
    If a product is mentioned, it must be associated with the correct category in the allowed products list below.
    If no products or categories are found, output an empty list.
    

    List out all products that are relevant to the customer service query based on how closely it relates
    to the product name and product category.
    Do not assume, from the name of the product, any features or attributes such as relative quality or price.

    The allowed products are provided in JSON format.
    The keys of each item represent the category.
    The values of each item is a list of products that are within that category.
    Allowed products: {products_and_category}
    

    """
    # 给出几个示例
    few_shot_user_1 = """I want the most expensive computer."""
    few_shot_assistant_1 = """ 
    [{'category': 'Computers and Laptops', \
'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]
    """
    
    messages =  [  
    {'role':'system', 'content': system_message},    
    {'role':'user', 'content': f"{delimiter}{few_shot_user_1}{delimiter}"},  
    {'role':'assistant', 'content': few_shot_assistant_1 },
    {'role':'user', 'content': f"{delimiter}{user_input}{delimiter}"},  
    ] 
    return get_completion_from_messages(messages)

In [3]:

def find_category_and_product_v1(user_input,products_and_category):
    """
    从用户输入中获取到产品和类别

    参数：
    user_input：用户的查询
    products_and_category：产品类型和对应产品的字典
    """
    
    delimiter = "####"
    system_message = f"""
    您将提供客户服务查询。\
    客户服务查询将用{delimiter}字符分隔。
    输出一个 Python 列表，列表中的每个对象都是 Json 对象，每个对象的格式如下：
        'category': <Computers and Laptops, Smartphones and Accessories, Televisions and Home Theater Systems, \
    Gaming Consoles and Accessories, Audio Equipment, Cameras and Camcorders中的一个>,
    以及
        'products': <必须在下面允许的产品中找到的产品列表>
    
    其中类别和产品必须在客户服务查询中找到。
    如果提到了一个产品，它必须与下面允许的产品列表中的正确类别关联。
    如果没有找到产品或类别，输出一个空列表。
    
    根据产品名称和产品类别与客户服务查询的相关性，列出所有相关的产品。
    不要从产品的名称中假设任何特性或属性，如相对质量或价格。
    
    允许的产品以 JSON 格式提供。
    每个项目的键代表类别。
    每个项目的值是该类别中的产品列表。
    允许的产品：{products_and_category}
    
    """
    
    few_shot_user_1 = """我想要最贵的电脑。"""
    few_shot_assistant_1 = """ 
    [{'category': 'Computers and Laptops', \
'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]
    """
    
    messages =  [  
    {'role':'system', 'content': system_message},    
    {'role':'user', 'content': f"{delimiter}{few_shot_user_1}{delimiter}"},  
    {'role':'assistant', 'content': few_shot_assistant_1 },
    {'role':'user', 'content': f"{delimiter}{user_input}{delimiter}"},  
    ] 
    return get_completion_from_messages(messages)

三、在一些查询上进行评估¶

In [5]:

# 第一个评估的查询
customer_msg_0 = f"""Which TV can I buy if I'm on a budget?"""

products_by_category_0 = find_category_and_product_v1(customer_msg_0,
                                                      products_and_category)
print(products_by_category_0)

    [{'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]

In [ ]:

# 第一个评估的查询
customer_msg_0 = f"""如果我预算有限，我可以买哪款电视？"""

products_by_category_0 = find_category_and_product_v1(customer_msg_0,
                                                      products_and_category)
print(products_by_category_0)

    [{'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'SoundMax Soundbar', 'CineView OLED TV']}]

In [6]:

# 第二个评估的查询
customer_msg_1 = f"""I need a charger for my smartphone"""

products_by_category_1 = find_category_and_product_v1(customer_msg_1,
                                                      products_and_category)
print(products_by_category_1)

    [{'category': 'Smartphones and Accessories', 'products': ['MobiTech PowerCase', 'MobiTech Wireless Charger', 'SmartX EarBuds']}]

In [ ]:

customer_msg_1 = f"""我需要一个智能手机的充电器"""

products_by_category_1 = find_category_and_product_v1(customer_msg_1,
                                                      products_and_category)
print(products_by_category_1)

    [{'category': 'Smartphones and Accessories', 'products': ['MobiTech PowerCase', 'MobiTech Wireless Charger', 'SmartX EarBuds']}]

In [7]:

# 第三个评估查询
customer_msg_2 = f"""
What computers do you have?"""

products_by_category_2 = find_category_and_product_v1(customer_msg_2,
                                                      products_and_category)
products_by_category_2

Out[7]:

"    [{'category': 'Computers and Laptops', 'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]"

In [ ]:

customer_msg_2 = f"""
你们有哪些电脑？"""

products_by_category_2 = find_category_and_product_v1(customer_msg_2,
                                                      products_and_category)
products_by_category_2

"    [{'category': 'Computers and Laptops', 'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]"

In [10]:

# 第四个查询，更复杂
customer_msg_3 = f"""
tell me about the smartx pro phone and the fotosnap camera, the dslr one.
Also, what TVs do you have?"""

products_by_category_3 = find_category_and_product_v1(customer_msg_3,
                                                      products_and_category)
print(products_by_category_3)

    [{'category': 'Smartphones and Accessories', 'products': ['SmartX ProPhone']},
     {'category': 'Cameras and Camcorders', 'products': ['FotoSnap DSLR Camera']},
     {'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]
     
    Note: The query mentions "smartx pro phone" and "fotosnap camera, the dslr one", so the output includes the relevant categories and products. The query also asks about TVs, so the relevant category is included in the output.

In [ ]:

customer_msg_3 = f"""
告诉我关于smartx pro手机和fotosnap相机的信息，那款DSLR的。
另外，你们有哪些电视？"""

products_by_category_3 = find_category_and_product_v1(customer_msg_3,
                                                      products_and_category)
print(products_by_category_3)

    [{'category': 'Smartphones and Accessories', 'products': ['SmartX ProPhone']}, {'category': 'Cameras and Camcorders', 'products': ['FotoSnap DSLR Camera']}]
    
    {'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}

它看起来像是输出了正确的数据，但它也输出了一堆文本，这些是多余的。这使得将其解析为 Python 字典列表更加困难。

四、更难的测试用例¶

找出一些在实际使用中，模型表现不如预期的查询。

In [9]:

customer_msg_4 = f"""
tell me about the CineView TV, the 8K one, Gamesphere console, the X one.
I'm on a budget, what computers do you have?"""

products_by_category_4 = find_category_and_product_v1(customer_msg_4,
                                                      products_and_category)
print(products_by_category_4)

    [{'category': 'Televisions and Home Theater Systems', 'products': ['CineView 8K TV']},
     {'category': 'Gaming Consoles and Accessories', 'products': ['GameSphere X']},
     {'category': 'Computers and Laptops', 'products': ['BlueWave Chromebook']}]
     
    Note: The CineView TV mentioned is the 8K one, and the Gamesphere console mentioned is the X one. 
    For the computer category, since the customer mentioned being on a budget, we cannot determine which specific product to recommend. 
    Therefore, we have included all the products in the Computers and Laptops category in the output.

In [10]:

customer_msg_4 = f"""
告诉我关于CineView电视的信息，那款8K的，还有Gamesphere游戏机，X款的。
我预算有限，你们有哪些电脑？"""

products_by_category_4 = find_category_and_product_v1(customer_msg_4,products_and_category)
print(products_by_category_4)

    [{'category': 'Televisions and Home Theater Systems', 'products': ['CineView 8K TV']}, {'category': 'Gaming Consoles and Accessories', 'products': ['GameSphere X']}, {'category': 'Computers and Laptops', 'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]
    
    具体来说，CineView 8K电视是一款高端电视，具有8K分辨率和OLED显示屏。GameSphere X是一款游戏机，具有高性能和多种游戏选择。对于预算有限的电脑，您可以考虑TechPro Chromebook或TechPro Ultrabook，它们都是较为经济实惠的选择。

五、修改指令以处理难测试用例¶

我们在提示中添加了以下内容，不要输出任何不在 JSON 格式中的附加文本，并添加了第二个示例，使用用户和助手消息进行 few-shot 提示。

In [11]:

def find_category_and_product_v2(user_input, products_and_category):
    """
    从用户输入中获取到产品和类别
    添加：不要输出任何不符合 JSON 格式的额外文本。
    添加了第二个示例（用于 few-shot 提示），用户询问最便宜的计算机。
    在这两个 few-shot 示例中，显示的响应只是 JSON 格式的完整产品列表。

    参数：
    user_input：用户的查询
    products_and_category：产品类型和对应产品的字典
    """
    delimiter = "####"
    system_message = f"""
    You will be provided with customer service queries. \
    The customer service query will be delimited with {delimiter} characters.
    Output a Python list of JSON objects, where each object has the following format:
        'category': <one of Computers and Laptops, Smartphones and Accessories, Televisions and Home Theater Systems, \
    Gaming Consoles and Accessories, Audio Equipment, Cameras and Camcorders>,
    AND
        'products': <a list of products that must be found in the allowed products below>
    Do not output any additional text that is not in JSON format.
    Do not write any explanatory text after outputting the requested JSON.


    Where the categories and products must be found in the customer service query.
    If a product is mentioned, it must be associated with the correct category in the allowed products list below.
    If no products or categories are found, output an empty list.
    

    List out all products that are relevant to the customer service query based on how closely it relates
    to the product name and product category.
    Do not assume, from the name of the product, any features or attributes such as relative quality or price.

    The allowed products are provided in JSON format.
    The keys of each item represent the category.
    The values of each item is a list of products that are within that category.
    Allowed products: {products_and_category}
    

    """
    
    few_shot_user_1 = """I want the most expensive computer. What do you recommend?"""
    few_shot_assistant_1 = """ 
    [{'category': 'Computers and Laptops', \
'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]
    """
    
    few_shot_user_2 = """I want the most cheapest computer. What do you recommend?"""
    few_shot_assistant_2 = """ 
    [{'category': 'Computers and Laptops', \
'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]
    """
    
    messages =  [  
    {'role':'system', 'content': system_message},    
    {'role':'user', 'content': f"{delimiter}{few_shot_user_1}{delimiter}"},  
    {'role':'assistant', 'content': few_shot_assistant_1 },
    {'role':'user', 'content': f"{delimiter}{few_shot_user_2}{delimiter}"},  
    {'role':'assistant', 'content': few_shot_assistant_2 },
    {'role':'user', 'content': f"{delimiter}{user_input}{delimiter}"},  
    ] 
    return get_completion_from_messages(messages)

In [11]:

def find_category_and_product_v2(user_input,products_and_category):
    """
    从用户输入中获取到产品和类别

    添加：不要输出任何不符合 JSON 格式的额外文本。
    添加了第二个示例（用于 few-shot 提示），用户询问最便宜的计算机。
    在这两个 few-shot 示例中，显示的响应只是 JSON 格式的完整产品列表。

    参数：
    user_input：用户的查询
    products_and_category：产品类型和对应产品的字典    
    """
    delimiter = "####"
    system_message = f"""
    您将提供客户服务查询。\
    客户服务查询将用{delimiter}字符分隔。
    输出一个 Python列表，列表中的每个对象都是 JSON 对象，每个对象的格式如下：
        'category': <Computers and Laptops, Smartphones and Accessories, Televisions and Home Theater Systems, \
    Gaming Consoles and Accessories, Audio Equipment, Cameras and Camcorders中的一个>,
    AND
        'products': <必须在下面允许的产品中找到的产品列表>
    不要输出任何不是 JSON 格式的额外文本。
    输出请求的 JSON 后，不要写任何解释性的文本。
    
    其中类别和产品必须在客户服务查询中找到。
    如果提到了一个产品，它必须与下面允许的产品列表中的正确类别关联。
    如果没有找到产品或类别，输出一个空列表。
    
    根据产品名称和产品类别与客户服务查询的相关性，列出所有相关的产品。
    不要从产品的名称中假设任何特性或属性，如相对质量或价格。
    
    允许的产品以 JSON 格式提供。
    每个项目的键代表类别。
    每个项目的值是该类别中的产品列表。
    允许的产品：{products_and_category}
    
    """
    
    few_shot_user_1 = """我想要最贵的电脑。你推荐哪款？"""
    few_shot_assistant_1 = """ 
    [{'category': 'Computers and Laptops', \
'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]
    """
    
    few_shot_user_2 = """我想要最便宜的电脑。你推荐哪款？"""
    few_shot_assistant_2 = """ 
    [{'category': 'Computers and Laptops', \
'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]
    """
    
    messages =  [  
    {'role':'system', 'content': system_message},    
    {'role':'user', 'content': f"{delimiter}{few_shot_user_1}{delimiter}"},  
    {'role':'assistant', 'content': few_shot_assistant_1 },
    {'role':'user', 'content': f"{delimiter}{few_shot_user_2}{delimiter}"},  
    {'role':'assistant', 'content': few_shot_assistant_2 },
    {'role':'user', 'content': f"{delimiter}{user_input}{delimiter}"},  
    ] 
    return get_completion_from_messages(messages)

六、在难测试用例上评估修改后的指令¶

In [12]:

customer_msg_3 = f"""
tell me about the smartx pro phone and the fotosnap camera, the dslr one.
Also, what TVs do you have?"""

products_by_category_3 = find_category_and_product_v2(customer_msg_3,
                                                      products_and_category)
print(products_by_category_3)

    [{'category': 'Smartphones and Accessories', 'products': ['SmartX ProPhone']}, {'category': 'Cameras and Camcorders', 'products': ['FotoSnap DSLR Camera']}, {'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]

In [12]:

customer_msg_3 = f"""
告诉我关于smartx pro手机和fotosnap相机的信息，那款DSLR的。
另外，你们有哪些电视？"""

products_by_category_3 = find_category_and_product_v2(customer_msg_3,
                                                      products_and_category)
print(products_by_category_3)

    [{'category': 'Smartphones and Accessories', 'products': ['SmartX ProPhone']}, {'category': 'Cameras and Camcorders', 'products': ['FotoSnap DSLR Camera']}, {'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]

七、回归测试：验证模型在以前的测试用例上仍然有效¶

检查并修复模型以提高难以测试的用例效果，同时确保此修正不会对先前的测试用例性能造成负面影响。

In [13]:

customer_msg_0 = f"""Which TV can I buy if I'm on a budget?"""

products_by_category_0 = find_category_and_product_v2(customer_msg_0,
                                                      products_and_category)
print(products_by_category_0)

    [{'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]

In [13]:

customer_msg_0 = f"""如果我预算有限，我可以买哪款电视？"""

products_by_category_0 = find_category_and_product_v2(customer_msg_0,
                                                      products_and_category)
print(products_by_category_0)

 

    [{'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]
    
    如果您的预算有限，我们建议您购买CineView 4K电视或SoundMax家庭影院。

八、收集开发集进行自动化测试¶

当您要调整的开发集不仅仅是一小部分示例时，开始自动化测试过程就变得有用了。

In [14]:

msg_ideal_pairs_set = [
    
    # eg 0
    {'customer_msg':"""Which TV can I buy if I'm on a budget?""",
     'ideal_answer':{
        'Televisions and Home Theater Systems':set(
            ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']
        )}
    },

    # eg 1
    {'customer_msg':"""I need a charger for my smartphone""",
     'ideal_answer':{
        'Smartphones and Accessories':set(
            ['MobiTech PowerCase', 'MobiTech Wireless Charger', 'SmartX EarBuds']
        )}
    },
    # eg 2
    {'customer_msg':f"""What computers do you have?""",
     'ideal_answer':{
           'Computers and Laptops':set(
               ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook'
               ])
                }
    },

    # eg 3
    {'customer_msg':f"""tell me about the smartx pro phone and \
    the fotosnap camera, the dslr one.\
    Also, what TVs do you have?""",
     'ideal_answer':{
        'Smartphones and Accessories':set(
            ['SmartX ProPhone']),
        'Cameras and Camcorders':set(
            ['FotoSnap DSLR Camera']),
        'Televisions and Home Theater Systems':set(
            ['CineView 4K TV', 'SoundMax Home Theater','CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV'])
        }
    }, 
    
    # eg 4
    {'customer_msg':"""tell me about the CineView TV, the 8K one, Gamesphere console, the X one.
I'm on a budget, what computers do you have?""",
     'ideal_answer':{
        'Televisions and Home Theater Systems':set(
            ['CineView 8K TV']),
        'Gaming Consoles and Accessories':set(
            ['GameSphere X']),
        'Computers and Laptops':set(
            ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook'])
        }
    },
    
    # eg 5
    {'customer_msg':f"""What smartphones do you have?""",
     'ideal_answer':{
           'Smartphones and Accessories':set(
               ['SmartX ProPhone', 'MobiTech PowerCase', 'SmartX MiniPhone', 'MobiTech Wireless Charger', 'SmartX EarBuds'
               ])
                    }
    },
    # eg 6
    {'customer_msg':f"""I'm on a budget.  Can you recommend some smartphones to me?""",
     'ideal_answer':{
        'Smartphones and Accessories':set(
            ['SmartX EarBuds', 'SmartX MiniPhone', 'MobiTech PowerCase', 'SmartX ProPhone', 'MobiTech Wireless Charger']
        )}
    },

    # eg 7 # this will output a subset of the ideal answer
    {'customer_msg':f"""What Gaming consoles would be good for my friend who is into racing games?""",
     'ideal_answer':{
        'Gaming Consoles and Accessories':set([
            'GameSphere X',
            'ProGamer Controller',
            'GameSphere Y',
            'ProGamer Racing Wheel',
            'GameSphere VR Headset'
     ])}
    },
    # eg 8
    {'customer_msg':f"""What could be a good present for my videographer friend?""",
     'ideal_answer': {
        'Cameras and Camcorders':set([
        'FotoSnap DSLR Camera', 'ActionCam 4K', 'FotoSnap Mirrorless Camera', 'ZoomMaster Camcorder', 'FotoSnap Instant Camera'
        ])}
    },
    
    # eg 9
    {'customer_msg':f"""I would like a hot tub time machine.""",
     'ideal_answer': []
    }
    
]

九、通过与理想答案比较来评估测试用例¶

In [16]:

import json
def eval_response_with_ideal(response,
                              ideal,
                              debug=False):
    """
    评估回复是否与理想答案匹配
    
    参数：
    response: 回复的内容
    ideal: 理想的答案
    debug: 是否打印调试信息
    """
    if debug:
        print("回复：")
        print(response)
    
    # json.loads() 只能解析双引号，因此此处将单引号替换为双引号
    json_like_str = response.replace("'",'"')
    
    # 解析为一系列的字典
    l_of_d = json.loads(json_like_str)
    
    # 当响应为空，即没有找到任何商品时
    if l_of_d == [] and ideal == []:
        return 1
    
    # 另外一种异常情况是，标准答案数量与回复答案数量不匹配
    elif l_of_d == [] or ideal == []:
        return 0
    
    # 统计正确答案数量
    correct = 0    
    
    if debug:
        print("l_of_d is")
        print(l_of_d)

    # 对每一个问答对  
    for d in l_of_d:

        # 获取产品和目录
        cat = d.get('category')
        prod_l = d.get('products')
        # 有获取到产品和目录
        if cat and prod_l:
            # convert list to set for comparison
            prod_set = set(prod_l)
            # get ideal set of products
            ideal_cat = ideal.get(cat)
            if ideal_cat:
                prod_set_ideal = set(ideal.get(cat))
            else:
                if debug:
                    print(f"没有在标准答案中找到目录 {cat}")
                    print(f"标准答案: {ideal}")
                continue
                
            if debug:
                print("产品集合：\n",prod_set)
                print()
                print("标准答案的产品集合：\n",prod_set_ideal)

            # 查找到的产品集合和标准的产品集合一致
            if prod_set == prod_set_ideal:
                if debug:
                    print("正确")
                correct +=1
            else:
                print("错误")
                print(f"产品集合: {prod_set}")
                print(f"标准的产品集合: {prod_set_ideal}")
                if prod_set <= prod_set_ideal:
                    print("回答是标准答案的一个子集")
                elif prod_set >= prod_set_ideal:
                    print("回答是标准答案的一个超集")

    # 计算正确答案数
    pc_correct = correct / len(l_of_d)
        
    return pc_correct

In [16]:

print(f'用户提问: {msg_ideal_pairs_set[7]["customer_msg"]}')
print(f'标准答案: {msg_ideal_pairs_set[7]["ideal_answer"]}')

用户提问: What Gaming consoles would be good for my friend who is into racing games?
标准答案: {'Gaming Consoles and Accessories': {'GameSphere VR Headset', 'GameSphere X', 'ProGamer Controller', 'ProGamer Racing Wheel', 'GameSphere Y'}}

In [17]:

response = find_category_and_product_v2(msg_ideal_pairs_set[7]["customer_msg"],
                                         products_and_category)
print(f'回答: {response}')

eval_response_with_ideal(response,
                              msg_ideal_pairs_set[7]["ideal_answer"])

回答:     [{'category': 'Gaming Consoles and Accessories', 'products': ['ProGamer Controller', 'ProGamer Racing Wheel', 'GameSphere VR Headset']}]
错误
产品集合: {'ProGamer Racing Wheel', 'ProGamer Controller', 'GameSphere VR Headset'}
标准的产品集合: {'GameSphere VR Headset', 'GameSphere X', 'ProGamer Racing Wheel', 'ProGamer Controller', 'GameSphere Y'}
回答是标准答案的一个子集

Out[17]:

0.0

In [17]:

# 调用中文 Prompt
response = find_category_and_product_v2(msg_ideal_pairs_set[7]["customer_msg"],
                                         products_and_category)
print(f'回答: {response}')

eval_response_with_ideal(response,
                              msg_ideal_pairs_set[7]["ideal_answer"])

回答:     [{'category': 'Gaming Consoles and Accessories', 'products': ['GameSphere X', 'ProGamer Controller', 'GameSphere Y', 'ProGamer Racing Wheel', 'GameSphere VR Headset']}]

Out[17]:

0.0

十、在所有测试用例上运行评估，并计算正确的用例比例¶

注意：如果任何 API 调用超时，将无法运行

In [20]:

score_accum = 0
for i, pair in enumerate(msg_ideal_pairs_set):
    time.sleep(20)
    print(f"示例 {i}")
    
    customer_msg = pair['customer_msg']
    ideal = pair['ideal_answer']
    
    # print("Customer message",customer_msg)
    # print("ideal:",ideal)
    response = find_category_and_product_v2(customer_msg,
                                                      products_and_category)

    
    # print("products_by_category",products_by_category)
    score = eval_response_with_ideal(response,ideal,debug=False)
    print(f"{i}: {score}")
    score_accum += score
    

n_examples = len(msg_ideal_pairs_set)
fraction_correct = score_accum / n_examples
print(f"正确比例为 {n_examples}: {fraction_correct}")

示例 0
0: 1.0
示例 1
1: 1.0
示例 2
2: 1.0
示例 3
3: 1.0
示例 4
4: 1.0
示例 5
5: 1.0
示例 6
6: 1.0
示例 7
错误
产品集合: {'ProGamer Racing Wheel', 'ProGamer Controller', 'GameSphere VR Headset'}
标准的产品集合: {'GameSphere VR Headset', 'GameSphere X', 'ProGamer Racing Wheel', 'ProGamer Controller', 'GameSphere Y'}
回答是标准答案的一个子集
7: 0.0
示例 8
8: 1.0
示例 9
9: 1
正确比例为 10: 0.9

使用 Prompt 构建应用程序的工作流程与使用监督学习构建应用程序的工作流程非常不同。

因此，我们认为这是需要记住的一件好事，当您正在构建监督学习模型时，会感觉到迭代速度快了很多。

如果您并未亲身体验，可能会惊叹于仅有手动构建的极少样本，就可以产生高效的评估方法。您可能会认为，仅有 10 个样本是不具备统计意义的。但当您真正运用这种方式时，您可能会对向开发集中添加一些复杂样本所带来的效果提升感到惊讶。

这对于帮助您和您的团队找到有效的 Prompt 和有效的系统非常有帮助。

在本章中，输出可以被定量评估，就像有一个期望的输出一样，您可以判断它是否给出了这个期望的输出。

在下一章中，我们将探讨如何在更加模糊的情况下评估我们的输出。即正确答案可能不那么明确的情况。

学习资源站

19-必修二-使用 ChatGPT API 构建客服助手-评估（上）

第九章评估（上）——存在一个简单的正确答案时¶

一、环境配置¶

1.1 加载 API 密钥和一些 Python 库。¶

1.2 获取相关产品和类别¶

二、找出相关产品和类别名称¶

三、在一些查询上进行评估¶

四、更难的测试用例¶

五、修改指令以处理难测试用例¶

六、在难测试用例上评估修改后的指令¶

七、回归测试：验证模型在以前的测试用例上仍然有效¶

八、收集开发集进行自动化测试¶

九、通过与理想答案比较来评估测试用例¶

十、在所有测试用例上运行评估，并计算正确的用例比例¶

19-必修二-使用 ChatGPT API 构建客服助手-评估（上）

第九章 评估（上）——存在一个简单的正确答案时¶

一、环境配置¶

1.1 加载 API 密钥和一些 Python 库。¶

1.2 获取相关产品和类别¶

二、找出相关产品和类别名称¶

三、在一些查询上进行评估¶

四、更难的测试用例¶

五、修改指令以处理难测试用例¶

六、在难测试用例上评估修改后的指令¶

七、回归测试：验证模型在以前的测试用例上仍然有效¶

八、收集开发集进行自动化测试¶

九、通过与理想答案比较来评估测试用例¶

十、在所有测试用例上运行评估，并计算正确的用例比例¶

第九章评估（上）——存在一个简单的正确答案时¶