常见的应用场景是使用大语言模型来构建一个能够回答关于给定文档和文档集合的问答系统。对于给定的文档, 比如从PDF、网页、公司主页中提取构建的内部文档集合,我们可以使用大语言模型来回答关于这些文档内容的问题,以帮助用户更有效地获取和使用他们所需要的信息。这种方式非常有效且灵活地适用于实际应用场景,因为它不仅仅利用大语言模型已有的训练集数据信息,它还能使用外部信息。
这个过程,我们会涉及LongChain中的其他组件,比如:表征模型(Embedding Models)和向量储存(Vector Stores)
一、设置OpenAI API Key¶
import os
import openai
from dotenv import load_dotenv, find_dotenv
# 读取本地/项目的环境变量。
# find_dotenv()寻找并定位.env文件的路径
# load_dotenv()读取该.env文件,并将其中的环境变量加载到当前的运行环境中
# 如果你设置的是全局的环境变量,这行代码则没有任何作用。
_ = load_dotenv(find_dotenv())
# 获取环境变量 OPENAI_API_KEY
openai.api_key = os.environ['OPENAI_API_KEY']
二、 结合表征模型和向量存储¶
使用语言模型与文档结合使用时,语言模型一次只能使用几千个单词的信息。如果我们文档比较长,如何让语言模型回答关于其中所有内容的问题呢?我们通过几个通过向量表征(Embeddings)和向量存储(Vector Store)实现
文本表征(Embeddings)是对文本语义的向量表征,相似内容的文本具有相似的表征向量。这使我们可以在向量空间中比较文本的相似性。
向量数据库(Vector Database)用来存储文档的文本块。对于给定的文档,我们首先将其分成较小的文本块(chunks),然后获取每个小文本块的文本表征,并将这些表征储存在向量数据库中。这个流程正是创建索引(index)的过程。将文档分成小文本块的原因在于我们可能无法将整个文档传入语言模型进行处理。
索引创建完成后,我们可以使用索引来查找与传入的查询(Query)最相关的文本片段 - 首先我们为查询获取向量表征,然后我们将其与向量数据库中的所有向量进行比较并选择最相似的n个文本块,最后我们将这些相似的文本块构建提示,并输入到语言模型,从而得到最终答案。
2.1 导入数据¶
from langchain.chains.retrieval_qa.base import RetrievalQA #检索QA链,在文档上进行检索
from langchain_community.chat_models import ChatOpenAI #openai模型
from langchain.document_loaders import CSVLoader #文档加载器,采用csv格式存储
from langchain.vectorstores import DocArrayInMemorySearch #向量存储
from IPython.display import display, Markdown #在jupyter显示信息的工具
#创建一个文档加载器,通过csv格式加载
file = 'data/OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)
docs = loader.load()
#查看单个文档,每个文档对应于CSV中的一行数据
docs[0]
Document(page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.", metadata={'source': 'data/OutdoorClothingCatalog_1000.csv', 'row': 0})
2.2 文本向量表征模型¶
#使用OpenAIEmbedding类
from langchain_community.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model='text-embedding-3-small') #初始化
/Users/lta/anaconda3/envs/cookbook/lib/python3.10/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: The class `OpenAIEmbeddings` was deprecated in LangChain 0.0.9 and will be removed in 0.3.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import OpenAIEmbeddings`. warn_deprecated(
#因为文档比较短了,所以这里不需要进行任何分块,可以直接进行向量表征
#使用初始化OpenAIEmbedding实例上的查询方法embed_query为文本创建向量表征
embed = embeddings.embed_query("Hi my name is Harrison")
#查看得到向量表征的长度
print(len(embed))
1536
#每个元素都是不同的数字值,组合起来就是文本的向量表征
print(embed[:5])
[0.005708176707077206, -0.005911823428831226, -0.06263797055840839, 0.022638179617308362, -0.05140398873287338]
2.3 基于向量表征创建向量存储¶
# 将刚才创建文本向量表征(embeddings)存储在向量存储(vector store)中
# 使用DocArrayInMemorySearch类的from_documents方法来实现
# 该方法接受文档列表以及向量表征模型作为输入
db = DocArrayInMemorySearch.from_documents(docs, embeddings)
/Users/lta/anaconda3/envs/cookbook/lib/python3.10/site-packages/pydantic/_migration.py:283: UserWarning: `pydantic.error_wrappers:ValidationError` has been moved to `pydantic:ValidationError`.
warnings.warn(f'`{import_path}` has been moved to `{new_location}`.')
import langchain
langchain.__version__
'0.2.3'
2.4 查询创建的向量存储¶
query = "Please suggest a shirt with sunblocking"
#使用上面的向量存储来查找与传入查询类似的文本,得到一个相似文档列表
docs = db.similarity_search(query)
#打印返回文档的个数
len(docs)
4
#打印返回的第一个文档
docs[0]
Document(page_content=': 255\nname: Sun Shield Shirt by\ndescription: "Block the sun, not the fun – our high-performance sun shirt is guaranteed to protect from harmful UV rays. \n\nSize & Fit: Slightly Fitted: Softly shapes the body. Falls at hip.\n\nFabric & Care: 78% nylon, 22% Lycra Xtra Life fiber. UPF 50+ rated – the highest rated sun protection possible. Handwash, line dry.\n\nAdditional Features: Wicks moisture for quick-drying comfort. Fits comfortably over your favorite swimsuit. Abrasion resistant for season after season of wear. Imported.\n\nSun Protection That Won\'t Wear Off\nOur high-performance fabric provides SPF 50+ sun protection, blocking 98% of the sun\'s harmful rays. This fabric is recommended by The Skin Cancer Foundation as an effective UV protectant.', metadata={'source': 'data/OutdoorClothingCatalog_1000.csv', 'row': 255})
我们可以看到一个返回了四个结果。输出的第一结果是一件关于防晒的衬衫,满足我们查询的要求:请推荐一件防晒功能的衬衫(Please suggest a shirt with sunblocking)
2.5 使用向量储存回答文档的相关问题¶
#基于向量储存,创建检索器
retriever = db.as_retriever()
#导入大语言模型, 这里使用默认模型gpt-3.5-turbo会出现504服务器超时,因此使用gpt-3.5-turbo-0301
llm = ChatOpenAI(model_name="gpt-3.5-turbo-0301",temperature = 0.0)
/Users/lta/anaconda3/envs/cookbook/lib/python3.10/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: The class `ChatOpenAI` was deprecated in LangChain 0.0.10 and will be removed in 0.3.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import ChatOpenAI`. warn_deprecated(
#合并获得的相似文档内容
qdocs = "".join([docs[i].page_content for i in range(len(docs))])
#将合并的相似文档内容后加上问题(question)输入到 `llm.call_as_llm`中
#这里问题是:以Markdown表格的方式列出所有具有防晒功能的衬衫并总结
response = llm.call_as_llm(f"{qdocs} Question: Please list all your \
shirts with sun protection in a table in markdown and summarize each one.")
/Users/lta/anaconda3/envs/cookbook/lib/python3.10/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: The method `BaseChatModel.call_as_llm` was deprecated in langchain-core 0.1.7 and will be removed in 0.3.0. Use invoke instead. warn_deprecated(
display(Markdown(response))
| Shirt Name | Description |
|---|---|
| Sun Shield Shirt | A high-performance sun shirt that provides UPF 50+ sun protection, blocks 98% of the sun's harmful rays, and is made of 78% nylon and 22% Lycra Xtra Life fiber. It wicks moisture for quick-drying comfort, fits comfortably over swimsuits, and is abrasion-resistant. |
| Men's Tropical Plaid Short-Sleeve Shirt | The lightest hot-weather shirt that is rated UPF 50+ for superior protection from the sun's UV rays. It is made of 100% polyester, is wrinkle-resistant, and has front and back cape venting that lets in cool breezes and two front bellows pockets. |
| Men's Plaid Tropic Shirt, Short-Sleeve | An ultracomfortable sun protection shirt that is rated UPF 50+ and made with 52% polyester and 48% nylon. It offers UPF 50+ coverage, blocks 98% of the sun's harmful UV rays, and is wrinkle-free and quickly evaporates perspiration. It also has front and back cape venting and two front bellows pockets. |
| Tropical Breeze Shirt | A lightweight, breathable long-sleeve men’s UPF shirt that offers superior SunSmart™ protection from the sun’s harmful rays. It is made of 71% nylon and 29% polyester, has UPF 50+ coverage, and is wrinkle-resistant and moisture-wicking. It also has front and back cape venting and two front bellows pockets. |
在此处打印响应,我们可以看到我们得到了一个表格,正如我们所要求的那样
'''
通过LangChain链封装起来
创建一个检索QA链,对检索到的文档进行问题回答,要创建这样的链,我们将传入几个不同的东西
1、语言模型,在最后进行文本生成
2、传入链类型,这里使用stuff,将所有文档塞入上下文并对语言模型进行一次调用
3、传入一个检索器
'''
qa_stuff = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
verbose=True
)
query = "Please list all your shirts with sun protection in a table \
in markdown and summarize each one."#创建一个查询并在此查询上运行链
response = qa_stuff.run(query)
/Users/lta/anaconda3/envs/cookbook/lib/python3.10/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: The method `Chain.run` was deprecated in langchain 0.1.0 and will be removed in 0.3.0. Use invoke instead. warn_deprecated(
> Entering new RetrievalQA chain... > Finished chain.
display(Markdown(response))#使用 display 和 markdown 显示它
| Shirt Name | Description |
|---|---|
| Sun Shield Shirt | High-performance sun shirt made of 78% nylon and 22% Lycra Xtra Life fiber. UPF 50+ rated. Wicks moisture for quick-drying comfort. Fits comfortably over your favorite swimsuit. Abrasion-resistant. Provides SPF 50+ sun protection, blocking 98% of the sun's harmful rays. |
| Men's Tropical Plaid Short-Sleeve Shirt | Lightest hot-weather shirt made of 100% polyester. Rated UPF 50+ for superior protection from the sun's UV rays. Traditional fit that is relaxed through the chest, sleeve, and waist. Wrinkle-resistant. Front and back cape venting that lets in cool breezes. Two front bellows pockets. Provides SPF 50+ sun protection, blocking 98% of the sun's harmful rays. |
| Men's Plaid Tropic Shirt, Short-Sleeve | Ultracomfortable sun protection rated to UPF 50+. Made with 52% polyester and 48% nylon. Originally designed for fishing. Lightest hot-weather shirt that offers UPF 50+ coverage. SunSmart technology blocks 98% of the sun's harmful UV rays. High-performance fabric is wrinkle-free and quickly evaporates perspiration. Front and back cape venting. Two front bellows pockets. Provides UPF 50+ coverage, limiting sun exposure and providing the highest rated sun protection available. |
| Tropical Breeze Shirt | Lightweight, breathable long-sleeve men’s UPF shirt. Made of 71% nylon and 29% polyester. UPF 50+ rated. Wrinkle-resistant and moisture-wicking fabric keeps you cool and comfortable. Traditional fit that is relaxed through the chest, sleeve, and waist. Originally designed for fishing. Innovative SunSmart technology blocks 98% of the sun's harmful UV rays. Front and back cape venting lets in cool breezes. Two front bellows pockets. |
Summary:
- Sun Shield Shirt: High-performance sun shirt made of nylon and Lycra Xtra Life fiber. Provides SPF 50+ sun protection, blocking 98% of the sun's harmful rays.
- Men's Tropical Plaid Short-Sleeve Shirt: Lightest hot-weather shirt made of polyester. Rated UPF 50+. Provides SPF 50+ sun protection, blocking 98% of the sun's harmful rays.
- Men's Plaid Tropic Shirt, Short-Sleeve: Ultracomfortable sun protection rated to UPF 50+. Made with polyester and nylon. Provides UPF 50+ coverage, limiting sun exposure and providing the highest rated sun protection available.
- Tropical Breeze Shirt: Lightweight, breathable long-sleeve men’s UPF shirt made of nylon and polyester. Provides UPF 50+ coverage. Innovative SunSmart technology blocks 98% of the sun's harmful UV rays.
这两个方式返回相同的结果
想在许多不同类型的块上执行相同类型的问答,该怎么办?之前的实验中只返回了4个文档,如果有多个文档,那么我们可以使用几种不同的方法
- Map Reduce
将所有块与问题一起传递给语言模型,获取回复,使用另一个语言模型调用将所有单独的回复总结成最终答案,它可以在任意数量的文档上运行。可以并行处理单个问题,同时也需要更多的调用。它将所有文档视为独立的 - Refine
用于循环许多文档,实际上它是用迭代实现的,它建立在先前文档的答案之上,非常适合用于合并信息并随时间逐步构建答案,由于依赖于先前调用的结果,因此它通常需要更长的时间,并且基本上需要与Map Reduce一样多的调用 - Map Re-rank
对每个文档进行单个语言模型调用,要求它返回一个分数,选择最高分,这依赖于语言模型知道分数应该是什么,需要告诉它,如果它与文档相关,则应该是高分,并在那里精细调整说明,可以批量处理它们相对较快,但是更加昂贵 - Stuff
将所有内容组合成一个文档