第五章检索(Retrieval)¶

一、向量数据库检索
二、结合各种技术
三、其他类型的检索

检索是我们的检索增强生成(RAG)流程的核心。

让我们获得前面课程存储的向量数据库(VectorDB)。

一、向量数据库检索¶

在当前文件夹下新建.env文件，内容为OPENAI_API_KEY = "sk-..."

本章节需要使用lark包，运行以下命令安装

In [1]:

import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

1.1 相似性搜索¶

In [2]:

from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/cs229_lectures/'
persist_directory_chinese = 'docs/chroma/matplotlib/'

将上节课所保存的向量数据库(VectorDB)加载进来

In [3]:

embedding = OpenAIEmbeddings(model='text-embedding-3-small')

/Users/lta/anaconda3/envs/chat_data/lib/python3.11/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: The class `OpenAIEmbeddings` was deprecated in LangChain 0.0.9 and will be removed in 0.3.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import OpenAIEmbeddings`.
  warn_deprecated(

In [4]:

vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

In [5]:

print(vectordb._collection.count())

In [6]:

vectordb_chinese = Chroma(
    persist_directory=persist_directory_chinese,
    embedding_function=embedding
)

In [7]:

print(vectordb_chinese._collection.count())

让我们现在来看看最大边际相关性的例子。因此，我们将从下面示例中加载有关蘑菇的信息。

让我们现在运行它与MMR。让我们传入k等于2。我们仍然希望返回两个文档，但让我们设置获取k等于3，其中我们最初获取所有三个文档。我们现在可以看到，我们检索的文档中返回了有毒的信息。

In [8]:

texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [9]:

texts_chinese = [
    """毒鹅膏菌（Amanita phalloides）具有大型且引人注目的地上（epigeous）子实体（basidiocarp）""",
    """一种具有大型子实体的蘑菇是毒鹅膏菌（Amanita phalloides）。某些品种全白。""",
    """A. phalloides，又名死亡帽，是已知所有蘑菇中最有毒的一种。""",
]

对于这个例子，我们将创建一个小数据库，我们可以作为一个示例来使用。

In [10]:

smalldb = Chroma.from_texts(texts, embedding=embedding)

In [11]:

smalldb_chinese = Chroma.from_texts(texts_chinese, embedding=embedding)

下面是我们对于这个示例所提出的问题

In [12]:

question = "Tell me about all-white mushrooms with large fruiting bodies"

In [13]:

question_chinese = "告诉我关于具有大型子实体的全白色蘑菇的信息"

现在，我们可以运行一个相似性搜索，设置k=2，只返回两个最相关的文档。

我们可以看到，没有提到它是有毒的事实。

In [14]:

smalldb.similarity_search(question, k=2)

Out[14]:

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).')]

In [15]:

smalldb_chinese.similarity_search(question_chinese, k=2)

Out[15]:

[Document(page_content='一种具有大型子实体的蘑菇是毒鹅膏菌（Amanita phalloides）。某些品种全白。'),
 Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.')]

现在，让我们运行最大边际相关性(MMR)。

设置k=2，因为我们仍然希望返回两个文档。设置fetch_k=3，fetch_k是我们最初获取的所有文档(3个)。

In [16]:

smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3)

Out[16]:

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='一种具有大型子实体的蘑菇是毒鹅膏菌（Amanita phalloides）。某些品种全白。')]

In [17]:

smalldb_chinese.max_marginal_relevance_search(question,k=2, fetch_k=3)

Out[17]:

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='一种具有大型子实体的蘑菇是毒鹅膏菌（Amanita phalloides）。某些品种全白。')]

我们现在可以看到，我们检索的文档中返回了有毒的信息。

1.2 解决多样性：最大边际相关性(MMR)¶

我们刚刚通过一个示例引出了一个问题：如何加强搜索结果的多样性。

最大边际相关性(Maximum marginal relevance)试图在查询的相关性和结果的多样性之间实现两全其美。

让我们回到上一节课的一个例子，当我们通过问题对向量数据库进行相似性搜索后

我们可以看看前两个文档，只看前几个字符，可以看到它们是相同的。

In [18]:

question = "what did they say about matlab?"
docs_ss = vectordb.similarity_search(question,k=3)

In [19]:

docs_ss[0].page_content[:100]

Out[19]:

'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '

In [20]:

docs_ss[1].page_content[:100]

Out[20]:

'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '

In [21]:

question_chinese = "Matplotlib是什么？"
docs_ss_chinese = vectordb_chinese.similarity_search(question_chinese,k=3)

In [22]:

docs_ss_chinese[0].page_content[:100]

Out[22]:

'第⼀回：Matplotlib 初相识\n⼀、认识matplotlib\nMatplotlib 是⼀个 Python 2D 绘图库，能够以多种硬拷⻉格式和跨平台的交互式环境⽣成出版物质量的图形，⽤来绘制各种'

In [23]:

docs_ss_chinese[1].page_content[:100]

Out[23]:

'第⼀回：Matplotlib 初相识\n⼀、认识matplotlib\nMatplotlib 是⼀个 Python 2D 绘图库，能够以多种硬拷⻉格式和跨平台的交互式环境⽣成出版物质量的图形，⽤来绘制各种'

注意：使用MMR所得出结果的差异。

In [24]:

docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

In [25]:

docs_mmr_chinese = vectordb_chinese.max_marginal_relevance_search(question_chinese,k=3)

当我们运行MMR后得到结果时，我们可以看到第一个与之前的相同，因为那是最相似的。

In [26]:

docs_mmr[0].page_content[:100]

Out[26]:

'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '

In [27]:

docs_mmr_chinese[0].page_content[:100]

Out[27]:

'第⼀回：Matplotlib 初相识\n⼀、认识matplotlib\nMatplotlib 是⼀个 Python 2D 绘图库，能够以多种硬拷⻉格式和跨平台的交互式环境⽣成出版物质量的图形，⽤来绘制各种'

但是当我们进行到第二个时，我们可以看到它是不同的。

它在回应中获得了一些多样性。

In [28]:

docs_mmr[1].page_content[:100]

Out[28]:

"mathematical work, he feels like he's disc overing truth and beauty in the universe. And \nhe says it"

In [29]:

docs_mmr_chinese[1].page_content[:100]

Out[29]:

'fig, ax = plt.subplots(figsize=(4,3))\nax.grid(True)\n使⽤ set_xscale 可以设置坐标轴的规度（指对数坐标等）\nfig, axs = plt.'

1.3 解决特殊性：使用元数据¶

在上一节课中，我们展示了一个问题，是询问了关于文档中某一讲的问题，但得到的结果中也包括了来自其他讲的结果。

为了解决这一问题，很多向量数据库都支持对metadata的操作。

metadata为每个嵌入的块(embedded chunk)提供上下文。

In [30]:

question = "what did they say about regression in the third lecture?"

In [31]:

question_chinese = "他们在第二讲中对Figure说了些什么？"

现在，我们以手动的方式来解决这个问题，我们会指定一个元数据过滤器filter

In [32]:

docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"docs/cs229_lectures/MachineLearning-Lecture03.pdf"}
)

In [33]:

docs_chinese = vectordb_chinese.similarity_search(
    question_chinese,
    k=3,
    filter={"source":"docs/matplotlib/第二回：艺术画笔见乾坤.pdf"}
)

接下来，我们可以看到结果都来自对应的章节

In [34]:

for d in docs:
    print(d.metadata)

{'page': 0, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 14, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 6, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}

In [35]:

for d in docs_chinese:
    print(d.metadata)

{'page': 9, 'source': 'docs/matplotlib/第二回：艺术画笔见乾坤.pdf'}
{'page': 0, 'source': 'docs/matplotlib/第二回：艺术画笔见乾坤.pdf'}
{'page': 3, 'source': 'docs/matplotlib/第二回：艺术画笔见乾坤.pdf'}

当然，我们不能每次都采用手动的方式来解决这个问题，这会显得不够智能

下一小节将要展示通过LLM来解决这个问题

1.4 解决特殊性：在元数据中使用自查询检索器¶

我们有一个有趣的挑战：我们通常希望从查询本身来推断元数据。

为了解决这个问题，我们可以使用SelfQueryRetriever，它使用LLM来提取：

用于向量搜索的查询(query)字符串，即：问题
要一起传入的元数据过滤器

大多数向量数据库支持元数据过滤器，因此不需要任何新的数据库及索引。

In [36]:

from langchain_community.llms.openai import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [37]:

llm = OpenAI(temperature=0)

/Users/lta/anaconda3/envs/chat_data/lib/python3.11/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: The class `OpenAI` was deprecated in LangChain 0.0.10 and will be removed in 0.3.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import OpenAI`.
  warn_deprecated(

AttributeInfo是我们可以指定元数据中的不同字段以及它们对应的位置。

在元数据中，我们只有两个字段，source和page。

我们将填写每个属性的名称、描述和类型的描述。

这些信息实际上将被传递给LLM，所以需要尽可能详细地描述。

In [38]:

metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `docs/cs229_lectures/MachineLearning-Lecture01.pdf`, `docs/cs229_lectures/MachineLearning-Lecture02.pdf`, or `docs/cs229_lectures/MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

In [39]:

metadata_field_info_chinese = [
    AttributeInfo(
        name="source",
        description="讲义来源于 `docs/matplotlib/第一回：Matplotlib初相识.pdf`, `docs/matplotlib/第二回：艺术画笔见乾坤.pdf`, or `docs/matplotlib/第三回：布局格式定方圆.pdf` 的其中之一",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="讲义的那一页",
        type="integer",
    ),
]

In [40]:

document_content_description = "Lecture notes"
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

In [41]:

document_content_description_chinese = "课堂讲义"
retriever_chinese = SelfQueryRetriever.from_llm(
    llm,
    vectordb_chinese,
    document_content_description_chinese,
    metadata_field_info_chinese,
    verbose=True
)

In [42]:

question = "what did they say about regression in the third lecture?"

In [43]:

question_chinese = "他们在第二讲中对Figure说了些什么？"

In [44]:

docs = retriever.get_relevant_documents(question)

/Users/lta/anaconda3/envs/chat_data/lib/python3.11/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: The method `BaseRetriever.get_relevant_documents` was deprecated in langchain-core 0.1.46 and will be removed in 0.3.0. Use invoke instead.
  warn_deprecated(

In [45]:

docs_chinese = retriever_chinese.get_relevant_documents(question_chinese)

In [46]:

for d in docs:
    print(d.metadata)

{'page': 10, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 11, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 6, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 2, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}

In [47]:

for d in docs_chinese:
    print(d.metadata)

1.5 其他技巧：压缩¶

另一种提高检索到的文档质量的方法是压缩。

与查询最相关的信息可能隐藏在具有大量不相关文本的文档中。

在应用程序中传递完整的文档可能会导致更昂贵的LLM调用和更差的响应。

上下文压缩就是为了解决这个问题。

In [48]:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [49]:

def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

In [50]:

llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)  # 压缩器

In [51]:

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [52]:

compression_retriever_chinese = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb_chinese.as_retriever()
)

In [53]:

question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

- "those homeworks will be done in either MATLA B or in Octave"
- "I know some people call it a free ve rsion of MATLAB"
- "MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data."
- "there's also a software package called Octave that you can download for free off the Internet."
- "it has somewhat fewer features than MATLAB, but it's free, and for the purposes of this class, it will work for just about everything."
- "once a colleague of mine at a different university, not at Stanford, actually teaches another machine learning course."
----------------------------------------------------------------------------------------------------
Document 2:

- "those homeworks will be done in either MATLA B or in Octave"
- "I know some people call it a free ve rsion of MATLAB"
- "MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data."
- "there's also a software package called Octave that you can download for free off the Internet."
- "it has somewhat fewer features than MATLAB, but it's free, and for the purposes of this class, it will work for just about everything."
- "once a colleague of mine at a different university, not at Stanford, actually teaches another machine learning course."
----------------------------------------------------------------------------------------------------
Document 3:

"Oh, it was the MATLAB."
----------------------------------------------------------------------------------------------------
Document 4:

"Oh, it was the MATLAB."

In [54]:

question_chinese = "Matplotlib是什么？"
compressed_docs_chinese = compression_retriever_chinese.get_relevant_documents(question_chinese)
pretty_print_docs(compressed_docs_chinese)

Document 1:

Matplotlib 是⼀个 Python 2D 绘图库，能够以多种硬拷⻉格式和跨平台的交互式环境⽣成出版物质量的图形，⽤来绘制各种静态，动态，交互式的图表。
----------------------------------------------------------------------------------------------------
Document 2:

Matplotlib 是⼀个 Python 2D 绘图库，能够以多种硬拷⻉格式和跨平台的交互式环境⽣成出版物质量的图形，⽤来绘制各种静态，动态，交互式的图表。
----------------------------------------------------------------------------------------------------
Document 3:

Axes：matplotlib 宇宙的核⼼，容纳了⼤量元素⽤来构造⼀幅幅⼦图，⼀个 figure 可以由⼀个或多个⼦图组成
Axis：axes的下属层级，⽤于处理所有和坐标轴，⽹格有关的元素
Tick：axis的下属层级，⽤来处理所有和刻度有关的元素
四、两种绘图接⼝
matplotlib 提供了两种最常⽤的绘图接⼝
----------------------------------------------------------------------------------------------------
Document 4:

Axes：matplotlib 宇宙的核⼼，容纳了⼤量元素⽤来构造⼀幅幅⼦图，⼀个 figure 可以由⼀个或多个⼦图组成
Axis：axes的下属层级，⽤于处理所有和坐标轴，⽹格有关的元素
Tick：axis的下属层级，⽤来处理所有和刻度有关的元素
四、两种绘图接⼝
matplotlib 提供了两种最常⽤的绘图接⼝

现在当我们提出问题后，查看结果文档

我们可以看到两件事。

它们比正常文档短很多
仍然有一些重复的东西，这是因为在底层我们使用的是语义搜索算法。

这就是我们在本课程前面使用MMR解决的问题。

这是一个很好的例子，你可以结合各种技术得到最好的可能结果。

二、结合各种技术¶

为了做到这一点，我们在从向量数据库创建检索器时，可以将搜索类型设置为MMR。

然后我们可以重新运行这个过程，看到我们返回的是一个过滤过的结果集，其中不包含任何重复的信息。

In [55]:

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [56]:

compression_retriever_chinese = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb_chinese.as_retriever(search_type = "mmr")
)

In [57]:

question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

- "those homeworks will be done in either MATLA B or in Octave"
- "I know some people call it a free ve rsion of MATLAB"
- "MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data."
- "there's also a software package called Octave that you can download for free off the Internet."
- "it has somewhat fewer features than MATLAB, but it's free, and for the purposes of this class, it will work for just about everything."
- "once a colleague of mine at a different university, not at Stanford, actually teaches another machine learning course."
----------------------------------------------------------------------------------------------------
Document 2:

"Oh, it was the MATLAB."

In [58]:

question_chinese = "Matplotlib是什么？"
compressed_docs_chinese = compression_retriever_chinese.get_relevant_documents(question_chinese)
pretty_print_docs(compressed_docs_chinese)

Document 1:

Matplotlib 是⼀个 Python 2D 绘图库，能够以多种硬拷⻉格式和跨平台的交互式环境⽣成出版物质量的图形，⽤来绘制各种静态，动态，交互式的图表。
----------------------------------------------------------------------------------------------------
Document 2:

Matplotlib是什么？
fig, ax = plt.subplots(figsize=(4,3))
ax.grid(True)
使⽤ set_xscale 可以设置坐标轴的规度（指对数坐标等）
fig, axs = plt.subplots(1, 2, figsize=(10, 4))
for j in range(2):
    axs[j].plot(list('abcd'), [10**i for i in range(4)])
    if j==0:
        axs[j].set_yscale('log')
    else:
        pass
fig.tight_layout()
----------------------------------------------------------------------------------------------------
Document 3:

matplotlib.pyplot.pie(x, explode=None, labels=None, colors=None, autopct=None, pctdistance=0.6, shadow=False,
labeldistance=1.1, startangle=0, radius=1, counterclock=True, wedgeprops=None, textprops=None, center=0, 0,
frame=False, rotatelabels=False, *, normalize=None, data=None)
制作数据x 的饼图，每个楔⼦的⾯积⽤ x/sum(x) 表示。
其中最主要的参数是前 4 个：
x：楔型的形状，⼀维数组。
explode：如果不是等于 None ，则是⼀个 len(x) 数组，它指定⽤于偏移每个楔形块的半径的分数。
labels：⽤于指定每个楔型块的标记，取值是列表或为 None 。
colors：饼图循环使⽤的颜⾊序列。如果取值为 None ，将使⽤当前
----------------------------------------------------------------------------------------------------
Document 4:

Matplotlib是什么？

三、其他类型的检索¶

值得注意的是，vetordb并不是唯一一种检索文档的工具。

LangChain检索器抽象包括其他检索文档的方式，如：TF-IDF 或 SVM。

In [59]:

from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [60]:

# 加载PDF
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
all_page_text = [p.page_content for p in pages]
joined_page_text = " ".join(all_page_text)

# 分割文本
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)

In [61]:

# 加载PDF
loader_chinese = PyPDFLoader("docs/matplotlib/第一回：Matplotlib初相识.pdf")
pages_chinese = loader_chinese.load()
all_page_text_chinese = [p.page_content for p in pages_chinese]
joined_page_text_chinese = " ".join(all_page_text_chinese)

# 分割文本
text_splitter_chinese = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits_chinese = text_splitter_chinese.split_text(joined_page_text_chinese)

In [62]:

# 检索
svm_retriever = SVMRetriever.from_texts(splits, embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [63]:

question = "What are major topics for this class?"  # 这门课的主要主题是什么？
docs_svm = svm_retriever.get_relevant_documents(question)
docs_svm[0]

Out[63]:

Document(page_content="or find project partners or discuss homework problems and so on, and it's not monitored \nby the TAs and me. So feel free to ta lk trash about this class there.  \nIf you want to contact the teaching staff, pl ease use the email address written down here, \ncs229-qa@cs.stanford.edu. This goes to an acc ount that's read by all the TAs and me. So \nrather than sending us email individually, if you send email to this account, it will \nactually let us get back to you maximally quickly with answers to your questions.  \nIf you're asking questions about homework probl ems, please say in the subject line which \nassignment and which question the email refers to, since that will also help us to route \nyour question to the appropriate TA or to me  appropriately and get the response back to \nyou quickly.  \nLet's see. Skipping ahead — let's see — for homework, one midterm, one open and term \nproject. Notice on the honor code. So one thi ng that I think will help you to succeed and \ndo well in this class and even help you to enjoy this cla ss more is if you form a study \ngroup.  \nSo start looking around where you' re sitting now or at the end of class today, mingle a \nlittle bit and get to know your classmates. I strongly encourage you to form study groups \nand sort of have a group of people to study with and have a group of your fellow students \nto talk over these concepts with. You can also  post on the class news group if you want to \nuse that to try to form a study group.")

In [64]:

question = "what did they say about matlab?"  # 他们关于Matlab说了些什么？
docs_tfidf = tfidf_retriever.get_relevant_documents(question)
docs_tfidf[0]

Out[64]:

Document(page_content="Saxena and Min Sun here did, wh ich is given an image like this, right? This is actually a \npicture taken of the Stanford campus. You can apply that sort of cl ustering algorithm and \ngroup the picture into regions. Let me actually blow that up so that you can see it more \nclearly. Okay. So in the middle, you see the lines sort of groupi ng the image together, \ngrouping the image into [inaudible] regions.  \nAnd what Ashutosh and Min did was they then  applied the learning algorithm to say can \nwe take this clustering and us e it to build a 3D model of the world? And so using the \nclustering, they then had a lear ning algorithm try to learn what the 3D structure of the \nworld looks like so that they could come up with a 3D model that you can sort of fly \nthrough, okay? Although many people used to th ink it's not possible to take a single \nimage and build a 3D model, but using a lear ning algorithm and that sort of clustering \nalgorithm is the first step. They were able to.  \nI'll just show you one more example. I like this  because it's a picture of Stanford with our \nbeautiful Stanford campus. So again, taking th e same sort of clustering algorithms, taking \nthe same sort of unsupervised learning algor ithm, you can group the pixels into different \nregions. And using that as a pre-processing step, they eventually built this sort of 3D model of Stanford campus in a single picture.  You can sort of walk  into the ceiling, look")

In [65]:

# 中文版
svm_retriever_chinese = SVMRetriever.from_texts(splits_chinese, embedding)
tfidf_retriever_chinese = TFIDFRetriever.from_texts(splits_chinese)

In [66]:

question_chinese = "这门课的主要主题是什么？" 
docs_svm_chinese = svm_retriever_chinese.get_relevant_documents(question_chinese)
docs_svm_chinese[0]

Out[66]:

Document(page_content='第⼀回：Matplotlib 初相识\n⼀、认识matplotlib\nMatplotlib 是⼀个 Python 2D 绘图库，能够以多种硬拷⻉格式和跨平台的交互式环境⽣成出版物质量的图形，⽤来绘制各种静态，动态，\n交互式的图表。\nMatplotlib 可⽤于 Python 脚本， Python 和 IPython Shell 、 Jupyter notebook ， Web 应⽤程序服务器和各种图形⽤户界⾯⼯具包等。\nMatplotlib 是 Python 数据可视化库中的泰⽃，它已经成为 python 中公认的数据可视化⼯具，我们所熟知的 pandas 和 seaborn 的绘图接⼝\n其实也是基于 matplotlib 所作的⾼级封装。\n为了对matplotlib 有更好的理解，让我们从⼀些最基本的概念开始认识它，再逐渐过渡到⼀些⾼级技巧中。\n⼆、⼀个最简单的绘图例⼦\nMatplotlib 的图像是画在 figure （如 windows ， jupyter 窗体）上的，每⼀个 figure ⼜包含了⼀个或多个 axes （⼀个可以指定坐标系的⼦区\n域）。最简单的创建 figure 以及 axes 的⽅式是通过 pyplot.subplots命令，创建 axes 以后，可以使⽤ Axes.plot绘制最简易的折线图。\nimport matplotlib.pyplot as plt\nimport matplotlib as mpl\nimport numpy as np\nfig, ax = plt.subplots()  # 创建⼀个包含⼀个 axes 的 figure\nax.plot([1, 2, 3, 4], [1, 4, 2, 3]);  # 绘制图像\nTrick： 在jupyter notebook 中使⽤ matplotlib 时会发现，代码运⾏后⾃动打印出类似 <matplotlib.lines.Line2D at 0x23155916dc0>\n这样⼀段话，这是因为 matplotlib 的绘图代码默认打印出最后⼀个对象。如果不想显示这句话，有以下三种⽅法，在本章节的代码示例\n中你能找到这三种⽅法的使⽤。\n\x00. 在代码块最后加⼀个分号 ;\n\x00. 在代码块最后加⼀句 plt.show()\n\x00. 在绘图时将绘图对象显式赋值给⼀个变量，如将 plt.plot([1, 2, 3, 4]) 改成 line =plt.plot([1, 2, 3, 4])\n和MATLAB 命令类似，你还可以通过⼀种更简单的⽅式绘制图像， matplotlib.pyplot⽅法能够直接在当前 axes 上绘制图像，如果⽤户\n未指定axes ， matplotlib 会帮你⾃动创建⼀个。所以上⾯的例⼦也可以简化为以下这⼀⾏代码。\nline =plt.plot([1, 2, 3, 4], [1, 4, 2, 3]) \n三、Figure 的组成\n现在我们来深⼊看⼀下 figure 的组成。通过⼀张 figure 解剖图，我们可以看到⼀个完整的 matplotlib 图像通常会包括以下四个层级，这些\n层级也被称为容器（ container ），下⼀节会详细介绍。在 matplotlib 的世界中，我们将通过各种命令⽅法来操纵图像中的每⼀个部分，\n从⽽达到数据可视化的最终效果，⼀副完整的图像实际上是各类⼦元素的集合。\nFigure：顶层级，⽤来容纳所有绘图元素')

In [67]:

question_chinese = "Matplotlib是什么？"
docs_tfidf_chinese = tfidf_retriever_chinese.get_relevant_documents(question_chinese)
docs_tfidf_chinese[0]

Out[67]:

Document(page_content='fig, ax = plt.subplots()  \n# step4 绘制图像，  这⼀模块的扩展参考第⼆章进⼀步学习\nax.plot(x, y, label=\'linear\')  \n# step5 添加标签，⽂字和图例，这⼀模块的扩展参考第四章进⼀步学习\nax.set_xlabel(\'x label\') \nax.set_ylabel(\'y label\') \nax.set_title("Simple Plot")  \nax.legend() ;\n思考题\n请思考两种绘图模式的优缺点和各⾃适合的使⽤场景\n在第五节绘图模板中我们是以 OO 模式作为例⼦展示的，请思考并写⼀个 pyplot 绘图模式的简单模板')

学习资源站

34-必修4-自有知识库RAG向量检索和问答-检索

第五章检索(Retrieval)¶

一、向量数据库检索¶

1.1 相似性搜索¶

1.2 解决多样性：最大边际相关性(MMR)¶

1.3 解决特殊性：使用元数据¶

1.4 解决特殊性：在元数据中使用自查询检索器¶

1.5 其他技巧：压缩¶

二、结合各种技术¶

三、其他类型的检索¶

34-必修4-自有知识库RAG向量检索和问答-检索

第五章 检索(Retrieval)¶

一、向量数据库检索¶

1.1 相似性搜索¶

1.2 解决多样性：最大边际相关性(MMR)¶

1.3 解决特殊性：使用元数据¶

1.4 解决特殊性：在元数据中使用自查询检索器¶

1.5 其他技巧：压缩¶

二、结合各种技术¶

三、其他类型的检索¶

第五章检索(Retrieval)¶