第四章向量数据库与词向量(Vectorstores and Embeddings)¶

一、环境配置
二、读取文档
三、Embeddings
四、Vectorstores
- 4.1 初始化Chroma
- 4.2 相似性搜索(Similarity Search))
五、失败的情况(Failure modes))

回顾一下检索增强生成（RAG）的整体工作流程：

一、环境配置¶

在当前文件夹下新建.env文件，内容为OPENAI_API_KEY = "sk-..."

由于本章节需要使用PyPDFLoader、Chroma，故需要安装依赖包pypdf、chromadb

In [1]:

import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # 读取本地.env文件

openai.api_key  = os.environ['OPENAI_API_KEY']

前两节课我们讨论了Document Loading（文档加载）和Splitting（分割）。

下面我们将使用前两节课的知识对文档进行加载分割。

二、读取文档¶

下面文档的课程链接 https://see.stanford.edu/Course/CS229 ，可在该网站上下载对应的课程讲义

In [2]:

from langchain.document_loaders import PyPDFLoader

# 加载 PDF
loaders = [
    # 故意添加重复文档，使数据混乱
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture03.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

下面文档是datawhale官方开源的matplotlib教程链接 https://datawhalechina.github.io/fantastic-matplotlib/index.html ，可在该网站上下载对应的教程

In [3]:

from langchain.document_loaders import PyPDFLoader

# 加载 PDF
loaders_chinese = [
    # 故意添加重复文档，使数据混乱
    PyPDFLoader("docs/matplotlib/第一回：Matplotlib初相识.pdf"),
    PyPDFLoader("docs/matplotlib/第一回：Matplotlib初相识.pdf"),
    PyPDFLoader("docs/matplotlib/第二回：艺术画笔见乾坤.pdf"),
    PyPDFLoader("docs/matplotlib/第三回：布局格式定方圆.pdf")
]
docs_chinese = []
for loader in loaders_chinese:
    docs_chinese.extend(loader.load())

在文档加载后，我们可以使用RecursiveCharacterTextSplitter(递归字符文本拆分器)来创建块。

In [4]:

# 分割文本
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,  # 每个文本块的大小。这意味着每次切分文本时，会尽量使每个块包含 1500 个字符。
    chunk_overlap = 150  # 每个文本块之间的重叠部分。
)

In [5]:

splits = text_splitter.split_documents(docs)

In [6]:

len(splits)

Out[6]:

In [7]:

splits_chinese = text_splitter.split_documents(docs_chinese)

In [8]:

len(splits_chinese)

Out[8]:

三、Embeddings¶

什么是Embeddings？

在机器学习和自然语言处理（NLP）中，Embeddings（嵌入）是一种将类别数据，如单词、句子或者整个文档，转化为实数向量的技术。这些实数向量可以被计算机更好地理解和处理。嵌入背后的主要想法是，相似或相关的对象在嵌入空间中的距离应该很近。

举个例子，我们可以使用词嵌入（word embeddings）来表示文本数据。在词嵌入中，每个单词被转换为一个向量，这个向量捕获了这个单词的语义信息。例如，"king" 和 "queen" 这两个单词在嵌入空间中的位置将会非常接近，因为它们的含义相似。而 "apple" 和 "orange" 也会很接近，因为它们都是水果。而 "king" 和 "apple" 这两个单词在嵌入空间中的距离就会比较远，因为它们的含义不同。

让我们取出我们的切分部分并对它们进行Embedding处理。

In [9]:

from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings(model='text-embedding-3-small')

/Users/lta/anaconda3/envs/chat_data/lib/python3.11/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: The class `OpenAIEmbeddings` was deprecated in LangChain 0.0.9 and will be removed in 0.3.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import OpenAIEmbeddings`.
  warn_deprecated(

在使用真实文档数据的例子之前，让我们用几个测试案例的句子来试试，以便了解embedding。

下面有几个示例句子，其中前两个非常相似，第三个与之无关。然后我们可以使用embedding类为每个句子创建一个embedding。

In [10]:

sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"

In [11]:

embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [12]:

sentence1_chinese = "我喜欢狗"
sentence2_chinese = "我喜欢犬科动物"
sentence3_chinese = "外面的天气很糟糕"

In [13]:

embedding1_chinese = embedding.embed_query(sentence1_chinese)
embedding2_chinese = embedding.embed_query(sentence2_chinese)
embedding3_chinese = embedding.embed_query(sentence3_chinese)

然后我们可以使用numpy来比较它们，看看哪些最相似。

我们期望前两个句子应该非常相似。

然后，第一和第二个与第三个相比应该相差很大。

我们将使用点积来比较两个嵌入。

如果你不知道什么是点积，没关系。你只需要知道的重要一点是，分数越高句子越相似。

In [14]:

import numpy as np

In [15]:

np.dot(embedding1, embedding2)

Out[15]:

0.8338763861124505

In [16]:

np.dot(embedding1, embedding3)

Out[16]:

0.21898928790384764

In [17]:

np.dot(embedding2, embedding3)

Out[17]:

0.1850211777650424

我们可以看到前两个embedding的分数相当高，为0.96。

如果我们将第一个embedding与第三个embedding进行比较，我们可以看到它明显较低，约为0.77。

如果我们将第二个embedding和第三个embedding进行比较，我们可以看到它的分数大约为0.75。

In [18]:

np.dot(embedding1_chinese, embedding2_chinese)

Out[18]:

0.7232575326539132

In [19]:

np.dot(embedding1_chinese, embedding3_chinese)

Out[19]:

0.18710999861622954

In [20]:

np.dot(embedding2_chinese, embedding3_chinese)

Out[20]:

0.13899606496112304

我们可以看到前两个embedding的分数相当高，为0.94。

如果我们将第一个embedding与第三个embedding进行比较，我们可以看到它明显较低，约为0.79。

如果我们将第二个embedding和第三个embedding进行比较，我们可以看到它的分数大约为0.78。

四、Vectorstores¶

4.1 初始化Chroma¶

Langchain集成了超过30个不同的向量存储库。我们选择Chroma是因为它轻量级且数据存储在内存中，这使得它非常容易启动和开始使用。

In [21]:

from langchain.vectorstores import Chroma

In [22]:

persist_directory = 'docs/chroma/cs229_lectures/'

In [23]:

!rm -rf './docs/chroma/cs229_lectures'  # 删除旧的数据库文件（如果文件夹中有文件的话），window电脑请手动删除

In [24]:

vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory  # 允许我们将persist_directory目录保存到磁盘上
)

In [25]:

print(vectordb._collection.count())

In [26]:

persist_directory_chinese = 'docs/chroma/matplotlib/'

In [27]:

!rm -rf './docs/chroma/matplotlib'  # 删除旧的数据库文件（如果文件夹中有文件的话）

In [28]:

vectordb_chinese = Chroma.from_documents(
    documents=splits_chinese,
    embedding=embedding,
    persist_directory=persist_directory_chinese  # 允许我们将persist_directory目录保存到磁盘上
)

In [29]:

print(vectordb_chinese._collection.count())

我们可以看到英文版的长度也是209、中文版的长度也是30，这与我们之前的切分数量是一样的。现在让我们开始使用它。

4.2 相似性搜索(Similarity Search)¶

In [30]:

question = "is there an email i can ask for help"  # "有我可以寻求帮助的电子邮件吗"

In [31]:

docs = vectordb.similarity_search(question,k=3)

In [32]:

len(docs)

Out[32]:

In [33]:

docs[0].page_content

Out[33]:

"So all right, online resources. The class has a home page, so it's in on the handouts. I \nwon't write on the chalkboard — http:// cs229.stanford.edu. And so when there are \nhomework assignments or things like that, we  usually won't sort of — in the mission of \nsaving trees, we will usually not give out many handouts in class. So homework \nassignments, homework solutions will be posted online at the course home page.  \nAs far as this class, I've also written, a nd I guess I've also revised every year a set of \nfairly detailed lecture notes that cover the te chnical content of this  class. And so if you \nvisit the course homepage, you'll also find the detailed lecture notes that go over in detail \nall the math and equations and so on  that I'll be doing in class.  \nThere's also a newsgroup, su.class.cs229, also written on the handout. This is a \nnewsgroup that's sort of a forum for people in  the class to get to  know each other and \nhave whatever discussions you want to ha ve amongst yourselves. So the class newsgroup \nwill not be monitored by the TAs and me. But this is a place for you to form study groups \nor find project partners or discuss homework problems and so on, and it's not monitored \nby the TAs and me. So feel free to ta lk trash about this class there.  \nIf you want to contact the teaching staff, pl ease use the email address written down here, \ncs229-qa@cs.stanford.edu. This goes to an acc ount that's read by all the TAs and me. So"

如果我们查看第一个文档的内容，我们可以看到它实际上是关于一个电子邮件地址，cs229-qa@cs.stanford.edu。

这是我们可以向其发送问题的电子邮件，所有的助教都会阅读这些邮件。

In [34]:

question_chinese = "Matplotlib是什么？"

In [35]:

docs_chinese = vectordb_chinese.similarity_search(question_chinese,k=3)

In [36]:

len(docs_chinese)

Out[36]:

In [37]:

docs_chinese[0].page_content

Out[37]:

'第⼀回：Matplotlib 初相识\n⼀、认识matplotlib\nMatplotlib 是⼀个 Python 2D 绘图库，能够以多种硬拷⻉格式和跨平台的交互式环境⽣成出版物质量的图形，⽤来绘制各种静态，动态，\n交互式的图表。\nMatplotlib 可⽤于 Python 脚本， Python 和 IPython Shell 、 Jupyter notebook ， Web 应⽤程序服务器和各种图形⽤户界⾯⼯具包等。\nMatplotlib 是 Python 数据可视化库中的泰⽃，它已经成为 python 中公认的数据可视化⼯具，我们所熟知的 pandas 和 seaborn 的绘图接⼝\n其实也是基于 matplotlib 所作的⾼级封装。\n为了对matplotlib 有更好的理解，让我们从⼀些最基本的概念开始认识它，再逐渐过渡到⼀些⾼级技巧中。\n⼆、⼀个最简单的绘图例⼦\nMatplotlib 的图像是画在 figure （如 windows ， jupyter 窗体）上的，每⼀个 figure ⼜包含了⼀个或多个 axes （⼀个可以指定坐标系的⼦区\n域）。最简单的创建 figure 以及 axes 的⽅式是通过 pyplot.subplots命令，创建 axes 以后，可以使⽤ Axes.plot绘制最简易的折线图。\nimport matplotlib.pyplot as plt\nimport matplotlib as mpl\nimport numpy as np\nfig, ax = plt.subplots()  # 创建⼀个包含⼀个 axes 的 figure\nax.plot([1, 2, 3, 4], [1, 4, 2, 3]);  # 绘制图像\nTrick： 在jupyter notebook 中使⽤ matplotlib 时会发现，代码运⾏后⾃动打印出类似 <matplotlib.lines.Line2D at 0x23155916dc0>\n这样⼀段话，这是因为 matplotlib 的绘图代码默认打印出最后⼀个对象。如果不想显示这句话，有以下三种⽅法，在本章节的代码示例\n中你能找到这三种⽅法的使⽤。\n\x00. 在代码块最后加⼀个分号 ;\n\x00. 在代码块最后加⼀句 plt.show()\n\x00. 在绘图时将绘图对象显式赋值给⼀个变量，如将 plt.plot([1, 2, 3, 4]) 改成 line =plt.plot([1, 2, 3, 4])\n和MATLAB 命令类似，你还可以通过⼀种更简单的⽅式绘制图像， matplotlib.pyplot⽅法能够直接在当前 axes 上绘制图像，如果⽤户\n未指定axes ， matplotlib 会帮你⾃动创建⼀个。所以上⾯的例⼦也可以简化为以下这⼀⾏代码。\nline =plt.plot([1, 2, 3, 4], [1, 4, 2, 3]) \n三、Figure 的组成\n现在我们来深⼊看⼀下 figure 的组成。通过⼀张 figure 解剖图，我们可以看到⼀个完整的 matplotlib 图像通常会包括以下四个层级，这些\n层级也被称为容器（ container ），下⼀节会详细介绍。在 matplotlib 的世界中，我们将通过各种命令⽅法来操纵图像中的每⼀个部分，\n从⽽达到数据可视化的最终效果，⼀副完整的图像实际上是各类⼦元素的集合。\nFigure：顶层级，⽤来容纳所有绘图元素'

如果我们查看第一个文档的内容，我们可以看到它实际上是关于Matplotlib的介绍

在此之后，我们要确保通过运行vectordb.persist来持久化向量数据库，以便我们在未来的课程中使用。

让我们保存它，以便以后使用！

In [38]:

vectordb.persist()

/Users/lta/anaconda3/envs/chat_data/lib/python3.11/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: Since Chroma 0.4.x the manual persistence method is no longer supported as docs are automatically persisted.
  warn_deprecated(

In [39]:

vectordb_chinese.persist()

五、失败的情况(Failure modes)¶

这看起来很好，基本的相似性搜索很容易就能让你完成80%的工作。

但是，可能会出现一些相似性搜索失败的情况。

这里有一些可能出现的边缘情况——我们将在下一堂课中修复它们。

In [40]:

question = "what did they say about matlab?"  # "他们对 matlab 有何评价？"

In [41]:

docs = vectordb.similarity_search(question,k=5)

In [42]:

question_chinese = "Matplotlib是什么？"

In [43]:

docs_chinese = vectordb_chinese.similarity_search(question_chinese,k=5)

请注意，我们得到了重复的块（因为索引中有重复的 MachineLearning-Lecture01.pdf、第一回：Matplotlib初相识.pdf）。

语义搜索获取所有相似的文档，但不强制多样性。

docs[0] 和 docs[1] 是完全相同的，以及docs_chinese[0] 和 docs_chinese[1] 是完全相同的。

In [44]:

docs[0]

Out[44]:

Document(page_content='those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t s een MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to  learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your  own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of  this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it\'s free, and for the purposes of  this class, it will work for just about \neverything.  \nSo actually I, well, so yeah, just a side comment for those of you that haven\'t seen \nMATLAB before I guess, once a colleague of mine at a different university, not at \nStanford, actually teaches another machine l earning course. He\'s taught it for many years. \nSo one day, he was in his office, and an old student of his from, lik e, ten years ago came \ninto his office and he said, "Oh, professo r, professor, thank you so much for your', metadata={'page': 8, 'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf'})

In [45]:

docs[1]

Out[45]:

Document(page_content='those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t s een MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to  learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your  own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of  this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it\'s free, and for the purposes of  this class, it will work for just about \neverything.  \nSo actually I, well, so yeah, just a side comment for those of you that haven\'t seen \nMATLAB before I guess, once a colleague of mine at a different university, not at \nStanford, actually teaches another machine l earning course. He\'s taught it for many years. \nSo one day, he was in his office, and an old student of his from, lik e, ten years ago came \ninto his office and he said, "Oh, professo r, professor, thank you so much for your', metadata={'page': 8, 'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf'})

In [46]:

docs_chinese[0]

Out[46]:

Document(page_content='第⼀回：Matplotlib 初相识\n⼀、认识matplotlib\nMatplotlib 是⼀个 Python 2D 绘图库，能够以多种硬拷⻉格式和跨平台的交互式环境⽣成出版物质量的图形，⽤来绘制各种静态，动态，\n交互式的图表。\nMatplotlib 可⽤于 Python 脚本， Python 和 IPython Shell 、 Jupyter notebook ， Web 应⽤程序服务器和各种图形⽤户界⾯⼯具包等。\nMatplotlib 是 Python 数据可视化库中的泰⽃，它已经成为 python 中公认的数据可视化⼯具，我们所熟知的 pandas 和 seaborn 的绘图接⼝\n其实也是基于 matplotlib 所作的⾼级封装。\n为了对matplotlib 有更好的理解，让我们从⼀些最基本的概念开始认识它，再逐渐过渡到⼀些⾼级技巧中。\n⼆、⼀个最简单的绘图例⼦\nMatplotlib 的图像是画在 figure （如 windows ， jupyter 窗体）上的，每⼀个 figure ⼜包含了⼀个或多个 axes （⼀个可以指定坐标系的⼦区\n域）。最简单的创建 figure 以及 axes 的⽅式是通过 pyplot.subplots命令，创建 axes 以后，可以使⽤ Axes.plot绘制最简易的折线图。\nimport matplotlib.pyplot as plt\nimport matplotlib as mpl\nimport numpy as np\nfig, ax = plt.subplots()  # 创建⼀个包含⼀个 axes 的 figure\nax.plot([1, 2, 3, 4], [1, 4, 2, 3]);  # 绘制图像\nTrick： 在jupyter notebook 中使⽤ matplotlib 时会发现，代码运⾏后⾃动打印出类似 <matplotlib.lines.Line2D at 0x23155916dc0>\n这样⼀段话，这是因为 matplotlib 的绘图代码默认打印出最后⼀个对象。如果不想显示这句话，有以下三种⽅法，在本章节的代码示例\n中你能找到这三种⽅法的使⽤。\n\x00. 在代码块最后加⼀个分号 ;\n\x00. 在代码块最后加⼀句 plt.show()\n\x00. 在绘图时将绘图对象显式赋值给⼀个变量，如将 plt.plot([1, 2, 3, 4]) 改成 line =plt.plot([1, 2, 3, 4])\n和MATLAB 命令类似，你还可以通过⼀种更简单的⽅式绘制图像， matplotlib.pyplot⽅法能够直接在当前 axes 上绘制图像，如果⽤户\n未指定axes ， matplotlib 会帮你⾃动创建⼀个。所以上⾯的例⼦也可以简化为以下这⼀⾏代码。\nline =plt.plot([1, 2, 3, 4], [1, 4, 2, 3]) \n三、Figure 的组成\n现在我们来深⼊看⼀下 figure 的组成。通过⼀张 figure 解剖图，我们可以看到⼀个完整的 matplotlib 图像通常会包括以下四个层级，这些\n层级也被称为容器（ container ），下⼀节会详细介绍。在 matplotlib 的世界中，我们将通过各种命令⽅法来操纵图像中的每⼀个部分，\n从⽽达到数据可视化的最终效果，⼀副完整的图像实际上是各类⼦元素的集合。\nFigure：顶层级，⽤来容纳所有绘图元素', metadata={'page': 0, 'source': 'docs/matplotlib/第一回：Matplotlib初相识.pdf'})

In [47]:

docs_chinese[1]

Out[47]:

Document(page_content='第⼀回：Matplotlib 初相识\n⼀、认识matplotlib\nMatplotlib 是⼀个 Python 2D 绘图库，能够以多种硬拷⻉格式和跨平台的交互式环境⽣成出版物质量的图形，⽤来绘制各种静态，动态，\n交互式的图表。\nMatplotlib 可⽤于 Python 脚本， Python 和 IPython Shell 、 Jupyter notebook ， Web 应⽤程序服务器和各种图形⽤户界⾯⼯具包等。\nMatplotlib 是 Python 数据可视化库中的泰⽃，它已经成为 python 中公认的数据可视化⼯具，我们所熟知的 pandas 和 seaborn 的绘图接⼝\n其实也是基于 matplotlib 所作的⾼级封装。\n为了对matplotlib 有更好的理解，让我们从⼀些最基本的概念开始认识它，再逐渐过渡到⼀些⾼级技巧中。\n⼆、⼀个最简单的绘图例⼦\nMatplotlib 的图像是画在 figure （如 windows ， jupyter 窗体）上的，每⼀个 figure ⼜包含了⼀个或多个 axes （⼀个可以指定坐标系的⼦区\n域）。最简单的创建 figure 以及 axes 的⽅式是通过 pyplot.subplots命令，创建 axes 以后，可以使⽤ Axes.plot绘制最简易的折线图。\nimport matplotlib.pyplot as plt\nimport matplotlib as mpl\nimport numpy as np\nfig, ax = plt.subplots()  # 创建⼀个包含⼀个 axes 的 figure\nax.plot([1, 2, 3, 4], [1, 4, 2, 3]);  # 绘制图像\nTrick： 在jupyter notebook 中使⽤ matplotlib 时会发现，代码运⾏后⾃动打印出类似 <matplotlib.lines.Line2D at 0x23155916dc0>\n这样⼀段话，这是因为 matplotlib 的绘图代码默认打印出最后⼀个对象。如果不想显示这句话，有以下三种⽅法，在本章节的代码示例\n中你能找到这三种⽅法的使⽤。\n\x00. 在代码块最后加⼀个分号 ;\n\x00. 在代码块最后加⼀句 plt.show()\n\x00. 在绘图时将绘图对象显式赋值给⼀个变量，如将 plt.plot([1, 2, 3, 4]) 改成 line =plt.plot([1, 2, 3, 4])\n和MATLAB 命令类似，你还可以通过⼀种更简单的⽅式绘制图像， matplotlib.pyplot⽅法能够直接在当前 axes 上绘制图像，如果⽤户\n未指定axes ， matplotlib 会帮你⾃动创建⼀个。所以上⾯的例⼦也可以简化为以下这⼀⾏代码。\nline =plt.plot([1, 2, 3, 4], [1, 4, 2, 3]) \n三、Figure 的组成\n现在我们来深⼊看⼀下 figure 的组成。通过⼀张 figure 解剖图，我们可以看到⼀个完整的 matplotlib 图像通常会包括以下四个层级，这些\n层级也被称为容器（ container ），下⼀节会详细介绍。在 matplotlib 的世界中，我们将通过各种命令⽅法来操纵图像中的每⼀个部分，\n从⽽达到数据可视化的最终效果，⼀副完整的图像实际上是各类⼦元素的集合。\nFigure：顶层级，⽤来容纳所有绘图元素', metadata={'page': 0, 'source': 'docs/matplotlib/第一回：Matplotlib初相识.pdf'})

我们可以看到一种新的失败的情况。

下面的问题询问了关于第三讲的问题，但也包括了来自其他讲的结果。

In [48]:

question = "what did they say about regression in the third lecture?"  # "他们在第三讲中是怎么谈论回归的？"

In [49]:

docs = vectordb.similarity_search(question,k=5)

In [50]:

for doc in docs:
    print(doc.metadata)

{'page': 0, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 14, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 6, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 0, 'source': 'docs/cs229_lectures/MachineLearning-Lecture02.pdf'}
{'page': 2, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}

In [51]:

print(docs[4].page_content)

regression problem like this. What I want to do today is talk about a class of algorithms 
called non-parametric learning algorithms that will help to alleviate the need somewhat 
for you to choose features very carefully. Okay ? And this leads us in to our discussion of 
locally weighted regression. And just to de fine the term, linear regression, as we’ve 
defined it so far, is an example of a parame tric learning algorithm. Parametric learning 
algorithm is one that’s defined as an algorithm that has a fixed number of parameters that 
fit to the data. Okay? So in linear regression we  have a fix set of parameters theta, right? 
That must fit to the data. In contrast, what  I’m gonna talk about now is our first non-
parametric learning algorithm. The formal defi nition, which is not very  intuitive, so I’ve 
replaced it with a second, say, more intuitive. The, sort of, formal definition of the non-
parametric learning algorithm is that it’s an  algorithm where the number of parameters 
goes with M, with the size of the training se t. And usually it’s de fined as a number of 
parameters grows linearly with the size of the training set. Th is is the formal definition. A 
slightly less formal definition is that th e amount of stuff that your learning algorithm 
needs to keep around will grow linearly with th e training sets or, in another way of saying 
it, is that this is an algorithm that we’ll n eed to keep around an entire training set, even

In [52]:

question_chinese = "他们在第二讲中对Figure说了些什么？"

In [53]:

docs_chinese = vectordb_chinese.similarity_search(question_chinese,k=5)

In [54]:

for doc_chinese in docs_chinese:
    print(doc_chinese.metadata)

{'page': 9, 'source': 'docs/matplotlib/第二回：艺术画笔见乾坤.pdf'}
{'page': 0, 'source': 'docs/matplotlib/第一回：Matplotlib初相识.pdf'}
{'page': 0, 'source': 'docs/matplotlib/第一回：Matplotlib初相识.pdf'}
{'page': 0, 'source': 'docs/matplotlib/第三回：布局格式定方圆.pdf'}
{'page': 0, 'source': 'docs/matplotlib/第二回：艺术画笔见乾坤.pdf'}

In [55]:

print(docs_chinese[2].page_content)

第⼀回：Matplotlib 初相识
⼀、认识matplotlib
Matplotlib 是⼀个 Python 2D 绘图库，能够以多种硬拷⻉格式和跨平台的交互式环境⽣成出版物质量的图形，⽤来绘制各种静态，动态，
交互式的图表。
Matplotlib 可⽤于 Python 脚本， Python 和 IPython Shell 、 Jupyter notebook ， Web 应⽤程序服务器和各种图形⽤户界⾯⼯具包等。
Matplotlib 是 Python 数据可视化库中的泰⽃，它已经成为 python 中公认的数据可视化⼯具，我们所熟知的 pandas 和 seaborn 的绘图接⼝
其实也是基于 matplotlib 所作的⾼级封装。
为了对matplotlib 有更好的理解，让我们从⼀些最基本的概念开始认识它，再逐渐过渡到⼀些⾼级技巧中。
⼆、⼀个最简单的绘图例⼦
Matplotlib 的图像是画在 figure （如 windows ， jupyter 窗体）上的，每⼀个 figure ⼜包含了⼀个或多个 axes （⼀个可以指定坐标系的⼦区
域）。最简单的创建 figure 以及 axes 的⽅式是通过 pyplot.subplots命令，创建 axes 以后，可以使⽤ Axes.plot绘制最简易的折线图。
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
fig, ax = plt.subplots()  # 创建⼀个包含⼀个 axes 的 figure
ax.plot([1, 2, 3, 4], [1, 4, 2, 3]);  # 绘制图像
Trick： 在jupyter notebook 中使⽤ matplotlib 时会发现，代码运⾏后⾃动打印出类似 <matplotlib.lines.Line2D at 0x23155916dc0>
这样⼀段话，这是因为 matplotlib 的绘图代码默认打印出最后⼀个对象。如果不想显示这句话，有以下三种⽅法，在本章节的代码示例
中你能找到这三种⽅法的使⽤。
. 在代码块最后加⼀个分号 ;
. 在代码块最后加⼀句 plt.show()
. 在绘图时将绘图对象显式赋值给⼀个变量，如将 plt.plot([1, 2, 3, 4]) 改成 line =plt.plot([1, 2, 3, 4])
和MATLAB 命令类似，你还可以通过⼀种更简单的⽅式绘制图像， matplotlib.pyplot⽅法能够直接在当前 axes 上绘制图像，如果⽤户
未指定axes ， matplotlib 会帮你⾃动创建⼀个。所以上⾯的例⼦也可以简化为以下这⼀⾏代码。
line =plt.plot([1, 2, 3, 4], [1, 4, 2, 3]) 
三、Figure 的组成
现在我们来深⼊看⼀下 figure 的组成。通过⼀张 figure 解剖图，我们可以看到⼀个完整的 matplotlib 图像通常会包括以下四个层级，这些
层级也被称为容器（ container ），下⼀节会详细介绍。在 matplotlib 的世界中，我们将通过各种命令⽅法来操纵图像中的每⼀个部分，
从⽽达到数据可视化的最终效果，⼀副完整的图像实际上是各类⼦元素的集合。
Figure：顶层级，⽤来容纳所有绘图元素

在下一讲中讨论的方法可以用来解决这两个问题！

学习资源站

33-必修4-自有知识库RAG向量检索和问答-向量数据库与词向量

第四章向量数据库与词向量(Vectorstores and Embeddings)¶

一、环境配置¶

二、读取文档¶

三、Embeddings¶

四、Vectorstores¶

4.1 初始化Chroma¶

4.2 相似性搜索(Similarity Search)¶

五、失败的情况(Failure modes)¶

33-必修4-自有知识库RAG向量检索和问答-向量数据库与词向量

第四章 向量数据库与词向量(Vectorstores and Embeddings)¶

一、环境配置¶

二、读取文档¶

三、Embeddings¶

四、Vectorstores¶

4.1 初始化Chroma¶

4.2 相似性搜索(Similarity Search)¶

五、失败的情况(Failure modes)¶

第四章向量数据库与词向量(Vectorstores and Embeddings)¶