logo

LLM 框架--Vector Database

wangzf / 2024-09-23


目录

向量数据库简介

向量数据库是用于高效计算和管理大量 向量数据 的解决方案。 向量数据库是一种专门用于 存储和检索向量数据(embedding) 的数据库系统。 它与传统的基于关系模型的数据库不同,它主要关注的是 向量数据的特性和相似性

在向量数据库中,数据被表示为向量形式,每个向量代表一个数据项。 这些向量可以是数字、文本、图像或其他类型的数据。 向量数据库使用高效的 索引查询 算法来加速向量数据的存储和检索过程。

向量数据库原理及优势

向量数据库中的 数据以向量作为基本单位,对向量进行 存储处理检索。 向量数据库通过计算与目标向量的 余弦距离点积Squared L2 等获取与目标向量的相似度。 当处理大量甚至海量的向量数据时,向量数据库 索引查询 算法的效率明显高于传统数据库。

主流向量数据库

Chroma

Chroma 简介

Chroma is the AI-native open-source vector database. Chroma makes it easy to build LLM apps by making knowledge, facts, and skills pluggable(可插拔) for LLMs.

img

img

Chroma gives you the tools to:

Chroma prioritizes:

Chroma 安装

$ pip insall chromadb

Chroma 使用

  1. 创建一个 Chroma Client
import chromadb

chroma_client = chormadb.Client()
  1. 创建一个 collection

Collections are where you’ll store your embeddings, documents, and any additional metadata.

collection = chroma_client.create_collection(name = "my_collection")
  1. 存储文本文档
collection.add(
    documents = [
        "This is a document about pineapple",
        "This is a document about oranges"
    ],
    ids = ["id1", "id2"],
)
  1. 查询 collection

You can query the collection with a list of query texts, and Chroma will return the n most similar results.

results = collection.query(
    query_texts = ["This is a query document about hawaii"],  # Chroma will embed this for you
    n_results = 2,  # how many results to return
)
print(result)
{
  'documents': [[
      'This is a document about pineapple',
      'This is a document about oranges'
  ]],
  'ids': [['id1', 'id2']],
  'distances': [[1.0404009819030762, 1.243080496788025]],
  'uris': None,
  'data': None,
  'metadatas': [[None, None]],
  'embeddings': None,
}

Chroma API

创建 Chroma Client

import chromadb

client = chromadb.PersistentClient(path = "/path/to/save/to")

# ------------------------------
# useful convenience method
# ------------------------------
# returns a nanosecond heartbeat. 
# Useful for making sure the client remains connected.
client.heartbeat()

# Empties and completely resets the database. 
# ⚠️ This is destructive and not reversible.
client.reset()

启动 Client-Server

本地启动 Chroma Server:

$ chroma run --path /db_path

本地启动 HTTP client:

import chromadb

chroma_client = chromadb.HttpClient(host = "localhost", port = 8000)

本地启动异步(async) HTTP client:

import asyncio
import chromadb

async def main():
    client = await chromadb.AsyncHttpClient()
    collection = await client.create_collection(name = "my_collection")

    await collection.add(
        documents = ["hello world"],
        ids = ["id1"]
    )

asyncio.run(main())

Collections 操作

Creating-Inspecting-Deleting

collection = client.create_collection(name="my_collection", embedding_function=emb_fn)
collection = client.get_collection(name="my_collection", embedding_function=emb_fn)
# Get a collection object from an existing collection, by name. Will raise an exception if it's not found.
collection = client.get_collection(name="test") 
# Get a collection object from an existing collection, by name. If it doesn't exist, create it.
collection = client.get_or_create_collection(name="test") 
# Delete a collection and all associated embeddings, documents, and metadata. 
⚠# ️ This is destructive and not reversible
client.delete_collection(name="my_collection") 

修改距离函数

Changing distance function

collection = client.create_collection(
    name = "collection_name",
    metadata = {"hnsw:space": "cosine"} # l2 is the default
)

Add data

collection.add(
    documents = ["lorem ipsum...", "doc2", "doc3", ...],
    metadatas = [
        {"chapter": "3", "verse": "16"}, 
        {"chapter": "3", "verse": "5"}, 
        {"chapter": "29", "verse": "11"}, 
        ...
    ],
    ids = ["id1", "id2", "id3", ...]
)
collection.add(
    documents = ["doc1", "doc2", "doc3", ...],
    embeddings = [
        [1.1, 2.3, 3.2], 
        [4.5, 6.9, 4.4], 
        [1.1, 2.3, 3.2], 
        ...
    ],
    metadatas = [
        {"chapter": "3", "verse": "16"}, 
        {"chapter": "3", "verse": "5"}, 
        {"chapter": "29", "verse": "11"}, 
        ...
    ],
    ids = ["id1", "id2", "id3", ...]
)
collection.add(
    embeddings = [[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas = [
        {"chapter": "3", "verse": "16"}, 
        {"chapter": "3", "verse": "5"}, 
        {"chapter": "29", "verse": "11"}, 
        ...
    ],
    ids = ["id1", "id2", "id3", ...]
)

Query

collection.query(
    query_embeddings = [
        [11.1, 12.1, 13.1],
        [1.1, 2.3, 3.2], 
        ...
    ],
    n_results = 10,
    where = {
        "metadata_field": "is_equal_to_this"
    },
    where_document = {
        "$contains": "search_string"
    },
)

Where

where filters

Updating data

collection.update(
    ids = ["id1", "id2", "id3", ...],
    embeddings = [
        [1.1, 2.3, 3.2], 
        [4.5, 6.9, 4.4], 
        [1.1, 2.3, 3.2], 
        ...
    ],
    metadatas = [
        {"chapter": "3", "verse": "16"}, 
        {"chapter": "3", "verse": "5"}, 
        {"chapter": "29", "verse": "11"}, 
        ...
    ],
    documents = ["doc1", "doc2", "doc3", ...],
)
collection.upsert(
    ids = ["id1", "id2", "id3", ...],
    embeddings = [
        [1.1, 2.3, 3.2], 
        [4.5, 6.9, 4.4], 
        [1.1, 2.3, 3.2], 
        ...
    ],
    metadatas = [
        {"chapter": "3", "verse": "16"}, 
        {"chapter": "3", "verse": "5"}, 
        {"chapter": "29", "verse": "11"}, 
        ...
    ],
    documents = ["doc1", "doc2", "doc3", ...],
)

Deleting data

collection.delete(
    ids = [
        "id1", "id2", "id3", ...
    ],
	where = {"chapter": "20"}
)

Chorma 部署

Chroma Server:

容器(Containers):

云服务(Cloud Providers):

Client-Server Mode

Run Chroma in client/server mode by using CLI:

$ chroma run --path /db_path

Connect to Server using Chroma HttpClient

import chromadb

chroma_client = chromadb.HttpClient(host = 'localhost', port = 8000)

异步运行 AsyncHttpClient

import asyncio
import chromadb

async def main():
    client = await chromadb.AsyncHttpClient()
    collection = await client.create_collection(name="my_collection")
    await collection.add(
        documents=["hello world"],
        ids=["id1"]
    )

asyncio.run(main())

Chorma’s Thin-Client

$ pip install chromadb-client
import chromadb

# Example setup of the client to connect to your chroma server
client = chromadb.HttpClient(host = 'localhost', port = 8000)

# Or for async usage:
async def main():
    client = await chromadb.AsyncHttpClient(host='localhost', port=8000)

Weaviate

Qdrant

参考