使用向量嵌入項目搜尋

本頁面說明如何使用 Cloud Firestore,透過下列技術執行 K 近鄰 (KNN) 向量搜尋:

  • 儲存向量值
  • 建立及管理 KNN 向量索引
  • 使用其中一種支援的向量距離測量方式,進行 K 近鄰 (KNN) 查詢

事前準備

將嵌入項目儲存在 Cloud Firestore 之前,您必須先生成向量嵌入。Cloud Firestore 不會生成嵌入。您可以使用 Vertex AI 等服務建立向量值,例如來自 Cloud Firestore 資料的文字嵌入。接著,您可以將這些嵌入內容儲存回 Cloud Firestore 文件。

如要進一步瞭解嵌入項目,請參閱「什麼是嵌入項目?」一文。

如要瞭解如何使用 Vertex AI 取得文字嵌入,請參閱「取得文字嵌入」。

儲存向量嵌入

下列範例說明如何在 Cloud Firestore 中儲存向量嵌入。

使用向量嵌入的寫入作業

以下範例說明如何在 Cloud Firestore 文件中儲存向量嵌入:

Python
from google.cloud import firestore
from google.cloud.firestore_v1.vector import Vector

firestore_client = firestore.Client()
collection = firestore_client.collection("coffee-beans")
doc = {
    "name": "Kahawa coffee beans",
    "description": "Information about the Kahawa coffee beans.",
    "embedding_field": Vector([0.18332680, 0.24160706, 0.3416704]),
}

collection.add(doc)
Node.js
import {
  Firestore,
  FieldValue,
} from "@google-cloud/firestore";

const db = new Firestore();
const coll = db.collection('coffee-beans');
await coll.add({
  name: "Kahawa coffee beans",
  description: "Information about the Kahawa coffee beans.",
  embedding_field: FieldValue.vector([1.0 , 2.0, 3.0])
});
Go
import (
	"context"
	"fmt"
	"io"

	"cloud.google.com/go/firestore"
)

type CoffeeBean struct {
	Name           string             `firestore:"name,omitempty"`
	Description    string             `firestore:"description,omitempty"`
	EmbeddingField firestore.Vector32 `firestore:"embedding_field,omitempty"`
	Color          string             `firestore:"color,omitempty"`
}

func storeVectors(w io.Writer, projectID string) error {
	ctx := context.Background()

	// Create client
	client, err := firestore.NewClient(ctx, projectID)
	if err != nil {
		return fmt.Errorf("firestore.NewClient: %w", err)
	}
	defer client.Close()

	// Vector can be represented by Vector32 or Vector64
	doc := CoffeeBean{
		Name:           "Kahawa coffee beans",
		Description:    "Information about the Kahawa coffee beans.",
		EmbeddingField: []float32{1.0, 2.0, 3.0},
		Color:          "red",
	}
	ref := client.Collection("coffee-beans").NewDoc()
	if _, err = ref.Set(ctx, doc); err != nil {
		fmt.Fprintf(w, "failed to upsert: %v", err)
		return err
	}

	return nil
}
Java
import com.google.cloud.firestore.CollectionReference;
import com.google.cloud.firestore.DocumentReference;
import com.google.cloud.firestore.FieldValue;
import com.google.cloud.firestore.VectorQuery;

CollectionReference coll = firestore.collection("coffee-beans");

Map<String, Object> docData = new HashMap<>();
docData.put("name", "Kahawa coffee beans");
docData.put("description", "Information about the Kahawa coffee beans.");
docData.put("embedding_field", FieldValue.vector(new double[] {1.0, 2.0, 3.0}));

ApiFuture<DocumentReference> future = coll.add(docData);
DocumentReference documentReference = future.get();

使用 Cloud Function 計算向量嵌入

如要在文件更新或建立時計算及儲存向量嵌入,可以設定 Cloud 函式

Python
@functions_framework.cloud_event
def store_embedding(cloud_event) -> None:
  """Triggers by a change to a Firestore document.
  """
  firestore_payload = firestore.DocumentEventData()
  payload = firestore_payload._pb.ParseFromString(cloud_event.data)

  collection_id, doc_id = from_payload(payload)
  # Call a function to calculate the embedding
  embedding = calculate_embedding(payload)
  # Update the document
  doc = firestore_client.collection(collection_id).document(doc_id)
  doc.set({"embedding_field": embedding}, merge=True)
Node.js
/**
 * A vector embedding will be computed from the
 * value of the `content` field. The vector value
 * will be stored in the `embedding` field. The
 * field names `content` and `embedding` are arbitrary
 * field names chosen for this example.
 */
async function storeEmbedding(event: FirestoreEvent<any>): Promise<void> {
  // Get the previous value of the document's `content` field.
  const previousDocumentSnapshot = event.data.before as QueryDocumentSnapshot;
  const previousContent = previousDocumentSnapshot.get("content");

  // Get the current value of the document's `content` field.
  const currentDocumentSnapshot = event.data.after as QueryDocumentSnapshot;
  const currentContent = currentDocumentSnapshot.get("content");

  // Don't update the embedding if the content field did not change
  if (previousContent === currentContent) {
    return;
  }

  // Call a function to calculate the embedding for the value
  // of the `content` field.
  const embeddingVector = calculateEmbedding(currentContent);

  // Update the `embedding` field on the document.
  await currentDocumentSnapshot.ref.update({
    embedding: embeddingVector,
  });
}
Go
  // Not yet supported in the Go client library
Java
  // Not yet supported in the Java client library

建立及管理向量索引

您必須先建立相應的索引,才能使用向量嵌入項目執行最鄰近搜尋。下列範例示範如何使用 Google Cloud CLI 建立及管理向量索引。您也可以使用 Firebase CLI 和 Terraform 管理向量索引

建立向量索引

建立向量索引前,請先升級至最新版 Google Cloud CLI

gcloud components update

如要建立向量索引,請使用 gcloud firestore indexes composite create

gcloud
gcloud firestore indexes composite create \
--collection-group=collection-group \
--query-scope=COLLECTION \
--field-config field-path=vector-field,vector-config='vector-configuration' \
--database=database-id

其中:

  • collection-group 是集合群組的 ID。
  • vector-field 是包含向量嵌入的欄位名稱。
  • database-id 是資料庫的 ID。
  • vector-configuration 包含向量 dimension 和索引類型。 dimension 是整數,上限為 2048。索引類型必須是 flat。 索引設定格式如下:{"dimension":"DIMENSION", "flat": "{}"}

以下範例會建立複合式索引,包括欄位 vector-field 的向量索引,以及欄位 color 的遞增索引。您可以使用這類索引,在最鄰近搜尋前預先篩選資料

gcloud
gcloud firestore indexes composite create \
--collection-group=collection-group \
--query-scope=COLLECTION \
--field-config=order=ASCENDING,field-path="color" \
--field-config field-path=vector-field,vector-config='{"dimension":"1024", "flat": "{}"}' \
--database=database-id

列出所有向量索引

gcloud
gcloud firestore indexes composite list --database=database-id

database-id 替換為資料庫的 ID。

刪除向量索引

gcloud
gcloud firestore indexes composite delete index-id --database=database-id

其中:

  • index-id 是要刪除的索引 ID。 使用 indexes composite list 擷取索引 ID。
  • database-id 是資料庫的 ID。

說明向量索引

gcloud
gcloud firestore indexes composite describe index-id --database=database-id

其中:

  • index-id 是要說明的索引 ID。使用 或 indexes composite list 擷取索引 ID。
  • database-id 是資料庫的 ID。

提出最鄰近查詢

您可以執行相似度搜尋,找出向量嵌入的最鄰近項目。相似度搜尋需要向量索引。 如果沒有索引,Cloud Firestore 會建議使用 gcloud CLI 建立索引。

以下範例會找出查詢向量的 10 個最近鄰。

Python
from google.cloud.firestore_v1.base_vector_query import DistanceMeasure
from google.cloud.firestore_v1.vector import Vector

collection = db.collection("coffee-beans")

# Requires a single-field vector index
vector_query = collection.find_nearest(
    vector_field="embedding_field",
    query_vector=Vector([0.3416704, 0.18332680, 0.24160706]),
    distance_measure=DistanceMeasure.EUCLIDEAN,
    limit=5,
)
Node.js
import {
  Firestore,
  FieldValue,
  VectorQuery,
  VectorQuerySnapshot,
} from "@google-cloud/firestore";

// Requires a single-field vector index
const vectorQuery: VectorQuery = coll.findNearest({
  vectorField: 'embedding_field',
  queryVector: [3.0, 1.0, 2.0],
  limit: 10,
  distanceMeasure: 'EUCLIDEAN'
});

const vectorQuerySnapshot: VectorQuerySnapshot = await vectorQuery.get();
Go
import (
	"context"
	"fmt"
	"io"

	"cloud.google.com/go/firestore"
)

func vectorSearchBasic(w io.Writer, projectID string) error {
	ctx := context.Background()

	// Create client
	client, err := firestore.NewClient(ctx, projectID)
	if err != nil {
		return fmt.Errorf("firestore.NewClient: %w", err)
	}
	defer client.Close()

	collection := client.Collection("coffee-beans")

	// Requires a vector index
	// https://firebase.google.com/docs/firestore/vector-search#create_and_manage_vector_indexes
	vectorQuery := collection.FindNearest("embedding_field",
		[]float32{3.0, 1.0, 2.0},
		5,
		// More info: https://firebase.google.com/docs/firestore/vector-search#vector_distances
		firestore.DistanceMeasureEuclidean,
		nil)

	docs, err := vectorQuery.Documents(ctx).GetAll()
	if err != nil {
		fmt.Fprintf(w, "failed to get vector query results: %v", err)
		return err
	}

	for _, doc := range docs {
		fmt.Fprintln(w, doc.Data()["name"])
	}
	return nil
}
Java
import com.google.cloud.firestore.VectorQuery;
import com.google.cloud.firestore.VectorQuerySnapshot;

VectorQuery vectorQuery = coll.findNearest(
        "embedding_field",
        new double[] {3.0, 1.0, 2.0},
        /* limit */ 10,
        VectorQuery.DistanceMeasure.EUCLIDEAN);

ApiFuture<VectorQuerySnapshot> future = vectorQuery.get();
VectorQuerySnapshot vectorQuerySnapshot = future.get();

向量距離

最近鄰查詢支援下列向量距離選項:

  • EUCLIDEAN:測量向量之間的歐氏距離。 詳情請參閱「歐幾里得」。
  • COSINE:根據向量間的角度比較向量,可讓您測量不以向量大小為依據的相似度。建議您搭配使用 DOT_PRODUCT 與單位正規化向量,而不要使用餘弦距離,因為兩者在數學上等價,但前者效能較佳。詳情請參閱「餘弦相似度」。
  • DOT_PRODUCT:與 COSINE 類似,但會受到向量大小的影響。詳情請參閱「點積」。

選擇距離測量方式

視所有向量嵌入是否已正規化而定,您可以決定要使用哪種距離測量方式來找出距離測量結果。正規化向量嵌入的量值 (長度) 剛好為 1.0。

此外,如果您知道模型訓練時使用的距離測量方式,請使用該方式計算向量嵌入之間的距離。

正規化資料

如果資料集中所有向量嵌入都經過正規化,則三種距離測量方式都會提供相同的語意搜尋結果。基本上,雖然每種距離測量方式會傳回不同的值,但這些值會以相同方式排序。當嵌入內容經過正規化處理後,DOT_PRODUCT 通常是運算效率最高的做法,但大多數情況下,兩者差異微不足道。不過,如果應用程式對效能非常敏感,DOT_PRODUCT 可能有助於調整效能。

未標準化的資料

如果資料集中的向量嵌入未經過正規化,則使用 DOT_PRODUCT 做為距離測量值在數學上並不正確,因為點積並非用來測量距離。視嵌入的生成方式和偏好的搜尋類型而定,COSINEEUCLIDEAN 距離測量產生的搜尋結果,在主觀上會優於其他距離測量。您可能需要透過 COSINEEUCLIDEAN 進行實驗,才能判斷哪一個最適合您的用途。

不確定資料是否經過正規化

如果您不確定資料是否已正規化,但想使用 DOT_PRODUCT,建議改用 COSINECOSINE 類似於 DOT_PRODUCT,但內建正規化功能。使用 COSINE 測量的距離範圍為 02。結果越接近 0,表示向量越相似。

預先篩選文件

如要在尋找最鄰近項目之前預先篩選文件,可以將相似性搜尋與其他查詢運算子結合使用。支援 andor 複合篩選器。如要進一步瞭解支援的欄位篩選器,請參閱「查詢運算子」。

Python
from google.cloud.firestore_v1.base_vector_query import DistanceMeasure
from google.cloud.firestore_v1.vector import Vector

collection = db.collection("coffee-beans")

# Similarity search with pre-filter
# Requires a composite vector index
vector_query = collection.where("color", "==", "red").find_nearest(
    vector_field="embedding_field",
    query_vector=Vector([0.3416704, 0.18332680, 0.24160706]),
    distance_measure=DistanceMeasure.EUCLIDEAN,
    limit=5,
)
Node.js
// Similarity search with pre-filter
// Requires composite vector index
const preFilteredVectorQuery: VectorQuery = coll
    .where("color", "==", "red")
    .findNearest({
      vectorField: "embedding_field",
      queryVector: [3.0, 1.0, 2.0],
      limit: 5,
      distanceMeasure: "EUCLIDEAN",
    });

const vectorQueryResults = await preFilteredVectorQuery.get();
Go
import (
	"context"
	"fmt"
	"io"

	"cloud.google.com/go/firestore"
)

func vectorSearchPrefilter(w io.Writer, projectID string) error {
	ctx := context.Background()

	// Create client
	client, err := firestore.NewClient(ctx, projectID)
	if err != nil {
		return fmt.Errorf("firestore.NewClient: %w", err)
	}
	defer client.Close()

	collection := client.Collection("coffee-beans")

	// Similarity search with pre-filter
	// Requires a composite vector index
	vectorQuery := collection.Where("color", "==", "red").
		FindNearest("embedding_field",
			[]float32{3.0, 1.0, 2.0},
			5,
			// More info: https://firebase.google.com/docs/firestore/vector-search#vector_distances
			firestore.DistanceMeasureEuclidean,
			nil)

	docs, err := vectorQuery.Documents(ctx).GetAll()
	if err != nil {
		fmt.Fprintf(w, "failed to get vector query results: %v", err)
		return err
	}

	for _, doc := range docs {
		fmt.Fprintln(w, doc.Data()["name"])
	}
	return nil
}
Java
import com.google.cloud.firestore.VectorQuery;
import com.google.cloud.firestore.VectorQuerySnapshot;

VectorQuery preFilteredVectorQuery = coll
        .whereEqualTo("color", "red")
        .findNearest(
                "embedding_field",
                new double[] {3.0, 1.0, 2.0},
                /* limit */ 10,
                VectorQuery.DistanceMeasure.EUCLIDEAN);

ApiFuture<VectorQuerySnapshot> future = preFilteredVectorQuery.get();
VectorQuerySnapshot vectorQuerySnapshot = future.get();

擷取計算出的向量距離

您可以在 FindNearest 查詢中指派 distance_result_field 輸出屬性名稱,擷取計算出的向量距離,如下列範例所示:

Python
from google.cloud.firestore_v1.base_vector_query import DistanceMeasure
from google.cloud.firestore_v1.vector import Vector

collection = db.collection("coffee-beans")

vector_query = collection.find_nearest(
    vector_field="embedding_field",
    query_vector=Vector([0.3416704, 0.18332680, 0.24160706]),
    distance_measure=DistanceMeasure.EUCLIDEAN,
    limit=10,
    distance_result_field="vector_distance",
)

docs = vector_query.stream()

for doc in docs:
    print(f"{doc.id}, Distance: {doc.get('vector_distance')}")
Node.js
const vectorQuery: VectorQuery = coll.findNearest(
    {
      vectorField: 'embedding_field',
      queryVector: [3.0, 1.0, 2.0],
      limit: 10,
      distanceMeasure: 'EUCLIDEAN',
      distanceResultField: 'vector_distance'
    });

const snapshot: VectorQuerySnapshot = await vectorQuery.get();

snapshot.forEach((doc) => {
  console.log(doc.id, ' Distance: ', doc.get('vector_distance'));
});
Go
import (
	"context"
	"fmt"
	"io"

	"cloud.google.com/go/firestore"
)

func vectorSearchDistanceResultField(w io.Writer, projectID string) error {
	ctx := context.Background()

	client, err := firestore.NewClient(ctx, projectID)
	if err != nil {
		return fmt.Errorf("firestore.NewClient: %w", err)
	}
	defer client.Close()

	collection := client.Collection("coffee-beans")

	// Requires a vector index
	// https://firebase.google.com/docs/firestore/vector-search#create_and_manage_vector_indexes
	vectorQuery := collection.FindNearest("embedding_field",
		[]float32{3.0, 1.0, 2.0},
		10,
		firestore.DistanceMeasureEuclidean,
		&firestore.FindNearestOptions{
			DistanceResultField: "vector_distance",
		})

	docs, err := vectorQuery.Documents(ctx).GetAll()
	if err != nil {
		fmt.Fprintf(w, "failed to get vector query results: %v", err)
		return err
	}

	for _, doc := range docs {
		fmt.Fprintf(w, "%v, Distance: %v\n", doc.Data()["name"], doc.Data()["vector_distance"])
	}
	return nil
}
Java
import com.google.cloud.firestore.VectorQuery;
import com.google.cloud.firestore.VectorQueryOptions;
import com.google.cloud.firestore.VectorQuerySnapshot;

VectorQuery vectorQuery = coll.findNearest(
        "embedding_field",
        new double[] {3.0, 1.0, 2.0},
        /* limit */ 10,
        VectorQuery.DistanceMeasure.EUCLIDEAN,
        VectorQueryOptions.newBuilder().setDistanceResultField("vector_distance").build());

ApiFuture<VectorQuerySnapshot> future = vectorQuery.get();
VectorQuerySnapshot vectorQuerySnapshot = future.get();

for (DocumentSnapshot document : vectorQuerySnapshot.getDocuments()) {
    System.out.println(document.getId() + " Distance: " + document.get("vector_distance"));
}

如要使用欄位遮罩傳回部分文件欄位和 distanceResultField,您也必須在欄位遮罩中加入 distanceResultField 的值,如下例所示:

Python
vector_query = collection.select(["color", "vector_distance"]).find_nearest(
    vector_field="embedding_field",
    query_vector=Vector([0.3416704, 0.18332680, 0.24160706]),
    distance_measure=DistanceMeasure.EUCLIDEAN,
    limit=10,
    distance_result_field="vector_distance",
)
Node.js
const vectorQuery: VectorQuery = coll
    .select('name', 'description', 'vector_distance')
    .findNearest({
      vectorField: 'embedding_field',
      queryVector: [3.0, 1.0, 2.0],
      limit: 10,
      distanceMeasure: 'EUCLIDEAN',
      distanceResultField: 'vector_distance'
    });
Go
import (
	"context"
	"fmt"
	"io"

	"cloud.google.com/go/firestore"
)

func vectorSearchDistanceResultFieldMasked(w io.Writer, projectID string) error {
	ctx := context.Background()

	client, err := firestore.NewClient(ctx, projectID)
	if err != nil {
		return fmt.Errorf("firestore.NewClient: %w", err)
	}
	defer client.Close()

	collection := client.Collection("coffee-beans")

	// Requires a vector index
	// https://firebase.google.com/docs/firestore/vector-search#create_and_manage_vector_indexes
	vectorQuery := collection.Select("color", "vector_distance").
		FindNearest("embedding_field",
			[]float32{3.0, 1.0, 2.0},
			10,
			firestore.DistanceMeasureEuclidean,
			&firestore.FindNearestOptions{
				DistanceResultField: "vector_distance",
			})

	docs, err := vectorQuery.Documents(ctx).GetAll()
	if err != nil {
		fmt.Fprintf(w, "failed to get vector query results: %v", err)
		return err
	}

	for _, doc := range docs {
		fmt.Fprintf(w, "%v, Distance: %v\n", doc.Data()["color"], doc.Data()["vector_distance"])
	}
	return nil
}
Java
import com.google.cloud.firestore.VectorQuery;
import com.google.cloud.firestore.VectorQueryOptions;
import com.google.cloud.firestore.VectorQuerySnapshot;

VectorQuery vectorQuery = coll
        .select("name", "description", "vector_distance")
        .findNearest(
          "embedding_field",
          new double[] {3.0, 1.0, 2.0},
          /* limit */ 10,
          VectorQuery.DistanceMeasure.EUCLIDEAN,
          VectorQueryOptions.newBuilder()
            .setDistanceResultField("vector_distance")
            .build());

ApiFuture<VectorQuerySnapshot> future = vectorQuery.get();
VectorQuerySnapshot vectorQuerySnapshot = future.get();

for (DocumentSnapshot document : vectorQuerySnapshot.getDocuments()) {
    System.out.println(document.getId() + " Distance: " + document.get("vector_distance"));
}

指定距離門檻

您可以指定相似度門檻,只傳回門檻內的文件。門檻欄位的行為取決於您選擇的距離測量方式:

  • EUCLIDEANCOSINE 距離會將門檻限制為距離小於或等於指定門檻的文件。向量越相似,這些距離測量值就越小。
  • DOT_PRODUCT距離限制會將門檻設為大於或等於指定門檻的文件。向量越相似,點積距離就越大。

以下範例說明如何指定距離門檻,使用 EUCLIDEAN 距離指標傳回最多 10 份最接近的文件,且距離最多 4.5 個單位:

Python
from google.cloud.firestore_v1.base_vector_query import DistanceMeasure
from google.cloud.firestore_v1.vector import Vector

collection = db.collection("coffee-beans")

vector_query = collection.find_nearest(
    vector_field="embedding_field",
    query_vector=Vector([0.3416704, 0.18332680, 0.24160706]),
    distance_measure=DistanceMeasure.EUCLIDEAN,
    limit=10,
    distance_threshold=4.5,
)

docs = vector_query.stream()

for doc in docs:
    print(f"{doc.id}")
Node.js
const vectorQuery: VectorQuery = coll.findNearest({
  vectorField: 'embedding_field',
  queryVector: [3.0, 1.0, 2.0],
  limit: 10,
  distanceMeasure: 'EUCLIDEAN',
  distanceThreshold: 4.5
});

const snapshot: VectorQuerySnapshot = await vectorQuery.get();

snapshot.forEach((doc) => {
  console.log(doc.id);
});
Go
import (
	"context"
	"fmt"
	"io"

	"cloud.google.com/go/firestore"
)

func vectorSearchDistanceThreshold(w io.Writer, projectID string) error {
	ctx := context.Background()

	client, err := firestore.NewClient(ctx, projectID)
	if err != nil {
		return fmt.Errorf("firestore.NewClient: %w", err)
	}
	defer client.Close()

	collection := client.Collection("coffee-beans")

	// Requires a vector index
	// https://firebase.google.com/docs/firestore/vector-search#create_and_manage_vector_indexes
	vectorQuery := collection.FindNearest("embedding_field",
		[]float32{3.0, 1.0, 2.0},
		10,
		firestore.DistanceMeasureEuclidean,
		&firestore.FindNearestOptions{
			DistanceThreshold: firestore.Ptr[float64](4.5),
		})

	docs, err := vectorQuery.Documents(ctx).GetAll()
	if err != nil {
		fmt.Fprintf(w, "failed to get vector query results: %v", err)
		return err
	}

	for _, doc := range docs {
		fmt.Fprintln(w, doc.Data()["name"])
	}
	return nil
}
Java
import com.google.cloud.firestore.VectorQuery;
import com.google.cloud.firestore.VectorQueryOptions;
import com.google.cloud.firestore.VectorQuerySnapshot;

VectorQuery vectorQuery = coll.findNearest(
        "embedding_field",
        new double[] {3.0, 1.0, 2.0},
        /* limit */ 10,
        VectorQuery.DistanceMeasure.EUCLIDEAN,
        VectorQueryOptions.newBuilder()
          .setDistanceThreshold(4.5)
          .build());

ApiFuture<VectorQuerySnapshot> future = vectorQuery.get();
VectorQuerySnapshot vectorQuerySnapshot = future.get();

for (DocumentSnapshot document : vectorQuerySnapshot.getDocuments()) {
    System.out.println(document.getId());
}

限制

使用向量嵌入時,請注意下列限制:

  • 支援的嵌入維度上限為 2048。如要儲存較大的索引,請使用降維
  • 從最鄰近查詢傳回的文件數量上限為 1000 份。
  • 向量搜尋不支援即時快照監聽器
  • 只有 Python、Node.js、Go 和 Java 用戶端程式庫支援向量搜尋。

後續步驟