Beta: Firebase Genkit is in Beta, which means that it is not subject to any SLA or deprecation policy and could change in backwards-incompatible ways. Throughout the Beta period, Firebase Genkit and its documentation will be updated and improved.

本頁面由 Cloud Translation API 翻譯而成。

編寫 Genkit 評估工具

您可以擴充 Firebase Genkit 以支援自訂評估，方法是使用 LLM 做為評估工具，或透過程式輔助 (啟發式) 評估。

評估器定義

評估器是用於評估大型語言模型回應的函式。自動評估有兩種主要方法：啟發式評估和 LLM 評估。在啟發法中，您會定義決定性函式。相較之下，在以 LLM 為基礎的評估作業中，系統會將內容回饋至 LLM，並要求 LLM 根據提示中設定的標準評分輸出內容。

ai.defineEvaluator 方法可用於在 Genkit 中定義評估器動作，支援這兩種方法。本文件將探討幾個範例，說明如何使用這項方法進行啟發式和 LLM 評估。

以 LLM 為基礎的評估工具

以 LLM 為基礎的評估工具會利用 LLM 評估生成式 AI 功能的 input、context 和 output。

Genkit 中的 LLM 評估工具由 3 個元件組成：

提示
評分函式
評估工具動作

定義提示

在這個範例中，評估者會利用 LLM 判斷食物 (output) 是否美味。首先，請向 LLM 提供脈絡，然後說明您希望模型執行的操作，最後提供幾個範例，讓模型根據這些範例回覆。

Genkit 的 definePrompt 公用程式可讓您輕鬆定義提示，並進行輸入和輸出驗證。以下程式碼是使用 definePrompt 設定評估提示的範例。

import { z } from "genkit";

const DELICIOUSNESS_VALUES = ['yes', 'no', 'maybe'] as const;

const DeliciousnessDetectionResponseSchema = z.object({
  reason: z.string(),
  verdict: z.enum(DELICIOUSNESS_VALUES),
});

function getDeliciousnessPrompt(ai: Genkit) {
  return  ai.definePrompt({
      name: 'deliciousnessPrompt',
      input: {
        schema: z.object({
          responseToTest: z.string(),
        }),
      },
      output: {
        schema: DeliciousnessDetectionResponseSchema,
      }
    },
    `You are a food critic. Assess whether the provided output sounds delicious, giving only "yes" (delicious), "no" (not delicious), or "maybe" (undecided) as the verdict.

    Examples:
    Output: Chicken parm sandwich
    Response: { "reason": "A classic and beloved dish.", "verdict": "yes" }

    Output: Boston Logan Airport tarmac
    Response: { "reason": "Not edible.", "verdict": "no" }

    Output: A juicy piece of gossip
    Response: { "reason": "Metaphorically 'tasty' but not food.", "verdict": "maybe" }

    New Output: {{ responseToTest }}
    Response:
    `
  );
}

定義評分函式

定義函式，該函式會採用提示所需的 output 範例，並為結果評分。Genkit 測試案例將 input 設為必填欄位，並將 output 和 context 設為選用欄位。評估人員有責任驗證評估作業所需的所有欄位是否存在。

import { ModelArgument, z } from 'genkit';
import { BaseEvalDataPoint, Score } from 'genkit/evaluator';

/**
 * Score an individual test case for delciousness.
 */
export async function deliciousnessScore<
  CustomModelOptions extends z.ZodTypeAny,
>(
  judgeLlm: ModelArgument<CustomModelOptions>,
  dataPoint: BaseEvalDataPoint,
  judgeConfig?: CustomModelOptions
): Promise<Score> {
  const d = dataPoint;
  // Validate the input has required fields
  if (!d.output) {
    throw new Error('Output is required for Deliciousness detection');
  }

  // Hydrate the prompt and generate an evaluation result
  const deliciousnessPrompt = getDeliciousnessPrompt(ai);
  const response = await deliciousnessPrompt(
    {
      responseToTest: d.output as string,
    },
    {
      model: judgeLlm,
      config: judgeConfig,
    }
  );

  // Parse the output
  const parsedResponse = response.output;
  if (!parsedResponse) {
    throw new Error(`Unable to parse evaluator response: ${response.text}`);
  }

  // Return a scored response
  return {
    score: parsedResponse.verdict,
    details: { reasoning: parsedResponse.reason },
  };
}

定義評估器動作

最後一步是編寫定義 EvaluatorAction 的函式。

import { Genkit, z } from 'genkit';
import { BaseEvalDataPoint, EvaluatorAction } from 'genkit/evaluator';

/**
 * Create the Deliciousness evaluator action.
 */
export function createDeliciousnessEvaluator<
  ModelCustomOptions extends z.ZodTypeAny,
>(
  ai: Genkit,
  judge: ModelArgument<ModelCustomOptions>,
  judgeConfig?: z.infer<ModelCustomOptions>
): EvaluatorAction {
  return ai.defineEvaluator(
    {
      name: `myCustomEvals/deliciousnessEvaluator`,
      displayName: 'Deliciousness',
      definition: 'Determines if output is considered delicous.',
      isBilled: true,
    },
    async (datapoint: BaseEvalDataPoint) => {
      const score = await deliciousnessScore(judge, datapoint, judgeConfig);
      return {
        testCaseId: datapoint.testCaseId,
        evaluation: score,
      };
    }
  );
}

defineEvaluator 方法與其他 Genkit 建構函式 (例如 defineFlow 和 defineRetriever) 類似。這個方法需要提供 EvaluatorFn 做為回呼。EvaluatorFn 方法會接受 BaseEvalDataPoint 物件，該物件對應至評估中資料集的單一項目，以及選用的自訂選項參數 (如有指定)。這個函式會處理資料點，並傳回 EvalResponse 物件。

BaseEvalDataPoint 和 EvalResponse 的 Zod 架構如下所示。

`BaseEvalDataPoint`

export const BaseEvalDataPoint = z.object({
  testCaseId: z.string(),
  input: z.unknown(),
  output: z.unknown().optional(),
  context: z.array(z.unknown()).optional(),
  reference: z.unknown().optional(),
  testCaseId: z.string().optional(),
  traceIds: z.array(z.string()).optional(),
});

export const EvalResponse = z.object({
  sampleIndex: z.number().optional(),
  testCaseId: z.string(),
  traceId: z.string().optional(),
  spanId: z.string().optional(),
  evaluation: z.union([ScoreSchema, z.array(ScoreSchema)]),
});

`ScoreSchema`

const ScoreSchema = z.object({
  id: z.string().describe('Optional ID to differentiate multiple scores').optional(),
  score: z.union([z.number(), z.string(), z.boolean()]).optional(),
  error: z.string().optional(),
  details: z
    .object({
      reasoning: z.string().optional(),
    })
    .passthrough()
    .optional(),
});

defineEvaluator 物件可讓使用者為評估工具提供名稱、可供使用者閱讀的顯示名稱和定義。開發人員使用者介面會顯示顯示名稱和定義，以及評估結果。它還有一個選用的 isBilled 欄位，用來標示這個評估工具是否會產生帳單 (例如，使用收費的 LLM 或 API)。如果評估人員需要付費，使用者必須在 CLI 中確認後，才能執行評估作業。這可避免不必要的支出。

捷思法評估工具

啟發式評估工具可以是任何用於評估生成式 AI 功能的 input、context 或 output 的函式。

Genkit 中的啟發式評估工具由 2 個元件組成：

評分函式
評估工具動作

定義評分函式

如同以 LLM 為基礎的評估工具，請定義評分函式。在這種情況下，評分函式不需要判斷 LLM。

import { EvalResponses } from 'genkit';
import { BaseEvalDataPoint, Score } from 'genkit/evaluator';

const US_PHONE_REGEX =
  /[\+]?[(]?[0-9]{3}[)]?[-\s\.]?[0-9]{3}[-\s\.]?[0-9]{4}/i;

/**
 * Scores whether a datapoint output contains a US Phone number.
 */
export async function usPhoneRegexScore(
  dataPoint: BaseEvalDataPoint
): Promise<Score> {
  const d = dataPoint;
  if (!d.output || typeof d.output !== 'string') {
    throw new Error('String output is required for regex matching');
  }
  const matches = US_PHONE_REGEX.test(d.output as string);
  const reasoning = matches
    ? `Output matched US_PHONE_REGEX`
    : `Output did not match US_PHONE_REGEX`;
  return {
    score: matches,
    details: { reasoning },
  };
}

定義評估器動作

import { Genkit } from 'genkit';
import { BaseEvalDataPoint, EvaluatorAction } from 'genkit/evaluator';

/**
 * Configures a regex evaluator to match a US phone number.
 */
export function createUSPhoneRegexEvaluator(ai: Genkit): EvaluatorAction {
  return ai.defineEvaluator(
    {
      name: `myCustomEvals/usPhoneRegexEvaluator`,
      displayName: "Regex Match for US PHONE NUMBER",
      definition: "Uses Regex to check if output matches a US phone number",
      isBilled: false,
    },
    async (datapoint: BaseEvalDataPoint) => {
      const score = await usPhoneRegexScore(datapoint);
      return {
        testCaseId: datapoint.testCaseId,
        evaluation: score,
      };
    }
  );
}

全部整合在一起

外掛程式定義

只要在初始化 Genkit 時安裝外掛程式，即可將外掛程式註冊至架構。如要定義新的外掛程式，請使用 genkitPlugin 輔助程式方法，在外掛程式內容中例項化所有 Genkit 動作。

這個程式碼範例顯示兩個評估工具：以 LLM 為基礎的美味度評估工具，以及以規則運算式為基礎的美國電話號碼評估工具。在外掛程式內容中例項化這些評估器，即可將這些評估器註冊至外掛程式。

import { GenkitPlugin, genkitPlugin } from 'genkit/plugin';

export function myCustomEvals<
  ModelCustomOptions extends z.ZodTypeAny
>(options: {
  judge: ModelArgument<ModelCustomOptions>;
  judgeConfig?: ModelCustomOptions;
}): GenkitPlugin {
  // Define the new plugin
  return genkitPlugin("myCustomEvals", async (ai: Genkit) => {
    const { judge, judgeConfig } = options;

    // The plugin instatiates our custom evaluators within the context
    // of the `ai` object, making them available
    // throughout our Genkit application.
    createDeliciousnessEvaluator(ai, judge, judgeConfig);
    createUSPhoneRegexEvaluator(ai);
  });
}
export default myCustomEvals;

設定 Genkit

將 myCustomEvals 外掛程式新增至 Genkit 設定。

如要使用 Gemini 進行評估，請停用安全設定，讓評估人員可以接受、偵測及評分潛在有害的內容。

import { gemini15Pro } from '@genkit-ai/googleai';

const ai = genkit({
  plugins: [
    vertexAI(),
    ...
    myCustomEvals({
      judge: gemini15Pro,
    }),
  ],
  ...
});

使用自訂評估工具

在 Genkit 應用程式內容中 (透過外掛程式或直接) 例項化自訂評估器後，即可開始使用。以下範例說明如何使用一些輸入和輸出範例，試用美味度評估工具。

1. 建立含有下列內容的 `deliciousness_dataset.json` json 檔案：

[
  {
    "testCaseId": "delicous_mango",
    "input": "What is a super delicious fruit",
    "output": "A perfectly ripe mango – sweet, juicy, and with a hint of tropical sunshine."
  },
  {
    "testCaseId": "disgusting_soggy_cereal",
    "input": "What is something that is tasty when fresh but less tasty after some time?",
    "output": "Stale, flavorless cereal that's been sitting in the box too long."
  }
]

2. 使用 Genkit CLI 針對這些測試案例執行評估工具。

# Start your genkit runtime
genkit start -- <command to start your app>
genkit eval:run deliciousness_dataset.json --evaluators=myCustomEvals/deliciousnessEvaluator

3. 前往 `localhost:4000/evaluate`，即可在 Genkit UI 中查看結果。

請注意，自訂評估工具的信心度會隨著標準資料集或方法的基準測試而提升。重複執行這類基準測試，改善評估工具的效能，直到達到目標品質等級為止。