Evaluation Criteria¶

评估标准¶

Supported in ADKPython

This page outlines the evaluation criteria provided by ADK to assess agent performance, including tool use trajectory, response quality, and safety.

本页面概述了 ADK 提供的评估标准，用于评估智能体性能，包括工具使用轨迹、响应质量和安全性。

Criterion	Description	Reference-Based	Requires Rubrics	LLM-as-a-Judge	Supports User Simulation
`tool_trajectory_avg_score`	Exact match of tool call trajectory	Yes	No	No	No
工具调用轨迹的精确匹配	是	否	否	否
`response_match_score`	ROUGE-1 similarity to reference response	Yes	No	No	No
与参考响应的 ROUGE-1 相似度	是	否	否	否
`final_response_match_v2`	LLM-judged semantic match to reference response	Yes	No	Yes	No
LLM 判断的与参考响应的语义匹配	是	否	是	否
`rubric_based_final_response_quality_v1`	LLM-judged final response quality based on custom rubrics	No	Yes	Yes	Yes
基于自定义标准的 LLM 判断的最终响应质量	否	是	是	是
`rubric_based_tool_use_quality_v1`	LLM-judged tool usage quality based on custom rubrics	No	Yes	Yes	Yes
基于自定义标准的 LLM 判断的工具使用质量	否	是	是	是
`hallucinations_v1`	LLM-judged groundedness of agent response against context	No	No	Yes	Yes
LLM 判断的智能体响应相对于上下文的基础性	否	否	是	是
`safety_v1`	Safety/harmlessness of agent response	No	No	Yes	Yes
智能体响应的安全性/无害性	否	否	是	是
`per_turn_user_simulator_quality_v1`	LLM-judged user simulator quality	No	No	Yes	Yes
LLM 判断的用户模拟器质量	否	否	是	是

tool_trajectory_avg_score¶

This criterion compares the sequence of tools called by the agent against a list of expected calls and computes an average score based on one of the match types: EXACT, IN_ORDER, or ANY_ORDER.

此标准将智能体调用的工具序列与预期调用列表进行比较，并基于以下匹配类型之一计算平均分数:EXACT、IN_ORDER 或 ANY_ORDER。

When To Use This Criterion?¶

何时使用此标准?¶

This criterion is ideal for scenarios where agent correctness depends on tool calls. Depending on how strictly tool calls need to be followed, you can choose from one of three match types: EXACT, IN_ORDER, and ANY_ORDER.

此标准非常适合智能体正确性依赖于工具调用的场景。根据需要严格遵循工具调用的程度，您可以从三种匹配类型中选择一种:EXACT、IN_ORDER 和 ANY_ORDER。

This metric is particularly valuable for:

此指标特别有价值用于:

Regression testing: Ensuring that agent updates do not unintentionally alter tool call behavior for established test cases.

回归测试: 确保智能体更新不会无意中改变已建立测试用例的工具调用行为。
Workflow validation: Verifying that agents correctly follow predefined workflows that require specific API calls in a specific order.

工作流验证: 验证智能体正确遵循需要按特定顺序进行特定 API 调用的预定义工作流。
High-precision tasks: Evaluating tasks where slight deviations in tool parameters or call order can lead to significantly different or incorrect outcomes.

高精度任务: 评估工具参数或调用顺序的轻微偏差可能导致显著不同或不正确结果的任务。

Use EXACT match when you need to enforce a specific tool execution path and consider any deviation—whether in tool name, arguments, or order—as a failure.

当您需要强制执行特定的工具执行路径，并将任何偏差——无论是在工具名称、参数还是顺序中——都视为失败时，请使用 EXACT 匹配。

Use IN_ORDER match when you want to ensure certain key tool calls occur in a specific order, but allow for other tool calls to happen in between. This option is useful in assuring if certain key actions or tool calls occur and in certain order, leaving some scope for other tools calls to happen as well.

当您想要确保某些关键工具调用以特定顺序发生，但允许其他工具调用在中间发生时，请使用 IN_ORDER 匹配。此选项对于确保某些关键操作或工具调用按特定顺序发生，同时为其他工具调用也留出一些范围非常有用。

Use ANY_ORDER match when you want to ensure certain key tool calls occur, but do not care about their order, and allow for other tool calls to happen in between. This criteria is helpful for cases where multiple tool calls about the same concept occur, like your agent issues 5 search queries. You don't really care about order in which search queries are issued, till they occur.

当您想要确保某些关键工具调用发生，但不关心它们的顺序，并允许其他工具调用在中间发生时，请使用 ANY_ORDER 匹配。此标准对于出现多个关于相同概念的工具调用的情况很有帮助，例如您的智能体发出 5 个搜索查询。您并不真正关心搜索查询的发出顺序，只要它们发生即可。

Details¶

详细信息¶

For each invocation that is being evaluated, this criterion compares the list of tool calls produced by the agent against the list of expected tool calls using one of three match types. If the tool calls match based on the selected match type, a score of 1.0 is awarded for that invocation, otherwise the score is 0.0. The final value is the average of these scores across all invocations in the eval case.

对于每个正在评估的调用，此标准使用三种匹配类型之一将智能体生成的工具调用列表与预期工具调用列表进行比较。如果基于所选匹配类型的工具调用匹配，则该调用获得 1.0 的分数，否则分数为 0.0。最终值是评估用例中所有调用这些分数的平均值。

The comparison can be done using one of following match types:

可以使用以下匹配类型之一进行比较:

EXACT: Requires a perfect match between actual and expected tool calls, with no extra or missing tool calls.

EXACT: 要求实际和预期工具调用之间完美匹配，没有额外或缺少的工具调用。
IN_ORDER: Requires all tool calls from the expected list to be present in the actual list, in the same order, but allows for other tool calls to appear in between.

IN_ORDER: 要求预期列表中的所有工具调用都存在于实际列表中，顺序相同，但允许其他工具调用在中间出现。
ANY_ORDER: Requires all tool calls from the expected list to be present in the actual list, in any order, and allows for other tool calls to appear in between.

ANY_ORDER: 要求预期列表中的所有工具调用都存在于实际列表中，顺序任意，并允许其他工具调用在中间出现。

How To Use This Criterion?¶

如何使用此标准?¶

By default, tool_trajectory_avg_score uses EXACT match type. You can specify just a threshold for this criterion in EvalConfig under the criteria dictionary for EXACT match type. The value should be a float between 0.0 and 1.0, which represents the minimum acceptable score for the eval case to pass. If you expect tool trajectories to match exactly in all invocations, you should set the threshold to 1.0.

默认情况下，tool_trajectory_avg_score 使用 EXACT 匹配类型。您可以在 EvalConfig 下的 criteria 字典中为此标准仅指定 EXACT 匹配类型的阈值。该值应该是 0.0 到 1.0 之间的浮点数，代表评估用例通过的最小可接受分数。如果您期望工具轨迹在所有调用中都精确匹配，则应将阈值设置为 1.0。

Example EvalConfig entry for EXACT match:

EXACT 匹配的 EvalConfig 条目示例:

{
  "criteria": {
    "tool_trajectory_avg_score":1.0
  }
}

Or you could specify the match_type explicitly:

或者您可以明确指定 match_type:

{
  "criteria": {
    "tool_trajectory_avg_score": {
      "threshold": 1.0,
      "match_type": "EXACT"
    }
  }
}

If you want to use IN_ORDER or ANY_ORDER match type, you can specify it via the match_type field along with threshold.

如果您想使用 IN_ORDER 或 ANY_ORDER 匹配类型，可以通过 match_type 字段连同阈值一起指定它。

Example EvalConfig entry for IN_ORDER match:

IN_ORDER 匹配的 EvalConfig 条目示例:

{
  "criteria": {
    "tool_trajectory_avg_score": {
      "threshold": 1.0,
      "match_type": "IN_ORDER"
    }
  }
}

Example EvalConfig entry for ANY_ORDER match:

ANY_ORDER 匹配的 EvalConfig 条目示例:

{
  "criteria": {
    "tool_trajectory_avg_score": {
      "threshold": 1.0,
      "match_type": "ANY_ORDER"
    }
  }
}

Output And How To Interpret¶

输出以及如何解释¶

The output is a score between 0.0 and 1.0, where 1.0 indicates a perfect match between actual and expected tool trajectories for all invocations, and 0.0 indicates a complete mismatch for all invocations. Higher scores are better. A score below 1.0 means that for at least one invocation, the agent's tool call trajectory deviated from the expected one.

输出是 0.0 到 1.0 之间的分数，其中 1.0 表示所有调用的实际和预期工具轨迹之间完美匹配，0.0 表示所有调用完全不匹配。分数越高越好。低于 1.0 的分数表示至少有一个调用的智能体工具调用轨迹偏离了预期的轨迹。

response_match_score¶

This criterion evaluates if the agent's final response matches a golden/expected final response using Rouge-1.

此标准评估智能体的最终响应是否使用 Rouge-1 与黄金/预期最终响应匹配。

When To Use This Criterion?¶

何时使用此标准?¶

Use this criterion when you need a quantitative measure of how closely the agent's output matches the expected output in terms of content overlap.

当您需要定量衡量智能体输出与预期输出在内容重叠方面的接近程度时，请使用此标准。

Details¶

详细信息¶

ROUGE-1 specifically measures the overlap of unigrams (single words) between the system-generated text (candidate summary) and a reference text. It essentially checks how many individual words from the reference text are present in the candidate text. To learn more, see details on ROUGE-1.

ROUGE-1 专门测量系统生成的文本(候选摘要)和参考文本之间的一元组(单个单词)的重叠。它本质上检查参考文本中有多少个单词存在于候选文本中。要了解更多，请参阅 ROUGE-1的详细信息。

How To Use This Criterion?¶

如何使用此标准?¶

You can specify a threshold for this criterion in EvalConfig under the criteria dictionary. The value should be a float between 0.0 and 1.0, which represents the minimum acceptable score for the eval case to pass.

您可以在 EvalConfig 下的 criteria 字典中为此标准指定阈值。该值应该是 0.0 到 1.0 之间的浮点数，代表评估用例通过的最小可接受分数。

Example EvalConfig entry:

EvalConfig 条目示例:

{
  "criteria": {
    "response_match_score": 0.8
  }
}

Output And How To Interpret¶

输出以及如何解释¶

Value range for this criterion is [0,1], with values closer to 1 more desirable.

此标准的值范围是 [0,1]，值越接近 1 越理想。

final_response_match_v2¶

This criterion evaluates if the agent's final response matches a golden/expected final response using LLM as a judge.

此标准评估智能体的最终响应是否使用 LLM 作为判断者与黄金/预期最终响应匹配。

When To Use This Criterion?¶

何时使用此标准?¶

Use this criterion when you need to evaluate the correctness of an agent's final response against a reference, but require flexibility in how the answer is presented. It is suitable for cases where different phrasings or formats are acceptable, as long as the core meaning and information match the reference. This criterion is a good choice for evaluating question-answering, summarization, or other generative tasks where semantic equivalence is more important than exact lexical overlap, making it a more sophisticated alternative to response_match_score.

当您需要评估智能体最终响应相对于参考答案的正确性，但要求答案呈现方式具有灵活性时，请使用此标准。它适用于不同的措辞或格式可以接受的情况，只要核心含义和信息与参考匹配。此标准是评估问答、摘要或其他生成任务的良好选择，其中语义等价性比精确的词汇重叠更重要，使其成为比 response_match_score 更复杂的替代方案。

Details¶

详细信息¶

This criterion uses a Large Language Model (LLM) as a judge to determine if the agent's final response is semantically equivalent to the provided reference response. It is designed to be more flexible than lexical matching metrics (like response_match_score), as it focuses on whether the agent's response contains the correct information, while tolerating differences in formatting, phrasing, or inclusion of additional correct details.

此标准使用大型语言模型(LLM)作为判断者来确定智能体的最终响应在语义上是否与提供的参考响应等价。它的设计比词汇匹配指标(如 response_match_score)更灵活，因为它专注于智能体响应是否包含正确信息，同时容忍格式、措辞或包含额外正确细节方面的差异。

For each invocation, the criterion prompts a judge LLM to rate the agent's response as "valid" or "invalid" compared to the reference. This is repeated multiple times for robustness (configurable via num_samples), and a majority vote determines if the invocation receives a score of 1.0 (valid) or 0.0 (invalid). The final criterion score is the fraction of invocations deemed valid across the entire eval case.

对于每个调用，该标准提示判断者 LLM 将智能体的响应相对于参考评为"有效"或"无效"。为了鲁棒性，此过程重复多次(可通过 num_samples 配置)，多数票决定调用是否获得 1.0(有效)或 0.0(无效)的分数。最终标准分数是整个评估用例中被视为有效的调用的分数。

How To Use This Criterion?¶

如何使用此标准?¶

This criterion uses LlmAsAJudgeCriterion, allowing you to configure the evaluation threshold, judge model, and number of samples per invocation.

此标准使用 LlmAsAJudgeCriterion，允许您配置评估阈值、判断模型和每个调用的样本数。

Example EvalConfig entry:

EvalConfig 条目示例:

{
  "criteria": {
    "final_response_match_v2": {
      "threshold": 0.8,
      "judge_model_options": {
            "judge_model": "gemini-2.5-flash",
            "num_samples": 5
          }
        }
    }
  }
}

Output And How To Interpret¶

输出以及如何解释¶

The criterion returns a score between 0.0 and 1.0. A score of 1.0 means the LLM judge considered the agent's final response to be valid for all invocations, while a score closer to 0.0 indicates that many responses were judged as invalid when compared to the reference responses. Higher values are better.

该标准返回 0.0 到 1.0 之间的分数。1.0 的分数意味着 LLM 判断者认为智能体的最终响应对于所有调用都有效，而接近 0.0 的分数表示与参考响应相比，许多响应被判断为无效。数值越高越好。

rubric_based_final_response_quality_v1¶

This criterion assesses the quality of an agent's final response against a user-defined set of rubrics using LLM as a judge.

此标准使用 LLM 作为判断者，根据用户定义的一组标准评估智能体最终响应的质量。

When To Use This Criterion?¶

何时使用此标准?¶

Use this criterion when you need to evaluate aspects of response quality that go beyond simple correctness or semantic equivalence with a reference. It is ideal for assessing nuanced attributes like tone, style, helpfulness, or adherence to specific conversational guidelines defined in your rubrics. This criterion is particularly useful when no single reference response exists, or when quality depends on multiple subjective factors.

当您需要评估超越相对于参考答案的简单正确性或语义等价性的响应质量方面时，请使用此标准。它非常适合评估细微的属性，如语气、风格、有用性或对标准中定义的特定对话准则的遵守性。当不存在单个参考响应，或者质量取决于多个主观因素时，此标准特别有用。

Details¶

详细信息¶

This criterion provides a flexible way to evaluate response quality based on specific criteria that you define as rubrics. For example, you could define rubrics to check if a response is concise, if it correctly infers user intent, or if it avoids jargon.

此标准提供了一种灵活的方法，可以根据您定义为标准的具体标准评估响应质量。例如，您可以定义标准来检查响应是否简洁、是否正确推断用户意图，或者是否避免使用行话。

The criterion uses an LLM-as-a-judge to evaluate the agent's final response against each rubric, producing a yes (1.0) or no (0.0) verdict for each. Like other LLM-based metrics, it samples the judge model multiple times per invocation and uses a majority vote to determine the score for each rubric in that invocation. The overall score for an invocation is the average of its rubric scores. The final criterion score for the eval case is the average of these overall scores across all invocations.

该标准使用 LLM 作为判断者来评估智能体最终响应相对于每个标准，为每个标准产生 yes (1.0)或no (0.0)的判决。与其他基于 LLM 的指标一样，它对每个调用多次采样判断模型，并使用多数票来确定该调用中每个标准的分数。调用的总体分数是其标准分数的平均值。评估用例的最终标准分数是所有调用中这些总体分数的平均值。

How To Use This Criterion?¶

如何使用此标准?¶

This criterion uses RubricsBasedCriterion, which requires a list of rubrics to be provided in the EvalConfig. Each rubric should be defined with a unique ID and its content.

此标准使用 RubricsBasedCriterion，它要求在 EvalConfig 中提供标准列表。每个标准应使用唯一 ID 和其内容来定义。

Example EvalConfig entry:

EvalConfig 条目示例:

{
  "criteria": {
    "rubric_based_final_response_quality_v1": {
      "threshold": 0.8,
      "judge_model_options": {
        "judge_model": "gemini-2.5-flash",
        "num_samples": 5
      },
      "rubrics": [
        {
          "rubric_id": "conciseness",
          "rubric_content": {
            "text_property": "The agent's response is direct and to the point."
          }
        },
        {
          "rubric_id": "intent_inference",
          "rubric_content": {
            "text_property": "The agent's response accurately infers the user's underlying goal from ambiguous queries."
          }
        }
      ]
    }
  }
}

Output And How To Interpret¶

输出以及如何解释¶

The criterion outputs an overall score between 0.0 and 1.0, where 1.0 indicates that the agent's responses satisfied all rubrics across all invocations, and 0.0 indicates that no rubrics were satisfied. The results also include detailed per-rubric scores for each invocation. Higher values are better.

该标准输出 0.0 到 1.0 之间的总体分数，其中 1.0 表示智能体的响应在所有调用中都满足所有标准，0.0 表示没有满足任何标准。结果还包括每个调用的详细每个标准的分数。数值越高越好。

rubric_based_tool_use_quality_v1¶

This criterion assesses the quality of an agent's tool usage against a user-defined set of rubrics using LLM as a judge.

此标准使用 LLM 作为判断者，根据用户定义的一组标准评估智能体工具使用的质量。

When To Use This Criterion?¶

何时使用此标准?¶

Use this criterion when you need to evaluate how an agent uses tools, rather than just if the final response is correct. It is ideal for assessing whether the agent selected the right tool, used the correct parameters, or followed a specific sequence of tool calls. This is useful for validating agent reasoning processes, debugging tool-use errors, and ensuring adherence to prescribed workflows, especially in cases where multiple tool-use paths could lead to a similar final answer but only one path is considered correct.

当您需要评估智能体如何使用工具，而不仅仅是是否最终响应正确时，请使用此标准。它非常适合评估智能体是否选择了正确的工具、使用了正确的参数，或遵循了特定的工具调用顺序。这对于验证智能体推理过程、调试工具使用错误以及确保遵守规定的工作流非常有用，特别是在多个工具使用路径可能导致相似最终答案但只有一条路径被认为是正确的情况下。

Details¶

详细信息¶

This criterion provides a flexible way to evaluate tool usage based on specific rules that you define as rubrics. For example, you could define rubrics to check if a specific tool was called, if its parameters were correct, or if tools were called in a particular order.

此标准提供了一种灵活的方法，可以根据您定义为标准的具体规则来评估工具使用。例如，您可以定义标准来检查是否调用了特定工具、其参数是否正确，或者工具是否按特定顺序调用。

The criterion uses an LLM-as-a-judge to evaluate the agent's tool calls and responses against each rubric, producing a yes (1.0) or no (0.0) verdict for each. Like other LLM-based metrics, it samples the judge model multiple times per invocation and uses a majority vote to determine the score for each rubric in that invocation. The overall score for an invocation is the average of its rubric scores. The final criterion score for the eval case is the average of these overall scores across all invocations.

该标准使用 LLM 作为判断者来评估智能体的工具调用和响应相对于每个标准，为每个标准产生 yes (1.0)或no (0.0)的判决。与其他基于 LLM 的指标一样，它对每个调用多次采样判断模型，并使用多数票来确定该调用中每个标准的分数。调用的总体分数是其标准分数的平均值。评估用例的最终标准分数是所有调用中这些总体分数的平均值。

How To Use This Criterion?¶

如何使用此标准?¶

This criterion uses RubricsBasedCriterion, which requires a list of rubrics to be provided in the EvalConfig. Each rubric should be defined with a unique ID and its content, describing a specific aspect of tool use to evaluate.

此标准使用 RubricsBasedCriterion，它要求在 EvalConfig 中提供标准列表。每个标准应使用唯一 ID 和其内容来定义，描述要评估的工具使用的特定方面。

Example EvalConfig entry:

EvalConfig 条目示例:

{
  "criteria": {
    "rubric_based_tool_use_quality_v1": {
      "threshold": 1.0,
      "judge_model_options": {
        "judge_model": "gemini-2.5-flash",
        "num_samples": 5
      },
      "rubrics": [
        {
          "rubric_id": "geocoding_called",
          "rubric_content": {
            "text_property": "The agent calls the GeoCoding tool before calling the GetWeather tool."
          }
        },
        {
          "rubric_id": "getweather_called",
          "rubric_content": {
            "text_property": "The agent calls the GetWeather tool with coordinates derived from the user's location."
          }
        }
      ]
    }
  }
}

Output And How To Interpret¶

输出以及如何解释¶

The criterion outputs an overall score between 0.0 and 1.0, where 1.0 indicates that the agent's tool usage satisfied all rubrics across all invocations, and 0.0 indicates that no rubrics were satisfied. The results also include detailed per-rubric scores for each invocation. Higher values are better.

该标准输出 0.0 到 1.0 之间的总体分数，其中 1.0 表示智能体的工具使用在所有调用中都满足所有标准，0.0 表示没有满足任何标准。结果还包括每个调用的详细每个标准的分数。数值越高越好。

hallucinations_v1¶

This criterion assesses whether a model response contains any false, contradictory, or unsupported claims.

此标准评估模型响应是否包含任何虚假、矛盾或不支持的声明。

When To Use This Criterion?¶

何时使用此标准?¶

Use this criterion to ensure that the agent's response is grounded in the provided context (e.g., tool outputs, user query, instructions) and does not contain hallucinations.

使用此标准以确保智能体的响应基于提供的上下文(例如，工具输出、用户查询、指令)，并且不包含幻觉。

Details¶

详细信息¶

This criterion assesses whether a model response contains any false, contradictory, or unsupported claims based on context that includes developer instructions, user prompt, tool definitions and tool invocations and their results. It uses LLM-as-a-judge and follows a two-step process:

此标准基于包括开发者指令、用户提示、工具定义和工具调用及其结果的上下文，评估模型响应是否包含任何虚假、矛盾或不支持的声明。它使用 LLM 作为判断者并遵循两步流程:

Segmenter: Segments the agent response into individual sentences.
1. 分段器: 将智能体响应分段为单个句子。
Sentence Validator: Evaluates each segmented sentence against the provided context for grounding. Each sentence is labeled as supported, unsupported, contradictory, disputed or not_applicable.
1. 句子验证器: 根据提供的上下文评估每个分段句子的基础性。每个句子被标记为 supported(支持)、unsupported(不支持)、contradictory(矛盾)、disputed(有争议)或 not_applicable(不适用)。

The metric computes an Accuracy Score: percentage of sentences that are supported or not_applicable. By default, only the final response is evaluated. If evaluate_intermediate_nl_responses is set to true in the criterion, intermediate natural language responses from agents are also evaluated.

该指标计算准确度分数:被标记为 supported 或 not_applicable 的句子的百分比。默认情况下，仅评估最终响应。如果在标准中将 evaluate_intermediate_nl_responses 设置为 true，则也会评估来自智能体的中间自然语言响应。

How To Use This Criterion?¶

如何使用此标准?¶

This criterion uses HallucinationsCriterion, allowing you to configure the evaluation threshold, judge model, number of samples per invocation and whether to evaluate intermediate natural language responses.

此标准使用 HallucinationsCriterion，允许您配置评估阈值、判断模型、每个调用的样本数以及是否评估中间自然语言响应。

Example EvalConfig entry:

EvalConfig 条目示例:

{
  "criteria": {
    "hallucinations_v1": {
      "threshold": 0.8,
      "judge_model_options": {
            "judge_model": "gemini-2.5-flash",
          },
      "evaluate_intermediate_nl_responses": true
    }
  }
}

Output And How To Interpret¶

输出以及如何解释¶

The criterion returns a score between 0.0 and 1.0. A score of 1.0 means all sentences in the agent's response are grounded in the context, while a score closer to 0.0 indicates that many sentences are false, contradictory, or unsupported. Higher values are better.

该标准返回 0.0 到 1.0 之间的分数。1.0 的分数意味着智能体响应中的所有句子都基于上下文，而接近 0.0 的分数表示许多句子是虚假的、矛盾或不支持的。数值越高越好。

safety_v1¶

This criterion evaluates the safety (harmlessness) of an Agent's Response.

此标准评估智能体响应的安全性(无害性)。

When To Use This Criterion?¶

何时使用此标准?¶

This criterion should be used when you need to ensure that agent responses comply with safety guidelines and do not produce harmful or inappropriate content. It is essential for user-facing applications or any system where response safety is a priority.

当您需要确保智能体响应符合安全准则并且不产生有害或不当内容时，应使用此标准。这对于面向用户的应用程序或任何响应安全性的优先级很高的系统至关重要。

Details¶

详细信息¶

This criterion assesses whether the agent's response contains any harmful content, such as hate speech, harassment, or dangerous information. Unlike other metrics implemented natively within ADK, safety_v1 delegates the evaluation to the Vertex AI General AI Eval SDK.

此标准评估智能体响应是否包含任何有害内容，如仇恨言论、骚扰或危险信息。与 ADK 中本地实现的其他指标不同，safety_v1 将评估委托给 Vertex AI General AI Eval SDK。

How To Use This Criterion?¶

如何使用此标准?¶

Using this criterion requires a Google Cloud Project. You must have GOOGLE_CLOUD_PROJECT and GOOGLE_CLOUD_LOCATION environment variables set, typically in an .env file in your agent's directory, for the Vertex AI SDK to function correctly.

使用此标准需要 Google Cloud 项目。您必须设置 GOOGLE_CLOUD_PROJECT 和 GOOGLE_CLOUD_LOCATION 环境变量，通常在智能体目录中的 .env 文件中，以便 Vertex AI SDK 正确运行。

You can specify a threshold for this criterion in EvalConfig under the criteria dictionary. The value should be a float between 0.0 and 1.0, representing the minimum safety score for a response to be considered passing.

您可以在 EvalConfig 下的 criteria 字典中为此标准指定阈值。该值应该是 0.0 到 1.0 之间的浮点数，代表响应被视为通过的最小安全分数。

Example EvalConfig entry:

EvalConfig 条目示例:

{
  "criteria": {
    "safety_v1": 0.8
  }
}

Output And How To Interpret¶

输出以及如何解释¶

The criterion returns a score between 0.0 and 1.0. Scores closer to 1.0 indicate that the response is safe, while scores closer to 0.0 indicate potential safety issues.

该标准返回 0.0 到 1.0 之间的分数。接近 1.0 的分数表示响应是安全的，而接近 0.0 的分数表示潜在的安全问题。

per_turn_user_simulator_quality_v1¶

This criterion evaluates whether a user simulator is faithful to a conversation plan.

此标准评估用户模拟器是否忠实于对话计划。

When To Use This Criterion?¶

何时使用此标准?¶

Use this criterion when you need to evaluate a user simulator in a multi-turn conversation. It is designed to assess whether the simulator follows the conversation plan defined in ConversationScenario.

当您需要在多轮对话中评估用户模拟器时，请使用此标准。它设计用于评估模拟器是否遵循 ConversationScenario 中定义的对话计划。

Details¶

详细信息¶

This criterion determines whether a user simulator follows a defined ConversationScenario in a multi-turn conversation.

此标准确定用户模拟器在多轮对话中是否遵循定义的 ConversationScenario。

For the first turn, this criterion checks if the user simulator response matches the starting_prompt in ConversationScenario. For subsequent turns, it uses LLM-as-a-judge to evaluate if the user response follows the conversation_plan in ConversationScenario.

对于第一轮，此标准检查用户模拟器响应是否与 ConversationScenario 中的 starting_prompt 匹配。对于后续轮次，它使用 LLM 作为判断者来评估用户响应是否遵循 ConversationScenario 中的 conversation_plan。

How To Use This Criterion?¶

如何使用此标准?¶

This criterion allows you to configure the evaluation threshold, judge model and number of samples per invocation. The criterion also lets you specify a stop_signal, which signals the LLM judge that the conversation was completed. For best results, use the stop signal in LlmBackedUserSimulator.

此标准允许您配置评估阈值、判断模型和每个调用的样本数。该标准还允许您指定 stop_signal，它向 LLM 判断者发出对话已完成的信号。为了获得最佳结果，请在 LlmBackedUserSimulator 中使用停止信号。

Example EvalConfig entry:

EvalConfig 条目示例:

{
  "criteria": {
    "per_turn_user_simulator_quality_v1": {
      "threshold": 1.0,
      "judge_model_options": {
        "judge_model": "gemini-2.5-flash",
        "num_samples": 5
      },
      "stop_signal": "</finished>"
    }
  }
}

Output And How To Interpret¶

输出以及如何解释¶

The criterion returns a score between 0.0 and 1.0, representing the fraction of turns in which the user simulator's response was judged to be valid according to the conversation scenario. A score of 1.0 indicates that the simulator behaved as expected in all turns, while a score closer to 0.0 indicates that the simulator deviated in many turns. Higher values are better.

该标准返回 0.0 到 1.0 之间的分数，表示用户模拟器响应根据对话场景被判断为有效的轮次的分数。1.0 的分数表示模拟器在所有轮次中都按预期表现，而接近 0.0 的分数表示模拟器在许多轮次中偏离了预期。数值越高越好。