Amazon Bedrock モデル評価 LLM-as-a-judge が一般提供開始

本記事は、2025 年 3 月 20 日にAWS公式サイトの What’s New with AWS? に掲載された英語記事を自動翻訳したものです。

ニュース内容

Amazon Bedrock モデル評価の LLM を審査員として利用する機能が一般提供されました。Amazon Bedrock モデル評価を使用すると、ユースケースに適したモデルを評価、比較、選択できます。Bedrock で利用可能ないくつかの LLM から審査員として選択して、評価モデルと評価対象モデルの適切な組み合わせを確保できます。正確性、完全性、プロフェッショナルなスタイルとトーンなどの品質メトリクス、有害性、回答拒否などの責任ある AI メトリクスを選択できます。サーバーレスモデル、Converse API と互換性のある Bedrock Marketplace モデル、カスタマイズおよび抽出されたモデル、インポートされたモデル、モデルルーターなど、Amazon Bedrock で利用可能なすべてのモデルを評価できます。評価ジョブ間で結果を比較することもできます。

まったく新しい – 柔軟性が向上! 現在、すでに取得した独自の推論応答を評価ジョブの入力プロンプトデータセットに取り込むことで (「独自の推論応答を持ち込む」)、どこにホストされているモデルやシステムを評価できます。これらの応答は、Amazon Bedrock モデルからのものでも、Amazon Bedrock の外部でホストされているモデルやアプリケーションからのものでもかまいません。これにより、評価ジョブで Amazon Bedrock モデルの呼び出しをバイパスし、アプリケーションの中間ステップをすべて最終応答に組み込むことができます。

LLM を審査員として使用すると、人間のような評価品質を低コストで実現できると同時に、数週間の時間を節約できます。

詳細については、Amazon Bedrock 評価ページとドキュメントをご覧ください。開始するには、AWS コンソールにサインインするか、Amazon Bedrock API を使用してください。

原文

Amazon Bedrock Model Evaluation’s LLM-as-a-judge capability is now generally available. Amazon Bedrock Model Evaluation allows you to evaluate, compare, and select the right models for your use case. You can choose an LLM as your judge from several available on Bedrock to ensure you have the right combination of evaluator models and models being evaluated. You can select quality metrics such as correctness, completeness, and professional style and tone, as well as responsible AI metrics such as harmfulness and answer refusal. You can evaluate all available models on Amazon Bedrock, including serverless models, Bedrock Marketplace models compatible with Converse API, customized and distilled models, imported models, and model routers. You can also compare results across evaluation jobs.

*Brand new – more flexibility!* Today, you can evaluate any model or system hosted anywhere by bringing your own inference responses you already fetched into your input prompt dataset for the evaluation job (“bring your own inference responses“). These responses can be from an Amazon Bedrock model or from any model or application hosted outside of Amazon Bedrock, enabling you to bypass calling an Amazon Bedrock model in the evaluation job, and allowing you to incorporate all the intermediate steps of your application into your final responses.

With LLM-as-a-judge, you can get human-like evaluation quality at lower cost, while saving weeks of time.

To learn more, visit the Amazon Bedrock Evaluations page and documentation. To get started, sign in to the AWS Console or use Amazon Bedrock APIs.

引用元：Amazon Bedrock Model Evaluation LLM-as-a-judge is now generally available