VIEScore
VIEScore
assesses the performance of image generation and editing tasks by evaluating the quality of synthesized images based on semantic consistency and perceptual quality. deepeval
's VIEScore metric is a self-explaining MLLM-Eval, meaning it outputs a reason for its metric score.
Using VIEScore
with GPT-4v as the evaluation model achieves scores comparable to human ratings in text-to-image generation tasks, and is especially good at detecting undesirable artifacts.
Required Arguments
To use the VIEScore
, you'll have to provide the following arguments when creating an MLLMTestCase
:
input
actual_output
Example
from deepeval import evaluate
from deepeval.metrics import VIEScore, VIEScoreTask
from deepeval.types import Image
from deepeval.test_case import MLLMTestCase
# Replace this with your actual MLLM application output
actual_output=[Image(url="https://shoe-images.com/edited-shoes", local=False)]
metric = VIEScore(
threshold=0.7,
model="gpt-4o",
include_reason=True,
task=VIEScoreTask.TEXT_TO_IMAGE_EDITING
)
test_case = MLLMTestCase(
input=["Change the color of the shoes to blue.", Image(url="./shoes.png", local=True)],
actual_output=actual_output,
retrieval_context=retrieval_context
)
metric.measure(test_case)
print(metric.score)
print(metric.reason)
# or evaluate test cases in bulk
evaluate([test_case], [metric])
There are six optional parameters when creating a FaithfulnessMetric
:
- [Optional]
threshold
: a float representing the minimum passing threshold, defaulted to 0.5. - [Optional]
model
: a string specifying which of OpenAI's GPT models to use, OR any custom MLLM model of typeDeepEvalBaseMLLM
. Defaulted to 'gpt-4o'. - [Optional]
include_reason
: a boolean which when set toTrue
, will include a reason for its evaluation score. Defaulted toTrue
. - [Optional]
strict_mode
: a boolean which when set toTrue
, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted toFalse
. - [Optional]
async_mode
: a boolean which when set toTrue
, enables concurrent execution within themeasure()
method. Defaulted toTrue
. - [Optional]
verbose_mode
: a boolean which when set toTrue
, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted toFalse
. - [Optional]
task
: aVIEScoreTask
enum indicating whether the task is image generation or image editing. Defaulted toVIEScoreTask.TEXT_TO_IMAGE_GENERATION
.
VIEScoreTask
is an enumeration that includes two types of tasks:
TEXT_TO_IMAGE_GENERATION
: the input should contain exactly 0 images, and the output should contain exactly 1 image.TEXT_TO_IMAGE_EDITING
: For this task type, both the input and output should each contain exactly 1 image.
How Is It Calculated?
The VIEScore
score is calculated according to the following equation:
The VIEScore
score combines Semantic Consistency (SC) and Perceptual Quality (PQ) sub-scores to provide a comprehensive evaluation of the synthesized image. The final overall score is derived by taking the square root of the product of the minimum SC and PQ scores.
1. SC Scores: These scores assess aspects such as alignment with the prompt and resemblance to concepts. The minimum value among these sub-scores represents the SC score. During the SC evaluation, both the input conditions and the synthesized image are used.
2. PQ Scores: These scores evaluate the naturalness and absence of artifacts in the image. The minimum value among these sub-scores represents the PQ score. For the PQ evaluation, only the synthesized image is used to prevent confusion from the input conditions.