Skip to content

Core

The core module provides foundational utilities for configuration, LLM management, and grading logic.


EvaluatorConfig

truthfulness_evaluator.core.config.EvaluatorConfig

Bases: BaseSettings

Configuration for the truthfulness evaluator.

Source code in src/truthfulness_evaluator/core/config.py
class EvaluatorConfig(BaseSettings):
    """Configuration for the truthfulness evaluator."""

    model_config = SettingsConfigDict(
        env_prefix="TRUTH_",
        env_file=".env",
        env_file_encoding="utf-8",
    )

    # Model configuration
    extraction_model: str = "gpt-4o-mini"
    verification_models: list[str] = ["gpt-4o", "claude-sonnet-4-5"]
    consensus_method: Literal["simple", "weighted", "ice"] = "weighted"
    confidence_threshold: float = 0.7

    # Search configuration
    enable_web_search: bool = True
    enable_filesystem_search: bool = True
    max_evidence_items: int = 5

    # ICE configuration
    ice_max_rounds: int = 3

    # Output configuration
    output_format: Literal["json", "markdown"] = "json"
    include_explanations: bool = True
    include_model_votes: bool = True

    # Human-in-the-loop
    enable_human_review: bool = False
    human_review_threshold: float = 0.6

    # API Keys (loaded from environment)
    openai_api_key: str = ""
    anthropic_api_key: str = ""

    def model_post_init(self, __context: Any) -> None:
        """Fallback to standard env vars if TRUTH_ prefix not used."""
        if not self.openai_api_key:
            self.openai_api_key = os.getenv("OPENAI_API_KEY", "")
        if not self.anthropic_api_key:
            self.anthropic_api_key = os.getenv("ANTHROPIC_API_KEY", "")

model_post_init(__context)

Fallback to standard env vars if TRUTH_ prefix not used.

Source code in src/truthfulness_evaluator/core/config.py
def model_post_init(self, __context: Any) -> None:
    """Fallback to standard env vars if TRUTH_ prefix not used."""
    if not self.openai_api_key:
        self.openai_api_key = os.getenv("OPENAI_API_KEY", "")
    if not self.anthropic_api_key:
        self.anthropic_api_key = os.getenv("ANTHROPIC_API_KEY", "")

Usage Example:

from truthfulness_evaluator.core.config import EvaluatorConfig

config = EvaluatorConfig(
    llm_provider="openai",
    llm_model="gpt-4o",
    llm_temperature=0.0,
    web_search_enabled=True,
    max_web_results=10
)

Environment Variables:

All configuration can be set via environment variables with the TRUTH_ prefix:

export TRUTH_LLM_PROVIDER=openai
export TRUTH_LLM_MODEL=gpt-4o
export TRUTH_LLM_TEMPERATURE=0.0
export TRUTH_WEB_SEARCH_ENABLED=true
export TRUTH_MAX_WEB_RESULTS=10

create_chat_model

truthfulness_evaluator.llm.factory.create_chat_model(model_name, temperature=0, **kwargs)

Create a chat model instance from a model name string.

Centralizes the OpenAI vs. Anthropic routing logic that was previously scattered across 5+ files.

Parameters:

Name Type Description Default
model_name str

Model identifier (e.g., "gpt-4o", "claude-sonnet-4-5").

required
temperature float

Sampling temperature.

0
**kwargs Any

Additional model-specific parameters. Pass base_url for OpenAI-compatible providers (Ollama, vLLM, etc.).

{}

Returns:

Type Description
BaseChatModel

Configured BaseChatModel instance.

Raises:

Type Description
ValueError

If provider cannot be determined from model name.

Source code in src/truthfulness_evaluator/llm/factory.py
def create_chat_model(
    model_name: str,
    temperature: float = 0,
    **kwargs: Any,
) -> BaseChatModel:
    """Create a chat model instance from a model name string.

    Centralizes the OpenAI vs. Anthropic routing logic that was
    previously scattered across 5+ files.

    Args:
        model_name: Model identifier (e.g., "gpt-4o", "claude-sonnet-4-5").
        temperature: Sampling temperature.
        **kwargs: Additional model-specific parameters. Pass ``base_url``
            for OpenAI-compatible providers (Ollama, vLLM, etc.).

    Returns:
        Configured BaseChatModel instance.

    Raises:
        ValueError: If provider cannot be determined from model name.
    """
    provider = _detect_provider(model_name, **kwargs)
    logger.debug("Creating %s model: %s", provider, model_name)

    if provider == "anthropic":
        from langchain_anthropic import ChatAnthropic

        return ChatAnthropic(model=model_name, temperature=temperature, **kwargs)

    from langchain_openai import ChatOpenAI

    return ChatOpenAI(model=model_name, temperature=temperature, **kwargs)

Usage Example:

from truthfulness_evaluator.llm.factory import create_chat_model

# Use default config (from environment)
llm = create_chat_model()

# Override config
llm = create_chat_model(
    model="claude-3-5-sonnet-20241022",
    provider="anthropic",
    temperature=0.2
)

Usage Note: This is the centralized LLM factory function. All chains and adapters use this function to instantiate LLM instances, ensuring consistent configuration and provider routing.


Grading

The grading module provides utilities for calculating truthfulness grades and building final reports.

truthfulness_evaluator.core.grading

Grading and summary logic for truthfulness reports.

build_report(source_document, claims, verifications, *, confidence_threshold=0.7, grade=None, summary=None)

Build a complete TruthfulnessReport with computed fields.

This is the primary way to construct a report. It computes grade, statistics, confidence, and summary from the raw data.

Parameters:

Name Type Description Default
source_document str

Path/URL of source document.

required
claims list[Claim]

Extracted claims.

required
verifications list[VerificationResult]

Verification results.

required
confidence_threshold float

Threshold for considering claims verified.

0.7
grade str | None

Override grade (if None, computed from verifications).

None
summary str | None

Override summary (if None, generated automatically).

None

Returns:

Type Description
TruthfulnessReport

TruthfulnessReport instance.

Source code in src/truthfulness_evaluator/core/grading.py
def build_report(
    source_document: str,
    claims: list[Claim],
    verifications: list[VerificationResult],
    *,
    confidence_threshold: float = 0.7,
    grade: str | None = None,
    summary: str | None = None,
) -> TruthfulnessReport:
    """Build a complete TruthfulnessReport with computed fields.

    This is the primary way to construct a report. It computes
    grade, statistics, confidence, and summary from the raw data.

    Args:
        source_document: Path/URL of source document.
        claims: Extracted claims.
        verifications: Verification results.
        confidence_threshold: Threshold for considering claims verified.
        grade: Override grade (if None, computed from verifications).
        summary: Override summary (if None, generated automatically).

    Returns:
        TruthfulnessReport instance.
    """
    verified_ids = {v.claim_id for v in verifications}
    unvalidated = [c for c in claims if c.id not in verified_ids]

    stats = calculate_statistics(claims, verifications)
    computed_grade = grade or calculate_grade(verifications, confidence_threshold)
    overall_confidence = (
        sum(v.confidence for v in verifications) / len(verifications) if verifications else 0.0
    )
    computed_summary = summary or generate_summary(computed_grade, stats)

    return TruthfulnessReport(
        source_document=source_document,
        overall_grade=computed_grade,
        overall_confidence=overall_confidence,
        summary=computed_summary,
        claims=claims,
        verifications=verifications,
        unvalidated_claims=unvalidated,
        statistics=stats,
    )

calculate_grade(verifications, confidence_threshold=0.7)

Calculate letter grade from verification results.

Parameters:

Name Type Description Default
verifications list[VerificationResult]

List of verification results.

required
confidence_threshold float

Minimum confidence to consider a claim verified.

0.7

Returns:

Type Description
str

Letter grade string (A+ through F).

Source code in src/truthfulness_evaluator/core/grading.py
def calculate_grade(
    verifications: list[VerificationResult],
    confidence_threshold: float = 0.7,
) -> str:
    """Calculate letter grade from verification results.

    Args:
        verifications: List of verification results.
        confidence_threshold: Minimum confidence to consider a claim verified.

    Returns:
        Letter grade string (A+ through F).
    """
    if not verifications:
        return "F"

    verified = [v for v in verifications if is_verified(v, confidence_threshold)]
    if not verified:
        return "F"

    support_ratio = sum(1 for v in verified if v.verdict == "SUPPORTS") / len(verified)
    confidence = sum(v.confidence for v in verified) / len(verified)

    score = round(support_ratio * confidence, 10)

    if score >= 0.9:
        return "A+"
    elif score >= 0.85:
        return "A"
    elif score >= 0.8:
        return "A-"
    elif score >= 0.75:
        return "B+"
    elif score >= 0.7:
        return "B"
    elif score >= 0.65:
        return "B-"
    elif score >= 0.6:
        return "C+"
    elif score >= 0.55:
        return "C"
    elif score >= 0.5:
        return "C-"
    elif score >= 0.4:
        return "D"
    else:
        return "F"

calculate_statistics(claims, verifications)

Calculate statistics from claims and verifications.

Parameters:

Name Type Description Default
claims list[Claim]

List of claims.

required
verifications list[VerificationResult]

List of verification results.

required

Returns:

Type Description
TruthfulnessStatistics

TruthfulnessStatistics instance.

Source code in src/truthfulness_evaluator/core/grading.py
def calculate_statistics(
    claims: list[Claim],
    verifications: list[VerificationResult],
) -> TruthfulnessStatistics:
    """Calculate statistics from claims and verifications.

    Args:
        claims: List of claims.
        verifications: List of verification results.

    Returns:
        TruthfulnessStatistics instance.
    """
    total = len(claims)
    supported = sum(1 for v in verifications if v.verdict == "SUPPORTS")
    refuted = sum(1 for v in verifications if v.verdict == "REFUTES")
    not_enough_info = sum(1 for v in verifications if v.verdict == "NOT_ENOUGH_INFO")
    unverifiable = sum(1 for v in verifications if v.verdict == "UNVERIFIABLE")

    verified_count = supported + refuted
    verification_rate = (verified_count / total) if total > 0 else 0.0
    accuracy_score = (supported / verified_count) if verified_count > 0 else 0.0

    return TruthfulnessStatistics(
        total_claims=total,
        supported=supported,
        refuted=refuted,
        not_enough_info=not_enough_info,
        unverifiable=unverifiable,
        verification_rate=verification_rate,
        accuracy_score=accuracy_score,
    )

generate_summary(grade, statistics)

Generate human-readable summary of evaluation results.

Parameters:

Name Type Description Default
grade str

Letter grade.

required
statistics TruthfulnessStatistics

TruthfulnessStatistics instance.

required

Returns:

Type Description
str

Summary string.

Source code in src/truthfulness_evaluator/core/grading.py
def generate_summary(
    grade: str,
    statistics: TruthfulnessStatistics,
) -> str:
    """Generate human-readable summary of evaluation results.

    Args:
        grade: Letter grade.
        statistics: TruthfulnessStatistics instance.

    Returns:
        Summary string.
    """
    stats = statistics

    if stats.total_claims == 0:
        return "No claims were extracted from the document."

    summary = f"Document received grade {grade}. "
    summary += f"Of {stats.total_claims} claims, "
    summary += f"{stats.supported} were supported, "
    summary += f"{stats.refuted} were refuted, and "
    summary += f"{stats.not_enough_info + stats.unverifiable} could not be verified."

    if stats.verification_rate < 0.5:
        summary += " Many claims lacked sufficient evidence for verification."
    elif stats.accuracy_score < 0.7:
        summary += " Several claims were found to be inaccurate."
    else:
        summary += " The document appears to be largely accurate."

    return summary

is_verified(result, confidence_threshold=0.7)

Whether a verification result meets the verification criteria.

Parameters:

Name Type Description Default
result VerificationResult

The verification result to check.

required
confidence_threshold float

Minimum confidence to consider verified.

0.7

Returns:

Type Description
bool

True if verdict is SUPPORTS/REFUTES and confidence meets threshold.

Source code in src/truthfulness_evaluator/core/grading.py
def is_verified(
    result: VerificationResult,
    confidence_threshold: float = 0.7,
) -> bool:
    """Whether a verification result meets the verification criteria.

    Args:
        result: The verification result to check.
        confidence_threshold: Minimum confidence to consider verified.

    Returns:
        True if verdict is SUPPORTS/REFUTES and confidence meets threshold.
    """
    return result.verdict in ("SUPPORTS", "REFUTES") and result.confidence >= confidence_threshold

Functions:

  • calculate_grade(verified: int, total: int) -> str: Calculate letter grade (A-F) from verification counts
  • is_verified(verdict: Verdict) -> bool: Check if a verdict counts as verified (true or likely_true)
  • build_report(claims: list[Claim], results: list[VerificationResult]) -> TruthfulnessReport: Build complete report with statistics

Usage Example:

from truthfulness_evaluator.core.grading import calculate_grade, is_verified, build_report

# Calculate grade
grade = calculate_grade(verified=8, total=10)  # "B"

# Check if verified
from truthfulness_evaluator.models import Verdict
verified = is_verified("true")  # True
verified = is_verified("false")  # False

# Build report
report = build_report(claims=extracted_claims, results=verification_results)