Назад към всички

llm-evaluator

// LLM-as-a-Judge evaluator via Langfuse. Scores traces on relevance, accuracy, hallucination, and helpfulness using GPT-5-nano as judge. Supports single trace scoring, batch backfill, and test mode. Integrates with Langfuse dashboard for observability. Triggers: evaluate trace, score quality, check ac

$ git log --oneline --stat
stars:1,933
forks:367
updated:March 4, 2026
SKILL.mdreadonly
SKILL.md Frontmatter
namellm-evaluator
version1.0.0
descriptionLLM-as-a-Judge evaluator via Langfuse. Scores traces on relevance, accuracy, hallucination, and helpfulness using GPT-5-nano as judge. Supports single trace scoring, batch backfill, and test mode. Integrates with Langfuse dashboard for observability. Triggers: evaluate trace, score quality, check accuracy, backfill scores, test evaluator, LLM judge.
licenseMIT
compatibility[object Object]
metadata[object Object]

LLM Evaluator ⚖️

LLM-as-a-Judge evaluation system powered by Langfuse. Uses GPT-5-nano to score AI outputs.

When to Use

  • Evaluating quality of search results or AI responses
  • Scoring traces for relevance, accuracy, hallucination detection
  • Batch scoring recent unscored traces
  • Quality assurance on agent outputs

Usage

# Test with sample cases
python3 {baseDir}/scripts/evaluator.py test

# Score a specific Langfuse trace
python3 {baseDir}/scripts/evaluator.py score <trace_id>

# Score with specific evaluator only
python3 {baseDir}/scripts/evaluator.py score <trace_id> --evaluators relevance

# Backfill scores on recent unscored traces
python3 {baseDir}/scripts/evaluator.py backfill --limit 20

Evaluators

EvaluatorMeasuresScale
relevanceResponse relevance to query0–1
accuracyFactual correctness0–1
hallucinationMade-up information detection0–1
helpfulnessOverall usefulness0–1

Credits

Built by M. Abidi | agxntsix.ai YouTube | GitHub Part of the AgxntSix Skill Suite for OpenClaw agents.

📅 Need help setting up OpenClaw for your business? Book a free consultation