Add script to evaluate khoj on Google's FRAMES benchmark

Google's FRAMES benchmark evaluates multi-step retrieval and reasoning
capabilities of an agent.

The script uses Gemini as an LLM Judge to evaluate Khoj responses to
the FRAMES benchmark prompts against the ground truth provided by it.
This commit is contained in:
Debanjum
2024-11-02 02:38:26 -07:00
parent 31b5fde163
commit 96904e0769
2 changed files with 186 additions and 0 deletions

View File

@@ -120,6 +120,8 @@ dev = [
"black >= 23.1.0",
"pre-commit >= 3.0.4",
"gitpython ~= 3.1.43",
"datasets",
"pandas",
]
[tool.hatch.version]