Add script to evaluate khoj on Google's FRAMES benchmark

Google's FRAMES benchmark evaluates multi-step retrieval and reasoning capabilities of an agent. The script uses Gemini as an LLM Judge to evaluate Khoj responses to the FRAMES benchmark prompts against the ground truth provided by it.
2026-03-02 13:18:18 +00:00 · 2024-11-02 02:38:26 -07:00
parent 31b5fde163
commit 96904e0769
2 changed files with 186 additions and 0 deletions
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -120,6 +120,8 @@ dev = [
    "black >= 23.1.0",
    "pre-commit >= 3.0.4",
    "gitpython ~= 3.1.43",
+    "datasets",
+    "pandas",
 ]

 [tool.hatch.version]