Debanjum
4f3fdaf19d
Increase khoj api response timeout on evals call. Handle no decision
2025-05-18 19:14:49 -07:00
Debanjum
8050173ee1
Timeout calls to khoj api in evals to continue to next question
2025-05-17 17:37:11 -07:00
Debanjum
e0352cd8e1
Handle unset ttft in metadata of failed chat response. Fixes evals.
...
This was causing evals to stop processing rest of batch as well.
2025-05-17 15:06:22 -07:00
Debanjum
911e1bf981
Use gemini 2.0 flash as evaluator. Set seed for it to reduce eval variance.
...
Gemini 2.0 flash model is cheaper and better than Gemini 1.5 pro
2025-04-04 20:11:00 +05:30
Debanjum
94ca458639
Set default chat model to KHOJ_CHAT_MODEL env var if set
...
Simplify code log to set default_use_model during init for readability
2025-03-09 18:23:30 +05:30
Debanjum
b4183c7333
Default to gemini 2.0 flash instead of 1.5 flash on Gemini setup
...
Add price of gemini 2.0 flash for cost calculations
2025-03-07 13:48:15 +05:30
Debanjum
f13bdc5135
Log eval run progress percentage for orientation
2025-03-07 13:48:15 +05:30
Debanjum
dc0bc5bcca
Evaluate information retrieval quality using eval script
...
- Encode article urls in filename indexed in Khoj KB
Makes it easier for humans to compare, trace retrieval performance
by looking at logs than using content hash (which was previously
explored)
2025-01-06 13:19:52 +07:00
Debanjum
daeba66c0d
Optionally pass references used by agent for response to eval scorers
...
This will allow the eval framework to evaluate retrieval quality too
2025-01-06 13:19:52 +07:00
Debanjum
8231f4bb6e
Return accuracy as decision to generalize across IR & standard scorers
2025-01-06 13:19:52 +07:00
Debanjum
fc6be543bd
Improve GPQA eval prompt to imrpove parsing answer from Khoj response
2024-11-30 17:21:09 -08:00
Debanjum
29e801c381
Add MATH500 dataset to eval
...
Evaluate simpler MATH500 responses with gemini 1.5 flash
This improves both the speed and cost of running this eval
2024-11-28 12:48:25 -08:00
Debanjum
22aef9bf53
Add GPQA (diamond) dataset to eval
2024-11-28 12:48:25 -08:00
Debanjum
ed364fa90e
Track running costs & accuracy of eval runs in progress
...
Collect, display and store running costs & accuracy of eval run.
This provides more insight into eval runs during execution instead of
having to wait until the eval run completes.
2024-11-20 12:40:51 -08:00
Debanjum
a2ccf6f59f
Fix github workflow to start Khoj, connect to PG and upload results
...
- Do not trigger tests to run in ci on update to evals
2024-11-18 04:25:15 -08:00
Debanjum
7c0fd71bfd
Add GitHub workflow to quiz Khoj across modes and specified evals ( #982 )
...
- Evaluate khoj on random 200 questions from each of google frames and openai simpleqa benchmarks across *general*, *default* and *research* modes
- Run eval with Gemini 1.5 Flash as test giver and Gemini 1.5 Pro as test evaluator models
- Trigger eval workflow on release or manually
- Make dataset, khoj mode and sample size configurable when triggered via manual workflow
- Enable Web search, webpage read tools during evaluation
2024-11-18 02:19:30 -08:00
Debanjum
41d9011a26
Move evaluation script into tests/evals directory
...
This should give more space for eval scripts, results and readme
2024-11-17 02:08:20 -08:00