klbr/khoj - khoj - Gitea: Git with a cup of tea

klbr/khoj

mirror of https://github.com/khoaliber/khoj.git synced 2026-03-02 13:18:18 +00:00

Author	SHA1	Message	Date
Debanjum	8490f2826b	Reduce evaluator llm verbosity during eval	2025-11-16 10:50:30 -08:00
Debanjum	52b1928023	Make gpqa answer evaluator more versatile at extracting mcq answers	2025-08-31 23:40:09 -07:00
Debanjum	703e189979	Deterministically shuffle dataset for consistent data in a eval run Previously eval run across modes would use different dataset shuffles. This change enables a strict apples to apples perf comparison of the different khoj modes across the same (random) subset of questions by using a dataset seed per workflow run to sample questions	2025-08-31 23:40:08 -07:00
Debanjum	2823c84bb4	Default to gemini 2.5 model series on init and for eval	2025-08-22 20:34:38 -07:00
Debanjum	c8e07e86e4	Format server code with ruff recommendations	2025-08-01 00:28:17 -07:00
Debanjum	4f3fdaf19d	Increase khoj api response timeout on evals call. Handle no decision	2025-05-18 19:14:49 -07:00
Debanjum	8050173ee1	Timeout calls to khoj api in evals to continue to next question	2025-05-17 17:37:11 -07:00
Debanjum	e0352cd8e1	Handle unset ttft in metadata of failed chat response. Fixes evals. This was causing evals to stop processing rest of batch as well.	2025-05-17 15:06:22 -07:00
Debanjum	911e1bf981	Use gemini 2.0 flash as evaluator. Set seed for it to reduce eval variance. Gemini 2.0 flash model is cheaper and better than Gemini 1.5 pro	2025-04-04 20:11:00 +05:30
Debanjum	94ca458639	Set default chat model to KHOJ_CHAT_MODEL env var if set Simplify code log to set default_use_model during init for readability	2025-03-09 18:23:30 +05:30
Debanjum	b4183c7333	Default to gemini 2.0 flash instead of 1.5 flash on Gemini setup Add price of gemini 2.0 flash for cost calculations	2025-03-07 13:48:15 +05:30
Debanjum	f13bdc5135	Log eval run progress percentage for orientation	2025-03-07 13:48:15 +05:30
Debanjum	dc0bc5bcca	Evaluate information retrieval quality using eval script - Encode article urls in filename indexed in Khoj KB Makes it easier for humans to compare, trace retrieval performance by looking at logs than using content hash (which was previously explored)	2025-01-06 13:19:52 +07:00
Debanjum	daeba66c0d	Optionally pass references used by agent for response to eval scorers This will allow the eval framework to evaluate retrieval quality too	2025-01-06 13:19:52 +07:00
Debanjum	8231f4bb6e	Return accuracy as decision to generalize across IR & standard scorers	2025-01-06 13:19:52 +07:00
Debanjum	fc6be543bd	Improve GPQA eval prompt to imrpove parsing answer from Khoj response	2024-11-30 17:21:09 -08:00
Debanjum	29e801c381	Add MATH500 dataset to eval Evaluate simpler MATH500 responses with gemini 1.5 flash This improves both the speed and cost of running this eval	2024-11-28 12:48:25 -08:00
Debanjum	22aef9bf53	Add GPQA (diamond) dataset to eval	2024-11-28 12:48:25 -08:00
Debanjum	ed364fa90e	Track running costs & accuracy of eval runs in progress Collect, display and store running costs & accuracy of eval run. This provides more insight into eval runs during execution instead of having to wait until the eval run completes.	2024-11-20 12:40:51 -08:00
Debanjum	a2ccf6f59f	Fix github workflow to start Khoj, connect to PG and upload results - Do not trigger tests to run in ci on update to evals	2024-11-18 04:25:15 -08:00
Debanjum	7c0fd71bfd	Add GitHub workflow to quiz Khoj across modes and specified evals (#982 ) - Evaluate khoj on random 200 questions from each of google frames and openai simpleqa benchmarks across general, default and research modes - Run eval with Gemini 1.5 Flash as test giver and Gemini 1.5 Pro as test evaluator models - Trigger eval workflow on release or manually - Make dataset, khoj mode and sample size configurable when triggered via manual workflow - Enable Web search, webpage read tools during evaluation	2024-11-18 02:19:30 -08:00
Debanjum	41d9011a26	Move evaluation script into tests/evals directory This should give more space for eval scripts, results and readme	2024-11-17 02:08:20 -08:00

22 Commits