Commit Graph

449 Commits

Author SHA1 Message Date
Debanjum
2f4160e24b Use single extract questions method across all LLMs for doc search
Using model specific extract questions was an artifact from older
times, with less guidable models.

New changes collate and reuse logic
- Rely on send_message_to_model_wrapper for model specific formatting.
- Use same prompt, context for all LLMs as can handle prompt variation.
- Use response schema enforcer to ensure response consistency across models.

Extract questions (because of its age) was the only tool directly within
each provider code. Put it into helpers to have all the (mini) tools
in one place.
2025-06-06 13:28:18 -07:00
Debanjum
7d59688729 Move document search tool into helpers module with other tools
Document search (because of its age) was the only tool directly within
an api router. Put it into helpers to have all the (mini) tools in one
place.
2025-06-06 13:28:18 -07:00
Debanjum
05d4e19cb8 Pass deep typed chat history for more ergonomic, readable, safe code
The chat dictionary is an artifact from earlier non-db chat history
storage. We've been ensuring new chat messages have valid type before
being written to DB for more than 6 months now.

Move to using the deeply typed chat history helps avoids null refs,
makes code more readable and easier to reason about.

Next Steps:
The current update entangles chat_history written to DB
with any virtual chat history message generated for intermediate
steps. The chat message type written to DB should be decoupled from
type that can be passed to AI model APIs (maybe?).

For now we've made the ChatMessage.message type looser to allow
for list[dict] type (apart from string). But later maybe a good idea
to decouple the chat_history recieved by send_message_to_model from
the chat_history saved to DB (which can then have its stricter type check)
2025-06-04 00:03:14 -07:00
Debanjum
dca17591f3 Handle parsing json from string with plain text suffix 2025-05-23 19:44:02 -07:00
Debanjum
4f3fdaf19d Increase khoj api response timeout on evals call. Handle no decision 2025-05-18 19:14:49 -07:00
Debanjum
fd591c6e6c Upgrade tenacity to respect min time for exponential backoff
Fix for issue is in tenacity 9.0.0. But older langchain required
tenacity <0.9.0.

Explicitly pin version of langchain sub packages to avoid indexing
and doc parsing breakage.
2025-05-17 17:37:15 -07:00
Debanjum
2694734d22 Update truncation logic to handle multi-part message content 2025-05-17 17:37:15 -07:00
Debanjum
8050173ee1 Timeout calls to khoj api in evals to continue to next question 2025-05-17 17:37:11 -07:00
Debanjum
e0352cd8e1 Handle unset ttft in metadata of failed chat response. Fixes evals.
This was causing evals to stop processing rest of batch as well.
2025-05-17 15:06:22 -07:00
Debanjum
911e1bf981 Use gemini 2.0 flash as evaluator. Set seed for it to reduce eval variance.
Gemini 2.0 flash model is cheaper and better than Gemini 1.5 pro
2025-04-04 20:11:00 +05:30
Debanjum
94ca458639 Set default chat model to KHOJ_CHAT_MODEL env var if set
Simplify code log to set default_use_model during init for readability
2025-03-09 18:23:30 +05:30
Debanjum
b4183c7333 Default to gemini 2.0 flash instead of 1.5 flash on Gemini setup
Add price of gemini 2.0 flash for cost calculations
2025-03-07 13:48:15 +05:30
Debanjum
f13bdc5135 Log eval run progress percentage for orientation 2025-03-07 13:48:15 +05:30
Debanjum
dc0bc5bcca Evaluate information retrieval quality using eval script
- Encode article urls in filename indexed in Khoj KB
  Makes it easier for humans to compare, trace retrieval performance
  by looking at logs than using content hash (which was previously
  explored)
2025-01-06 13:19:52 +07:00
Debanjum
daeba66c0d Optionally pass references used by agent for response to eval scorers
This will allow the eval framework to evaluate retrieval quality too
2025-01-06 13:19:52 +07:00
Debanjum
8231f4bb6e Return accuracy as decision to generalize across IR & standard scorers 2025-01-06 13:19:52 +07:00
Debanjum
c4bb92076e Convert async create automation api endpoints to sync 2024-12-26 21:59:55 -08:00
Debanjum
01bc6d35dc Rename Chat Model Options table to Chat Model as short & readable (#1003)
- Previous was incorrectly plural but was defining only a single model
- Rename chat model table field to name
- Update documentation
- Update references functions and variables to match new name
2024-12-12 11:24:16 -08:00
sabaimran
6c8007e23b Improve handling of multiple output modes
- Use the generated descriptions / inferred queries to supply context to the model about what it's created and give a richer response
- Stop sending the generated image in user message. This seemed to be confusing the model more than helping.
- Also, rename the open ai converse method to converse_openai to follow patterns with other providers
2024-12-10 16:54:21 -08:00
Debanjum
9dd3782f5c Rename OpenAIProcessorConversationConfig DB model to more apt AiModelApi (#998)
* Rename OpenAIProcessorConversationConfig to more apt AiModelAPI

The DB model name had drifted from what it is being used for,
a general chat api provider that supports other chat api providers like
anthropic and google chat models apart from openai based chat models.

This change renames the DB model and updates the docs to remove this
confusion.

Using Ai Model Api we catch most use-cases including chat, stt, image generation etc.
2024-12-08 18:02:29 -08:00
sabaimran
886fe4a0c9 Merge branch 'master' of github.com:khoj-ai/khoj into features/allow-multi-outputs-in-chat 2024-12-03 21:37:00 -08:00
Debanjum
fc6be543bd Improve GPQA eval prompt to imrpove parsing answer from Khoj response 2024-11-30 17:21:09 -08:00
sabaimran
c5329d76ba Merge branch 'master' of github.com:khoj-ai/khoj into features/allow-multi-outputs-in-chat 2024-11-29 14:12:03 -08:00
sabaimran
d91935c880 Initial commit of a functional but not yet elegant prototype for this concept 2024-11-28 17:28:23 -08:00
Debanjum
29e801c381 Add MATH500 dataset to eval
Evaluate simpler MATH500 responses with gemini 1.5 flash

This improves both the speed and cost of running this eval
2024-11-28 12:48:25 -08:00
Debanjum
22aef9bf53 Add GPQA (diamond) dataset to eval 2024-11-28 12:48:25 -08:00
Debanjum
70b7e7c73a Improve load of complex json objects. Use it to pick tool, run code
Gemini doesn't work well when trying to output json objects. Using it
to output raw json strings with complex, multi-line structures
requires more intense clean-up of raw json string for parsing
2024-11-26 17:37:57 -08:00
Debanjum
ed364fa90e Track running costs & accuracy of eval runs in progress
Collect, display and store running costs & accuracy of eval run.

This provides more insight into eval runs during execution instead of
having to wait until the eval run completes.
2024-11-20 12:40:51 -08:00
Debanjum
45c623f95c Dedupe, organize chat actor, director tests
- Move Chat actor tests that were previously in chat director tests file
- Dedupe online, offline io selector chat actor tests
2024-11-18 16:10:50 -08:00
Debanjum
2a76c69d0d Run online, offine chat actor, director tests for any supported provider
- Previously online chat actors, director tests only worked with openai.
  This change allows running them for any supported onlnie provider
  including Google, Anthropic and Openai.

- Enable online/offline chat actor, director in two ways:
  1. Explicitly setting KHOJ_TEST_CHAT_PROVIDER environment variable to
     google, anthropic, openai, offline
  2. Implicitly by the first API key found from openai, google or anthropic.

- Default offline chat provider to use Llama 3.1 3B for faster, lower
  compute test runs
2024-11-18 15:11:37 -08:00
Debanjum
653127bf1d Improve data source, output mode selection
- Set output mode to single string. Specify output schema in prompt
  - Both thesee should encourage model to select only 1 output mode
    instead of encouraging it in prompt too many times
  - Output schema should also improve schema following in general
- Standardize variable, func name of io selector for readability
- Fix chat actors to test the io selector chat actor
- Make chat actor return sources, output separately for better
  disambiguation, at least during tests, for now
2024-11-18 15:11:37 -08:00
Debanjum
a2ccf6f59f Fix github workflow to start Khoj, connect to PG and upload results
- Do not trigger tests to run in ci on update to evals
2024-11-18 04:25:15 -08:00
Debanjum
7c0fd71bfd Add GitHub workflow to quiz Khoj across modes and specified evals (#982)
- Evaluate khoj on random 200 questions from each of google frames and openai simpleqa benchmarks across *general*, *default* and *research* modes
- Run eval with Gemini 1.5 Flash as test giver and Gemini 1.5 Pro as test evaluator models
- Trigger eval workflow on release or manually
- Make dataset, khoj mode and sample size configurable when triggered via manual workflow
- Enable Web search, webpage read tools during evaluation
2024-11-18 02:19:30 -08:00
sabaimran
0eba6ce315 When diagram generation fails, save to conversation log
- Update tool name when choosing tools to execute
2024-11-17 13:23:12 -08:00
sabaimran
7e662a05f8 Merge branch 'master' of github.com:khoj-ai/khoj into features/improve-tool-selection 2024-11-17 12:26:55 -08:00
Debanjum
41d9011a26 Move evaluation script into tests/evals directory
This should give more space for eval scripts, results and readme
2024-11-17 02:08:20 -08:00
Debanjum
d9d5884958 Enable evaluating Khoj on the OpenAI SimpleQA bench using eval script
- Just load the raw csv from OpenAI bucket. Normalize it into FRAMES format
- Improve docstring for frames datasets as well
- Log the load dataset perf timer at info level
2024-11-17 02:08:20 -08:00
Debanjum
eb5bc6d9eb Remove Talc search bench from Khoj eval script 2024-11-17 02:08:20 -08:00
sabaimran
c77dc84a68 Remove output_modes function reference in chat tests 2024-11-15 14:03:07 -08:00
Debanjum
9fc44f1a7f Enable evaluation Khoj on the Talc Search Bench using Eval script
- Just load the raw jsonl from Github and normalize it into FRAMES format
- Color printed accuracy in eval script to blue for readability
2024-11-13 22:50:14 -08:00
Debanjum
f4e37209a2 Improve error handling, display and configurability of eval script
- Default to evaluation decision of None when either agent or
  evaluator llm fails. This fixes accuracy calculations on errors
- Fix showing color for decision True
- Enable arg flags to specify output results file paths
2024-11-13 14:32:22 -08:00
Debanjum
ff5c10c221 Do not CRUD on entries, files & conversations in DB for null user
Increase defense-in-depth by reducing paths to create, read, update or
delete entries, files and conversations in DB when user is unset.
2024-11-11 12:20:07 -08:00
sabaimran
8805e731fd Merge branch 'master' of github.com:khoj-ai/khoj into features/include-full-file-in-convo-with-filter 2024-11-10 19:24:11 -08:00
Debanjum
f967bdf702 Show correct example index being currently processed in frames eval
Previously the batch start index wasn't being passed so all batches
started in parallel were showing the same processing example index

This change doesn't impact the evaluation itself, just the index shown
of the example currently being evaluated
2024-11-10 14:49:51 -08:00
Debanjum
84a8088c2b Only evaluate non-empty responses to reduce eval script latency, cost
Empty responses by Khoj will always be an incorrect response, so no
need to make call to an evaluator agent to check that
2024-11-10 14:49:51 -08:00
sabaimran
623a97a9ee Merge branch 'master' of github.com:khoj-ai/khoj into features/include-full-file-in-convo-with-filter 2024-11-07 17:18:23 -08:00
sabaimran
cf0bcec0e7 Revert SKIP_TESTS flag in offline chat director tests 2024-11-04 19:06:54 -08:00
sabaimran
1f372bf2b1 Update file summarization unit tests now that multiple files are allowed 2024-11-04 17:45:54 -08:00
Debanjum
1ccbf72752 Use logger instead of print to track eval 2024-11-04 00:40:26 -08:00
Debanjum
791eb205f6 Run prompt batches in parallel for faster eval runs 2024-11-02 04:58:03 -07:00