- Cache last anthropic message. Given research mode now uses function
calling paradigm and not the old research mode structure.
- Cache tool definitions passed to anthropic models
- Stop dropping first message if by assistant as seems like Anthropic
API doesn't complain about it any more.
- Drop tool result when tool call is truncated as invalid state
- Do not truncate tool use message content, just drop the whole tool
use message.
AI model APIs need tool use assistant message content in specific
form (e.g with thinking etc.). So dropping content items breaks
expected tool use message content format.
Handle tool use scenarios where iteration query isn't set for retry
The chat actor (and director) tests haven't been looked into in a long
while. They'd gone stale in how they were calling thee functions. And
what was required to run them. Now the online chat actor tests work
again.
Using model specific extract questions was an artifact from older
times, with less guidable models.
New changes collate and reuse logic
- Rely on send_message_to_model_wrapper for model specific formatting.
- Use same prompt, context for all LLMs as can handle prompt variation.
- Use response schema enforcer to ensure response consistency across models.
Extract questions (because of its age) was the only tool directly within
each provider code. Put it into helpers to have all the (mini) tools
in one place.
The chat dictionary is an artifact from earlier non-db chat history
storage. We've been ensuring new chat messages have valid type before
being written to DB for more than 6 months now.
Move to using the deeply typed chat history helps avoids null refs,
makes code more readable and easier to reason about.
Next Steps:
The current update entangles chat_history written to DB
with any virtual chat history message generated for intermediate
steps. The chat message type written to DB should be decoupled from
type that can be passed to AI model APIs (maybe?).
For now we've made the ChatMessage.message type looser to allow
for list[dict] type (apart from string). But later maybe a good idea
to decouple the chat_history recieved by send_message_to_model from
the chat_history saved to DB (which can then have its stricter type check)
Fix for issue is in tenacity 9.0.0. But older langchain required
tenacity <0.9.0.
Explicitly pin version of langchain sub packages to avoid indexing
and doc parsing breakage.
- Encode article urls in filename indexed in Khoj KB
Makes it easier for humans to compare, trace retrieval performance
by looking at logs than using content hash (which was previously
explored)
- Previous was incorrectly plural but was defining only a single model
- Rename chat model table field to name
- Update documentation
- Update references functions and variables to match new name
- Use the generated descriptions / inferred queries to supply context to the model about what it's created and give a richer response
- Stop sending the generated image in user message. This seemed to be confusing the model more than helping.
- Also, rename the open ai converse method to converse_openai to follow patterns with other providers
* Rename OpenAIProcessorConversationConfig to more apt AiModelAPI
The DB model name had drifted from what it is being used for,
a general chat api provider that supports other chat api providers like
anthropic and google chat models apart from openai based chat models.
This change renames the DB model and updates the docs to remove this
confusion.
Using Ai Model Api we catch most use-cases including chat, stt, image generation etc.
Gemini doesn't work well when trying to output json objects. Using it
to output raw json strings with complex, multi-line structures
requires more intense clean-up of raw json string for parsing
Collect, display and store running costs & accuracy of eval run.
This provides more insight into eval runs during execution instead of
having to wait until the eval run completes.
- Previously online chat actors, director tests only worked with openai.
This change allows running them for any supported onlnie provider
including Google, Anthropic and Openai.
- Enable online/offline chat actor, director in two ways:
1. Explicitly setting KHOJ_TEST_CHAT_PROVIDER environment variable to
google, anthropic, openai, offline
2. Implicitly by the first API key found from openai, google or anthropic.
- Default offline chat provider to use Llama 3.1 3B for faster, lower
compute test runs
- Set output mode to single string. Specify output schema in prompt
- Both thesee should encourage model to select only 1 output mode
instead of encouraging it in prompt too many times
- Output schema should also improve schema following in general
- Standardize variable, func name of io selector for readability
- Fix chat actors to test the io selector chat actor
- Make chat actor return sources, output separately for better
disambiguation, at least during tests, for now
- Evaluate khoj on random 200 questions from each of google frames and openai simpleqa benchmarks across *general*, *default* and *research* modes
- Run eval with Gemini 1.5 Flash as test giver and Gemini 1.5 Pro as test evaluator models
- Trigger eval workflow on release or manually
- Make dataset, khoj mode and sample size configurable when triggered via manual workflow
- Enable Web search, webpage read tools during evaluation
- Just load the raw csv from OpenAI bucket. Normalize it into FRAMES format
- Improve docstring for frames datasets as well
- Log the load dataset perf timer at info level
- Default to evaluation decision of None when either agent or
evaluator llm fails. This fixes accuracy calculations on errors
- Fix showing color for decision True
- Enable arg flags to specify output results file paths
Previously the batch start index wasn't being passed so all batches
started in parallel were showing the same processing example index
This change doesn't impact the evaluation itself, just the index shown
of the example currently being evaluated