- Add more color to personality and communication style
- Split prompt into capabilities and style sections
- Remove directives in personality meant for older, less smart models.
- Discourage model from unnecessarily sharing code snippets in final
response unless explicitly requested.
- Ack websocket interrupt even when no task running
Otherwise chat UX isn't updated to indicate query has stopped
processing for this edge case
- Mark chat request as not being procesed on server side error
It is already being passed in model_kwargs, so not required to be
passed explicitly as well.
This code path isn't being used currently, but better to fix for
if/when it is used
- Set the agent of the current conversation in the agent dropdown when a new conversation with a non-default agent is initialized. This was unset previously.
- Pass the current selected agent in the dropdown when creating new chat
- Correctly select the `khoj-header-agent-select' element
- A regression had stopped indicating to user that the websocket
connection had broken. Now the interrupt has some visual indication.
- Websocket disconnects from client didn't trigger the partial
research to be saved. Now we use an interrupt signal to save partial
research before closing task.
Although we had handling in place for retrying after gemini suggested
backoff on hitting rate limits. The actual rate limit exception was
getting caught to render friendly message, so retry wasn't actually
getting triggered.
This change allows both
- Retry on hitting 429 rate limit exceptions
- Return friendly message if rate limit triggered retry eventually fails
Related:
- Changes to retry with gemini suggested backoff time in 0f953f9
Issue: chosen_io variable was accessed before initialization when
ValueError was raise.
Fix: Set chosen_io to fallback values on failure to select default
chat tools
Make researcher handle ambiguous requests better by working with
reasonable assumptions (clearly told to user in response) instead of
burdering user with clarification requests.
Fix portions of the researcher prompt that had gone stale since moving
to tool use and making researcher more task (vs q&a) oriented
Previously the researcher was passing the whole code to execute in its
queries to the tool AI instead of asking it to write the code and
limiting its query to a natural language request (with required data).
The division of responsibility should help researcher just worry about
constructing a request with all the required details instead of also
worrying about writing correct code.
Their tool call response may not strictly follow expected response
format. Let researcher handle incorrect arguments to code tool (i.e
triggers type error)
What
- Get reasoning of openai reasoning models from responses api for sho
- Improves cache hits and reasoning reuse for iterative agents like
research mode.
This should improve speed, quality, cost and transparency of using
openai reasoning models.
More cache hits and better reasoning as reasoning blocks are included
while model is researching (reasoning intersperse with tool calls)
when using the responses api.
Previously line start, end anchors would just work if the whole file
started or ended with the regex pattern rather than matching by line.
Fix it to work like a standard grep tool and match by line start, end.
Reduce usage of boolean operators like "hello OR bye OR see you" which
doesn't work and reduces search quality. They're trying to stuff the
search query with multiple different queries.
## Overview
Speed up app install and development using a faster, modern development
toolchain
## Details
### Major
- Use [uv](https://docs.astral.sh/uv/) for faster server install (vs
pip)
- Use [bun](https://bun.sh/) for faster web app install (vs yarn)
- Use [ruff](https://docs.astral.sh/ruff/) for faster formatting of
server code (vs black, isort)
- Fix devcontainer builds. See if uv and bun can speed up server and
client installs
### Minor
- Format web app with prettier and server with ruff. This is most of the
file changes in this PR.
- Simplify copying web app built files in pypi workflow to make it less
flaky.
- CI runners don't have GPUs
- Pytorch related Nvidia cuda packages are not required for testing,
evals or pre-commit checks.
- Avoiding these massive downloads should speed up workflow run.
### Overview
Make server leaner to increase development speed.
Remove old indexing code and the native offline chat which was hard to
maintain.
- The native offline chat module was written when the local ai model api
ecosystem wasn't mature. Now it is. Reuse that.
- Offline chat requires GPU for usable speeds. Decoupling offline chat
from Khoj server is the recommended way to go for practical inference
speeds (e.g Ollama on machine, Khoj in docker etc.)
### Details
- Drop old code to index files on server filesystem. Clean cli, init
paths.
- Drop native offline chat support with llama-cpp-python.
Use established local ai APIs like Llama.cpp Server, Ollama, vLLM etc.
- Drop old pre 1.0 khoj config migration scripts
- Update test setup to index test data after old indexing code removed.