### Overview
Make server leaner to increase development speed.
Remove old indexing code and the native offline chat which was hard to
maintain.
- The native offline chat module was written when the local ai model api
ecosystem wasn't mature. Now it is. Reuse that.
- Offline chat requires GPU for usable speeds. Decoupling offline chat
from Khoj server is the recommended way to go for practical inference
speeds (e.g Ollama on machine, Khoj in docker etc.)
### Details
- Drop old code to index files on server filesystem. Clean cli, init
paths.
- Drop native offline chat support with llama-cpp-python.
Use established local ai APIs like Llama.cpp Server, Ollama, vLLM etc.
- Drop old pre 1.0 khoj config migration scripts
- Update test setup to index test data after old indexing code removed.
- Delete tests testing deprecated server side indexing flows
- Delete `Local(Plaintext|Org|Markdown|Pdf)Config' methods, files and
references in tests
- Index test data via new helper method, `get_index_files'
- It is modelled after the old `get_org_files' variants in main app
- It passes the test data in required format to `configure_content'
Allows maintaining the more realistic tests from before while
using new indexing mechanism (rather than the deprecated server
side indexing mechanism
This stale code was originally used to index files on server file
system directly by server. We currently push files to sync via API.
Server side syncing of remote content like Github and Notion is still
supported. But old, unused code for server side sync of files on
server fs is being cleaned out.
New --log-file cli args allows specifying where khoj server should
store logs on fs. This replaces the --config-file cli arg that was
only being used as a proxy for deciding where to store the log file.
- TODO
- Tests are broken. They were relying on the server side content
syncing for test setup
It is recommended to chat with open-source models by running an
open-source server like Ollama, Llama.cpp on your GPU powered machine
or use a commercial provider of open-source models like DeepInfra or
OpenRouter.
These chat model serving options provide a mature Openai compatible
API that already works with Khoj.
Directly using offline chat models only worked reasonably with pip
install on a machine with GPU. Docker setup of khoj had trouble with
accessing GPU. And without GPU access offline chat is too slow.
Deprecating support for an offline chat provider directly from within
Khoj will reduce code complexity and increase developement velocity.
Offline models are subsumed to use existing Openai ai model provider.
Clarify that the tool AI will perform a maximum of X sub-queries for
each query passed to it by the manager AI.
Avoids the manager AI from trying to directly pass a list of queries
to the search tool AI. It should just pass just a single query.
Send larger thought chunks to improve streaming efficiency and
reduce rendering load on web client.
This rendering load was most evident when using high throughput
models or low compute clients.
The server side message buffering should result in fewer re-renders,
faster streaming and lower compute load on client.
Related commit to buffer message content in fc99f8b37
- Ask both manager and code gen AI to not run or write
unsafe code for some safety improvement (over code exec in sandbox).
- Disallow custom agent prompts instructing unsafe code gen
## PR Summary
This PR resolves the deprecation warnings of the Pydantic library, which
you can find in the [CI
logs](https://github.com/khoj-ai/khoj/actions/runs/16528997676/job/46749452047#step:9:142):
```python
PydanticDeprecatedSince20: The `copy` method is deprecated; use `model_copy` instead. See the docstring of `BaseModel.copy` for details about how to handle `include` and `exclude`. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
```
Save to conversation in normal flow should only be done if
interrupt wasn't triggered.
Saving conversations on interrupt is handled completely by the
disconnect monitor since the improvements to interrupt.
This abort is handled correctly for steps before final response. But
not if interrupt occurs while final response is being sent. This
changes checks for cancellation after final response send attempt and
avoids duplicate chat turn save.
- Extract llm thoughts from more openai compatible ai api providers
like llama.cpp server vllm and litellm.
- Try structured thought extraction by default
- Try in-stream thought extraction for specific model families like
qwen and deepseek.
- Show thoughts with tool use. For intermediate steps like research
mode from openai compatible models
Some consensus on thought in model response is being reached with
using deepseek style thoughts in structured response (via
"reasoning_content" field) or qwen style thoughts in main
response (i.e <think></think> tags).
Default to try deepseek style structured thought extraction. So the
previous default stream processor isn't required.
A previous regression resulted in the start llm response event being
sent with every (non-thought) message chunk. It should only be sent
once after thoughts and before first normal message chunk is streamed.
Regression probably introduced with changes to stream thoughts.
This should fix the chat streaming latency logs.
Send larger message chunks to improve streaming efficiency and
reduce rendering load on web client.
This rendering load was most evident when using high throughput
models, low compute clients and message with images. As message
content was rerendered on every token sent to the web app.
The server side message buffering should result in fewer re-renders
and lower compute load on client.
Fixes calling websocket rate limiter from async chat_ws method.
Not sure why the issue did not trigger in local setups. Maybe has to
do with gunicorn vs uvicorn / multi-workers setup in prod vs local.
- Add a websocket api endpoint for chat. Reuse most of the existing chat
logic.
- Communicate from web app using the websocket chat api endpoint.
- Pass interrupt messages using websocket to guide research, operator
trajectory
Previously we were using the abort and send new POST /api/chat
mechanism.
This didn't scale well to multi-worker setups as a different worker
could pick up the new interrupt message request.
Using websocket to send messages in the middle of long running tasks
should work more naturally.
- Chat history is retrieved and updated with new messages just before
write. This is to reduce chance of message loss due to conflicting
writes making last to save to conversation win conflict.
- This was problematic artifact of old code. Removing it should reduce
conflict surface area.
- Interrupts and live chat could hit this issue due to different reasons
- Use websocket library to handle setup, reconnection from web app
Use react-use-websocket library to handle websocket connection and
reconnection logic. Previously connection wasn't re-established on
disconnects.
- Send interrupt messages with ws to update research, operator trajectory
Previously we were using the abort and send new POST /api/chat
mechanism.
But now we can use the websocket's bi-directional messaging capability
to send users messages in the middle of a research, operator run.
This change should
1. Allow for a faster, more interactive interruption to shift the
research direction without breaking the conversation flow. As
previously we were using the DB to communicate interrupts across
workers, this would take time and feel sluggish on the UX.
2. Be a more robust interrupt mechanism that'll work in multi worker
setups. As same worker is interacted with to send interrupt messages
instead of potentially new worker receiving the POST /api/chat with
the interrupt user message.
On the server we're using an asyncio Queue to pass messages down from
websocket api to researcher via event generator. This can be extended
to pass to other iterative agents like operator.
Fix using research tool names instead of slash command tool names
(exposed to user) in research mode conversation history construction.
Map agent input tools to relevant research tools. Previously
using agents with a limited set of tools in research mode reduces
tools available to agent in research mode.
Fix checks to skip tools if not configured.
The chat model friendly name field was introduced in a8c47a70f. But
we weren't setting the friendly name for ollama models, which get
automatically loaded on first run.
This broke setting chat model options, server admin settings and
creating new chat pages (at least) as they display the chat model's
friendly name.
This change ensures the friendly name for auto loaded chat models is
set to resolve these issues. We also add a null ref check to web app
model selector as an additional safeguard to prevent new chat page
crash due to missing friendly name going forward.
Resolves#1208