Drop native offline chat support with llama-cpp-python

It is recommended to chat with open-source models by running an open-source server like Ollama, Llama.cpp on your GPU powered machine or use a commercial provider of open-source models like DeepInfra or OpenRouter. These chat model serving options provide a mature Openai compatible API that already works with Khoj. Directly using offline chat models only worked reasonably with pip install on a machine with GPU. Docker setup of khoj had trouble with accessing GPU. And without GPU access offline chat is too slow. Deprecating support for an offline chat provider directly from within Khoj will reduce code complexity and increase developement velocity. Offline models are subsumed to use existing Openai ai model provider.
2026-03-02 13:18:18 +00:00 · 2025-07-03 01:49:18 -07:00
parent 3f8cc71aca
commit b1f2737c9a
28 changed files with 71 additions and 1945 deletions
--- a/documentation/docs/advanced/admin.md
+++ b/documentation/docs/advanced/admin.md
@@ -20,7 +20,7 @@ Add all the agents you want to use for your different use-cases like Writer, Res
 ### Chat Model Options
 Add all the chat models you want to try, use and switch between for your different use-cases. For each chat model you add:
 - `Chat model`: The name of an [OpenAI](https://platform.openai.com/docs/models), [Anthropic](https://docs.anthropic.com/en/docs/about-claude/models#model-names), [Gemini](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#gemini-models) or [Offline](https://huggingface.co/models?pipeline_tag=text-generation&library=gguf) chat model.
- `Model type`: The chat model provider like `OpenAI`, `Offline`.
+- `Model type`: The chat model provider like `OpenAI`, `Google`.
 - `Vision enabled`: Set to `true` if your model supports vision. This is currently only supported for vision capable OpenAI models like `gpt-4o`
 - `Max prompt size`, `Subscribed max prompt size`: These are optional fields. They are used to truncate the context to the maximum context size that can be passed to the model. This can help with accuracy and cost-saving.<br />
 - `Tokenizer`: This is an optional field. It is used to accurately count tokens and truncate context passed to the chat model to stay within the models max prompt size.
--- a/documentation/docs/get-started/setup.mdx
+++ b/documentation/docs/get-started/setup.mdx
@@ -18,10 +18,6 @@ import TabItem from '@theme/TabItem';
 These are the general setup instructions for self-hosted Khoj.
 You can install the Khoj server using either [Docker](?server=docker) or [Pip](?server=pip).

-:::info[Offline Model + GPU]
-To use the offline chat model with your GPU, we recommend using the Docker setup with Ollama . You can also use the local Khoj setup via the Python package directly.
-:::
-
 :::info[First Run]
 Restart your Khoj server after the first run to ensure all settings are applied correctly.
 :::
@@ -225,10 +221,6 @@ To start Khoj automatically in the background use [Task scheduler](https://www.w
 You can now open the web app at http://localhost:42110 and start interacting!<br />
 Nothing else is necessary, but you can customize your setup further by following the steps below.

-:::info[First Message to Offline Chat Model]
-The offline chat model gets downloaded when you first send a message to it. The download can take a few minutes! Subsequent messages should be faster.
-:::
-
 ### Add Chat Models
 <h4>Login to the Khoj Admin Panel</h4>
 Go to http://localhost:42110/server/admin and login with the admin credentials you setup during installation.
@@ -301,13 +293,14 @@ Offline chat stays completely private and can work without internet using any op
 - A Nvidia, AMD GPU or a Mac M1+ machine would significantly speed up chat responses
 :::

-1. Get the name of your preferred chat model from [HuggingFace](https://huggingface.co/models?pipeline_tag=text-generation&library=gguf). *Most GGUF format chat models are supported*.
-2. Open the [create chat model page](http://localhost:42110/server/admin/database/chatmodel/add/) on the admin panel
-3. Set the `chat-model` field to the name of your preferred chat model
-   - Make sure the `model-type` is set to `Offline`
-4. Set the newly added chat model as your preferred model in your [User chat settings](http://localhost:42110/settings) and [Server chat settings](http://localhost:42110/server/admin/database/serverchatsettings/).
-5. Restart the Khoj server and [start chatting](http://localhost:42110) with your new offline model!
-  </TabItem>
+1. Install any Openai API compatible local ai model server like [llama-cpp-server](https://github.com/ggml-org/llama.cpp/tree/master/tools/server), Ollama, vLLM etc.
+2. Add an [ai model api](http://localhost:42110/server/admin/database/aimodelapi/add/) on the admin panel
+   - Set the `api url` field to the url of your local ai model provider like `http://localhost:11434/v1/` for Ollama
+3. Restart the Khoj server to load models available on your local ai model provider
+   - If that doesn't work, you'll need to manually add available [chat model](http://localhost:42110/server/admin/database/chatmodel/add) in the admin panel.
+4. Set the newly added chat model as your preferred model in your [User chat settings](http://localhost:42110/settings)
+5. [Start chatting](http://localhost:42110) with your local AI!
+</TabItem>
 </Tabs>

 :::tip[Multiple Chat Models]