Drop native offline chat support with llama-cpp-python

It is recommended to chat with open-source models by running an open-source server like Ollama, Llama.cpp on your GPU powered machine or use a commercial provider of open-source models like DeepInfra or OpenRouter. These chat model serving options provide a mature Openai compatible API that already works with Khoj. Directly using offline chat models only worked reasonably with pip install on a machine with GPU. Docker setup of khoj had trouble with accessing GPU. And without GPU access offline chat is too slow. Deprecating support for an offline chat provider directly from within Khoj will reduce code complexity and increase developement velocity. Offline models are subsumed to use existing Openai ai model provider.
2026-03-07 05:40:17 +00:00 · 2025-07-03 01:49:18 -07:00
parent 3f8cc71aca
commit b1f2737c9a
28 changed files with 71 additions and 1945 deletions
--- a/tests/helpers.py
+++ b/tests/helpers.py
@@ -19,7 +19,7 @@ from khoj.database.models import (
 from khoj.processor.conversation.utils import message_to_log


-def get_chat_provider(default: ChatModel.ModelType | None = ChatModel.ModelType.OFFLINE):
+def get_chat_provider(default: ChatModel.ModelType | None = ChatModel.ModelType.GOOGLE):
    provider = os.getenv("KHOJ_TEST_CHAT_PROVIDER")
    if provider and provider in ChatModel.ModelType:
        return ChatModel.ModelType(provider)
@@ -93,7 +93,7 @@ class ChatModelFactory(factory.django.DjangoModelFactory):

    max_prompt_size = 20000
    tokenizer = None
-    name = "bartowski/Meta-Llama-3.2-3B-Instruct-GGUF"
+    name = "gemini-2.0-flash"
    model_type = get_chat_provider()
    ai_model_api = factory.LazyAttribute(lambda obj: AiModelApiFactory() if get_chat_api_key() else None)