Commit Graph

4631 Commits

Author SHA1 Message Date
Debanjum
4f3fdaf19d Increase khoj api response timeout on evals call. Handle no decision 2025-05-18 19:14:49 -07:00
Debanjum
31dcc44c20 Output tokens >> reasoning tokens to avoid early response termination. 2025-05-18 14:45:23 -07:00
Debanjum
73e28666b5 Fix to set default chat model for all user tiers via env var 2025-05-18 14:45:23 -07:00
Debanjum
06dcd4426d Improve Research Mode Context Management (#1179)
### Major
* Do more granular truncation on hitting context limits
* Pack research iterations as list of message content instead of
separate messages
* Update message truncation logic to truncate items in message content
list
* Make researcher aware of number of web, doc queries allowed per
iteration

### Minor
* Prompt web page reader to extract quantitative data as is from pages
* Track gemini 2.0 flash lite cost. Reduce max prompt size for 4o-mini
* Ensure time to first token logged only once per chat response
* Upgrade tenacity to respect min_time passed to exponential backoff
with jitter function
2025-05-17 17:38:31 -07:00
Debanjum
fd591c6e6c Upgrade tenacity to respect min time for exponential backoff
Fix for issue is in tenacity 9.0.0. But older langchain required
tenacity <0.9.0.

Explicitly pin version of langchain sub packages to avoid indexing
and doc parsing breakage.
2025-05-17 17:37:15 -07:00
Debanjum
988bde651c Make researcher aware of no. of web, doc queries allowed per iteration
- Construct tool description dynamically based on configurable query
  count
- Inform the researcher how many webpage reads, online searches and
  document searches it can perform per iteration when it has to decide
  which next tool to use and the query to send to the tool AI.
- Pass the query counts to perform from the research AI down to the
  tool AIs
2025-05-17 17:37:15 -07:00
Debanjum
417ab42206 Track gemini 2.0 flash lite cost. Reduce max prompt size for 4o-mini 2025-05-17 17:37:15 -07:00
Debanjum
e125e299a7 Ensure time to first token logged only once per chat response
Time to first token Log lines were shown multiple times if new chunk
bein streamed was empty for some reason.

This change makes the logic robust to empty chunks being recieved.
2025-05-17 17:37:15 -07:00
Debanjum
2694734d22 Update truncation logic to handle multi-part message content 2025-05-17 17:37:15 -07:00
Debanjum
a337d9e4b8 Structure research iteration msgs for more granular context management
Previously research iterations and conversation logs were added to a
single user message. This prevented truncating each past iteration
separately on hitting context limits. So the whole past research
context had to be dropped on hitting context limits.

This change splits each research iteration into a separate item in a
message content list.

It uses the ability for message content to be a list, that is
supported by all major ai model apis like openai, anthropic and gemini.

The change in message format seen by pick next tool chat actor:
- New Format
  - System: System Message
  - User/Assistant: Chat History
  - User: Raw Query
  - Assistant: Iteration History
    - Iteration 1
    - Iteration 2
  - User: Query with Pick Next Tool Nudge

- Old Format
  - User: System + Chat History + Previous Iterations Message
  - User: Query

- Collateral Changes
The construct_structured_message function has been updated to always
return a list[dict[str, Any]].

Previously it'd only use list if attached_file_context or vision model
with images for wider compatibility with other openai compatible api
2025-05-17 17:37:15 -07:00
Debanjum
0f53a67837 Prompt web page reader to extract quantitative data as is from pages
Previously the research agent would have a hard time getting
quantitative data extracted by the web page reader tool AI.

This change aims to encourage the web page reader tool to extract
relevant data in verbatim form for higher granularity research and
responses.
2025-05-17 17:37:15 -07:00
Debanjum
99a2305246 Improve tool chat history constructor and fix its usage during research.
Code tool should see code context and webpage tool should see online
context during research runs

Fix to include code context from past conversations to answer queries.

Add all queries to tool chat history when no specific tool to limit
extracting inferred queries for provided.
2025-05-17 17:37:15 -07:00
Debanjum
8050173ee1 Timeout calls to khoj api in evals to continue to next question 2025-05-17 17:37:11 -07:00
Debanjum
442c7b6153 Retry running code on more request exception 2025-05-17 17:37:11 -07:00
Debanjum
10a5d68a2c Improve retry, increase timeouts of gemini api calls
- Catch specific retryable exceptions for retry
- Increase httpx timeout from default of 5s to 20s
2025-05-17 16:38:55 -07:00
Debanjum
20f08ca564 Reduce timeouts on calling local and online llms via openai api
- Use much larger read, connect timeout if llm served over local url
- Use larger timeout duration than default (5s) for online llms too
  This matches timeout duration increase calls to gemini api
2025-05-17 16:38:55 -07:00
Debanjum
e0352cd8e1 Handle unset ttft in metadata of failed chat response. Fixes evals.
This was causing evals to stop processing rest of batch as well.
2025-05-17 15:06:22 -07:00
Debanjum
673a15b6eb Upgrade hf hub package to include hf_xet for faster downloads 2025-05-17 15:06:22 -07:00
Debanjum
d867dca310 Fix send_message_to_model_wrapper by using sync is_user_subscribed check
Calling an async function from a sync function wouldn't work.
2025-05-17 15:06:22 -07:00
Sajjad Baloch
a4ab498aec Update README for better contributions (#1170)
- Improve overall flow of the contribute section of Readme
- Fix where to look for good first issues. The contributors board is outdated. Easier to maintain and view good-first-issue with issue tags directly.

Co-authored-by: Debanjum <debanjum@gmail.com>
2025-05-12 09:51:01 -06:00
Debanjum
2feed544a6 Add Gemini 2.0 flash back to default gemini chat models list
Remove once gemini 2.5 flash is GA
2025-05-11 19:05:09 -06:00
Debanjum
2e290ea690 Pass conversation history to generate non-streaming chat model responses
Allows send_message_to_model_wrapper func to also use conversation
logs as context to generate response. This is an optional parameter
2025-05-09 00:02:14 -06:00
Debanjum
8787586e7e Dedupe code to format messages before sending to appropriate chat model
Fallback to assume not a subscribed user if user not passed.
This allows user arg to be actually optional in the async
send_message_to_model_wrapper function
2025-05-09 00:02:14 -06:00
Debanjum
e94bf00e1e Add cancellation support to research mode via asyncio.Event 2025-05-09 00:01:45 -06:00
Debanjum
1572781946 Parse and show reasoning model thoughts (#1172)
### Major
All reasoning models return thoughts differently due to lack of
standardization.
We normalize thoughts by reasoning models and providers to ease handling
within Khoj.

The model thoughts are parsed during research mode when generating final
response.
These model thoughts are returned by the chat API and shown in train of
thought shown on web app.

Thoughts are enabled for Deepseek, Anthropic, Grok and Qwen3 reasoning
models served via API.
Gemini and Openai reasoning models do not show their thoughts via
standard APIs.

### Minor
- Fix ability to use Deepseek reasoner for intermediate stages of chat
- Enable handling Qwen3 reasoning models
2025-05-02 20:29:38 -06:00
Debanjum
2cd7302966 Parse Grok reasoning model thoughts returned by API 2025-05-02 19:59:17 -06:00
Debanjum
8cadb0dbc0 Parse Anthropic reasoning model thoughts returned by API 2025-05-02 19:59:13 -06:00
Debanjum
ae4e352b42 Fix formatting to use Deepseek reasoner for completion via OpenAI API
Previously Deepseek reasoner couldn't be used via API for completion
because of the additional formatting constrains it required was being
applied in this function.

The formatting fix was being applied in the chat completion endpoint.
2025-05-02 19:11:16 -06:00
Debanjum
61a50efcc3 Parse DeepSeek reasoning model thoughts served via OpenAI compatible API
DeepSeek reasoners returns reasoning in reasoning_content field.

Create an async stream processor to parse the reasoning out when using
the deepseek reasoner model.
2025-05-02 19:11:16 -06:00
Debanjum
16f3c85dde Handle thinking by reasoning models. Show in train of thought on web client 2025-05-02 19:11:16 -06:00
Debanjum
d10dcc83d4 Only enable reasoning by qwen3 models in deepthought mode 2025-05-02 18:36:49 -06:00
Debanjum
6eaf54eb7a Parse Qwen3 reasoning model thoughts served via OpenAI compatible API
The Qwen3 reasoning models return thoughts within <think></think> tags
before response.

This change parses the thoughts out from final response from the
response stream and returns as structured response with thoughts.

These thoughts aren't passed to client yet
2025-05-02 18:36:45 -06:00
Debanjum
7b9f2c21c7 Parse thoughts from thinking models served via OpenAI compatible API
OpenAI API doesn't support thoughts via chat completion by default.
But there are thinking models served via OpenAI compatible APIs like
deepseek and qwen3.

Add stream handlers and modified response types that can contain
thoughts as well apart from content returned by a model.

This can be used to instantiate stream handlers for different model
types like deepseek, qwen3 etc served over an OpenAI compatible API.
2025-05-02 17:49:16 -06:00
Debanjum
6843db1647 Use conversation specific chat model to respond to free tier users
Recent changes enabled free tier users to switch free tier chat models
per conversation or the default.

This change enables free tier users to generate responses with their
conversation specific chat model.

Related: #725, #1151
2025-05-02 17:48:48 -06:00
Debanjum
5b5efe463d Remove inline base64 images from webpages read with Firecrawl 2025-05-02 14:11:27 -06:00
Debanjum
559b323475 Support attaching jupyter/ipython notebooks from the web app to chat 2025-05-02 14:11:27 -06:00
sabaimran
dab6977fed add number 1 repo of day badge 2025-04-23 16:49:12 -07:00
Debanjum
964a784acf Release Khoj version 1.41.0 2025-04-23 19:01:27 +05:30
Debanjum
23dae72420 Update default models: Gemini models to 2.5 series, Gpt 4o to 4.1 2025-04-23 18:40:38 +05:30
Debanjum
d84a0f6e2c Use latest node base image to build web app for khoj docker image 2025-04-23 17:53:33 +05:30
Debanjum
dd46bcabc2 Track gpt-4.1 model costs. Set prompt size of new gemini, openai models 2025-04-23 17:53:33 +05:30
Debanjum
87262d15bb Save conversation to DB in the background, as an asyncio task 2025-04-22 17:42:33 +05:30
Debanjum
f929ff8438 Simplify AI Chat Response Streaming (#1167)
Reason
---
- Simplify code and logic to stream chat response by solely relying on
asyncio event loop.
- Reduce overhead of managing threads to increase efficiency and
throughput (where possible).

Details
---
- Use async/await with no threading when generating chat response via
OpenAI, Gemini, Anthropic AI model APIs
- Use threading for offline chat model as llama-cpp doesn't support
async streaming yet
2025-04-21 14:28:02 +05:30
Debanjum
a4b5842ac3 Remove ThreadedGenerator class, previously used to stream chat response 2025-04-21 14:16:40 +05:30
Debanjum
763fa2fa79 Refactor Offline chat response to stream async, with separate thread 2025-04-21 10:48:38 +05:30
Debanjum
932a9615ef Refactor Anthropic chat response to stream async, no separate thread 2025-04-21 10:46:07 +05:30
Debanjum
a557031447 Refactor Gemini chat response to stream async, no separate thread 2025-04-21 10:46:07 +05:30
Debanjum
0751f2ea30 Refactor Openai chat response to stream async, no separate thread
- Refactor chat API to use async/await for Openai streaming
- Fix and clean Openai chat response async streaming
2025-04-21 10:44:49 +05:30
Debanjum
c93c0d982e Create async get anthropic, openai client funcs, move to reusable package
This package is where the get openai client functions also reside.
2025-04-21 09:30:26 +05:30
Debanjum
973aded6c5 Fix system prompt to make openai reasoning models md format response 2025-04-20 20:33:45 +05:30