klbr/khoj - khoj - Gitea: Git with a cup of tea

klbr/khoj

mirror of https://github.com/khoaliber/khoj.git synced 2026-03-02 21:19:12 +00:00

Author	SHA1	Message	Date
Debanjum	833c8ed150	Add a flexible operator agent using separate reasoning, grounder models - This operator works with model served over an openai compatible api - It uses separate vision models to reason and ground actions. This improves flexibility in the operator agents that can be created. We do not know need our operator agent ot rely on monolithic models to can both reason over visual data and ground their actions. We can create operator agent from 2 separate models: 1. To reason over screenshots to suggest natural language next action 2. To ground those suggestion into visually grounded actions This allows us to create fully local operators or operators combining the best visual reasoner with the best visual grounder models.	2025-05-19 16:28:55 -07:00
Debanjum	773d20a26f	Improve instructions to the openai operator agent. Inform it can only control a single playwright browser page. Previously it was assuming it is operating a whole browser, so would have trouble navigating to different pages. Improve handling of error in action parsing	2025-05-19 16:28:55 -07:00
Debanjum	4db888cd62	Simplify operator loop. Make each OperatorAgent manage state internally. Remove each OperatorAgent specific code from leaking out into the operator. The Oprator just calls the standard OperatorAgent functions. Each AgentOperator specific logic is handled by the OperatorAgent internally. The improve the separation of responsibility between the operator, OperatorAgent and the Environment. - Make environment pass screenshot data in agent agnostic format - Have operator agents providers format image data to their AI model specific format - Add environment step type to distinguish image vs text content - Clearly mark major steps in the operator iteration loop - Handle anthropic models returning computer tool actions as normal tool calls by normalizing next action retrieval from response for it - Remove unused ActionResults fields - Remove unnnecessary placeholders to content of action results like for screenshot data	2025-05-19 16:28:55 -07:00
Debanjum	a1c9c6b2e3	Add pages visited via browser operator to references returned to clients	2025-05-19 16:28:55 -07:00
Debanjum	e71575ad1a	Render screenshot in train of thought on openai agent screenshot action	2025-05-19 16:28:55 -07:00
Debanjum	78e052bfcb	Decouple environment from operator agent to improve modularity Decouple applying action on Environment from next action decision by OperatorAgent - Create an abstract Environment class with a `step' method and a standardized set of supported actions for each concrete Environment - Wrap playwright page into a concrete Environment class - Create abstract OperatorAgent class with an abstract `act' method - Wrap Openai computer Operator into concrete OperatorAgent class - Wrap Claude computer Operator into a concrete OperatorAgent class Handle interaction between Agent's action	2025-05-19 16:28:55 -07:00
Debanjum	7c60e04efb	Pull out common iteration loop into main browser operator method	2025-05-19 16:28:54 -07:00
Debanjum	08e93c64ab	Render screenshot in train of thought on browser screenshot action Update web app to render screenshot image when screenshot action taken by browser operator	2025-05-19 16:28:54 -07:00
Debanjum	188b3c85ae	Force open links in current page to stay in operator page context Previously some link clicks would open in new tab. This is out of the browser operator's context and so the new page cannot be interacted with by the browser operator. This change catches new page opens and opens them in the context page instead.	2025-05-19 16:28:54 -07:00
Debanjum	20f87542e5	Add cancellation support to browser operator via asyncio.Event	2025-05-19 16:28:54 -07:00
Debanjum	9f75622346	Allow browser operator to use browser with existing context over CDP Give the Khoj browser operator access to browser with existing context (auth, cookies etc.) by starting it with CDP enabled. Process: 1. Start Browser with CDP enabled: `Edge/Chromium/Chrome --remote-debugging-port=9222' 2. Set the KHOJ_CDP_URL env var to the CDP url of the browser to use. 3. Start Khoj and ask it to get browser based work done with operator + research mode	2025-05-19 16:28:54 -07:00
Debanjum	b9ea538b02	Support operating web browser with Anthropic models - Add back() and goto(url) helper functions to operate browser - Cache operator messages to Anthropic API for speed and cost savings	2025-05-19 16:28:54 -07:00
Debanjum	2e86141575	Enable Khoj to use a GUI web browser. Operate it with Openai models	2025-05-19 16:28:54 -07:00
Debanjum	ab5d0b5878	Upgrade server dependencies	2025-05-19 16:28:21 -07:00
Debanjum	22cd638add	Fix handling unset openai_base_url to run eval with openai chat models The github run_eval workflow sets OPENAI_BASE_URL to empty string. The ai model api created during initialization for openai models gets set to empty string rather than None or the actual openai base url This tries to call llm at to empty string base url instead of the default openai api base url, which obviously fails. Fix is to map empty base url's to the actual openai api base url.	2025-05-19 16:19:43 -07:00
Debanjum	cf55582852	Retry on empty response or error in chat completion by llm over api Previously all exceptions were being caught. So retry logic wasn't getting triggered. Exception catching had been added to close llm thread when threads instead of async was being used for final response generation. This isn't required anymore since moving to async. And we can now re-enable retry on failures. Raise error if response is empty to retry llm completion.	2025-05-19 11:27:19 -07:00
Debanjum	7827d317b4	Widen vision support for chat models served via openai compatible api Send image as png to non-openai models served via an openai compatible api. As more models support png than webp. Continue storing images as webp on server for efficiency. Convert to png at the openai api layer and only for non-openai models served via an openai compatible api. Enable using vision models like ui-tars (via llama.cpp server), grok.	2025-05-19 11:27:19 -07:00
Debanjum	4f3fdaf19d	Increase khoj api response timeout on evals call. Handle no decision	2025-05-18 19:14:49 -07:00
Debanjum	31dcc44c20	Output tokens >> reasoning tokens to avoid early response termination.	2025-05-18 14:45:23 -07:00
Debanjum	73e28666b5	Fix to set default chat model for all user tiers via env var	2025-05-18 14:45:23 -07:00
Debanjum	06dcd4426d	Improve Research Mode Context Management (#1179 ) ### Major * Do more granular truncation on hitting context limits * Pack research iterations as list of message content instead of separate messages * Update message truncation logic to truncate items in message content list * Make researcher aware of number of web, doc queries allowed per iteration ### Minor * Prompt web page reader to extract quantitative data as is from pages * Track gemini 2.0 flash lite cost. Reduce max prompt size for 4o-mini * Ensure time to first token logged only once per chat response * Upgrade tenacity to respect min_time passed to exponential backoff with jitter function	2025-05-17 17:38:31 -07:00
Debanjum	fd591c6e6c	Upgrade tenacity to respect min time for exponential backoff Fix for issue is in tenacity 9.0.0. But older langchain required tenacity <0.9.0. Explicitly pin version of langchain sub packages to avoid indexing and doc parsing breakage.	2025-05-17 17:37:15 -07:00
Debanjum	988bde651c	Make researcher aware of no. of web, doc queries allowed per iteration - Construct tool description dynamically based on configurable query count - Inform the researcher how many webpage reads, online searches and document searches it can perform per iteration when it has to decide which next tool to use and the query to send to the tool AI. - Pass the query counts to perform from the research AI down to the tool AIs	2025-05-17 17:37:15 -07:00
Debanjum	417ab42206	Track gemini 2.0 flash lite cost. Reduce max prompt size for 4o-mini	2025-05-17 17:37:15 -07:00
Debanjum	e125e299a7	Ensure time to first token logged only once per chat response Time to first token Log lines were shown multiple times if new chunk bein streamed was empty for some reason. This change makes the logic robust to empty chunks being recieved.	2025-05-17 17:37:15 -07:00
Debanjum	2694734d22	Update truncation logic to handle multi-part message content	2025-05-17 17:37:15 -07:00
Debanjum	a337d9e4b8	Structure research iteration msgs for more granular context management Previously research iterations and conversation logs were added to a single user message. This prevented truncating each past iteration separately on hitting context limits. So the whole past research context had to be dropped on hitting context limits. This change splits each research iteration into a separate item in a message content list. It uses the ability for message content to be a list, that is supported by all major ai model apis like openai, anthropic and gemini. The change in message format seen by pick next tool chat actor: - New Format - System: System Message - User/Assistant: Chat History - User: Raw Query - Assistant: Iteration History - Iteration 1 - Iteration 2 - User: Query with Pick Next Tool Nudge - Old Format - User: System + Chat History + Previous Iterations Message - User: Query - Collateral Changes The construct_structured_message function has been updated to always return a list[dict[str, Any]]. Previously it'd only use list if attached_file_context or vision model with images for wider compatibility with other openai compatible api	2025-05-17 17:37:15 -07:00
Debanjum	0f53a67837	Prompt web page reader to extract quantitative data as is from pages Previously the research agent would have a hard time getting quantitative data extracted by the web page reader tool AI. This change aims to encourage the web page reader tool to extract relevant data in verbatim form for higher granularity research and responses.	2025-05-17 17:37:15 -07:00
Debanjum	99a2305246	Improve tool chat history constructor and fix its usage during research. Code tool should see code context and webpage tool should see online context during research runs Fix to include code context from past conversations to answer queries. Add all queries to tool chat history when no specific tool to limit extracting inferred queries for provided.	2025-05-17 17:37:15 -07:00
Debanjum	8050173ee1	Timeout calls to khoj api in evals to continue to next question	2025-05-17 17:37:11 -07:00
Debanjum	442c7b6153	Retry running code on more request exception	2025-05-17 17:37:11 -07:00
Debanjum	10a5d68a2c	Improve retry, increase timeouts of gemini api calls - Catch specific retryable exceptions for retry - Increase httpx timeout from default of 5s to 20s	2025-05-17 16:38:55 -07:00
Debanjum	20f08ca564	Reduce timeouts on calling local and online llms via openai api - Use much larger read, connect timeout if llm served over local url - Use larger timeout duration than default (5s) for online llms too This matches timeout duration increase calls to gemini api	2025-05-17 16:38:55 -07:00
Debanjum	e0352cd8e1	Handle unset ttft in metadata of failed chat response. Fixes evals. This was causing evals to stop processing rest of batch as well.	2025-05-17 15:06:22 -07:00
Debanjum	673a15b6eb	Upgrade hf hub package to include hf_xet for faster downloads	2025-05-17 15:06:22 -07:00
Debanjum	d867dca310	Fix send_message_to_model_wrapper by using sync is_user_subscribed check Calling an async function from a sync function wouldn't work.	2025-05-17 15:06:22 -07:00
Sajjad Baloch	a4ab498aec	Update README for better contributions (#1170 ) - Improve overall flow of the contribute section of Readme - Fix where to look for good first issues. The contributors board is outdated. Easier to maintain and view good-first-issue with issue tags directly. Co-authored-by: Debanjum <debanjum@gmail.com>	2025-05-12 09:51:01 -06:00
Debanjum	2feed544a6	Add Gemini 2.0 flash back to default gemini chat models list Remove once gemini 2.5 flash is GA	2025-05-11 19:05:09 -06:00
Debanjum	2e290ea690	Pass conversation history to generate non-streaming chat model responses Allows send_message_to_model_wrapper func to also use conversation logs as context to generate response. This is an optional parameter	2025-05-09 00:02:14 -06:00
Debanjum	8787586e7e	Dedupe code to format messages before sending to appropriate chat model Fallback to assume not a subscribed user if user not passed. This allows user arg to be actually optional in the async send_message_to_model_wrapper function	2025-05-09 00:02:14 -06:00
Debanjum	e94bf00e1e	Add cancellation support to research mode via asyncio.Event	2025-05-09 00:01:45 -06:00
Debanjum	1572781946	Parse and show reasoning model thoughts (#1172 ) ### Major All reasoning models return thoughts differently due to lack of standardization. We normalize thoughts by reasoning models and providers to ease handling within Khoj. The model thoughts are parsed during research mode when generating final response. These model thoughts are returned by the chat API and shown in train of thought shown on web app. Thoughts are enabled for Deepseek, Anthropic, Grok and Qwen3 reasoning models served via API. Gemini and Openai reasoning models do not show their thoughts via standard APIs. ### Minor - Fix ability to use Deepseek reasoner for intermediate stages of chat - Enable handling Qwen3 reasoning models	2025-05-02 20:29:38 -06:00
Debanjum	2cd7302966	Parse Grok reasoning model thoughts returned by API	2025-05-02 19:59:17 -06:00
Debanjum	8cadb0dbc0	Parse Anthropic reasoning model thoughts returned by API	2025-05-02 19:59:13 -06:00
Debanjum	ae4e352b42	Fix formatting to use Deepseek reasoner for completion via OpenAI API Previously Deepseek reasoner couldn't be used via API for completion because of the additional formatting constrains it required was being applied in this function. The formatting fix was being applied in the chat completion endpoint.	2025-05-02 19:11:16 -06:00
Debanjum	61a50efcc3	Parse DeepSeek reasoning model thoughts served via OpenAI compatible API DeepSeek reasoners returns reasoning in reasoning_content field. Create an async stream processor to parse the reasoning out when using the deepseek reasoner model.	2025-05-02 19:11:16 -06:00
Debanjum	16f3c85dde	Handle thinking by reasoning models. Show in train of thought on web client	2025-05-02 19:11:16 -06:00
Debanjum	d10dcc83d4	Only enable reasoning by qwen3 models in deepthought mode	2025-05-02 18:36:49 -06:00
Debanjum	6eaf54eb7a	Parse Qwen3 reasoning model thoughts served via OpenAI compatible API The Qwen3 reasoning models return thoughts within <think></think> tags before response. This change parses the thoughts out from final response from the response stream and returns as structured response with thoughts. These thoughts aren't passed to client yet	2025-05-02 18:36:45 -06:00
Debanjum	7b9f2c21c7	Parse thoughts from thinking models served via OpenAI compatible API OpenAI API doesn't support thoughts via chat completion by default. But there are thinking models served via OpenAI compatible APIs like deepseek and qwen3. Add stream handlers and modified response types that can contain thoughts as well apart from content returned by a model. This can be used to instantiate stream handlers for different model types like deepseek, qwen3 etc served over an OpenAI compatible API.	2025-05-02 17:49:16 -06:00

... 5 6 7 8 9 ...

4948 Commits