- This operator works with model served over an openai compatible api
- It uses separate vision models to reason and ground actions.
This improves flexibility in the operator agents that can be created.
We do not know need our operator agent ot rely on monolithic models to
can both reason over visual data and ground their actions.
We can create operator agent from 2 separate models:
1. To reason over screenshots to suggest natural language next action
2. To ground those suggestion into visually grounded actions
This allows us to create fully local operators or operators combining
the best visual reasoner with the best visual grounder models.
Inform it can only control a single playwright browser page.
Previously it was assuming it is operating a whole browser, so would
have trouble navigating to different pages.
Improve handling of error in action parsing
Remove each OperatorAgent specific code from leaking out into the
operator. The Oprator just calls the standard OperatorAgent functions.
Each AgentOperator specific logic is handled by the OperatorAgent
internally.
The improve the separation of responsibility between the operator,
OperatorAgent and the Environment.
- Make environment pass screenshot data in agent agnostic format
- Have operator agents providers format image data to their AI model
specific format
- Add environment step type to distinguish image vs text content
- Clearly mark major steps in the operator iteration loop
- Handle anthropic models returning computer tool actions as normal
tool calls by normalizing next action retrieval from response for it
- Remove unused ActionResults fields
- Remove unnnecessary placeholders to content of action results like
for screenshot data
Decouple applying action on Environment from next action decision by
OperatorAgent
- Create an abstract Environment class with a `step' method
and a standardized set of supported actions for each concrete Environment
- Wrap playwright page into a concrete Environment class
- Create abstract OperatorAgent class with an abstract `act' method
- Wrap Openai computer Operator into concrete OperatorAgent class
- Wrap Claude computer Operator into a concrete OperatorAgent class
Handle interaction between Agent's action
Previously some link clicks would open in new tab. This is out of the
browser operator's context and so the new page cannot be interacted
with by the browser operator.
This change catches new page opens and opens them in the context page
instead.
Give the Khoj browser operator access to browser with existing
context (auth, cookies etc.) by starting it with CDP enabled.
Process:
1. Start Browser with CDP enabled:
`Edge/Chromium/Chrome --remote-debugging-port=9222'
2. Set the KHOJ_CDP_URL env var to the CDP url of the browser to use.
3. Start Khoj and ask it to get browser based work done with operator
+ research mode
The github run_eval workflow sets OPENAI_BASE_URL to empty string.
The ai model api created during initialization for openai models gets
set to empty string rather than None or the actual openai base url
This tries to call llm at to empty string base url instead of the
default openai api base url, which obviously fails.
Fix is to map empty base url's to the actual openai api base url.
Previously all exceptions were being caught. So retry logic wasn't
getting triggered.
Exception catching had been added to close llm thread when threads
instead of async was being used for final response generation.
This isn't required anymore since moving to async. And we can now
re-enable retry on failures.
Raise error if response is empty to retry llm completion.
Send image as png to non-openai models served via an openai compatible
api. As more models support png than webp.
Continue storing images as webp on server for efficiency.
Convert to png at the openai api layer and only for non-openai models
served via an openai compatible api.
Enable using vision models like ui-tars (via llama.cpp server), grok.
### Major
* Do more granular truncation on hitting context limits
* Pack research iterations as list of message content instead of
separate messages
* Update message truncation logic to truncate items in message content
list
* Make researcher aware of number of web, doc queries allowed per
iteration
### Minor
* Prompt web page reader to extract quantitative data as is from pages
* Track gemini 2.0 flash lite cost. Reduce max prompt size for 4o-mini
* Ensure time to first token logged only once per chat response
* Upgrade tenacity to respect min_time passed to exponential backoff
with jitter function
Fix for issue is in tenacity 9.0.0. But older langchain required
tenacity <0.9.0.
Explicitly pin version of langchain sub packages to avoid indexing
and doc parsing breakage.
- Construct tool description dynamically based on configurable query
count
- Inform the researcher how many webpage reads, online searches and
document searches it can perform per iteration when it has to decide
which next tool to use and the query to send to the tool AI.
- Pass the query counts to perform from the research AI down to the
tool AIs
Time to first token Log lines were shown multiple times if new chunk
bein streamed was empty for some reason.
This change makes the logic robust to empty chunks being recieved.
Previously research iterations and conversation logs were added to a
single user message. This prevented truncating each past iteration
separately on hitting context limits. So the whole past research
context had to be dropped on hitting context limits.
This change splits each research iteration into a separate item in a
message content list.
It uses the ability for message content to be a list, that is
supported by all major ai model apis like openai, anthropic and gemini.
The change in message format seen by pick next tool chat actor:
- New Format
- System: System Message
- User/Assistant: Chat History
- User: Raw Query
- Assistant: Iteration History
- Iteration 1
- Iteration 2
- User: Query with Pick Next Tool Nudge
- Old Format
- User: System + Chat History + Previous Iterations Message
- User: Query
- Collateral Changes
The construct_structured_message function has been updated to always
return a list[dict[str, Any]].
Previously it'd only use list if attached_file_context or vision model
with images for wider compatibility with other openai compatible api
Previously the research agent would have a hard time getting
quantitative data extracted by the web page reader tool AI.
This change aims to encourage the web page reader tool to extract
relevant data in verbatim form for higher granularity research and
responses.
Code tool should see code context and webpage tool should see online
context during research runs
Fix to include code context from past conversations to answer queries.
Add all queries to tool chat history when no specific tool to limit
extracting inferred queries for provided.
- Use much larger read, connect timeout if llm served over local url
- Use larger timeout duration than default (5s) for online llms too
This matches timeout duration increase calls to gemini api
- Improve overall flow of the contribute section of Readme
- Fix where to look for good first issues. The contributors board is outdated. Easier to maintain and view good-first-issue with issue tags directly.
Co-authored-by: Debanjum <debanjum@gmail.com>
Fallback to assume not a subscribed user if user not passed.
This allows user arg to be actually optional in the async
send_message_to_model_wrapper function
### Major
All reasoning models return thoughts differently due to lack of
standardization.
We normalize thoughts by reasoning models and providers to ease handling
within Khoj.
The model thoughts are parsed during research mode when generating final
response.
These model thoughts are returned by the chat API and shown in train of
thought shown on web app.
Thoughts are enabled for Deepseek, Anthropic, Grok and Qwen3 reasoning
models served via API.
Gemini and Openai reasoning models do not show their thoughts via
standard APIs.
### Minor
- Fix ability to use Deepseek reasoner for intermediate stages of chat
- Enable handling Qwen3 reasoning models
Previously Deepseek reasoner couldn't be used via API for completion
because of the additional formatting constrains it required was being
applied in this function.
The formatting fix was being applied in the chat completion endpoint.
DeepSeek reasoners returns reasoning in reasoning_content field.
Create an async stream processor to parse the reasoning out when using
the deepseek reasoner model.
The Qwen3 reasoning models return thoughts within <think></think> tags
before response.
This change parses the thoughts out from final response from the
response stream and returns as structured response with thoughts.
These thoughts aren't passed to client yet
OpenAI API doesn't support thoughts via chat completion by default.
But there are thinking models served via OpenAI compatible APIs like
deepseek and qwen3.
Add stream handlers and modified response types that can contain
thoughts as well apart from content returned by a model.
This can be used to instantiate stream handlers for different model
types like deepseek, qwen3 etc served over an OpenAI compatible API.