Commit Graph

4807 Commits

Author SHA1 Message Date
Debanjum
d8bc6239f8 Bifurcate visual grounder into a ui-tars specific & generic grounder
Split visual grounder into two implementations:

- A ui-tars specific visual grounder agent. This uses the canonical
  implementation of ui-tars with specialized system prompt and action
  parsing.

- Fallback to generic visual grounder utilizing tool-use and served over
  any openai compatible api. This was previously being used for our
  ui-tars implementation as well.
2025-05-19 16:28:55 -07:00
Debanjum
c3bfb15fab Support KeyUp, KeyDown operator actions. Make coordinates into floats 2025-05-19 16:28:55 -07:00
Debanjum
b279060e2c Enable using Operator with Gemini models 2025-05-19 16:28:55 -07:00
Debanjum
0d8fb667ec Add action results for multiple actions similar to other operator agents
Adds the results of each action in a separate item in message content.
Previously we were adding this as a single larger text blob. This
changes adds structure to simplify post processing (e.g truncation).

The updated add_action_results should also require less work to
generalize if we pass tool call history to grounding model as
action results in valid openai format.
2025-05-19 16:28:55 -07:00
Debanjum
e17c06b798 Set operator query on init. Pass summarize prompt to summarize func
The initial user query isn't updated during an operator run. So set it
when initializing the operator agent. Instead of passing it on every
call to act.

Pass summarize prompt directly to the summarize function. Let it
construct the summarize message to query vision model with.
Previously it was being passed to the add_action_results func as
previous implementation that did not use a separate summarize func.

Also rename chat_model to vision_model for a more pertinent var name.

These changes make the code cleaner and implementation more readable.
2025-05-19 16:28:55 -07:00
Debanjum
38bcba2f4b Make back action in browser environment use goto to avoid timeouts
For some reason the page.go_back() action in playwright had a much
higher propensity to timeout. Use goto instead to reduce these page
traversal timeouts.

This requires tracking navigation history.
2025-05-19 16:28:55 -07:00
Debanjum
fd139d4708 Improve termination on task completion for binary operator agent
Only let the visual reasoner handle terminating the operator run.
Previously the grounder was also able to trigger termination.

Make catching the termination by the reasoner more robust
2025-05-19 16:28:55 -07:00
Debanjum
680c226137 Use any supported vision model as reasoner for binary operator agent 2025-05-19 16:28:55 -07:00
Debanjum
3839d83b90 Modularize operator into separate files for agent, action, environment etc
The previous browser_operator.py file had become pretty massive and
unwieldy. This change breaks it apart into separate files for
- the abstract environment and operator agent base
- the concrete agents: anthropic, openai and binary
- the concrete environment browser operator
- the operator actions used by agents and environment
2025-05-19 16:28:55 -07:00
Debanjum
833c8ed150 Add a flexible operator agent using separate reasoning, grounder models
- This operator works with model served over an openai compatible api
- It uses separate vision models to reason and ground actions.

This improves flexibility in the operator agents that can be created.
We do not know need our operator agent ot rely on monolithic models to
can both reason over visual data and ground their actions.

We can create operator agent from 2 separate models:
1. To reason over screenshots to suggest natural language next action
2. To ground those suggestion into visually grounded actions

This allows us to create fully local operators or operators combining
the best visual reasoner with the best visual grounder models.
2025-05-19 16:28:55 -07:00
Debanjum
773d20a26f Improve instructions to the openai operator agent.
Inform it can only control a single playwright browser page.
Previously it was assuming it is operating a whole browser, so would
have trouble navigating to different pages.

Improve handling of error in action parsing
2025-05-19 16:28:55 -07:00
Debanjum
4db888cd62 Simplify operator loop. Make each OperatorAgent manage state internally.
Remove each OperatorAgent specific code from leaking out into the
operator. The Oprator just calls the standard OperatorAgent functions.

Each AgentOperator specific logic is handled by the OperatorAgent
internally.

The improve the separation of responsibility between the operator,
OperatorAgent and the Environment.

- Make environment pass screenshot data in agent agnostic format
  - Have operator agents providers format image data to their AI model
    specific format
  - Add environment step type to distinguish image vs text content
- Clearly mark major steps in the operator iteration loop
- Handle anthropic models returning computer tool actions as normal
  tool calls by normalizing next action retrieval from response for it
- Remove unused ActionResults fields
- Remove unnnecessary placeholders to content of action results like
  for screenshot data
2025-05-19 16:28:55 -07:00
Debanjum
a1c9c6b2e3 Add pages visited via browser operator to references returned to clients 2025-05-19 16:28:55 -07:00
Debanjum
e71575ad1a Render screenshot in train of thought on openai agent screenshot action 2025-05-19 16:28:55 -07:00
Debanjum
78e052bfcb Decouple environment from operator agent to improve modularity
Decouple applying action on Environment from next action decision by
OperatorAgent

- Create an abstract Environment class with a `step' method
  and a standardized set of supported actions for each concrete Environment
  - Wrap playwright page into a concrete Environment class

- Create abstract OperatorAgent class with an abstract `act' method
  - Wrap Openai computer Operator into concrete OperatorAgent class
  - Wrap Claude computer Operator into a concrete OperatorAgent class

Handle interaction between Agent's action
2025-05-19 16:28:55 -07:00
Debanjum
7c60e04efb Pull out common iteration loop into main browser operator method 2025-05-19 16:28:54 -07:00
Debanjum
08e93c64ab Render screenshot in train of thought on browser screenshot action
Update web app to render screenshot image when screenshot action taken
by browser operator
2025-05-19 16:28:54 -07:00
Debanjum
188b3c85ae Force open links in current page to stay in operator page context
Previously some link clicks would open in new tab. This is out of the
browser operator's context and so the new page cannot be interacted
with by the browser operator.

This change catches new page opens and opens them in the context page
instead.
2025-05-19 16:28:54 -07:00
Debanjum
20f87542e5 Add cancellation support to browser operator via asyncio.Event 2025-05-19 16:28:54 -07:00
Debanjum
9f75622346 Allow browser operator to use browser with existing context over CDP
Give the Khoj browser operator access to browser with existing
context (auth, cookies etc.) by starting it with CDP enabled.

Process:
1. Start Browser with CDP enabled:
  `Edge/Chromium/Chrome --remote-debugging-port=9222'
2. Set the KHOJ_CDP_URL env var to the CDP url of the browser to use.
3. Start Khoj and ask it to get browser based work done with operator
   + research mode
2025-05-19 16:28:54 -07:00
Debanjum
b9ea538b02 Support operating web browser with Anthropic models
- Add back() and goto(url) helper functions to operate browser
- Cache operator messages to Anthropic API for speed and cost savings
2025-05-19 16:28:54 -07:00
Debanjum
2e86141575 Enable Khoj to use a GUI web browser. Operate it with Openai models 2025-05-19 16:28:54 -07:00
Debanjum
ab5d0b5878 Upgrade server dependencies 2025-05-19 16:28:21 -07:00
Debanjum
22cd638add Fix handling unset openai_base_url to run eval with openai chat models
The github run_eval workflow sets OPENAI_BASE_URL to empty string.

The ai model api created during initialization for openai models gets
set to empty string rather than None or the actual openai base url

This tries to call llm at to empty string base url instead of the
default openai api base url, which obviously fails.

Fix is to map empty base url's to the actual openai api base url.
2025-05-19 16:19:43 -07:00
Debanjum
cf55582852 Retry on empty response or error in chat completion by llm over api
Previously all exceptions were being caught. So retry logic wasn't
getting triggered.

Exception catching had been added to close llm thread when threads
instead of async was being used for final response generation.

This isn't required anymore since moving to async. And we can now
re-enable retry on failures.

Raise error if response is empty to retry llm completion.
2025-05-19 11:27:19 -07:00
Debanjum
7827d317b4 Widen vision support for chat models served via openai compatible api
Send image as png to non-openai models served via an openai compatible
api. As more models support png than webp.

Continue storing images as webp on server for efficiency.

Convert to png at the openai api layer and only for non-openai models
served via an openai compatible api.

Enable using vision models like ui-tars (via llama.cpp server), grok.
2025-05-19 11:27:19 -07:00
Debanjum
4f3fdaf19d Increase khoj api response timeout on evals call. Handle no decision 2025-05-18 19:14:49 -07:00
Debanjum
31dcc44c20 Output tokens >> reasoning tokens to avoid early response termination. 2025-05-18 14:45:23 -07:00
Debanjum
73e28666b5 Fix to set default chat model for all user tiers via env var 2025-05-18 14:45:23 -07:00
Debanjum
06dcd4426d Improve Research Mode Context Management (#1179)
### Major
* Do more granular truncation on hitting context limits
* Pack research iterations as list of message content instead of
separate messages
* Update message truncation logic to truncate items in message content
list
* Make researcher aware of number of web, doc queries allowed per
iteration

### Minor
* Prompt web page reader to extract quantitative data as is from pages
* Track gemini 2.0 flash lite cost. Reduce max prompt size for 4o-mini
* Ensure time to first token logged only once per chat response
* Upgrade tenacity to respect min_time passed to exponential backoff
with jitter function
2025-05-17 17:38:31 -07:00
Debanjum
fd591c6e6c Upgrade tenacity to respect min time for exponential backoff
Fix for issue is in tenacity 9.0.0. But older langchain required
tenacity <0.9.0.

Explicitly pin version of langchain sub packages to avoid indexing
and doc parsing breakage.
2025-05-17 17:37:15 -07:00
Debanjum
988bde651c Make researcher aware of no. of web, doc queries allowed per iteration
- Construct tool description dynamically based on configurable query
  count
- Inform the researcher how many webpage reads, online searches and
  document searches it can perform per iteration when it has to decide
  which next tool to use and the query to send to the tool AI.
- Pass the query counts to perform from the research AI down to the
  tool AIs
2025-05-17 17:37:15 -07:00
Debanjum
417ab42206 Track gemini 2.0 flash lite cost. Reduce max prompt size for 4o-mini 2025-05-17 17:37:15 -07:00
Debanjum
e125e299a7 Ensure time to first token logged only once per chat response
Time to first token Log lines were shown multiple times if new chunk
bein streamed was empty for some reason.

This change makes the logic robust to empty chunks being recieved.
2025-05-17 17:37:15 -07:00
Debanjum
2694734d22 Update truncation logic to handle multi-part message content 2025-05-17 17:37:15 -07:00
Debanjum
a337d9e4b8 Structure research iteration msgs for more granular context management
Previously research iterations and conversation logs were added to a
single user message. This prevented truncating each past iteration
separately on hitting context limits. So the whole past research
context had to be dropped on hitting context limits.

This change splits each research iteration into a separate item in a
message content list.

It uses the ability for message content to be a list, that is
supported by all major ai model apis like openai, anthropic and gemini.

The change in message format seen by pick next tool chat actor:
- New Format
  - System: System Message
  - User/Assistant: Chat History
  - User: Raw Query
  - Assistant: Iteration History
    - Iteration 1
    - Iteration 2
  - User: Query with Pick Next Tool Nudge

- Old Format
  - User: System + Chat History + Previous Iterations Message
  - User: Query

- Collateral Changes
The construct_structured_message function has been updated to always
return a list[dict[str, Any]].

Previously it'd only use list if attached_file_context or vision model
with images for wider compatibility with other openai compatible api
2025-05-17 17:37:15 -07:00
Debanjum
0f53a67837 Prompt web page reader to extract quantitative data as is from pages
Previously the research agent would have a hard time getting
quantitative data extracted by the web page reader tool AI.

This change aims to encourage the web page reader tool to extract
relevant data in verbatim form for higher granularity research and
responses.
2025-05-17 17:37:15 -07:00
Debanjum
99a2305246 Improve tool chat history constructor and fix its usage during research.
Code tool should see code context and webpage tool should see online
context during research runs

Fix to include code context from past conversations to answer queries.

Add all queries to tool chat history when no specific tool to limit
extracting inferred queries for provided.
2025-05-17 17:37:15 -07:00
Debanjum
8050173ee1 Timeout calls to khoj api in evals to continue to next question 2025-05-17 17:37:11 -07:00
Debanjum
442c7b6153 Retry running code on more request exception 2025-05-17 17:37:11 -07:00
Debanjum
10a5d68a2c Improve retry, increase timeouts of gemini api calls
- Catch specific retryable exceptions for retry
- Increase httpx timeout from default of 5s to 20s
2025-05-17 16:38:55 -07:00
Debanjum
20f08ca564 Reduce timeouts on calling local and online llms via openai api
- Use much larger read, connect timeout if llm served over local url
- Use larger timeout duration than default (5s) for online llms too
  This matches timeout duration increase calls to gemini api
2025-05-17 16:38:55 -07:00
Debanjum
e0352cd8e1 Handle unset ttft in metadata of failed chat response. Fixes evals.
This was causing evals to stop processing rest of batch as well.
2025-05-17 15:06:22 -07:00
Debanjum
673a15b6eb Upgrade hf hub package to include hf_xet for faster downloads 2025-05-17 15:06:22 -07:00
Debanjum
d867dca310 Fix send_message_to_model_wrapper by using sync is_user_subscribed check
Calling an async function from a sync function wouldn't work.
2025-05-17 15:06:22 -07:00
Sajjad Baloch
a4ab498aec Update README for better contributions (#1170)
- Improve overall flow of the contribute section of Readme
- Fix where to look for good first issues. The contributors board is outdated. Easier to maintain and view good-first-issue with issue tags directly.

Co-authored-by: Debanjum <debanjum@gmail.com>
2025-05-12 09:51:01 -06:00
Debanjum
2feed544a6 Add Gemini 2.0 flash back to default gemini chat models list
Remove once gemini 2.5 flash is GA
2025-05-11 19:05:09 -06:00
Debanjum
2e290ea690 Pass conversation history to generate non-streaming chat model responses
Allows send_message_to_model_wrapper func to also use conversation
logs as context to generate response. This is an optional parameter
2025-05-09 00:02:14 -06:00
Debanjum
8787586e7e Dedupe code to format messages before sending to appropriate chat model
Fallback to assume not a subscribed user if user not passed.
This allows user arg to be actually optional in the async
send_message_to_model_wrapper function
2025-05-09 00:02:14 -06:00
Debanjum
e94bf00e1e Add cancellation support to research mode via asyncio.Event 2025-05-09 00:01:45 -06:00