Commit Graph

4779 Commits

Author SHA1 Message Date
Debanjum
07e33994f0 Reduce scroll amount to have previous page stay a bit on screen 2025-05-20 00:31:56 -07:00
Debanjum
e2c1b1fcd3 Add dev container config to ease setup for remote development 2025-05-19 23:34:31 -07:00
Debanjum
fdb681ca0e Only install desktop, obsidian app from dev_setup.sh with --full flag 2025-05-19 23:34:31 -07:00
Debanjum
33dd4c8c33 Handle gemini returning simple string in response candidates 2025-05-19 19:45:10 -07:00
Debanjum
626ced8b8b Fix adding code results to chatml messages context 2025-05-19 19:45:10 -07:00
Debanjum
ded753ff9a Improve parsing tool use coordinate returned by claude operator agent
It sometimes outputs coordinates in string rather than list. Make
parser more robust to those kind of errors.

Share error with operator agent to fix/iterate on instead of exiting
the operator loop.
2025-05-19 16:28:55 -07:00
Debanjum
473dd006d5 Remove unnecessary images conversion to png in binary operator agent.
It's handled by the ai model interaction handlers in khoj server core.
2025-05-19 16:28:55 -07:00
Debanjum
9f3fbf9021 Encourage reasoner, grounder to work better together in binary operator
- Encourage grounder to adhere to the reasoners action instruction
- Encourage reasoner to explore other actions when stuck in a loop
  Previously seemed to be forcing it too strongly to choose
  "single most important" next action. So may not have been exploring
  other actions to achieve objective on initial failure.
2025-05-19 16:28:55 -07:00
Debanjum
ac19f6d336 Improve operator exception handling
- Do not catch errors messages just to re-throw them. Results in
  confusing exception happened during handling of an exception
  stacktrace. Makes it harder to debug

- Log error when action_results.content isn't set or empty to
  debug this operator run error
2025-05-19 16:28:55 -07:00
Debanjum
59e0e092b0 Remove deprecated prompt for grounding model to choose goto, back func
Goto and back functions are chosen by the visual reasoning model for
increased reliability in selecting those tools. The ui-tars grounding
models seems too tuned to use a specific set of tools.
2025-05-19 16:28:55 -07:00
Debanjum
1442a4f6fb Handle reasoning messages returned by openai cua model
Documentation about this is currently limited, confusing. But it seems
like reasoning item should be kept if computer_call after, else drop.

Add noop placeholder for reasoning item to prevent termination of
operator run on response with just reasoning.
2025-05-19 16:28:55 -07:00
Debanjum
95f211d03c Resolve mypy typing errors in operator code 2025-05-19 16:28:55 -07:00
Debanjum
33689feb91 Handle more openai response types for better rendering and error avoidance
The reasoning messages in openai cua needs to be passed back or some
such. Else it throws missing response with required id error.

Folks are confused about expected behavior for this online as well.
The documentation to handle this seems to be sparse, unclear.
2025-05-19 16:28:55 -07:00
Debanjum
3a75cd3c3d Only trigger claude, openai monolithic operators with specific models
To use Anthropic monolithic operator, set chat model to claude-3.7-sonnet
To use Openai monolithic operator, set chat model to gpt-4o
2025-05-19 16:28:55 -07:00
Debanjum
258b5a0372 Show operator screenshots with reasoning in train of thought on web app 2025-05-19 16:28:55 -07:00
Debanjum
21a9556b06 Show formatted action, env screenshot after action on each operator step
Show natural language, formatted text for each action. Previously we
were just showing json dumps of the actions taken.

Pass screenshot at each step for openai, anthropic and binary operator agents
Use text and image field in json passed to client for rendering both.

Show actions, env screenshot after actions applied in train of thought.
Showing the post action application screenshot seems more intuitive.

Previously we were showing the screenshot used to decide next action.
This pre action application screenshot was being shown after next
action decided (in train of thought). This was anyway misleading to
the actual ordering of event.

Rendered response is now a structured payload (dict) passing image
and text to be rendered up from operator to clients for rendering of
train of thought.
2025-05-19 16:28:55 -07:00
Debanjum
a1d712e031 Add current cursor position to browser screenshots for ai, human view 2025-05-19 16:28:55 -07:00
Debanjum
1be3986537 Require explicit switch to enable operator locally for now
Operator is still early in development. To enable it:
- Set KHOJ_OPERATOR_ENABLE environment variable to true
- Run any one of the commands below:
  - `pip install khoj[local]'
  - `pip install khoj[dev]'
  - `pip install playwright'
2025-05-19 16:28:55 -07:00
Debanjum
b395a438d0 Fix handling multiple actions requested by grounding agent in an iteration 2025-05-19 16:28:55 -07:00
Debanjum
e5415bdaee Only reasoning agent should terminate run, not the grounding agent.
Grounding agent does not have the full context and capabilities to
make this call. Only let reasoning agent make termination decision.

Add a wait action instead when grounder requests termination.
2025-05-19 16:28:55 -07:00
Debanjum
ffe58d2ec1 Parse goto, back actions directly from instruction for uitars grounder
UI tars grounder doesn't like calling non-standard functions like
goto, back.

Directly parse visual reasoner instruction to bypass uitars grounder
model.

At least for goto and back functions grounding isn't necessary, so
this works well.
2025-05-19 16:28:55 -07:00
Debanjum
7395af3c3a Allow visual grounder of binary operator agent to see past actions
Previously the grounding agent would be reset on every call. So it
only saw the most recent instruction and screenshot to make its next
action suggestion.

This change allows the visual grounders to see past instructions and
actions to prevent looping and encourage more exploratory action
suggestions by it when stuck or see errors.
2025-05-19 16:28:55 -07:00
Debanjum
d8bc6239f8 Bifurcate visual grounder into a ui-tars specific & generic grounder
Split visual grounder into two implementations:

- A ui-tars specific visual grounder agent. This uses the canonical
  implementation of ui-tars with specialized system prompt and action
  parsing.

- Fallback to generic visual grounder utilizing tool-use and served over
  any openai compatible api. This was previously being used for our
  ui-tars implementation as well.
2025-05-19 16:28:55 -07:00
Debanjum
c3bfb15fab Support KeyUp, KeyDown operator actions. Make coordinates into floats 2025-05-19 16:28:55 -07:00
Debanjum
b279060e2c Enable using Operator with Gemini models 2025-05-19 16:28:55 -07:00
Debanjum
0d8fb667ec Add action results for multiple actions similar to other operator agents
Adds the results of each action in a separate item in message content.
Previously we were adding this as a single larger text blob. This
changes adds structure to simplify post processing (e.g truncation).

The updated add_action_results should also require less work to
generalize if we pass tool call history to grounding model as
action results in valid openai format.
2025-05-19 16:28:55 -07:00
Debanjum
e17c06b798 Set operator query on init. Pass summarize prompt to summarize func
The initial user query isn't updated during an operator run. So set it
when initializing the operator agent. Instead of passing it on every
call to act.

Pass summarize prompt directly to the summarize function. Let it
construct the summarize message to query vision model with.
Previously it was being passed to the add_action_results func as
previous implementation that did not use a separate summarize func.

Also rename chat_model to vision_model for a more pertinent var name.

These changes make the code cleaner and implementation more readable.
2025-05-19 16:28:55 -07:00
Debanjum
38bcba2f4b Make back action in browser environment use goto to avoid timeouts
For some reason the page.go_back() action in playwright had a much
higher propensity to timeout. Use goto instead to reduce these page
traversal timeouts.

This requires tracking navigation history.
2025-05-19 16:28:55 -07:00
Debanjum
fd139d4708 Improve termination on task completion for binary operator agent
Only let the visual reasoner handle terminating the operator run.
Previously the grounder was also able to trigger termination.

Make catching the termination by the reasoner more robust
2025-05-19 16:28:55 -07:00
Debanjum
680c226137 Use any supported vision model as reasoner for binary operator agent 2025-05-19 16:28:55 -07:00
Debanjum
3839d83b90 Modularize operator into separate files for agent, action, environment etc
The previous browser_operator.py file had become pretty massive and
unwieldy. This change breaks it apart into separate files for
- the abstract environment and operator agent base
- the concrete agents: anthropic, openai and binary
- the concrete environment browser operator
- the operator actions used by agents and environment
2025-05-19 16:28:55 -07:00
Debanjum
833c8ed150 Add a flexible operator agent using separate reasoning, grounder models
- This operator works with model served over an openai compatible api
- It uses separate vision models to reason and ground actions.

This improves flexibility in the operator agents that can be created.
We do not know need our operator agent ot rely on monolithic models to
can both reason over visual data and ground their actions.

We can create operator agent from 2 separate models:
1. To reason over screenshots to suggest natural language next action
2. To ground those suggestion into visually grounded actions

This allows us to create fully local operators or operators combining
the best visual reasoner with the best visual grounder models.
2025-05-19 16:28:55 -07:00
Debanjum
773d20a26f Improve instructions to the openai operator agent.
Inform it can only control a single playwright browser page.
Previously it was assuming it is operating a whole browser, so would
have trouble navigating to different pages.

Improve handling of error in action parsing
2025-05-19 16:28:55 -07:00
Debanjum
4db888cd62 Simplify operator loop. Make each OperatorAgent manage state internally.
Remove each OperatorAgent specific code from leaking out into the
operator. The Oprator just calls the standard OperatorAgent functions.

Each AgentOperator specific logic is handled by the OperatorAgent
internally.

The improve the separation of responsibility between the operator,
OperatorAgent and the Environment.

- Make environment pass screenshot data in agent agnostic format
  - Have operator agents providers format image data to their AI model
    specific format
  - Add environment step type to distinguish image vs text content
- Clearly mark major steps in the operator iteration loop
- Handle anthropic models returning computer tool actions as normal
  tool calls by normalizing next action retrieval from response for it
- Remove unused ActionResults fields
- Remove unnnecessary placeholders to content of action results like
  for screenshot data
2025-05-19 16:28:55 -07:00
Debanjum
a1c9c6b2e3 Add pages visited via browser operator to references returned to clients 2025-05-19 16:28:55 -07:00
Debanjum
e71575ad1a Render screenshot in train of thought on openai agent screenshot action 2025-05-19 16:28:55 -07:00
Debanjum
78e052bfcb Decouple environment from operator agent to improve modularity
Decouple applying action on Environment from next action decision by
OperatorAgent

- Create an abstract Environment class with a `step' method
  and a standardized set of supported actions for each concrete Environment
  - Wrap playwright page into a concrete Environment class

- Create abstract OperatorAgent class with an abstract `act' method
  - Wrap Openai computer Operator into concrete OperatorAgent class
  - Wrap Claude computer Operator into a concrete OperatorAgent class

Handle interaction between Agent's action
2025-05-19 16:28:55 -07:00
Debanjum
7c60e04efb Pull out common iteration loop into main browser operator method 2025-05-19 16:28:54 -07:00
Debanjum
08e93c64ab Render screenshot in train of thought on browser screenshot action
Update web app to render screenshot image when screenshot action taken
by browser operator
2025-05-19 16:28:54 -07:00
Debanjum
188b3c85ae Force open links in current page to stay in operator page context
Previously some link clicks would open in new tab. This is out of the
browser operator's context and so the new page cannot be interacted
with by the browser operator.

This change catches new page opens and opens them in the context page
instead.
2025-05-19 16:28:54 -07:00
Debanjum
20f87542e5 Add cancellation support to browser operator via asyncio.Event 2025-05-19 16:28:54 -07:00
Debanjum
9f75622346 Allow browser operator to use browser with existing context over CDP
Give the Khoj browser operator access to browser with existing
context (auth, cookies etc.) by starting it with CDP enabled.

Process:
1. Start Browser with CDP enabled:
  `Edge/Chromium/Chrome --remote-debugging-port=9222'
2. Set the KHOJ_CDP_URL env var to the CDP url of the browser to use.
3. Start Khoj and ask it to get browser based work done with operator
   + research mode
2025-05-19 16:28:54 -07:00
Debanjum
b9ea538b02 Support operating web browser with Anthropic models
- Add back() and goto(url) helper functions to operate browser
- Cache operator messages to Anthropic API for speed and cost savings
2025-05-19 16:28:54 -07:00
Debanjum
2e86141575 Enable Khoj to use a GUI web browser. Operate it with Openai models 2025-05-19 16:28:54 -07:00
Debanjum
ab5d0b5878 Upgrade server dependencies 2025-05-19 16:28:21 -07:00
Debanjum
22cd638add Fix handling unset openai_base_url to run eval with openai chat models
The github run_eval workflow sets OPENAI_BASE_URL to empty string.

The ai model api created during initialization for openai models gets
set to empty string rather than None or the actual openai base url

This tries to call llm at to empty string base url instead of the
default openai api base url, which obviously fails.

Fix is to map empty base url's to the actual openai api base url.
2025-05-19 16:19:43 -07:00
Debanjum
cf55582852 Retry on empty response or error in chat completion by llm over api
Previously all exceptions were being caught. So retry logic wasn't
getting triggered.

Exception catching had been added to close llm thread when threads
instead of async was being used for final response generation.

This isn't required anymore since moving to async. And we can now
re-enable retry on failures.

Raise error if response is empty to retry llm completion.
2025-05-19 11:27:19 -07:00
Debanjum
7827d317b4 Widen vision support for chat models served via openai compatible api
Send image as png to non-openai models served via an openai compatible
api. As more models support png than webp.

Continue storing images as webp on server for efficiency.

Convert to png at the openai api layer and only for non-openai models
served via an openai compatible api.

Enable using vision models like ui-tars (via llama.cpp server), grok.
2025-05-19 11:27:19 -07:00
Debanjum
4f3fdaf19d Increase khoj api response timeout on evals call. Handle no decision 2025-05-18 19:14:49 -07:00
Debanjum
31dcc44c20 Output tokens >> reasoning tokens to avoid early response termination. 2025-05-18 14:45:23 -07:00