Commit Graph

4698 Commits

Author SHA1 Message Date
Debanjum
6cb512d9cf Support natural interrupt and send query behavior from web app
- Just send your new query. If a query was running previously it'd
be interrupted and new query would start processing. This improves on
the previous 2 click interrupt and send ux.

- Utilizes partial research for interrupted query, so you can now
redirect khoj's research direction. This is useful if you need to
share more details, change khoj's research direction in anyway or
complete research. Khoj's train of thought can be helpful for this.
2025-05-26 00:35:10 -07:00
Debanjum
2b7dd7401b Continue interrupt queries only after previous query written to DB 2025-05-26 00:35:10 -07:00
Debanjum
3cd6e1a9a6 Save and restore research from partial state 2025-05-26 00:35:09 -07:00
Debanjum
a83c36fa05 Validate operator, research, context.query fields of ChatMessage
- Track operator, research context in ChatMessage
- Track query field in (document) context field of ChatMessage

This allows validating chat message before inserting into DB
2025-05-26 00:03:59 -07:00
Debanjum
02ee4e90a2 Pass doc/web/code/operator context as list[dict] of message content 2025-05-26 00:03:59 -07:00
Debanjum
98b56316e4 Support constructing chat message as a list of dictionaries
Research mode recently started passing iteration as list of message
content dicts. This change extends to storing it as is in DB.
2025-05-26 00:03:59 -07:00
Debanjum
df9ab51fd0 Track research results as iteration list instead of iteration summaries 2025-05-26 00:03:59 -07:00
Debanjum
5d65fa8698 Use Django timezone funcs to make datetimes in DB timezone aware
These seem to be a new class of errors showing up. Explicitly using
django timezone functions to add awareness to date time files stored
in DB seems to mitigate the issue.

Related #1180
2025-05-25 23:43:06 -07:00
Debanjum
231aa1c0df Support claude 4 models. Engage reasoning, operator. Track costs etc.
- Engage reasoning when using claude 4 models
- Allow claude 4 models as monolithic operator agents
- Ease identifying which anthropic models can reason, operate GUIs
- Track costs, set default context window of claude 4 models
- Handle stop reason on calls to new claude 4 models
2025-05-25 23:43:06 -07:00
Debanjum
dca17591f3 Handle parsing json from string with plain text suffix 2025-05-23 19:44:02 -07:00
Debanjum
acebb90643 Mention keys expected in prompt to next research tool selector 2025-05-23 19:44:02 -07:00
Debanjum
e968cca273 Clean usage of conversation_id in chat API function
- Normalize conversation_id type to str instead of str or UUID
- Do not pass conversation_id to agenerate_chat_response as
  the associated conversation is also being passed. So can get its id
  directly.
2025-05-23 19:44:02 -07:00
Debanjum
a76032522e Add type hints to function args calling anthropic model api 2025-05-22 15:02:45 -07:00
Debanjum
97c5222b04 Set type hints and reorder args of all converse_[provider] methods
- Query is more important and should be passed before references
- Add type hints to user query and references for code readability
2025-05-22 15:02:45 -07:00
Debanjum
2ea16298aa Create Operator Framework. Enable Khoj to Operate Web Browser (#1174)
## Overview

1. Create base framework to compose different operators and environments
for Khoj to operate.
2. Enable Khoj to operate a web browser using anthropic, openai, gemini
or open-source models

**Note**: *This is an alpha level feature release. It is meant for local
testing by contributors and self-hosters.*

## Capabilities
- Have Khoj operate a web browser to complete tasks that require actions
and visual feedback.
- Experiment with any vision model as operator. Khoj supports monolithic
and binary operator
- Monolithic operators rely on a single models like claude, openai to
both reason and ground operator actions
- Binary operators allow bootstrapping a fully local operator. It can
use any vision model for visual reasoning when paired with a capable
visual grounding model.

## Limitations
- In general, it is slower, more expensive and less comprehensive than
standard Khoj for research

## Setup
1. Install Khoj with playwright by either 
   - running `pip install khoj[local]`
- installing playwright separately via `pip install playwright` and
`playwright install chromium`
2. Set `KHOJ_OPERATOR_ENABLED` env var to true (i.e
`KHOJ_OPERATOR_ENABLED=true`)
3. Start Khoj (e.g `USE_EMBEDDED_DB="true" khoj --anonymous-mode -vv`)
4. Add the necessary chat model(s) with `vision enabled` via your [Khoj
Admin Panel](http://localhost:42110/server/admin)
- To use Anthropic claude: `claude-3.7-sonnet*` chat model is required
with vision enabled
- To use Openai operator: `gpt-4o` chat model is required with vision
enabled
- For other operator configurations: a chat model named `ui-tars-1.5` is
required with vision enabled
This can technically be any visual grounding model served via an openai
compatible api. I've just tested with ui-tars-1.5-7b deployed to an HF
inference endpoint for now. See [deployment
instructions](https://github.com/bytedance/UI-TARS/blob/main/README_deploy.md)
5. Set your desired vision chat model via [user
settings](http://localhost:42110/settings) to use as operator.
6. Run your queries with either the `/operator` slash command or by just
asking Khoj in your query to use the operator tool. You can combine run
operator in research mode a well

### Advanced Usage
- Reuse Browser Session
- Why: Have Khoj operate web services you've logged into. E.g manage
your gmail, github, social media etc.
  - Setup
1. Start Chromium or Edge in Remote Debugging mode. For example, on Mac
you can start Edge by running the following in your terminal:
`/Applications/Microsoft\ Edge.app/Contents/MacOS/Microsoft\ Edge
--remote-debugging-port=9222`
4. Connect Khoj to that browser instance by setting the environment
variable `KHOJ_CDP_URL` to its URL.
      By default you'd set `KHOJ_CDP_URL="http://localhost:9222"`

## Architecture
### Operator Agents
| Type | Design |
|----- |-----|
| Monolithic | <img
src="https://github.com/user-attachments/assets/7a96440f-1732-482b-9bd9-0920cb0c60890"
width=400> |
| Binary | <img
src="https://github.com/user-attachments/assets/c5d101c0-3475-43c2-a301-daa943cde190"
width=400> |
2025-05-20 01:30:36 -07:00
Debanjum
19b4c18b69 Configure max iterations per operator run via environment variable 2025-05-20 01:03:11 -07:00
Debanjum
06a1a22e3b Align generic grounding agent's interface with uitars grounding agent
The generic grounding agent has not been tested properly but at least
it should be aligned with the interface being used by the ui-tars
grounding agent which has been tested.
2025-05-20 00:31:56 -07:00
Debanjum
0ce74e0329 Show operator context when use operator in default and research mode 2025-05-20 00:31:56 -07:00
Debanjum
cc355f93fc Use operator context consistently as a dict[str, str] of query, result 2025-05-20 00:31:56 -07:00
Debanjum
07e33994f0 Reduce scroll amount to have previous page stay a bit on screen 2025-05-20 00:31:56 -07:00
Debanjum
e2c1b1fcd3 Add dev container config to ease setup for remote development 2025-05-19 23:34:31 -07:00
Debanjum
fdb681ca0e Only install desktop, obsidian app from dev_setup.sh with --full flag 2025-05-19 23:34:31 -07:00
Debanjum
33dd4c8c33 Handle gemini returning simple string in response candidates 2025-05-19 19:45:10 -07:00
Debanjum
626ced8b8b Fix adding code results to chatml messages context 2025-05-19 19:45:10 -07:00
Debanjum
ded753ff9a Improve parsing tool use coordinate returned by claude operator agent
It sometimes outputs coordinates in string rather than list. Make
parser more robust to those kind of errors.

Share error with operator agent to fix/iterate on instead of exiting
the operator loop.
2025-05-19 16:28:55 -07:00
Debanjum
473dd006d5 Remove unnecessary images conversion to png in binary operator agent.
It's handled by the ai model interaction handlers in khoj server core.
2025-05-19 16:28:55 -07:00
Debanjum
9f3fbf9021 Encourage reasoner, grounder to work better together in binary operator
- Encourage grounder to adhere to the reasoners action instruction
- Encourage reasoner to explore other actions when stuck in a loop
  Previously seemed to be forcing it too strongly to choose
  "single most important" next action. So may not have been exploring
  other actions to achieve objective on initial failure.
2025-05-19 16:28:55 -07:00
Debanjum
ac19f6d336 Improve operator exception handling
- Do not catch errors messages just to re-throw them. Results in
  confusing exception happened during handling of an exception
  stacktrace. Makes it harder to debug

- Log error when action_results.content isn't set or empty to
  debug this operator run error
2025-05-19 16:28:55 -07:00
Debanjum
59e0e092b0 Remove deprecated prompt for grounding model to choose goto, back func
Goto and back functions are chosen by the visual reasoning model for
increased reliability in selecting those tools. The ui-tars grounding
models seems too tuned to use a specific set of tools.
2025-05-19 16:28:55 -07:00
Debanjum
1442a4f6fb Handle reasoning messages returned by openai cua model
Documentation about this is currently limited, confusing. But it seems
like reasoning item should be kept if computer_call after, else drop.

Add noop placeholder for reasoning item to prevent termination of
operator run on response with just reasoning.
2025-05-19 16:28:55 -07:00
Debanjum
95f211d03c Resolve mypy typing errors in operator code 2025-05-19 16:28:55 -07:00
Debanjum
33689feb91 Handle more openai response types for better rendering and error avoidance
The reasoning messages in openai cua needs to be passed back or some
such. Else it throws missing response with required id error.

Folks are confused about expected behavior for this online as well.
The documentation to handle this seems to be sparse, unclear.
2025-05-19 16:28:55 -07:00
Debanjum
3a75cd3c3d Only trigger claude, openai monolithic operators with specific models
To use Anthropic monolithic operator, set chat model to claude-3.7-sonnet
To use Openai monolithic operator, set chat model to gpt-4o
2025-05-19 16:28:55 -07:00
Debanjum
258b5a0372 Show operator screenshots with reasoning in train of thought on web app 2025-05-19 16:28:55 -07:00
Debanjum
21a9556b06 Show formatted action, env screenshot after action on each operator step
Show natural language, formatted text for each action. Previously we
were just showing json dumps of the actions taken.

Pass screenshot at each step for openai, anthropic and binary operator agents
Use text and image field in json passed to client for rendering both.

Show actions, env screenshot after actions applied in train of thought.
Showing the post action application screenshot seems more intuitive.

Previously we were showing the screenshot used to decide next action.
This pre action application screenshot was being shown after next
action decided (in train of thought). This was anyway misleading to
the actual ordering of event.

Rendered response is now a structured payload (dict) passing image
and text to be rendered up from operator to clients for rendering of
train of thought.
2025-05-19 16:28:55 -07:00
Debanjum
a1d712e031 Add current cursor position to browser screenshots for ai, human view 2025-05-19 16:28:55 -07:00
Debanjum
1be3986537 Require explicit switch to enable operator locally for now
Operator is still early in development. To enable it:
- Set KHOJ_OPERATOR_ENABLE environment variable to true
- Run any one of the commands below:
  - `pip install khoj[local]'
  - `pip install khoj[dev]'
  - `pip install playwright'
2025-05-19 16:28:55 -07:00
Debanjum
b395a438d0 Fix handling multiple actions requested by grounding agent in an iteration 2025-05-19 16:28:55 -07:00
Debanjum
e5415bdaee Only reasoning agent should terminate run, not the grounding agent.
Grounding agent does not have the full context and capabilities to
make this call. Only let reasoning agent make termination decision.

Add a wait action instead when grounder requests termination.
2025-05-19 16:28:55 -07:00
Debanjum
ffe58d2ec1 Parse goto, back actions directly from instruction for uitars grounder
UI tars grounder doesn't like calling non-standard functions like
goto, back.

Directly parse visual reasoner instruction to bypass uitars grounder
model.

At least for goto and back functions grounding isn't necessary, so
this works well.
2025-05-19 16:28:55 -07:00
Debanjum
7395af3c3a Allow visual grounder of binary operator agent to see past actions
Previously the grounding agent would be reset on every call. So it
only saw the most recent instruction and screenshot to make its next
action suggestion.

This change allows the visual grounders to see past instructions and
actions to prevent looping and encourage more exploratory action
suggestions by it when stuck or see errors.
2025-05-19 16:28:55 -07:00
Debanjum
d8bc6239f8 Bifurcate visual grounder into a ui-tars specific & generic grounder
Split visual grounder into two implementations:

- A ui-tars specific visual grounder agent. This uses the canonical
  implementation of ui-tars with specialized system prompt and action
  parsing.

- Fallback to generic visual grounder utilizing tool-use and served over
  any openai compatible api. This was previously being used for our
  ui-tars implementation as well.
2025-05-19 16:28:55 -07:00
Debanjum
c3bfb15fab Support KeyUp, KeyDown operator actions. Make coordinates into floats 2025-05-19 16:28:55 -07:00
Debanjum
b279060e2c Enable using Operator with Gemini models 2025-05-19 16:28:55 -07:00
Debanjum
0d8fb667ec Add action results for multiple actions similar to other operator agents
Adds the results of each action in a separate item in message content.
Previously we were adding this as a single larger text blob. This
changes adds structure to simplify post processing (e.g truncation).

The updated add_action_results should also require less work to
generalize if we pass tool call history to grounding model as
action results in valid openai format.
2025-05-19 16:28:55 -07:00
Debanjum
e17c06b798 Set operator query on init. Pass summarize prompt to summarize func
The initial user query isn't updated during an operator run. So set it
when initializing the operator agent. Instead of passing it on every
call to act.

Pass summarize prompt directly to the summarize function. Let it
construct the summarize message to query vision model with.
Previously it was being passed to the add_action_results func as
previous implementation that did not use a separate summarize func.

Also rename chat_model to vision_model for a more pertinent var name.

These changes make the code cleaner and implementation more readable.
2025-05-19 16:28:55 -07:00
Debanjum
38bcba2f4b Make back action in browser environment use goto to avoid timeouts
For some reason the page.go_back() action in playwright had a much
higher propensity to timeout. Use goto instead to reduce these page
traversal timeouts.

This requires tracking navigation history.
2025-05-19 16:28:55 -07:00
Debanjum
fd139d4708 Improve termination on task completion for binary operator agent
Only let the visual reasoner handle terminating the operator run.
Previously the grounder was also able to trigger termination.

Make catching the termination by the reasoner more robust
2025-05-19 16:28:55 -07:00
Debanjum
680c226137 Use any supported vision model as reasoner for binary operator agent 2025-05-19 16:28:55 -07:00
Debanjum
3839d83b90 Modularize operator into separate files for agent, action, environment etc
The previous browser_operator.py file had become pretty massive and
unwieldy. This change breaks it apart into separate files for
- the abstract environment and operator agent base
- the concrete agents: anthropic, openai and binary
- the concrete environment browser operator
- the operator actions used by agents and environment
2025-05-19 16:28:55 -07:00