Commit Graph

4925 Commits

Author SHA1 Message Date
Debanjum
ded1db642c Get max context for user, operator model pair for context compression 2025-05-31 20:51:08 -07:00
Debanjum
7eaf0e80c5 Get max prompt size for given user, model via reusable functions 2025-05-31 20:51:08 -07:00
Debanjum
3797f03625 Log ai model usage on every call to get_chat_usage_metrics in debug mode 2025-05-31 20:51:08 -07:00
Debanjum
4cb900658d Cache system prompt, tools of anthropic operator agent for efficiency 2025-05-31 20:51:08 -07:00
Debanjum
928e5ee8ad Cache messages to anthropic models from chat actors for efficiency 2025-05-31 20:51:08 -07:00
Debanjum
0d1e6b0d53 Do not overwrite system_prompt for idempotent AI API calls retry
Previously on tenacity retry the system_prompt could get overwritten
2025-05-31 20:51:08 -07:00
Debanjum
e0ea151f20 Implement file editor and terminal tools, in-built in claude
This should improve viewing, editing files and viewing terminal
command outputs by anthropic operator
2025-05-31 20:51:08 -07:00
Debanjum
21bf7f1d6d Continue interrupted operator run with new query and previous context
Track research and operator results at each nested iteration step
using python object references + async events bubbled up from nested
iterators.

Instantiates operator with interrupted operator messages from research
or normal mode.

Reflects actual interaction trajectory as closely as possible to agent
including conversation history, partial operator trajectory and new
query for fine grained, corrigible steerability.

Research mode continues with operator tool directly if previous
iteration was an interrupted operator run.
2025-05-31 20:51:08 -07:00
Debanjum
de35d91e1d Pass previous trajectory to operator agents for context 2025-05-31 20:51:08 -07:00
Debanjum
864e0ac8b5 Simplify research iteration and main research function names 2025-05-31 20:51:08 -07:00
Debanjum
6c9d569a22 Fix to get user questions in chat history from user not khoj message
Since partial state reload after interrupt drops Khoj messages. The
assumption that there will always be a Khoj message after a user
message is broken. That is, there can now be multiple user messages
preceding a Khoj user message now.

This change allow for user queries to still be extracted for chat
history even if no khoj message follow.
2025-05-31 20:51:08 -07:00
Debanjum
b6aa77a6f5 Lookback 3 previous turns to select next tool, for questions history 2025-05-31 20:50:03 -07:00
Debanjum
d511cbfa34 Extract constructing question history into shared function for reuse
Minor logic update to only include non image inferred queries for
gemini, anthropic models as well instead of just for openai models.

Apart from that the extracted function should be functionally same.
2025-05-31 16:50:26 -07:00
Debanjum
da663e184c Type operator results. Enable storing, loading operator trajectories.
We were passing operator results as a simple dictionary. Strongly
typing it makes sense as operator results becomes more complex.

Storing operator results with trajectory on interrupts will allow
restarting interrupted operator run with agent messages of interrupted
trajectory loaded into operator agents
2025-05-31 16:50:26 -07:00
Debanjum
675fc0ad05 Decouple trajectory compression from `act'. Reuse func to call llm api 2025-05-31 16:50:26 -07:00
Debanjum
b027024c42 Handle failed operator agent calls to anthropic api more gracefully
Add anthropic operator api call errors to trajectory instead of
erroring out of current operator run
2025-05-31 16:50:26 -07:00
Debanjum
d54bfc19e5 Add trajectory compression to anthropic operator agent
- Add compression parameters to base operator agent for reuse
- Increase default operator iterations
2025-05-31 16:50:26 -07:00
Debanjum
cb451fa67c Put default summarize prompt into operator agent
This allows:
- Each operator agent to own its summarization prompt. That it can
  tune if it wants
- The outer operator loop to pass an override summarize prompt when it
  invokes the summarize func but it does not have to
2025-05-31 16:50:26 -07:00
Debanjum
99fdd91a01 Latch to bottom instantly and well when auto scroll chat stream on web 2025-05-31 16:50:26 -07:00
Debanjum
253656b634 Fix engaging anthropic api cache for operator trajectories.
It had become broken at some point due to refactoring. The cache
control was getting added and removed right after in add_action_results

What we actually wanted to do is clear the old cache breakpoint and
put a new one at the latest operator tool result message.

This should improve operator speed and lower costs with anthropic
models.
2025-05-31 16:50:26 -07:00
Debanjum
faecbdb7d8 Enable operators to use computers 2025-05-31 16:50:25 -07:00
Debanjum
771909f76a Implement docker computer environment for operator
- Generalize building pyautogui into executable python code snippet.
  This should work across docker and local. And should be easier to
  extend to operate a remote computer over the network as well.

- Create dockerfile for pyautogui operate-able containerized computer
2025-05-28 17:40:32 -07:00
Debanjum
e117f57f64 Implement local computer environment for operator 2025-05-28 17:40:32 -07:00
Debanjum
7eab87bfdf Generalize operator to operate multiple types of environment
Previously it could only operate a (playwright) browser. Now
- The operator logic and naming has been updated assuming
  multiple environment types can be operated
- The operator entrypoint is now at __init__.py to simplify imports
  and the entrypoint function is called operate_environment
- All operator agents have been updated to select their system prompts
  and tools based on the environment they'll operate
2025-05-27 19:01:36 -07:00
Debanjum
c0689b2740 Easily interrupt and redirect khoj's research direction via chat
- Khoj can now save and restore research from partial state
  This triggers an interrupt that saves the partial research, then
  when a new query is sent it loads the previous partial research as
  context and continues utilizing with the new user query to orient
  its future research
- Support natural interrupt and send query behavior from web app
  This triggers an abort and send when a user sends a chat message
  while khoj is in the middle of some previous research.

This interrupt mechanism enables a more natural, interactive
research flow
2025-05-27 17:57:21 -07:00
Debanjum
c9e6b8e88d Align expected types to actual returned types by AI APIs, operator 2025-05-26 00:39:06 -07:00
Debanjum
c1c1fc6265 Make send message validation more robust on web app 2025-05-26 00:35:10 -07:00
Debanjum
6cb512d9cf Support natural interrupt and send query behavior from web app
- Just send your new query. If a query was running previously it'd
be interrupted and new query would start processing. This improves on
the previous 2 click interrupt and send ux.

- Utilizes partial research for interrupted query, so you can now
redirect khoj's research direction. This is useful if you need to
share more details, change khoj's research direction in anyway or
complete research. Khoj's train of thought can be helpful for this.
2025-05-26 00:35:10 -07:00
Debanjum
2b7dd7401b Continue interrupt queries only after previous query written to DB 2025-05-26 00:35:10 -07:00
Debanjum
3cd6e1a9a6 Save and restore research from partial state 2025-05-26 00:35:09 -07:00
Debanjum
a83c36fa05 Validate operator, research, context.query fields of ChatMessage
- Track operator, research context in ChatMessage
- Track query field in (document) context field of ChatMessage

This allows validating chat message before inserting into DB
2025-05-26 00:03:59 -07:00
Debanjum
02ee4e90a2 Pass doc/web/code/operator context as list[dict] of message content 2025-05-26 00:03:59 -07:00
Debanjum
98b56316e4 Support constructing chat message as a list of dictionaries
Research mode recently started passing iteration as list of message
content dicts. This change extends to storing it as is in DB.
2025-05-26 00:03:59 -07:00
Debanjum
df9ab51fd0 Track research results as iteration list instead of iteration summaries 2025-05-26 00:03:59 -07:00
Debanjum
5d65fa8698 Use Django timezone funcs to make datetimes in DB timezone aware
These seem to be a new class of errors showing up. Explicitly using
django timezone functions to add awareness to date time files stored
in DB seems to mitigate the issue.

Related #1180
2025-05-25 23:43:06 -07:00
Debanjum
231aa1c0df Support claude 4 models. Engage reasoning, operator. Track costs etc.
- Engage reasoning when using claude 4 models
- Allow claude 4 models as monolithic operator agents
- Ease identifying which anthropic models can reason, operate GUIs
- Track costs, set default context window of claude 4 models
- Handle stop reason on calls to new claude 4 models
2025-05-25 23:43:06 -07:00
Debanjum
dca17591f3 Handle parsing json from string with plain text suffix 2025-05-23 19:44:02 -07:00
Debanjum
acebb90643 Mention keys expected in prompt to next research tool selector 2025-05-23 19:44:02 -07:00
Debanjum
e968cca273 Clean usage of conversation_id in chat API function
- Normalize conversation_id type to str instead of str or UUID
- Do not pass conversation_id to agenerate_chat_response as
  the associated conversation is also being passed. So can get its id
  directly.
2025-05-23 19:44:02 -07:00
Debanjum
a76032522e Add type hints to function args calling anthropic model api 2025-05-22 15:02:45 -07:00
Debanjum
97c5222b04 Set type hints and reorder args of all converse_[provider] methods
- Query is more important and should be passed before references
- Add type hints to user query and references for code readability
2025-05-22 15:02:45 -07:00
Debanjum
2ea16298aa Create Operator Framework. Enable Khoj to Operate Web Browser (#1174)
## Overview

1. Create base framework to compose different operators and environments
for Khoj to operate.
2. Enable Khoj to operate a web browser using anthropic, openai, gemini
or open-source models

**Note**: *This is an alpha level feature release. It is meant for local
testing by contributors and self-hosters.*

## Capabilities
- Have Khoj operate a web browser to complete tasks that require actions
and visual feedback.
- Experiment with any vision model as operator. Khoj supports monolithic
and binary operator
- Monolithic operators rely on a single models like claude, openai to
both reason and ground operator actions
- Binary operators allow bootstrapping a fully local operator. It can
use any vision model for visual reasoning when paired with a capable
visual grounding model.

## Limitations
- In general, it is slower, more expensive and less comprehensive than
standard Khoj for research

## Setup
1. Install Khoj with playwright by either 
   - running `pip install khoj[local]`
- installing playwright separately via `pip install playwright` and
`playwright install chromium`
2. Set `KHOJ_OPERATOR_ENABLED` env var to true (i.e
`KHOJ_OPERATOR_ENABLED=true`)
3. Start Khoj (e.g `USE_EMBEDDED_DB="true" khoj --anonymous-mode -vv`)
4. Add the necessary chat model(s) with `vision enabled` via your [Khoj
Admin Panel](http://localhost:42110/server/admin)
- To use Anthropic claude: `claude-3.7-sonnet*` chat model is required
with vision enabled
- To use Openai operator: `gpt-4o` chat model is required with vision
enabled
- For other operator configurations: a chat model named `ui-tars-1.5` is
required with vision enabled
This can technically be any visual grounding model served via an openai
compatible api. I've just tested with ui-tars-1.5-7b deployed to an HF
inference endpoint for now. See [deployment
instructions](https://github.com/bytedance/UI-TARS/blob/main/README_deploy.md)
5. Set your desired vision chat model via [user
settings](http://localhost:42110/settings) to use as operator.
6. Run your queries with either the `/operator` slash command or by just
asking Khoj in your query to use the operator tool. You can combine run
operator in research mode a well

### Advanced Usage
- Reuse Browser Session
- Why: Have Khoj operate web services you've logged into. E.g manage
your gmail, github, social media etc.
  - Setup
1. Start Chromium or Edge in Remote Debugging mode. For example, on Mac
you can start Edge by running the following in your terminal:
`/Applications/Microsoft\ Edge.app/Contents/MacOS/Microsoft\ Edge
--remote-debugging-port=9222`
4. Connect Khoj to that browser instance by setting the environment
variable `KHOJ_CDP_URL` to its URL.
      By default you'd set `KHOJ_CDP_URL="http://localhost:9222"`

## Architecture
### Operator Agents
| Type | Design |
|----- |-----|
| Monolithic | <img
src="https://github.com/user-attachments/assets/7a96440f-1732-482b-9bd9-0920cb0c60890"
width=400> |
| Binary | <img
src="https://github.com/user-attachments/assets/c5d101c0-3475-43c2-a301-daa943cde190"
width=400> |
2025-05-20 01:30:36 -07:00
Debanjum
19b4c18b69 Configure max iterations per operator run via environment variable 2025-05-20 01:03:11 -07:00
Debanjum
06a1a22e3b Align generic grounding agent's interface with uitars grounding agent
The generic grounding agent has not been tested properly but at least
it should be aligned with the interface being used by the ui-tars
grounding agent which has been tested.
2025-05-20 00:31:56 -07:00
Debanjum
0ce74e0329 Show operator context when use operator in default and research mode 2025-05-20 00:31:56 -07:00
Debanjum
cc355f93fc Use operator context consistently as a dict[str, str] of query, result 2025-05-20 00:31:56 -07:00
Debanjum
07e33994f0 Reduce scroll amount to have previous page stay a bit on screen 2025-05-20 00:31:56 -07:00
Debanjum
e2c1b1fcd3 Add dev container config to ease setup for remote development 2025-05-19 23:34:31 -07:00
Debanjum
fdb681ca0e Only install desktop, obsidian app from dev_setup.sh with --full flag 2025-05-19 23:34:31 -07:00
Debanjum
33dd4c8c33 Handle gemini returning simple string in response candidates 2025-05-19 19:45:10 -07:00