## Summary
- Enable Khoj to operate computers: Add experimental computer operator
functionality that allows Khoj to interact with desktop environments,
browsers, and terminals to accomplish complex tasks
- Multi-environment support: Implement computer environments with GUI,
file system, and terminal access. Can control host computer or Docker
container computer
## Key Features
### Computer Operation Capabilities
- Desktop control (screenshots, clicking, typing, keyboard shortcuts)
- File editing and management
- Terminal/bash command execution
- Web browser automation
- Visual feedback via train-of-thought video playback
### Infrastructure & Architecture:
- Docker container (ghcr.io/khoj-ai/computer:latest) with Ubuntu 24.04,
XFCE desktop, VNC access
- Local computer environment support with pyautogui
- Modular operator agent system supporting multiple environment types
- Trajectory compression and context management for long-running tasks
### Model Integration:
- Anthropic models only (Claude Sonnet 4, Claude 3.7 Sonnet, Claude Opus
4)
- OpenAI and binary operator agents temporarily disabled
- Enhanced caching and context management for operator conversations
### User Experience:
- `/operator` command or just ask Khoj to use operator tool to invoke
computer operation
- Integrate with research mode for extended 30+ minute task execution
- Video of computer operation in train of thought for transparency
### Configuration
- Set `KHOJ_OPERATOR_ENABLED=True` in `docker-compose.yml`
- Requires Anthropic API key
- Computer container runs on port 5900 (VNC)
- You can seek through the train of thought video of computer operation or
follow it in live mode.
- Interleaves video with normal text thoughts.
- Video available of old interactions and currently streaming message.
- Add type guards for action.path in drag vs text editor actions
- Added type guards for Union type attribute access
- Fixed variable naming conflicts between drag and text editor cases
- Resolved remaining typing issues in OpenAI, Anthropic agents
- Type guard without requiring another code indent level
- Create reusable method to call model
- Fix to summarize messages on operator run.
- Mark assistant tool calls with role = assistant, not environment
- Try fix message format when load after interrupts.
Does not work well yet
Previously CTRL+A would get triggered instead of ctrl+a. CTRL+A is
equivalent to ctrl+shift+a. This isn't intended and should be
called directly when required.
Now key combos like ctrl+a on computer firefox etc. work as expected
Track research and operator results at each nested iteration step
using python object references + async events bubbled up from nested
iterators.
Instantiates operator with interrupted operator messages from research
or normal mode.
Reflects actual interaction trajectory as closely as possible to agent
including conversation history, partial operator trajectory and new
query for fine grained, corrigible steerability.
Research mode continues with operator tool directly if previous
iteration was an interrupted operator run.
Since partial state reload after interrupt drops Khoj messages. The
assumption that there will always be a Khoj message after a user
message is broken. That is, there can now be multiple user messages
preceding a Khoj user message now.
This change allow for user queries to still be extracted for chat
history even if no khoj message follow.
Minor logic update to only include non image inferred queries for
gemini, anthropic models as well instead of just for openai models.
Apart from that the extracted function should be functionally same.
We were passing operator results as a simple dictionary. Strongly
typing it makes sense as operator results becomes more complex.
Storing operator results with trajectory on interrupts will allow
restarting interrupted operator run with agent messages of interrupted
trajectory loaded into operator agents
This allows:
- Each operator agent to own its summarization prompt. That it can
tune if it wants
- The outer operator loop to pass an override summarize prompt when it
invokes the summarize func but it does not have to
It had become broken at some point due to refactoring. The cache
control was getting added and removed right after in add_action_results
What we actually wanted to do is clear the old cache breakpoint and
put a new one at the latest operator tool result message.
This should improve operator speed and lower costs with anthropic
models.
- Generalize building pyautogui into executable python code snippet.
This should work across docker and local. And should be easier to
extend to operate a remote computer over the network as well.
- Create dockerfile for pyautogui operate-able containerized computer
Previously it could only operate a (playwright) browser. Now
- The operator logic and naming has been updated assuming
multiple environment types can be operated
- The operator entrypoint is now at __init__.py to simplify imports
and the entrypoint function is called operate_environment
- All operator agents have been updated to select their system prompts
and tools based on the environment they'll operate
- Khoj can now save and restore research from partial state
This triggers an interrupt that saves the partial research, then
when a new query is sent it loads the previous partial research as
context and continues utilizing with the new user query to orient
its future research
- Support natural interrupt and send query behavior from web app
This triggers an abort and send when a user sends a chat message
while khoj is in the middle of some previous research.
This interrupt mechanism enables a more natural, interactive
research flow
- Just send your new query. If a query was running previously it'd
be interrupted and new query would start processing. This improves on
the previous 2 click interrupt and send ux.
- Utilizes partial research for interrupted query, so you can now
redirect khoj's research direction. This is useful if you need to
share more details, change khoj's research direction in anyway or
complete research. Khoj's train of thought can be helpful for this.
- Track operator, research context in ChatMessage
- Track query field in (document) context field of ChatMessage
This allows validating chat message before inserting into DB
These seem to be a new class of errors showing up. Explicitly using
django timezone functions to add awareness to date time files stored
in DB seems to mitigate the issue.
Related #1180
- Engage reasoning when using claude 4 models
- Allow claude 4 models as monolithic operator agents
- Ease identifying which anthropic models can reason, operate GUIs
- Track costs, set default context window of claude 4 models
- Handle stop reason on calls to new claude 4 models
- Normalize conversation_id type to str instead of str or UUID
- Do not pass conversation_id to agenerate_chat_response as
the associated conversation is also being passed. So can get its id
directly.