- Converts response schema into a anthropic tool call definition.
- Works with simple enums without needing to rely on $defs, $refs as
unsupported by Anthropic API
- Do not force specific tool use as not supported with deep thought
This puts anthropic models on parity with openai, gemini models for
response schema following. Reduces need for complex json response
parsing on khoj end.
There seems to be a more standard mechanism of specifying launch.json
params for devcontainers. Previous mechanism to write launch.json to
.vscode/launch.json in post creation step does not work.
Improve default launch.json to include khoj admin username, password
with placeholder values to get started with local development faster.
Define dockerfile for devcontainer to pre-built server, web app
dependencies during dev container image creation stage. So install on
dev container startup is sped up as no need to install dependencies.
## Description
This PR introduces significant improvements to the Obsidian Khoj
plugin's chat interface and editing capabilities, enhancing the overall
user experience and content management functionality.
## Features
### 🔍 Enhanced Communication Mode
I've implemented radio buttons below the chat window for easier
communication mode selection. The modes are now displayed as emojis in
the conversation for a cleaner interface, replacing the previous
text-based system (e.g., /default, /research). I've also documented the
search mode functionality in the help command.
#### Screenshots
- Radio buttons for mode selection
- Emoji display in conversations

### 💬 Revamped Message Interaction
I've redesigned the message buttons with improved spacing and color
coding for better visual differentiation. The new edit button allows
quick message modifications - clicking it removes the conversation up to
that point and copies the message to the input field for easy editing or
retrying questions.
#### Screenshots
- New message styling and color scheme

- Edit button functionality

### 🤖 Advanced Agent Selection System
I've added a new chat creation button with agent selection capability.
Users can now choose from their available agents when starting a new
chat. While agents can't be switched mid-conversation to maintain
context, users can easily start fresh conversations with different
agents.
#### Screenshots
- Agent selection dropdown

### 👁️ Real-Time Context Awareness
I've added a button that gives Khoj access to read Obsidian opened tabs.
This allows Khoj to read open notes and track changes in real-time,
maintaining a history of previous versions to provide more contextual
assistance.
#### Screenshots
- Window access toggle

### ✏️ Smart Document Editing
Inspired by Cursor IDE's intelligent editing and ChatGPT's Canvas
functionality, I've implemented a first version of a content creation
system we've been discussing. Using a JSON-based modification system,
Khoj can now make precise changes to specific parts of files, with
changes previewed in yellow highlighting before application.
Modification code blocks are neatly organized in collapsible sections
with clear action summaries. While this is just a first step, it's
working remarkably well and I have several ideas for expanding this
functionality to make Khoj an even more powerful content creation
assistant.
#### Screenshots
- JSON modification preview
- Change highlighting system
- Collapsible code blocks
- Accept/cancel controls

---------
Co-authored-by: Debanjum <debanjum@gmail.com>
## Summary
- Enable Khoj to operate computers: Add experimental computer operator
functionality that allows Khoj to interact with desktop environments,
browsers, and terminals to accomplish complex tasks
- Multi-environment support: Implement computer environments with GUI,
file system, and terminal access. Can control host computer or Docker
container computer
## Key Features
### Computer Operation Capabilities
- Desktop control (screenshots, clicking, typing, keyboard shortcuts)
- File editing and management
- Terminal/bash command execution
- Web browser automation
- Visual feedback via train-of-thought video playback
### Infrastructure & Architecture:
- Docker container (ghcr.io/khoj-ai/computer:latest) with Ubuntu 24.04,
XFCE desktop, VNC access
- Local computer environment support with pyautogui
- Modular operator agent system supporting multiple environment types
- Trajectory compression and context management for long-running tasks
### Model Integration:
- Anthropic models only (Claude Sonnet 4, Claude 3.7 Sonnet, Claude Opus
4)
- OpenAI and binary operator agents temporarily disabled
- Enhanced caching and context management for operator conversations
### User Experience:
- `/operator` command or just ask Khoj to use operator tool to invoke
computer operation
- Integrate with research mode for extended 30+ minute task execution
- Video of computer operation in train of thought for transparency
### Configuration
- Set `KHOJ_OPERATOR_ENABLED=True` in `docker-compose.yml`
- Requires Anthropic API key
- Computer container runs on port 5900 (VNC)
- You can seek through the train of thought video of computer operation or
follow it in live mode.
- Interleaves video with normal text thoughts.
- Video available of old interactions and currently streaming message.
- Add type guards for action.path in drag vs text editor actions
- Added type guards for Union type attribute access
- Fixed variable naming conflicts between drag and text editor cases
- Resolved remaining typing issues in OpenAI, Anthropic agents
- Type guard without requiring another code indent level
- Create reusable method to call model
- Fix to summarize messages on operator run.
- Mark assistant tool calls with role = assistant, not environment
- Try fix message format when load after interrupts.
Does not work well yet
Previously CTRL+A would get triggered instead of ctrl+a. CTRL+A is
equivalent to ctrl+shift+a. This isn't intended and should be
called directly when required.
Now key combos like ctrl+a on computer firefox etc. work as expected
Track research and operator results at each nested iteration step
using python object references + async events bubbled up from nested
iterators.
Instantiates operator with interrupted operator messages from research
or normal mode.
Reflects actual interaction trajectory as closely as possible to agent
including conversation history, partial operator trajectory and new
query for fine grained, corrigible steerability.
Research mode continues with operator tool directly if previous
iteration was an interrupted operator run.
Since partial state reload after interrupt drops Khoj messages. The
assumption that there will always be a Khoj message after a user
message is broken. That is, there can now be multiple user messages
preceding a Khoj user message now.
This change allow for user queries to still be extracted for chat
history even if no khoj message follow.
Minor logic update to only include non image inferred queries for
gemini, anthropic models as well instead of just for openai models.
Apart from that the extracted function should be functionally same.
We were passing operator results as a simple dictionary. Strongly
typing it makes sense as operator results becomes more complex.
Storing operator results with trajectory on interrupts will allow
restarting interrupted operator run with agent messages of interrupted
trajectory loaded into operator agents