Debanjum 2ea16298aa Create Operator Framework. Enable Khoj to Operate Web Browser (#1174)
## Overview

1. Create base framework to compose different operators and environments
for Khoj to operate.
2. Enable Khoj to operate a web browser using anthropic, openai, gemini
or open-source models

**Note**: *This is an alpha level feature release. It is meant for local
testing by contributors and self-hosters.*

## Capabilities
- Have Khoj operate a web browser to complete tasks that require actions
and visual feedback.
- Experiment with any vision model as operator. Khoj supports monolithic
and binary operator
- Monolithic operators rely on a single models like claude, openai to
both reason and ground operator actions
- Binary operators allow bootstrapping a fully local operator. It can
use any vision model for visual reasoning when paired with a capable
visual grounding model.

## Limitations
- In general, it is slower, more expensive and less comprehensive than
standard Khoj for research

## Setup
1. Install Khoj with playwright by either 
   - running `pip install khoj[local]`
- installing playwright separately via `pip install playwright` and
`playwright install chromium`
2. Set `KHOJ_OPERATOR_ENABLED` env var to true (i.e
`KHOJ_OPERATOR_ENABLED=true`)
3. Start Khoj (e.g `USE_EMBEDDED_DB="true" khoj --anonymous-mode -vv`)
4. Add the necessary chat model(s) with `vision enabled` via your [Khoj
Admin Panel](http://localhost:42110/server/admin)
- To use Anthropic claude: `claude-3.7-sonnet*` chat model is required
with vision enabled
- To use Openai operator: `gpt-4o` chat model is required with vision
enabled
- For other operator configurations: a chat model named `ui-tars-1.5` is
required with vision enabled
This can technically be any visual grounding model served via an openai
compatible api. I've just tested with ui-tars-1.5-7b deployed to an HF
inference endpoint for now. See [deployment
instructions](https://github.com/bytedance/UI-TARS/blob/main/README_deploy.md)
5. Set your desired vision chat model via [user
settings](http://localhost:42110/settings) to use as operator.
6. Run your queries with either the `/operator` slash command or by just
asking Khoj in your query to use the operator tool. You can combine run
operator in research mode a well

### Advanced Usage
- Reuse Browser Session
- Why: Have Khoj operate web services you've logged into. E.g manage
your gmail, github, social media etc.
  - Setup
1. Start Chromium or Edge in Remote Debugging mode. For example, on Mac
you can start Edge by running the following in your terminal:
`/Applications/Microsoft\ Edge.app/Contents/MacOS/Microsoft\ Edge
--remote-debugging-port=9222`
4. Connect Khoj to that browser instance by setting the environment
variable `KHOJ_CDP_URL` to its URL.
      By default you'd set `KHOJ_CDP_URL="http://localhost:9222"`

## Architecture
### Operator Agents
| Type | Design |
|----- |-----|
| Monolithic | <img
src="https://github.com/user-attachments/assets/7a96440f-1732-482b-9bd9-0920cb0c60890"
width=400> |
| Binary | <img
src="https://github.com/user-attachments/assets/c5d101c0-3475-43c2-a301-daa943cde190"
width=400> |
2025-05-20 01:30:36 -07:00
2025-04-23 19:01:27 +05:30
2025-04-23 19:01:27 +05:30

Khoj Logo

test docker pypi discord

Your AI second brain

📑 Docs   •   🌐 Web   •   🔥 App   •   💬 Discord   •   ✍🏽 Blog

khoj-ai%2Fkhoj | Trendshift


🎁 New

  • Start any message with /research to try out the experimental research mode with Khoj.
  • Anyone can now create custom agents with tunable personality, tools and knowledge bases.
  • Read about Khoj's excellent performance on modern retrieval and reasoning benchmarks.

Overview

Khoj is a personal AI app to extend your capabilities. It smoothly scales up from an on-device personal AI to a cloud-scale enterprise AI.

  • Chat with any local or online LLM (e.g llama3, qwen, gemma, mistral, gpt, claude, gemini, deepseek).
  • Get answers from the internet and your docs (including image, pdf, markdown, org-mode, word, notion files).
  • Access it from your Browser, Obsidian, Emacs, Desktop, Phone or Whatsapp.
  • Create agents with custom knowledge, persona, chat model and tools to take on any role.
  • Automate away repetitive research. Get personal newsletters and smart notifications delivered to your inbox.
  • Find relevant docs quickly and easily using our advanced semantic search.
  • Generate images, talk out loud, play your messages.
  • Khoj is open-source, self-hostable. Always.
  • Run it privately on your computer or try it on our cloud app.

See it in action

demo_chat

Go to https://app.khoj.dev to see Khoj live.

Full feature list

You can see the full feature list here.

Self-Host

To get started with self-hosting Khoj, read the docs.

Enterprise

Khoj is available as a cloud service, on-premises, or as a hybrid solution. To learn more about Khoj Enterprise, visit our website.

Frequently Asked Questions (FAQ)

Q: Can I use Khoj without self-hosting?

Yes! You can use Khoj right away at https://app.khoj.dev — no setup required.

Q: What kinds of documents can Khoj read?

Khoj supports a wide variety: PDFs, Markdown, Notion, Word docs, org-mode files, and more.

Q: How can I make my own agent?

Check out this blog post for a step-by-step guide to custom agents. For more questions, head over to our Discord!

Contributors

Cheers to our awesome contributors! 🎉

Made with contrib.rocks.

Interested in Contributing?

Khoj is open source. It is sustained by the community and wed love for you to join it! Whether youre a coder, designer, writer, or enthusiast, theres a place for you.

Why Contribute?

  • Make an Impact: Help build, test and improve a tool used by thousands to boost productivity.
  • Learn & Grow: Work on cutting-edge AI, LLMs, and semantic search technologies.

You can help us build new features, improve the project documentation, report issues and fix bugs. If you're a developer, please see our Contributing Guidelines and check out good first issues to work on.

Description
No description provided
Readme AGPL-3.0 116 MiB
Languages
Python 51%
TypeScript 36.1%
CSS 4.1%
HTML 3.2%
Emacs Lisp 2.4%
Other 3.1%