klbr/khoj - khoj - Gitea: Git with a cup of tea

klbr/khoj

mirror of https://github.com/khoaliber/khoj.git synced 2026-03-02 13:18:18 +00:00

Author	SHA1	Message	Date
Debanjum	07e33994f0	Reduce scroll amount to have previous page stay a bit on screen	2025-05-20 00:31:56 -07:00
Debanjum	e2c1b1fcd3	Add dev container config to ease setup for remote development	2025-05-19 23:34:31 -07:00
Debanjum	fdb681ca0e	Only install desktop, obsidian app from dev_setup.sh with --full flag	2025-05-19 23:34:31 -07:00
Debanjum	33dd4c8c33	Handle gemini returning simple string in response candidates	2025-05-19 19:45:10 -07:00
Debanjum	626ced8b8b	Fix adding code results to chatml messages context	2025-05-19 19:45:10 -07:00
Debanjum	ded753ff9a	Improve parsing tool use coordinate returned by claude operator agent It sometimes outputs coordinates in string rather than list. Make parser more robust to those kind of errors. Share error with operator agent to fix/iterate on instead of exiting the operator loop.	2025-05-19 16:28:55 -07:00
Debanjum	473dd006d5	Remove unnecessary images conversion to png in binary operator agent. It's handled by the ai model interaction handlers in khoj server core.	2025-05-19 16:28:55 -07:00
Debanjum	9f3fbf9021	Encourage reasoner, grounder to work better together in binary operator - Encourage grounder to adhere to the reasoners action instruction - Encourage reasoner to explore other actions when stuck in a loop Previously seemed to be forcing it too strongly to choose "single most important" next action. So may not have been exploring other actions to achieve objective on initial failure.	2025-05-19 16:28:55 -07:00
Debanjum	ac19f6d336	Improve operator exception handling - Do not catch errors messages just to re-throw them. Results in confusing exception happened during handling of an exception stacktrace. Makes it harder to debug - Log error when action_results.content isn't set or empty to debug this operator run error	2025-05-19 16:28:55 -07:00
Debanjum	59e0e092b0	Remove deprecated prompt for grounding model to choose goto, back func Goto and back functions are chosen by the visual reasoning model for increased reliability in selecting those tools. The ui-tars grounding models seems too tuned to use a specific set of tools.	2025-05-19 16:28:55 -07:00
Debanjum	1442a4f6fb	Handle reasoning messages returned by openai cua model Documentation about this is currently limited, confusing. But it seems like reasoning item should be kept if computer_call after, else drop. Add noop placeholder for reasoning item to prevent termination of operator run on response with just reasoning.	2025-05-19 16:28:55 -07:00
Debanjum	95f211d03c	Resolve mypy typing errors in operator code	2025-05-19 16:28:55 -07:00
Debanjum	33689feb91	Handle more openai response types for better rendering and error avoidance The reasoning messages in openai cua needs to be passed back or some such. Else it throws missing response with required id error. Folks are confused about expected behavior for this online as well. The documentation to handle this seems to be sparse, unclear.	2025-05-19 16:28:55 -07:00
Debanjum	3a75cd3c3d	Only trigger claude, openai monolithic operators with specific models To use Anthropic monolithic operator, set chat model to claude-3.7-sonnet To use Openai monolithic operator, set chat model to gpt-4o	2025-05-19 16:28:55 -07:00
Debanjum	258b5a0372	Show operator screenshots with reasoning in train of thought on web app	2025-05-19 16:28:55 -07:00
Debanjum	21a9556b06	Show formatted action, env screenshot after action on each operator step Show natural language, formatted text for each action. Previously we were just showing json dumps of the actions taken. Pass screenshot at each step for openai, anthropic and binary operator agents Use text and image field in json passed to client for rendering both. Show actions, env screenshot after actions applied in train of thought. Showing the post action application screenshot seems more intuitive. Previously we were showing the screenshot used to decide next action. This pre action application screenshot was being shown after next action decided (in train of thought). This was anyway misleading to the actual ordering of event. Rendered response is now a structured payload (dict) passing image and text to be rendered up from operator to clients for rendering of train of thought.	2025-05-19 16:28:55 -07:00
Debanjum	a1d712e031	Add current cursor position to browser screenshots for ai, human view	2025-05-19 16:28:55 -07:00
Debanjum	1be3986537	Require explicit switch to enable operator locally for now Operator is still early in development. To enable it: - Set KHOJ_OPERATOR_ENABLE environment variable to true - Run any one of the commands below: - `pip install khoj[local]' - `pip install khoj[dev]' - `pip install playwright'	2025-05-19 16:28:55 -07:00
Debanjum	b395a438d0	Fix handling multiple actions requested by grounding agent in an iteration	2025-05-19 16:28:55 -07:00
Debanjum	e5415bdaee	Only reasoning agent should terminate run, not the grounding agent. Grounding agent does not have the full context and capabilities to make this call. Only let reasoning agent make termination decision. Add a wait action instead when grounder requests termination.	2025-05-19 16:28:55 -07:00
Debanjum	ffe58d2ec1	Parse goto, back actions directly from instruction for uitars grounder UI tars grounder doesn't like calling non-standard functions like goto, back. Directly parse visual reasoner instruction to bypass uitars grounder model. At least for goto and back functions grounding isn't necessary, so this works well.	2025-05-19 16:28:55 -07:00
Debanjum	7395af3c3a	Allow visual grounder of binary operator agent to see past actions Previously the grounding agent would be reset on every call. So it only saw the most recent instruction and screenshot to make its next action suggestion. This change allows the visual grounders to see past instructions and actions to prevent looping and encourage more exploratory action suggestions by it when stuck or see errors.	2025-05-19 16:28:55 -07:00
Debanjum	d8bc6239f8	Bifurcate visual grounder into a ui-tars specific & generic grounder Split visual grounder into two implementations: - A ui-tars specific visual grounder agent. This uses the canonical implementation of ui-tars with specialized system prompt and action parsing. - Fallback to generic visual grounder utilizing tool-use and served over any openai compatible api. This was previously being used for our ui-tars implementation as well.	2025-05-19 16:28:55 -07:00
Debanjum	c3bfb15fab	Support KeyUp, KeyDown operator actions. Make coordinates into floats	2025-05-19 16:28:55 -07:00
Debanjum	b279060e2c	Enable using Operator with Gemini models	2025-05-19 16:28:55 -07:00
Debanjum	0d8fb667ec	Add action results for multiple actions similar to other operator agents Adds the results of each action in a separate item in message content. Previously we were adding this as a single larger text blob. This changes adds structure to simplify post processing (e.g truncation). The updated add_action_results should also require less work to generalize if we pass tool call history to grounding model as action results in valid openai format.	2025-05-19 16:28:55 -07:00
Debanjum	e17c06b798	Set operator query on init. Pass summarize prompt to summarize func The initial user query isn't updated during an operator run. So set it when initializing the operator agent. Instead of passing it on every call to act. Pass summarize prompt directly to the summarize function. Let it construct the summarize message to query vision model with. Previously it was being passed to the add_action_results func as previous implementation that did not use a separate summarize func. Also rename chat_model to vision_model for a more pertinent var name. These changes make the code cleaner and implementation more readable.	2025-05-19 16:28:55 -07:00
Debanjum	38bcba2f4b	Make back action in browser environment use goto to avoid timeouts For some reason the page.go_back() action in playwright had a much higher propensity to timeout. Use goto instead to reduce these page traversal timeouts. This requires tracking navigation history.	2025-05-19 16:28:55 -07:00
Debanjum	fd139d4708	Improve termination on task completion for binary operator agent Only let the visual reasoner handle terminating the operator run. Previously the grounder was also able to trigger termination. Make catching the termination by the reasoner more robust	2025-05-19 16:28:55 -07:00
Debanjum	680c226137	Use any supported vision model as reasoner for binary operator agent	2025-05-19 16:28:55 -07:00
Debanjum	3839d83b90	Modularize operator into separate files for agent, action, environment etc The previous browser_operator.py file had become pretty massive and unwieldy. This change breaks it apart into separate files for - the abstract environment and operator agent base - the concrete agents: anthropic, openai and binary - the concrete environment browser operator - the operator actions used by agents and environment	2025-05-19 16:28:55 -07:00
Debanjum	833c8ed150	Add a flexible operator agent using separate reasoning, grounder models - This operator works with model served over an openai compatible api - It uses separate vision models to reason and ground actions. This improves flexibility in the operator agents that can be created. We do not know need our operator agent ot rely on monolithic models to can both reason over visual data and ground their actions. We can create operator agent from 2 separate models: 1. To reason over screenshots to suggest natural language next action 2. To ground those suggestion into visually grounded actions This allows us to create fully local operators or operators combining the best visual reasoner with the best visual grounder models.	2025-05-19 16:28:55 -07:00
Debanjum	773d20a26f	Improve instructions to the openai operator agent. Inform it can only control a single playwright browser page. Previously it was assuming it is operating a whole browser, so would have trouble navigating to different pages. Improve handling of error in action parsing	2025-05-19 16:28:55 -07:00
Debanjum	4db888cd62	Simplify operator loop. Make each OperatorAgent manage state internally. Remove each OperatorAgent specific code from leaking out into the operator. The Oprator just calls the standard OperatorAgent functions. Each AgentOperator specific logic is handled by the OperatorAgent internally. The improve the separation of responsibility between the operator, OperatorAgent and the Environment. - Make environment pass screenshot data in agent agnostic format - Have operator agents providers format image data to their AI model specific format - Add environment step type to distinguish image vs text content - Clearly mark major steps in the operator iteration loop - Handle anthropic models returning computer tool actions as normal tool calls by normalizing next action retrieval from response for it - Remove unused ActionResults fields - Remove unnnecessary placeholders to content of action results like for screenshot data	2025-05-19 16:28:55 -07:00
Debanjum	a1c9c6b2e3	Add pages visited via browser operator to references returned to clients	2025-05-19 16:28:55 -07:00
Debanjum	e71575ad1a	Render screenshot in train of thought on openai agent screenshot action	2025-05-19 16:28:55 -07:00
Debanjum	78e052bfcb	Decouple environment from operator agent to improve modularity Decouple applying action on Environment from next action decision by OperatorAgent - Create an abstract Environment class with a `step' method and a standardized set of supported actions for each concrete Environment - Wrap playwright page into a concrete Environment class - Create abstract OperatorAgent class with an abstract `act' method - Wrap Openai computer Operator into concrete OperatorAgent class - Wrap Claude computer Operator into a concrete OperatorAgent class Handle interaction between Agent's action	2025-05-19 16:28:55 -07:00
Debanjum	7c60e04efb	Pull out common iteration loop into main browser operator method	2025-05-19 16:28:54 -07:00
Debanjum	08e93c64ab	Render screenshot in train of thought on browser screenshot action Update web app to render screenshot image when screenshot action taken by browser operator	2025-05-19 16:28:54 -07:00
Debanjum	188b3c85ae	Force open links in current page to stay in operator page context Previously some link clicks would open in new tab. This is out of the browser operator's context and so the new page cannot be interacted with by the browser operator. This change catches new page opens and opens them in the context page instead.	2025-05-19 16:28:54 -07:00
Debanjum	20f87542e5	Add cancellation support to browser operator via asyncio.Event	2025-05-19 16:28:54 -07:00
Debanjum	9f75622346	Allow browser operator to use browser with existing context over CDP Give the Khoj browser operator access to browser with existing context (auth, cookies etc.) by starting it with CDP enabled. Process: 1. Start Browser with CDP enabled: `Edge/Chromium/Chrome --remote-debugging-port=9222' 2. Set the KHOJ_CDP_URL env var to the CDP url of the browser to use. 3. Start Khoj and ask it to get browser based work done with operator + research mode	2025-05-19 16:28:54 -07:00
Debanjum	b9ea538b02	Support operating web browser with Anthropic models - Add back() and goto(url) helper functions to operate browser - Cache operator messages to Anthropic API for speed and cost savings	2025-05-19 16:28:54 -07:00
Debanjum	2e86141575	Enable Khoj to use a GUI web browser. Operate it with Openai models	2025-05-19 16:28:54 -07:00
Debanjum	ab5d0b5878	Upgrade server dependencies	2025-05-19 16:28:21 -07:00
Debanjum	22cd638add	Fix handling unset openai_base_url to run eval with openai chat models The github run_eval workflow sets OPENAI_BASE_URL to empty string. The ai model api created during initialization for openai models gets set to empty string rather than None or the actual openai base url This tries to call llm at to empty string base url instead of the default openai api base url, which obviously fails. Fix is to map empty base url's to the actual openai api base url.	2025-05-19 16:19:43 -07:00
Debanjum	cf55582852	Retry on empty response or error in chat completion by llm over api Previously all exceptions were being caught. So retry logic wasn't getting triggered. Exception catching had been added to close llm thread when threads instead of async was being used for final response generation. This isn't required anymore since moving to async. And we can now re-enable retry on failures. Raise error if response is empty to retry llm completion.	2025-05-19 11:27:19 -07:00
Debanjum	7827d317b4	Widen vision support for chat models served via openai compatible api Send image as png to non-openai models served via an openai compatible api. As more models support png than webp. Continue storing images as webp on server for efficiency. Convert to png at the openai api layer and only for non-openai models served via an openai compatible api. Enable using vision models like ui-tars (via llama.cpp server), grok.	2025-05-19 11:27:19 -07:00
Debanjum	4f3fdaf19d	Increase khoj api response timeout on evals call. Handle no decision	2025-05-18 19:14:49 -07:00
Debanjum	31dcc44c20	Output tokens >> reasoning tokens to avoid early response termination.	2025-05-18 14:45:23 -07:00

1 2 3 4 5 ...

4779 Commits