It sometimes outputs coordinates in string rather than list. Make
parser more robust to those kind of errors.
Share error with operator agent to fix/iterate on instead of exiting
the operator loop.
- Encourage grounder to adhere to the reasoners action instruction
- Encourage reasoner to explore other actions when stuck in a loop
Previously seemed to be forcing it too strongly to choose
"single most important" next action. So may not have been exploring
other actions to achieve objective on initial failure.
- Do not catch errors messages just to re-throw them. Results in
confusing exception happened during handling of an exception
stacktrace. Makes it harder to debug
- Log error when action_results.content isn't set or empty to
debug this operator run error
Goto and back functions are chosen by the visual reasoning model for
increased reliability in selecting those tools. The ui-tars grounding
models seems too tuned to use a specific set of tools.
Documentation about this is currently limited, confusing. But it seems
like reasoning item should be kept if computer_call after, else drop.
Add noop placeholder for reasoning item to prevent termination of
operator run on response with just reasoning.
The reasoning messages in openai cua needs to be passed back or some
such. Else it throws missing response with required id error.
Folks are confused about expected behavior for this online as well.
The documentation to handle this seems to be sparse, unclear.
Show natural language, formatted text for each action. Previously we
were just showing json dumps of the actions taken.
Pass screenshot at each step for openai, anthropic and binary operator agents
Use text and image field in json passed to client for rendering both.
Show actions, env screenshot after actions applied in train of thought.
Showing the post action application screenshot seems more intuitive.
Previously we were showing the screenshot used to decide next action.
This pre action application screenshot was being shown after next
action decided (in train of thought). This was anyway misleading to
the actual ordering of event.
Rendered response is now a structured payload (dict) passing image
and text to be rendered up from operator to clients for rendering of
train of thought.
Operator is still early in development. To enable it:
- Set KHOJ_OPERATOR_ENABLE environment variable to true
- Run any one of the commands below:
- `pip install khoj[local]'
- `pip install khoj[dev]'
- `pip install playwright'
Grounding agent does not have the full context and capabilities to
make this call. Only let reasoning agent make termination decision.
Add a wait action instead when grounder requests termination.
UI tars grounder doesn't like calling non-standard functions like
goto, back.
Directly parse visual reasoner instruction to bypass uitars grounder
model.
At least for goto and back functions grounding isn't necessary, so
this works well.
Previously the grounding agent would be reset on every call. So it
only saw the most recent instruction and screenshot to make its next
action suggestion.
This change allows the visual grounders to see past instructions and
actions to prevent looping and encourage more exploratory action
suggestions by it when stuck or see errors.
Split visual grounder into two implementations:
- A ui-tars specific visual grounder agent. This uses the canonical
implementation of ui-tars with specialized system prompt and action
parsing.
- Fallback to generic visual grounder utilizing tool-use and served over
any openai compatible api. This was previously being used for our
ui-tars implementation as well.
Adds the results of each action in a separate item in message content.
Previously we were adding this as a single larger text blob. This
changes adds structure to simplify post processing (e.g truncation).
The updated add_action_results should also require less work to
generalize if we pass tool call history to grounding model as
action results in valid openai format.
The initial user query isn't updated during an operator run. So set it
when initializing the operator agent. Instead of passing it on every
call to act.
Pass summarize prompt directly to the summarize function. Let it
construct the summarize message to query vision model with.
Previously it was being passed to the add_action_results func as
previous implementation that did not use a separate summarize func.
Also rename chat_model to vision_model for a more pertinent var name.
These changes make the code cleaner and implementation more readable.
For some reason the page.go_back() action in playwright had a much
higher propensity to timeout. Use goto instead to reduce these page
traversal timeouts.
This requires tracking navigation history.
Only let the visual reasoner handle terminating the operator run.
Previously the grounder was also able to trigger termination.
Make catching the termination by the reasoner more robust
The previous browser_operator.py file had become pretty massive and
unwieldy. This change breaks it apart into separate files for
- the abstract environment and operator agent base
- the concrete agents: anthropic, openai and binary
- the concrete environment browser operator
- the operator actions used by agents and environment
- This operator works with model served over an openai compatible api
- It uses separate vision models to reason and ground actions.
This improves flexibility in the operator agents that can be created.
We do not know need our operator agent ot rely on monolithic models to
can both reason over visual data and ground their actions.
We can create operator agent from 2 separate models:
1. To reason over screenshots to suggest natural language next action
2. To ground those suggestion into visually grounded actions
This allows us to create fully local operators or operators combining
the best visual reasoner with the best visual grounder models.
Inform it can only control a single playwright browser page.
Previously it was assuming it is operating a whole browser, so would
have trouble navigating to different pages.
Improve handling of error in action parsing
Remove each OperatorAgent specific code from leaking out into the
operator. The Oprator just calls the standard OperatorAgent functions.
Each AgentOperator specific logic is handled by the OperatorAgent
internally.
The improve the separation of responsibility between the operator,
OperatorAgent and the Environment.
- Make environment pass screenshot data in agent agnostic format
- Have operator agents providers format image data to their AI model
specific format
- Add environment step type to distinguish image vs text content
- Clearly mark major steps in the operator iteration loop
- Handle anthropic models returning computer tool actions as normal
tool calls by normalizing next action retrieval from response for it
- Remove unused ActionResults fields
- Remove unnnecessary placeholders to content of action results like
for screenshot data
Decouple applying action on Environment from next action decision by
OperatorAgent
- Create an abstract Environment class with a `step' method
and a standardized set of supported actions for each concrete Environment
- Wrap playwright page into a concrete Environment class
- Create abstract OperatorAgent class with an abstract `act' method
- Wrap Openai computer Operator into concrete OperatorAgent class
- Wrap Claude computer Operator into a concrete OperatorAgent class
Handle interaction between Agent's action
Previously some link clicks would open in new tab. This is out of the
browser operator's context and so the new page cannot be interacted
with by the browser operator.
This change catches new page opens and opens them in the context page
instead.
Give the Khoj browser operator access to browser with existing
context (auth, cookies etc.) by starting it with CDP enabled.
Process:
1. Start Browser with CDP enabled:
`Edge/Chromium/Chrome --remote-debugging-port=9222'
2. Set the KHOJ_CDP_URL env var to the CDP url of the browser to use.
3. Start Khoj and ask it to get browser based work done with operator
+ research mode
The github run_eval workflow sets OPENAI_BASE_URL to empty string.
The ai model api created during initialization for openai models gets
set to empty string rather than None or the actual openai base url
This tries to call llm at to empty string base url instead of the
default openai api base url, which obviously fails.
Fix is to map empty base url's to the actual openai api base url.
Previously all exceptions were being caught. So retry logic wasn't
getting triggered.
Exception catching had been added to close llm thread when threads
instead of async was being used for final response generation.
This isn't required anymore since moving to async. And we can now
re-enable retry on failures.
Raise error if response is empty to retry llm completion.
Send image as png to non-openai models served via an openai compatible
api. As more models support png than webp.
Continue storing images as webp on server for efficiency.
Convert to png at the openai api layer and only for non-openai models
served via an openai compatible api.
Enable using vision models like ui-tars (via llama.cpp server), grok.