- Adds support for multiple users to be connected to the same Khoj instance using their Google login credentials
- Moves storage solution from in-memory json data to a Postgres db. This stores all relevant information, including accounts, embeddings, chat history, server side chat configuration
- Adds the concept of a Khoj server admin for configuring instance-wide settings regarding search model, and chat configuration
- Miscellaneous updates and fixes to the UX, including chat references, colors, and an updated config page
- Adds billing to allow users to subscribe to the cloud service easily
- Adds a separate GitHub action for building the dockerized production (tag `prod`) and dev (tag `dev`) images, separate from the image used for local building. The production image uses `gunicorn` with multiple workers to run the server.
- Updates all clients (Obsidian, Emacs, Desktop) to follow the client/server architecture. The server no longer reads from the file system at all; it only accepts data via the indexer API. In line with that, removes the functionality to configure org, markdown, plaintext, or other file-specific settings in the server. Only leaves GitHub and Notion for server-side configuration.
- Changes license to GNU AGPLv3
Resolves#467Resolves#488Resolves#303Resolves#345Resolves#195Resolves#280Resolves#461Closes#259Resolves#351Resolves#301Resolves#296
- Link to Django admin panel for user to create Chat Models on their
Khoj server
- This should only get hit when user is not using Khoj cloud, as Khoj
cloud would already have Chat models configured
- While sigmoid normalization isn't required for reranking.
Normalizing score to distance metrics for both encoder and cross
encoder scores is useful to reason about them
- Softmax wasn't required as don't need probabilities, sigmoid is good
enough to get distance metric
- Expose ability to modify search model via Django admin interface
- Previously the bi_encoder and cross_encoder models to use were set
in code
- Now it's user configurable but with a default config generated by
default
- During the migration, the confidence score stopped being used. It
was being passed down from API to some point and went unused
- Remove score thresholding for images as image search confidence
score different from text search model distance score
- Default score threshold of 0.15 is experimentally determined by
manually looking at search results vs distance for a few queries
- Use distance instead of confidence as metric for search result quality
Previously we'd moved text search to a distance metric from a
confidence score.
Now convert even cross encoder, image search scores to distance metric
for consistent results sorting
Remove the Results Count button from the web app. It's hanging weirdly
with not much context to its purpose.
Reintroduce it in the Search card when created under the Features section
Reduce user confusion by joining config update with index updation for
each content type.
So only a single click required to configure any content type instead
of two clicks on two separate pages
- Notes prompt doesn't need to be so tuned to question answering. User
could just want to talk about life. The notes need to be used to
response to those, not necessarily only retrieve answers from notes
- System and notes prompts were forcing asking follow-up questions a
little too much. Reduce strength of follow-up question asking
The Chat models sometime output reference notes directly in the chat
body in unformatted form, specifically as Notes:\n['. Prevent that.
Reference notes are shown in clean, formatted form anyway