klbr/khoj - khoj - Gitea: Git with a cup of tea

klbr/khoj

mirror of https://github.com/khoaliber/khoj.git synced 2026-03-03 05:29:12 +00:00

Go to file

Debanjum Singh Solanky b89fc2f4ac Add /reload API to reload model embeddings and entries from file

- The reload API adds the ability to separate out the loading of
  embeddings from file without having to restart app or (re-)generate embeddings

- Before this the only way to load model from file was by restarting app
- The other way to reload the model embeddings by regenerating them
  was to expensive for larger datasets

- This unlocks at least 1 use-case, where
  - we regenerate model via an app instance running on a separate server and
  - just reload the generated embeddings on the client device

  - This allows us to offload the expensive embedding generation
    compute to a background server while letting

  - This avoids having to (re-)restart application on client device or
    be forced to generate embeddings on the client device itself

  - But it requires the model relevant files to be synced to the client device
    This can be done with any file syncing application like Syncthing

  - We can then call /regenerate on server and /reload client on a
    regular schedule to keep our data up to date on semantic search

2022-06-29 23:47:17 +04:00

.github/workflows

Set PORT arg when building docker image in the build workflow

2022-01-29 18:11:47 -05:00

config

Minor fix to notes jsonl file extension in sample_config.yml

2022-01-29 04:13:36 -05:00

src

Add /reload API to reload model embeddings and entries from file

2022-06-29 23:47:17 +04:00

tests

Normalize org notes path to be relative to home directory

2022-06-28 19:16:11 +04:00

views

Fix input text behavior for null/empty value fields

2021-12-04 10:45:48 -05:00

.dockerignore

Make Docker ignore unnecessary files

2022-06-29 22:29:34 +04:00

.gitignore

Improve test data organization and update correspoding conftests

2022-01-29 02:03:17 -05:00

demo.mp4

Add demo of semantic search to repository

2022-05-14 04:29:25 -04:00

docker-compose.yml

Mount embeddings to /data/embeddings for directory naming consistency

2022-01-29 03:24:02 -05:00

LICENSE

Add Readme, License. Update .gitignore

2021-08-15 22:52:37 -07:00

README.org

Show Demo of Semantic Search in Readme

2022-05-14 01:29:13 -07:00

README.org

Semantic Search

Semantic Search

Allow natural language search on user content like notes, images, transactions using transformer ML models

User can interface with semantic-search via the API or Emacs. All search is done locally*

Demo

Setup

1. Clone

  git clone https://github.com/debanjum/semantic-search && cd semantic-search

2. Configure

[Required] Update docker-compose.yml to mount your images, org-mode notes and beancount directories
[Optional] Edit application configuration in sample_config.yml

3. Run

docker-compose up -d

Note: The first run will take time. Let it run, it's mostly not hung, just generating embeddings

Use

Semantic Search via API
Semantic Search via Emacs
- Install semantic-search.el
- Run M-x semantic-search <user-query>

Upgrade

  docker-compose build

Troubleshooting

Symptom: Errors out with "Killed" in error message
- Fix: Increase RAM available to Docker Containers in Docker Settings
- Refer: StackOverflow Solution, Configure Resources on Docker for Mac
Symptom: Errors out complaining about Tensors mismatch, null etc
- Mitigation: Delete content-type > image section from docker_sample_config.yml

Miscellaneous

The experimental chat API endpoint uses the OpenAI API
- It is disabled by default
- To use it add your openai-api-key to config.yml

Development Setup

Setup on Local Machine

1. Install Dependencies

Install Python3 [Required]
Install Conda [Required]

Install Exiftool [Optional]

sudo apt-get -y install libimage-exiftool-perl

2. Install Semantic Search

git clone https://github.com/debanjum/semantic-search && cd semantic-search
conda env create -f config/environment.yml
conda activate semantic-search

3. Configure

Configure files/directories to search in content-type section of sample_config.yml
To run application on test data, update file paths containing /data/ to tests/data/ in sample_config.yml
- Example replace /data/notes/*.org with tests/data/notes/*.org

4. Run

Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML

python3 -m src.main -c=config/sample_config.yml -vv

Upgrade On Local Machine

  cd semantic-search
  git pull origin master
  conda deactivate semantic-search
  conda env update -f config/environment.yml
  conda activate semantic-search

Acknowledgments

MiniLM Model for Asymmetric Text Search. See SBert Documentation
OpenAI CLIP Model for Image Search. See SBert Documentation
Charles Cave for OrgNode Parser
Sven Marnach for PyExifTool

Languages

Python 51%

TypeScript 36.1%

CSS 4.1%

HTML 3.2%

Emacs Lisp 2.4%

Other 3.1%