klbr/khoj - khoj - Gitea: Git with a cup of tea

klbr/khoj

mirror of https://github.com/khoaliber/khoj.git synced 2026-03-02 21:19:12 +00:00

Go to file

Debanjum 589bfa9424 Run Explicit Filter on Entries, Embeddings before Semantic Search for Query

## Issue
  - Explicit filtering was being done after search by the bi-encoder
     but before re-ranking by the cross-encoder

  - This limited the quality of results being returned for queries with explicit filters. 
     The bi-encoder returned results which were going to be excluded. 
     So the burden of improving those limited results post filtering was on the
     cross-encoder, by re-ranking the remaining results to best match the query

## Fix
  - Given that the entry and its embedding are at the same index in their respective lists. 
     We know which entries map to which embedding tensors.
     So we can run the filter for blocked, required words before the bi-encoder search. 
     And limit entries, embeddings being considered for the current query

## Result
  - Semantic search by the bi-encoder returns the most relevant results 
     for the query, knowing that the results aren't going to be filtered out after. 
     So the cross-encoder shoulders less of the burden of improving the results

## Corollary
  - This pre-filtering technique allows us to apply other explicit filters
     on entries relevant for the current query, before calling search
     - E.g limit search to entries within date/time specified in query

2022-07-12 13:12:22 -07:00

.github/workflows

Run build on PR

2022-07-04 18:09:47 -04:00

config

Add specific version for Python packages and downgrade miniconda Docker image to potentially fix build issues

2022-07-04 18:01:55 -04:00

src

Run Explicit Filter on Entries, Embeddings before Semantic Search for Query

2022-07-12 18:25:42 +04:00

tests

Fix asymmetric search test to pass entries returned by query to collate_results

2022-07-12 18:48:49 +04:00

views

Fix input text behavior for null/empty value fields

2021-12-04 10:45:48 -05:00

.dockerignore

Make Docker ignore unnecessary files

2022-06-29 22:29:34 +04:00

.gitignore

Improve test data organization and update correspoding conftests

2022-01-29 02:03:17 -05:00

demo.mp4

Add demo of semantic search to repository

2022-05-14 04:29:25 -04:00

docker-compose.yml

Correct syntax of memory limit in docker-compose.yml

2022-07-06 20:07:11 -04:00

Dockerfile

Add specific version for Python packages and downgrade miniconda Docker image to potentially fix build issues

2022-07-04 18:01:55 -04:00

LICENSE

Add Readme, License. Update .gitignore

2021-08-15 22:52:37 -07:00

README.org

Fix formatting for pytest command

2022-07-08 10:18:26 -04:00

README.org

Semantic Search

Semantic Search

Allow natural language search on user content like notes, images, transactions using transformer ML models

User can interface with semantic-search via the API or Emacs. All search is done locally*

Demo

Setup

1. Clone

  git clone https://github.com/debanjum/semantic-search && cd semantic-search

2. Configure

[Required] Update docker-compose.yml to mount your images, org-mode notes and beancount directories
[Optional] Edit application configuration in sample_config.yml

3. Run

docker-compose up -d

Note: The first run will take time. Let it run, it's mostly not hung, just generating embeddings

Use

Semantic Search via API
Semantic Search via Emacs
- Install semantic-search.el
- Run M-x semantic-search <user-query>

Run Unit tests

pytest

Upgrade

  docker-compose build --pull

Troubleshooting

Symptom: Errors out with "Killed" in error message
- Fix: Increase RAM available to Docker Containers in Docker Settings
- Refer: StackOverflow Solution, Configure Resources on Docker for Mac
Symptom: Errors out complaining about Tensors mismatch, null etc
- Mitigation: Delete content-type > image section from docker_sample_config.yml

Miscellaneous

The experimental chat API endpoint uses the OpenAI API
- It is disabled by default
- To use it add your openai-api-key to config.yml

Development Setup

Setup on Local Machine

1. Install Dependencies

Install Python3 [Required]
Install Conda [Required]

Install Exiftool [Optional]

sudo apt-get -y install libimage-exiftool-perl

2. Install Semantic Search

git clone https://github.com/debanjum/semantic-search && cd semantic-search
conda env create -f config/environment.yml
conda activate semantic-search

3. Configure

Configure files/directories to search in content-type section of sample_config.yml
To run application on test data, update file paths containing /data/ to tests/data/ in sample_config.yml
- Example replace /data/notes/*.org with tests/data/notes/*.org

4. Run

Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML

python3 -m src.main -c=config/sample_config.yml -vv

Upgrade On Local Machine

  cd semantic-search
  git pull origin master
  conda deactivate semantic-search
  conda env update -f config/environment.yml
  conda activate semantic-search

Acknowledgments

MiniLM Model for Asymmetric Text Search. See SBert Documentation
OpenAI CLIP Model for Image Search. See SBert Documentation
Charles Cave for OrgNode Parser
Sven Marnach for PyExifTool

Languages

Python 51%

TypeScript 36.1%

CSS 4.1%

HTML 3.2%

Emacs Lisp 2.4%

Other 3.1%