- Issue
- Explicit filtering was earlier being done after search by bi-encoder
but before re-ranking by cross-encoder
- This was limiting the quality of results being returned. As the
bi-encoder returned results which were going to be excluded. So the
burden of improving those limited results post filtering was on the
cross-encoder by re-ranking the remaining results based on query
- Fix
- Given the embeddings corresponding to an entry are at the same index
in their respective lists. We can run the filter for blocked,
required words before the search by the bi-encoder model. And limit
entries, embeddings being considered for the current query
- Result
- Semantic search by the bi-encoder gets to return most relevant
results for the query, knowing that the results aren't going to be
filtered out after. So the cross-encoder shoulders less of the
burden of improving results
- Corollary
- This pre-filtering technique allows us to apply other explicit
filters on entries relevant for the current query
- E.g limit search for entries within date/time specified in query
Semantic Search
Allow natural language search on user content like notes, images, transactions using transformer ML models
User can interface with semantic-search via the API or Emacs. All search is done locally*
Demo
Setup
1. Clone
git clone https://github.com/debanjum/semantic-search && cd semantic-search
2. Configure
- [Required] Update docker-compose.yml to mount your images, org-mode notes and beancount directories
- [Optional] Edit application configuration in sample_config.yml
3. Run
docker-compose up -d
Note: The first run will take time. Let it run, it's mostly not hung, just generating embeddings
Use
-
Semantic Search via API
-
Semantic Search via Emacs
- Install semantic-search.el
- Run
M-x semantic-search <user-query>
Upgrade
docker-compose build
Troubleshooting
-
Symptom: Errors out with "Killed" in error message
- Fix: Increase RAM available to Docker Containers in Docker Settings
- Refer: StackOverflow Solution, Configure Resources on Docker for Mac
-
Symptom: Errors out complaining about Tensors mismatch, null etc
- Mitigation: Delete content-type > image section from docker_sample_config.yml
Miscellaneous
-
The experimental chat API endpoint uses the OpenAI API
- It is disabled by default
- To use it add your
openai-api-keyto config.yml
Development Setup
Setup on Local Machine
1. Install Dependencies
- Install Python3 [Required]
- Install Conda [Required]
-
Install Exiftool [Optional]
sudo apt-get -y install libimage-exiftool-perl
2. Install Semantic Search
git clone https://github.com/debanjum/semantic-search && cd semantic-search
conda env create -f config/environment.yml
conda activate semantic-search
3. Configure
- Configure files/directories to search in
content-typesection ofsample_config.yml -
To run application on test data, update file paths containing
/data/totests/data/insample_config.yml- Example replace
/data/notes/*.orgwithtests/data/notes/*.org
- Example replace
4. Run
Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML
python3 -m src.main -c=config/sample_config.yml -vv
Upgrade On Local Machine
cd semantic-search
git pull origin master
conda deactivate semantic-search
conda env update -f config/environment.yml
conda activate semantic-search
Acknowledgments
- MiniLM Model for Asymmetric Text Search. See SBert Documentation
- OpenAI CLIP Model for Image Search. See SBert Documentation
- Charles Cave for OrgNode Parser
- Sven Marnach for PyExifTool