[[https://github.com/debanjum/khoj/actions/workflows/test.yml/badge.svg]] [[https://github.com/debanjum/khoj/actions/workflows/build.yml/badge.svg]] * Khoj /A natural language search engine for your personal notes, transactions and images/ ** Table of Contents - [[https://github.com/debanjum/khoj#Features][Features]] - [[https://github.com/debanjum/khoj#Demo][Demo]] - [[https://github.com/debanjum/khoj#Description][Description]] - [[https://github.com/debanjum/khoj#Analysis][Analysis]] - [[https://github.com/debanjum/khoj#Architecture][Architecture]] - [[https://github.com/debanjum/khoj#Setup][Setup]] - [[https://github.com/debanjum/khoj#Clone][Clone]] - [[https://github.com/debanjum/khoj#Configure][Configure]] - [[https://github.com/debanjum/khoj#Run][Run]] - [[https://github.com/debanjum/khoj#Use][Use]] - [[https://github.com/debanjum/khoj#Upgrade][Upgrade]] - [[https://github.com/debanjum/khoj#Troubleshooting][Troubleshooting]] - [[https://github.com/debanjum/khoj#Miscellaneous][Miscellaneous]] - [[https://github.com/debanjum/khoj#Development-setup][Development Setup]] - [[https://github.com/debanjum/khoj#Setup-on-local-machine][Setup on Local Machine]] - [[https://github.com/debanjum/khoj#Upgrade-on-local-machine][Upgrade on Local Machine]] - [[https://github.com/debanjum/khoj#Run-unit-tests][Run Unit Tests]] - [[https://github.com/debanjum/khoj#Performance][Performance]] - [[https://github.com/debanjum/khoj#Query-performance][Query Performance]] - [[https://github.com/debanjum/khoj#Indexing-performance][Indexing Performance]] - [[https://github.com/debanjum/khoj#Miscellaneous-1][Miscellaneous]] - [[https://github.com/debanjum/khoj#Acknowledgments][Acknowledgments]] ** Features - *Natural*: Advanced Natural language understanding using Transformer based ML Models - *Local*: Your personal data stays local. All search, indexing is done on your machine[[https://github.com/debanjum/khoj#miscellaneous][*]] - *Incremental*: Incremental search for a fast, search-as-you-type experience - *Pluggable*: Modular architecture makes it relatively easy to plug in new data sources, frontends and ML models - *Multiple Sources*: Search your Org-mode and Markdown notes, Beancount transactions and Photos - *Multiple Interfaces*: Search using a [[./src/interface/web/index.html][Web Browser]], [[./src/interface/emacs/khoj.el][Emacs]] or the [[http://localhost:8000/docs][API]] ** Demo https://user-images.githubusercontent.com/6413477/181664862-31565b0a-0e64-47e1-a79a-599dfc486c74.mp4 *** Description - User searches for "/Setup editor/" - The demo looks for the most relevant section in this readme and the [[https://github.com/debanjum/khoj/tree/master/src/interface/emacs][khoj.el readme]] - Top result is what we are looking for, the [[https://github.com/debanjum/khoj/tree/master/src/interface/emacs#installation][section to Install Khoj.el on Emacs]] *** Analysis - The results do not have any words used in the query - /Based on the top result it seems the re-ranking model understands that Emacs is an editor?/ - The results incrementally update as the query is entered - The results are re-ranked, for better accuracy, once user is idle ** Architecture [[https://github.com/debanjum/khoj/blob/master/docs/khoj_architecture.png]] ** Setup *** 1. Clone #+begin_src shell git clone https://github.com/debanjum/khoj && cd khoj #+end_src *** 2. Configure - *Required*: Update [[./docker-compose.yml][docker-compose.yml]] to mount your images, (org-mode or markdown) notes and beancount directories - *Optional*: Edit application configuration in [[./config/sample_config.yml][sample_config.yml]] *** 3. Run #+begin_src shell docker-compose up -d #+end_src /Note: The first run will take time. Let it run, it's mostly not hung, just generating embeddings/ ** Use - *Khoj via Web* - Go to [[http://localhost:8000/]] or open [[./src/interface/web/index.html][index.html]] in your browser - *Khoj via Emacs* - [[https://github.com/debanjum/khoj/tree/master/src/interface/emacs#installation][Install]] [[./src/interface/emacs/khoj.el][khoj.el]] - Run ~M-x khoj ~ - *Khoj via API* - See [[http://localhost:8000/docs][Khoj FastAPI Docs]] - [[http://localhost:8000/search?q=%22what%20is%20the%20meaning%20of%20life%22][Query]] - [[http://localhost:8000/regenerate?t=ledger][Regenerate Embeddings]] - [[https://localhost:8000/ui][Configure Application]] ** Upgrade #+begin_src shell docker-compose build --pull #+end_src ** Troubleshooting - Symptom: Errors out with "Killed" in error message - Fix: Increase RAM available to Docker Containers in Docker Settings - Refer: [[https://stackoverflow.com/a/50770267][StackOverflow Solution]], [[https://docs.docker.com/desktop/mac/#resources][Configure Resources on Docker for Mac]] - Symptom: Errors out complaining about Tensors mismatch, null etc - Mitigation: Delete content-type > image section from docker_sample_config.yml ** Miscellaneous - The experimental [[localhost:8000/chat][chat]] API endpoint uses the [[https://openai.com/api/][OpenAI API]] - It is disabled by default - To use it add your ~openai-api-key~ to config.yml ** Development Setup *** Setup on Local Machine **** 1. Install Dependencies 1. Install Python3 [Required] 2. [[https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html][Install Conda]] [Required] 3. Install Exiftool [Optional] #+begin_src shell sudo apt-get -y install libimage-exiftool-perl #+end_src **** 2. Install Khoj #+begin_src shell git clone https://github.com/debanjum/khoj && cd khoj conda env create -f config/environment.yml conda activate khoj #+end_src **** 3. Configure - Configure files/directories to search in ~content-type~ section of ~sample_config.yml~ - To run application on test data, update file paths containing ~/data/~ to ~tests/data/~ in ~sample_config.yml~ - Example replace ~/data/notes/*.org~ with ~tests/data/notes/*.org~ **** 4. Run Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML #+begin_src shell python3 -m src.main -c=config/sample_config.yml -vv #+end_src *** Upgrade On Local Machine #+begin_src shell cd khoj git pull origin master conda deactivate khoj conda env update -f config/environment.yml conda activate khoj #+end_src *** Run Unit Tests #+begin_src shell pytest #+end_src ** Performance *** Query performance - Semantic search using the bi-encoder is fairly fast at <5 ms - Reranking using the cross-encoder is slower at <2s on 15 results. Tweak ~top_k~ to tradeoff speed for accuracy of results. - Applying explicit filters is very slow currently at ~6s. This is because the filters are rudimentary. Considerable speed-ups can be achieved using indexes etc. *** Indexing performance - Indexing is more strongly impacted by the size of the source data - Indexing 100K+ line corpus of notes takes 6 minutes - Indexing 4000+ images takes about 15 minutes and more than 8Gb of RAM - Once https://github.com/debanjum/khoj/issues/36 is implemented, it should only take this long on first run *** Miscellaneous - Testing done on a Mac M1 and a >100K line corpus of notes - Search, indexing on a GPU has not been tested yet ** Acknowledgments - [[https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1][Multi-QA MiniLM Model]], [[https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2][All MiniLM Model]] for Text Search. See [[https://www.sbert.net/examples/applications/retrieve_rerank/README.html][SBert Documentation]] - [[https://github.com/openai/CLIP][OpenAI CLIP Model]] for Image Search. See [[https://www.sbert.net/examples/applications/image-search/README.html][SBert Documentation]] - Charles Cave for [[http://members.optusnet.com.au/~charles57/GTD/orgnode.html][OrgNode Parser]] - [[https://mooz.github.io/org-js/][Org.js]] to render Org-mode results on the Web interface - [[https://github.com/markdown-it/markdown-it][Markdown-it]] to render Markdown results on the Web interface - Sven Marnach for [[https://github.com/smarnach/pyexiftool/blob/master/exiftool.py][PyExifTool]]