Debanjum a7b4d58865 Fix Image Search and Improve Desktop App
### Fix Image Search
  - Do not use XMP metadata by default for image search
    - It seems to be buggy currently. The returned results do not make sense with XMP metadata enabled

### Fix Image Search using Desktop App
  - Fix configuring Image Search via Desktop GUI
    - Set `input-directories`, instead of unused `input-files` for `content-type.image` in `khoj.yml`
  - Fix running Image Search via Desktop apps. 
    - Previously the transformers wasn't getting packaged into the app by pyinstaller
    - This is required by image search to run. So the desktop apps would fail to start when image search was enabled
    - Resolves #68
  - Append selected files, directories via "Add" button in Desktop GUI
    - This allows selecting multiple files, directories using Desktop GUI
    - Previously selecting multiple image directories had to be entered manually

### Improve Desktop App
  - Show Splash Screen to Desktop on App Initialization
    - The app takes a while to load during first run 
    - A splash screen signals that app is loading and not being unresponsive
    - Note: _Pyinstaller only supports splash screens on Windows, Linux. Not on Macs._
  - Add Khoj icon to the Windows, Linux app. Windows expects a `.ico` icon type
  - Only exclude `libtorch_{cuda, cpu, python}` on Linux machine
    - Seems those libraries are being used on Mac (and maybe Windows). 
    - Linux is where the app size benefits from removing these is maximum anyway
  - Fix PyInstaller Warnings on App Start
    - The warning show up as annoying error popups on Windows
2022-08-19 17:37:09 +00:00
2021-08-15 22:52:37 -07:00

Khoj 🦅

build test publish release

A natural language search engine for your personal notes, transactions and images

Table of Contents

Features

  • Natural: Advanced natural language understanding using Transformer based ML Models
  • Local: Your personal data stays local. All search, indexing is done on your machine*
  • Incremental: Incremental search for a fast, search-as-you-type experience
  • Pluggable: Modular architecture makes it easy to plug in new data sources, frontends and ML models
  • Multiple Sources: Search your Org-mode and Markdown notes, Beancount transactions and Photos
  • Multiple Interfaces: Search using a Web Browser, Emacs or the API

Demo

https://user-images.githubusercontent.com/6413477/184735169-92c78bf1-d827-4663-9087-a1ea194b8f4b.mp4

Description

  • Install Khoj via pip
  • Start Khoj app
  • Add this readme and khoj.el readme as org-mode for Khoj to index
  • Search "Setup editor" on the Web and Emacs. Re-rank the results for better accuracy
  • Top result is what we are looking for, the section to Install Khoj.el on Emacs

Analysis

  • The results do not have any words used in the query
    • Based on the top result it seems the re-ranking model understands that Emacs is an editor?
  • The results incrementally update as the query is entered
  • The results are re-ranked, for better accuracy, once user hits enter

Interfaces

Architecture

Setup

1. Install

pip install khoj-assistant

2. Start App

khoj

3. Configure

  1. Enable content types and point to files to search in the First Run Screen that pops up on app start
  2. Click configure and wait. The app will load ML model, generates embeddings and expose the search API

Use

Upgrade

pip install --upgrade khoj-assistant

Troubleshoot

  • Symptom: Errors out complaining about Tensors mismatch, null etc
    • Mitigation: Disable image search on the desktop GUI
  • Symptom: Errors out with "Killed" in error message in Docker

Miscellaneous

  • The beta chat and search API endpoints use OpenAI API
    • It is disabled by default
    • To use it add your openai-api-key via the app configure screen
    • Warning: If you use the above beta APIs, your query and top result(s) will be sent to OpenAI for processing

Performance

Query performance

  • Semantic search using the bi-encoder is fairly fast at <50 ms
  • Reranking using the cross-encoder is slower at <2s on 15 results. Tweak top_k to tradeoff speed for accuracy of results
  • Applying explicit filters is very slow currently at ~6s. This is because the filters are rudimentary. Considerable speed-ups can be achieved using indexes etc

Indexing performance

  • Indexing is more strongly impacted by the size of the source data
  • Indexing 100K+ line corpus of notes takes 6 minutes
  • Indexing 4000+ images takes about 15 minutes and more than 8Gb of RAM
  • Once https://github.com/debanjum/khoj/issues/36 is implemented, it should only take this long on first run

Miscellaneous

  • Testing done on a Mac M1 and a >100K line corpus of notes
  • Search, indexing on a GPU has not been tested yet

Development

Setup

Using Pip

1. Install
git clone https://github.com/debanjum/khoj && cd khoj
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
2. Configure
  • Copy the config/khoj_sample.yml to ~/.khoj/khoj.yml
  • Set input-files or input-filter in each relevant content-type section of ~/.khoj/khoj.yml
    • Set input-directories field in image content-type section
  • Delete content-type and processor sub-section(s) irrelevant for your use-case
3. Run
khoj -vv

Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML

4. Upgrade
# To Upgrade To Latest Stable Release
# Maps to the latest tagged version of khoj on master branch
pip install --upgrade khoj-assistant

# To Upgrade To Latest Pre-Release
# Maps to the latest commit on the master branch
pip install --upgrade --pre khoj-assistant

# To Upgrade To Specific Development Release.
# Useful to test, review a PR.
# Note: khoj-assistant is published to test PyPi on creating a PR
pip install -i https://test.pypi.org/simple/ khoj-assistant==0.1.5.dev57166025766

Using Docker

1. Clone
git clone https://github.com/debanjum/khoj && cd khoj
2. Configure
  • Required: Update docker-compose.yml to mount your images, (org-mode or markdown) notes and beancount directories
  • Optional: Edit application configuration in khoj_docker.yml
3. Run
docker-compose up -d

Note: The first run will take time. Let it run, it's mostly not hung, just generating embeddings

4. Upgrade
docker-compose build --pull

Using Conda

1. Install Dependencies
  • Install Conda [Required]
  • Install Exiftool [Optional]
    sudo apt -y install libimage-exiftool-perl
    
2. Install Khoj
git clone https://github.com/debanjum/khoj && cd khoj
conda env create -f config/environment.yml
conda activate khoj
3. Configure
  • Copy the config/khoj_sample.yml to ~/.khoj/khoj.yml
  • Set input-files or input-filter in each relevant content-type section of ~/.khoj/khoj.yml
    • Set input-directories field in image content-type section
  • Delete content-type, processor sub-sections irrelevant for your use-case
4. Run
python3 -m src.main -vv

Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML

5. Upgrade
cd khoj
git pull origin master
conda deactivate khoj
conda env update -f config/environment.yml
conda activate khoj

Test

pytest

Credits

Description
No description provided
Readme AGPL-3.0 116 MiB
Languages
Python 51%
TypeScript 36.1%
CSS 4.1%
HTML 3.2%
Emacs Lisp 2.4%
Other 3.1%