- Previously heading entries were not indexed to maintain search quality
- But given that there are use-cases for indexing entries with no body
- Add a configurable `index_heading_entries' field to index heading entries
- This `TextContentConfig' field is currently only used for OrgMode content
- Stop passing verbose flag around app methods
- Minor remap of verbosity levels to match python logging framework levels
- verbose = 0 maps to logging.WARN
- verbose = 1 maps to logging.INFO
- verbose >=2 maps to logging.DEBUG
- Minor clean-up of app: unused modules, conversation file opening
- 5e6625a Fix file browser to not add empty line when no file/dir selected
- 8098b8c Bring main window to Top when open from System Tray
- 1c122a8 Place window near top so buttons are not hidden by OS bottom bar
- dfe2546 Set Khoj Icon on Main Desktop Window
- 1b1f8f9 Move Splash screen text below icon. Set the text color to black
- 450f644 Fix path to remove shared libraries when packaging the Windows app
- When no file selected in file browser an empty line/entry gets added
to input entries list
- Bug got introduced due to insufficient update on change to add
instead of insert
- Update is_none_or_empty helper method to also check for empty string
- Note: Support for MPS in Pytorch is currently in v1.13.0 nightly builds
- Users will have to wait for PyTorch MPS support to land in stable builds
- Until then the code can be tweaked and tested to make use of the GPU
acceleration on newer Macs
- Pass device to load models onto from app state.
- SentenceTransformer models accept device to load models onto during initialization
- Pass device to load corpus embeddings onto from app state
- CLIP Image score and XMP metadata score are not combining well.
When combined they give non sensical results. Enable only once
figure how best to combine the two.
- Show scores with higher precision for image search
- Image search scores seem to be mostly be between 0.2 - 0.3 for some reason
- Higher precision scores make it easier to understand the quality
of returned results perceived by the model itself
- Avoid having to pass the khoj_sample.yml data file into pip, native apps
- Packaging data files into python packages is annoying.
- There's `MANIFEST.in`, `data_files` and `package_data` in setup.py
- Bdist, wheel, generated source tarball use different set of these fields
and put the data files in different locations
- Rather just code the default config into a constant. Avoid
pointless file reads as well this way
- Assume path is absolute in yaml util module while saving, loading file
- This follows same convention as jsonl. Which just operates on
passed file path, assuming it is of appropriate form.
Responsibility to put it in appropriate form is on the caller, for now
- Include khoj_sample.yml in pip package to load default config from
- Create khoj config directory if it doesn't exist
- Load config from khoj_sample.yml if khoj.yml config doesn't exist
- Track current (saved/loaded) config separate from the new config (to
be written) when user clicks Start
- Fallback to using default config when no config for the specific
content type or processor is specified in khoj.yml
- Earlier were only loading default config on first run, not after
- Create Child CheckBox, LineEdit classes for Processor Widgets
- Create ProcessorType, similar to SearchType
- Track ProcessorType the widgets are associated with
- Simplify update, save, load of config based on type
- Make config_file an optional arg. It defaults to default khoj config dir
- Return args.config as None if no config_file explicitly passed by user
- Parent can use args.config = None as signal to trigger first run experience
- Main.py was becoming too big to manage. It had both
controllers/routers and component configurations (search, processors)
in it
- Now that the native app GUI code is also getting added to the main
path, good time to split/modularize/clean main.py
- Put global state into a separate file to share across modules
- Test invalid config file path throws. Remove redundant cli test
- Simplify cli parser code
- Do not need to explicitly check if args.config_file set.
argparser checks for positional arguments automatically
- Use standard semantics for cli args
- All positional args are required. Non positional args are optional
- Improve command line --help description
- Add custom validator to throw if neither input_filter or
input_<files|directories> are specified
- Set field expecting paths to type Path
- Now that default_config isn't used in code. We can update
fields in rawconfig to specify whether they're required or not.
This lets pydantic validate config file and throw appropriate error
- Reason
- Simplifies code. No merge_dict required
- 1 place for user to see all configurables, defaults and required values
- Details
- Remove default_config from code. Set defaults in khoj_sample.yml itself
- Keep fields required to be set by user as empty in khoj_sample to YAML
- Set defaults for fields not requiring configuration by user
- Setting up default compressed-jsonl, embeddings-file was only required
for org search_type, while org-files and org-filter were allowed to be
passed as command line argument
- This avoided having to set compressed-jsonl and embeddings-file via
command line argument as well for org search type
- Now that all search types are only configurable via config file, We
can default all search types to None. The default config for the
rest of the search types wasn't being used anyway
- Previously org-files were configurable via cmdline args.
Where as none of the other search types are
- This is an artifact of how the application grew
- It can be removed for better consistency and
equal preference given all search types
- Reason:
Allow natural search on markdown based notes, documentation,
websites etc
- Details:
- Create markdown processor to extract Markdown entries (identified by
Heading) into standard jsonl format required by text_search
- Update API, Configs to support interfacing with new markdown type
- Update Emacs, Web clients to support interfacing with new markdown
type via API
- Update Readme to mentiond markdown is also supported
Closes#35
- The code for both the text search types were mostly the same
It was earlier done this way for expedience while experimenting
- The minor differences were reconciled and merged into a single
text_search type
- This simplifies the app and making it easier to process other
text types
Now that the logic to compile entries is in the processor layer, the
extract_entries method is standard across (text) search_types
Extract the load_jsonl method as a utility helper method.
Use it in (a)symmetric search types
- The all-MiniLM-L6-v2 is more accurate
- The exact previous model isn't benchmarked but based on the
performance of the closest model to it. Seems like the new model
maybe similar in speed and size
- On very preliminary evaluation of the model, the new model seems
faster, with pretty decent results
- The multi-qa-MiniLM-L6-cos-v1 is more extensively benchmarked[1]
- It has the right mix of model query speed, size and performance on benchmarks
- On hugging face it has way more downloads and likes than the msmarco model[2]
- On very preliminary evaluation of the model
- It doubles the encoding speed of all entries (down from ~8min to 4mins)
- It gave more entries that stay relevant to the query (3/5 vs 1/5 earlier)
[1]: https://www.sbert.net/docs/pretrained_models.html
[2]: https://huggingface.co/sentence-transformers