- Assume path is absolute in yaml util module while saving, loading file
- This follows same convention as jsonl. Which just operates on
passed file path, assuming it is of appropriate form.
Responsibility to put it in appropriate form is on the caller, for now
- Make config_file an optional arg. It defaults to default khoj config dir
- Return args.config as None if no config_file explicitly passed by user
- Parent can use args.config = None as signal to trigger first run experience
- Test invalid config file path throws. Remove redundant cli test
- Simplify cli parser code
- Do not need to explicitly check if args.config_file set.
argparser checks for positional arguments automatically
- Use standard semantics for cli args
- All positional args are required. Non positional args are optional
- Improve command line --help description
- Reason
- Simplifies code. No merge_dict required
- 1 place for user to see all configurables, defaults and required values
- Details
- Remove default_config from code. Set defaults in khoj_sample.yml itself
- Keep fields required to be set by user as empty in khoj_sample to YAML
- Set defaults for fields not requiring configuration by user
- Setting up default compressed-jsonl, embeddings-file was only required
for org search_type, while org-files and org-filter were allowed to be
passed as command line argument
- This avoided having to set compressed-jsonl and embeddings-file via
command line argument as well for org search type
- Now that all search types are only configurable via config file, We
can default all search types to None. The default config for the
rest of the search types wasn't being used anyway
- Previously org-files were configurable via cmdline args.
Where as none of the other search types are
- This is an artifact of how the application grew
- It can be removed for better consistency and
equal preference given all search types
- The all-MiniLM-L6-v2 is more accurate
- The exact previous model isn't benchmarked but based on the
performance of the closest model to it. Seems like the new model
maybe similar in speed and size
- On very preliminary evaluation of the model, the new model seems
faster, with pretty decent results
- The multi-qa-MiniLM-L6-cos-v1 is more extensively benchmarked[1]
- It has the right mix of model query speed, size and performance on benchmarks
- On hugging face it has way more downloads and likes than the msmarco model[2]
- On very preliminary evaluation of the model
- It doubles the encoding speed of all entries (down from ~8min to 4mins)
- It gave more entries that stay relevant to the query (3/5 vs 1/5 earlier)
[1]: https://www.sbert.net/docs/pretrained_models.html
[2]: https://huggingface.co/sentence-transformers
- Break the compute embeddings method into separate methods:
compute_image_embeddings and compute_metadata_embeddings
- If image_metadata_embeddings isn't defined, do not use it to enhance
search results. Given image_metadata_embeddings wouldn't be defined
if use_xmp_metadata is False, we can avoid unnecessary addition of
args to query method