klbr/khoj - khoj - Gitea: Git with a cup of tea

klbr/khoj

mirror of https://github.com/khoaliber/khoj.git synced 2026-03-03 13:19:16 +00:00

Author	SHA1	Message	Date
Debanjum Singh Solanky	17c38b526a	Default config for each search types to None - Setting up default compressed-jsonl, embeddings-file was only required for org search_type, while org-files and org-filter were allowed to be passed as command line argument - This avoided having to set compressed-jsonl and embeddings-file via command line argument as well for org search type - Now that all search types are only configurable via config file, We can default all search types to None. The default config for the rest of the search types wasn't being used anyway	2022-07-31 22:23:57 +03:00
Debanjum Singh Solanky	b83021a723	Improve code readability of merge_dicts helper method	2022-07-31 22:07:56 +03:00
Debanjum Singh Solanky	38aede68f2	Only configure org via config file for consistency across search types - Previously org-files were configurable via cmdline args. Where as none of the other search types are - This is an artifact of how the application grew - It can be removed for better consistency and equal preference given all search types	2022-07-31 22:02:03 +03:00
Debanjum Singh Solanky	65fea7681a	Rename notes search type to org search, now that markdown notes supported	2022-07-21 22:09:44 +04:00
Debanjum Singh Solanky	d4d7dbaca6	Support Natural Search on Markdown Files - Reason: Allow natural search on markdown based notes, documentation, websites etc - Details: - Create markdown processor to extract Markdown entries (identified by Heading) into standard jsonl format required by text_search - Update API, Configs to support interfacing with new markdown type - Update Emacs, Web clients to support interfacing with new markdown type via API - Update Readme to mentiond markdown is also supported Closes #35	2022-07-21 22:07:05 +04:00
Debanjum Singh Solanky	0602d018c0	Merge Symmetric, Asymmetric Search Types into a single Text Search Type - The code for both the text search types were mostly the same It was earlier done this way for expedience while experimenting - The minor differences were reconciled and merged into a single text_search type - This simplifies the app and making it easier to process other text types	2022-07-21 21:19:52 +04:00
Debanjum Singh Solanky	0917f1574d	Consolidate jsonl helper methods in a single file under utils module	2022-07-21 03:30:13 +04:00
Debanjum Singh Solanky	de726c4b6c	Minor fixes to unused installer utility script	2022-07-21 03:30:13 +04:00
Debanjum Singh Solanky	5aad297286	Reuse logic to extract entries across symmetric, asymmetric search Now that the logic to compile entries is in the processor layer, the extract_entries method is standard across (text) search_types Extract the load_jsonl method as a utility helper method. Use it in (a)symmetric search types	2022-07-21 02:53:18 +04:00
Debanjum Singh Solanky	6c9ffdba57	Allow indexing multiple image directories for image search	2022-07-20 02:56:01 +04:00
Debanjum Singh Solanky	732b2d287f	Give the project a short, less generic name. Rename it to Khoj - Semantic Search was just a placeholder used to test the idea out Didn't want to get into naming at that point of time	2022-07-19 18:26:16 +04:00
Debanjum Singh Solanky	989526ae54	Use a more accurate model for symmetric semantic search - The all-MiniLM-L6-v2 is more accurate - The exact previous model isn't benchmarked but based on the performance of the closest model to it. Seems like the new model maybe similar in speed and size - On very preliminary evaluation of the model, the new model seems faster, with pretty decent results	2022-07-18 20:27:26 +04:00
Debanjum Singh Solanky	4a90972e38	Use a better model for asymmetric semantic search - The multi-qa-MiniLM-L6-cos-v1 is more extensively benchmarked[1] - It has the right mix of model query speed, size and performance on benchmarks - On hugging face it has way more downloads and likes than the msmarco model[2] - On very preliminary evaluation of the model - It doubles the encoding speed of all entries (down from ~8min to 4mins) - It gave more entries that stay relevant to the query (3/5 vs 1/5 earlier) [1]: https://www.sbert.net/docs/pretrained_models.html [2]: https://huggingface.co/sentence-transformers	2022-07-18 20:27:26 +04:00
Debanjum Singh Solanky	f5d6d1e752	Tiny style fix to separate functions by 2 newlines	2022-06-29 23:47:17 +04:00
Debanjum Singh Solanky	3d8a07f252	Extract empty line escape sequences var into constants file for reuse	2022-02-27 19:01:49 -05:00
Saba	33bc62dc19	Fix type of use_xmp_metadata to be bool, rather than str	2022-01-24 21:53:26 -05:00
Debanjum Singh Solanky	179153dc5a	Rename RawConfig Types for Consistency - Naming convention - [ContentType][ConfigType]Config - Where [ConfigType] ~ Content, Search, Processor - Where [ContentType] ~ Text, Image, Asymmetric, Symmetric, Conversation - Current Configs: - Content: - Org Notes - Org Music - Image - Ledger/Beancount - Search: - Asymmetric - Symmetric - Image - Processor: - Conversation	2022-01-14 20:54:38 -05:00
Debanjum Singh Solanky	c64e0c2965	Load model from HuggingFace if model_directory unset in config YAML - Do not save/load the model to/from disk when model_directory unset in config.yml - Add symmetric search default config to cli.py	2022-01-14 17:36:59 -05:00
Debanjum Singh Solanky	510faa1904	Save Image Search Model to Disk	2022-01-14 17:36:59 -05:00
Debanjum Singh Solanky	934ec233b0	Add Search Config for Symmetric Model. Save Model to Disk	2022-01-14 17:36:59 -05:00
Debanjum Singh Solanky	b63026d97c	Save Asymmetric Search Model to Disk - Improve application load time - Remove dependence on internet to startup application and perform semantic search	2022-01-14 17:36:27 -05:00
Debanjum Singh Solanky	ea28897cdd	Remove deprecated conversation_history field from config	2022-01-12 10:35:52 -05:00
Debanjum Singh Solanky	6dc2a99d35	Merge branch 'master' of github.com:debanjum/semantic-search into add-summarize-capability-to-chat-bot - Fix openai_api_key being set in ConfigProcessorConfig - Merge addition of config UI and config instantiation updates	2021-12-20 13:30:42 +05:30
Debanjum Singh Solanky	65da7daf1f	Load, Save Conversation Session Summaries to Log. s/chat_log/chat_session Conversation logs structure now has session info too instead of just chat info Session info will allow loading past conversation summaries as context for AI in new conversations { "session": [ { "summary": <chat_session_summary>, "session-start": <session_start_index_in_chat_log>, "session-end": <session_end_index_in_chat_log> }], "chat": [ { "intent": <intent-object> "trigger-emotion": <emotion-triggered-by-message> "by": <AI\|Human> "message": <chat_message> "created": <message_created_date> }] }	2021-12-15 10:17:07 +05:30
Saba	d65190c3ee	Update unit tests, files with removing model suffix to config types	2021-12-09 08:50:38 -05:00
Saba	76e9e9da2f	Update unit tests to use the new BaseModel types	2021-12-05 09:31:39 -05:00
Saba	10e4065e05	Consolidate the search config models and pass verbose as a top level flag	2021-12-04 11:43:48 -05:00
Saba	43e647835b	Append Model Suffixed to config models	2021-12-04 10:51:21 -05:00
Saba	e068968b35	Update imports for raw config models in config.py	2021-12-04 10:44:55 -05:00
Saba	4d6284b0af	Remove Test suffix from Config models	2021-12-04 10:44:13 -05:00
Saba	7ca4fc3453	Resolve mrege conflicts with updated processor conversation data model	2021-11-28 16:22:52 -05:00
Saba	87a6c2d716	Use parse_obj instead of parse_raw as incoming data is in dict	2021-11-28 14:34:32 -05:00
Saba	da52433d89	Update to re-use the raw config base models in config.py as well	2021-11-28 11:57:33 -05:00
Saba	6292fe4481	Update to re-use the raw config base models in config.py as well	2021-11-28 11:57:13 -05:00
Saba	311c4b7e7b	Working API request body parsing to /post config!	2021-11-28 11:16:33 -05:00
Saba	66183cc298	Working API request body parsing to /post config!	2021-11-28 11:12:26 -05:00
Debanjum Singh Solanky	67c3cd7372	Wire up GPT understand method to /chat API. Log conversation metadata too	2021-11-28 00:04:39 +05:30
Debanjum Singh Solanky	a99b4b3434	Make conversation processor configurable	2021-11-27 18:12:01 +05:30
Debanjum Singh Solanky	c47a8cdf16	Allow configuring host, port or unix socket of server via CLI	2021-10-02 16:16:33 -07:00
Debanjum Singh Solanky	d2905c4be6	Move tests out to project root. Use absolute import in project tests/ directory in project root is more standard. Just had to use absolute path for internal module imports to get it to work	2021-09-30 04:12:14 -07:00
Debanjum Singh Solanky	d5597442f4	Modularize Code. Wrap Search, Model Config in Classes. Add Tests Details - Rename method query_* to query in search_types for standardization - Wrapping Config code in classes simplified mocking test config - Reduce args beings passed to a function by passing it as single argument wrapped in a class - Minimize setup in main.py:__main__. Put most of it into functions These functions can be mocked if required in tests later too Setup Flow: CLI_Args\|Config_YAML -> (Text\|Image)SearchConfig -> (Text\|Image)SearchModel	2021-09-30 02:04:04 -07:00
Debanjum Singh Solanky	f4dd9cd117	Use type specific model for other search types too. Expose them via SearchModels - Wrap Image, Music, Ledger search into the type of SearchModel they use Similar to what was done for notes model by wrapping it's config into an AsymmetricSearchModel. - Use the uber wrapper class to expose all type specific search models	2021-09-29 21:09:42 -07:00
Debanjum Singh Solanky	e22e0b41e3	Wrap asymmetric search model into SearchModels. Test notes search end-to-end - Wrap asymmetric search model parameters into AsymmetricSearchModel class - Create wrapper for all search type models. Put notes search model into it - Test notes search end-to-end from client API layer to results. Use model build on test data	2021-09-29 20:47:35 -07:00
Debanjum Singh Solanky	cde11a2331	Wrap search type enablement status in a search settings class - Cleaner, more idiomatic usage of a global variable - Simplifies mocking when testing client in pytest as setting wrapped in object rather than a simple type. So passed around by reference	2021-09-29 19:18:33 -07:00
Debanjum Singh Solanky	81ce0cacc3	Only allow supported search types to /search, /regenerate APIs - Use a SearchType to limit types that can be passed by user - FastAPI automatically validates type passed in query param - Available type options show up in Swagger UI, FastAPI docs - controller code looks neater instead of doing string comparisons for type - Test invalid, valid search types via pytest	2021-09-29 19:12:56 -07:00
Debanjum Singh Solanky	169ddcc8c6	Make Using XMP Metadata to Enhance Image Search Optional, Configurable - Break the compute embeddings method into separate methods: compute_image_embeddings and compute_metadata_embeddings - If image_metadata_embeddings isn't defined, do not use it to enhance search results. Given image_metadata_embeddings wouldn't be defined if use_xmp_metadata is False, we can avoid unnecessary addition of args to query method	2021-09-16 12:01:05 -07:00
Debanjum Singh Solanky	3afe054312	Make image batch size to encode configurable via config.yml	2021-09-16 10:52:31 -07:00
Debanjum Singh Solanky	d8abbc0552	Use XMP metadata in images to improve image search - Details - The CLIP model can represent images, text in the same vector space - Enhance CLIP's image understanding by augmenting the plain image with it's text based metadata. Specifically with any subject, description XMP tags on the image - Improve results by combining plain image similarity score with metadata similarity scores for the highest ranked images - Minor Fixes - Convert verbose to integer from bool in image_search. It's already passed as integer from the main program entrypoint - Process images with ".jpeg" extensions too	2021-09-16 08:55:20 -07:00
Debanjum Singh Solanky	0263d4d068	Enable semantic search for songs in org-music Org-Music: https://github.com/debanjum/org-music	2021-08-29 06:06:28 -07:00
Debanjum Singh Solanky	fd7888f3d4	Resolve relative file paths to config YAML file in cli.py	2021-08-29 03:03:37 -07:00

1 2

55 Commits