klbr/khoj - khoj - Gitea: Git with a cup of tea

klbr/khoj

mirror of https://github.com/khoaliber/khoj.git synced 2026-03-06 13:22:12 +00:00

Author	SHA1	Message	Date
Debanjum Singh Solanky	afc84de234	Make word filter regex explicit. Allow hyphen in word filters Helps with #88	2022-09-12 17:05:29 +03:00
Debanjum Singh Solanky	ebd5039bd1	Merge branch 'master' into support-incremental-updates-of-embeddings	2022-09-10 22:37:13 +03:00
Debanjum Singh Solanky	c17a0fd05b	Do not store word filters index to file. Not necessary for now - It's more of a hassle to not let word filter go stale on entry updates - Generating index on 120K lines of notes takes 1s. Loading from file takes 0.2s. For less content load time difference will be even smaller - Let go of startup time improvement for simplicity for now	2022-09-10 21:01:54 +03:00
Debanjum Singh Solanky	e00bb53336	Init word filter dictionary with default value as set to simplify code	2022-09-10 12:19:09 +03:00
Debanjum Singh Solanky	490157cafa	Setup File Filter for Markdown and Ledger content types - Pass file associated with entries in markdown, beancount to json converters - Add File, Word, Date Filters to Ledger, Markdown Types - Word, Date Filters were accidently removed from the above types yesterday - File Filter is the only filter that newly got added	2022-09-06 15:31:26 +03:00
Debanjum Singh Solanky	3707a4cdd4	Improve date filter perf. Precompute date to entry map, Cache results - Precompute date to entry map - Cache results for faster recall - Log preformance timers in date filter	2022-09-05 18:21:29 +03:00
Debanjum Singh Solanky	31503e7afd	Do not pass embeddings as argument to filter.apply method	2022-09-05 15:46:54 +03:00
Debanjum Singh Solanky	965bd052f1	Make search filters return entry ids satisfying filter - Filter entries, embeddings by ids satisfying all filters in query func, after each filter has returned entry ids satisfying their individual acceptance criteria - Previously each filter would return a filtered list of entries. Each filter would be applied on entries filtered by previous filters. This made the filtering order dependent - Benefits - Filters can be applied independent of their order of execution - Precomputed indexes for each filter is not in danger of running into index out of bound errors, as filters run on original entries instead of on entries filtered by filters that have run before it - Extract entries satisfying filter only once instead of doing this for each filter - Costs - Each filter has to process all entries even if previous filters may have already marked them as non-satisfactory	2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky	7dd20d764c	Pre-compute file to entry map in file filter to mark ids to include faster	2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky	2890b4cd44	Simplify extracting entries satisfying file filter	2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky	7e083d3e96	Cache results for file filters passed in query for faster filtering	2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky	f634399f23	Convert simple file filters with no path separator into regex - Specify just file name to get all notes associated with file at path - E.g `query` with `file:"file1.org"` will return `entry1` if `entry1` is in `file1.org` at `~/notes/file.org` - Test - Test converting simple file name filter to regex for path match - Test file filter with space in file name	2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky	1f9fd28b34	Create File Filter to filter files to query. Add tests for file filter	2022-09-05 01:09:20 +03:00
Debanjum Singh Solanky	e4418746f2	Create Abstract Base Class for Filters. Make Word, Date Filter Child of BaseFilter	2022-09-04 18:48:16 +03:00
Debanjum Singh Solanky	f930324350	Rename explicit filter to word filter to be more specific	2022-09-04 17:18:47 +03:00
Debanjum Singh Solanky	6087862521	Use LRU helper class for explicit filter cache	2022-09-04 16:42:28 +03:00
Debanjum Singh Solanky	191a656ed7	Use word to entry map, list comprehension to speed up explicit filter - Code Changes - Use list comprehension and `torch.index_select' methods - to speed selection of entries, embedding tensors satisfying filter - avoid deep copy of entries, embeddings - avoid updating existing lists (of entries, embeddings) - Use word to entry map and set operations to mark entries satisfying inclusion, exclusion filters - Results - Speed up explicit filtering by two orders of magnitude - Improve consistency of speed up across inclusion and exclusion filtering	2022-09-04 15:22:35 +03:00
Debanjum Singh Solanky	28d3dc1434	Deep copy entries, embeddings in filters. Defer till actual filtering - Only the filter knows when entries, embeddings are to be manipulated. So move the responsibility to deep copy before manipulating entries, embeddings to the filters - Create deep copy in filters. Avoids creating deep copy of entries, embeddings when filter results are being loaded from cache etc	2022-09-04 02:38:57 +03:00
Debanjum Singh Solanky	3308e68edf	Cache explicitly filtered entries, embeddings by required, blocked words	2022-09-04 02:38:57 +03:00
Debanjum Singh Solanky	cdcee89ae5	Wrap words in quotes to trigger explicit filter from query - Do not run the more expensive explicit filter until the word to be filtered is completed by user. This requires an end sequence marker to identify end of explicit word filter to trigger filtering - Space isn't a good enough delimiter as the explicit filter could be at the end of the query in which case no space	2022-09-04 02:38:57 +03:00
Debanjum Singh Solanky	8d9f507df3	Load entries_by_word_set from file only once on first load of explicit filter	2022-09-04 00:37:37 +03:00
Debanjum Singh Solanky	858d86075b	Use regexes to check if any explicit filters in query. Test can_filter	2022-09-03 23:47:28 +03:00
Debanjum Singh Solanky	546fad570d	Use regex to extract include, exclude filter words from query	2022-09-03 23:41:43 +03:00
Debanjum Singh Solanky	ffb8e3988e	Use Python Logging Framework to Time Performance of Explicit Filter	2022-09-03 22:24:10 +03:00
Debanjum Singh Solanky	c7de57b8ea	Pre-compute entry word sets to improve explicit filter query performance	2022-09-03 16:16:31 +03:00
Debanjum Singh Solanky	38df727ef4	Fix escape sequence usage in strings. Remove unneeded import of os Rename /config API method to config to match it's purpose. UI is anyway too generic, and not what it is doing	2022-08-03 18:51:55 +03:00
Debanjum Singh Solanky	1b55462fb0	Convert search_filter, conversation dir to proper modules Add __init__.py files to their directories	2022-08-02 20:23:42 +03:00
Debanjum Singh Solanky	b1e64fd4a8	Improve search speed. Only apply filter if filter keywords in query - Formalize filters into class with can_filter() and filter() methods - Use can_filter() method to decide whether to apply filter and create deep copies of entries and embeddings for it - Improve search speed for queries with no filters as deep copying entries, embeddings takes the most time after cross-encodes scoring when calling the /search API Earlier we would create deep copies of entries, embeddings even if the query did not contain any filter keywords	2022-07-26 22:47:26 +04:00
Debanjum Singh Solanky	b673d26a12	Extract Entries in a standardized format across text search types Issue: - Had different schema of extracted entries for symmetric_ledger vs asymmetric - Entry extraction for asymmetric was dirty, relying on cryptic indices to store raw entry vs cleaned entry meant to be passed to embeddings - This was pushing the load of figuring out what property to extract from each entry to downstream processes like the filters - This limited the filters to only work for asymmetric search, not for symmetric_ledger - Fix - Use consistent format for extracted entries { 'embed': entry_string_meant_to_be_passed_to_model_and_get_embeddings, 'raw' : raw_entry_string_meant_to_be_passed_to_use } - Result - Now filters can be applied across search types, and the specific field they should be applied on can be configured by each search type	2022-07-19 20:52:25 +04:00
Debanjum Singh Solanky	85077bc1d1	Handle unparseable date range passed via date filter in query - Do not reuse the same list - Just create new list, so only parsed data is in it	2022-07-14 22:47:23 +04:00
Debanjum Singh Solanky	c3b3e8959d	Put entry splitting regex in explicit filter into a variable for code readability	2022-07-14 22:00:10 +04:00
Debanjum Singh Solanky	3aac3c7d52	Run explicit filter on raw entry, add more terms to split entries by - With \t Last Word in Headings was suffixed by \t and so couldn't be filtered by - User interacts with raw entries, so run explicit filters on raw entry - For semantic search using the filtered entry is cleaner, still	2022-07-14 21:54:04 +04:00
Debanjum Singh Solanky	7640e2ab0c	Wrap attempt to extract dates from entry in try/catch - Not all YYYY-MM-DD strings in entry are necessarily dates	2022-07-14 21:38:00 +04:00
Debanjum Singh Solanky	9de2097182	Fix date filter usage with multi word queries. Simplify date regex	2022-07-14 21:34:33 +04:00
Debanjum Singh Solanky	dcb6fe479e	Fix date_filter query, entry in query range check. Add tests for it - Fix date_filter date_in_entry within query range check - Extracted_date_range is in [included_date, excluded_date) format - But check was checking for date_in_entry <= excluded_date - Fixed it to do date_in_entry < excluded_date - Fix removal of date filter from query - Add tests for date_filter	2022-07-14 20:01:35 +04:00
Debanjum Singh Solanky	011f81fac5	Fix date_filter to handle non overlapping date ranges	2022-07-14 18:53:38 +04:00
Debanjum Singh Solanky	70ac35b2a5	Compute Date Range to filter entries to, from Comparators, Dates in Query	2022-07-14 18:20:09 +04:00
Debanjum Singh Solanky	e6db3e3d00	Prefer Dates From Future only when specific words in date string - Default to looking at dates from past, as most notes are from past - Look for dates in future for cases where it's obvious query is for dates in the future but dateparser's parse doesn't parse it at all. E.g parse('5 months from now') returns nothing - Setting PREFER_DATES_FROM_FUTURE in this case and passing just parse('5 months') to dateparser.parse works as expected	2022-07-14 18:13:12 +04:00
Debanjum Singh Solanky	4a201d52af	Add, test date filter regex and date parsing to get natural date range	2022-07-14 16:47:32 +04:00
Debanjum Singh Solanky	b54588717f	Filter for entries with dates specified by user in query - Create Date filter - Users can pass dates in YYYY-MM-DD format in their query - Use it to filter asymmetric search to user specified dates	2022-07-14 00:51:02 +04:00
Debanjum Singh Solanky	c92789d20a	Extract explicit pre-search filter function into a separate module Details -- - Move explicit_filters function into separate module under search_filter - Update signature of explicit filter to take and return query, entries, embeddings - Use this explicit_filter func from search_filters module in query Reason -- Abstraction will simplify adding other pre-search filters. E.g datetime filter	2022-07-13 16:20:04 +04:00

41 Commits