mirror of
https://github.com/khoaliber/khoj.git
synced 2026-03-06 13:22:12 +00:00
Extract Entries in a standardized format across text search types
Issue:
- Had different schema of extracted entries for symmetric_ledger vs asymmetric
- Entry extraction for asymmetric was dirty, relying on cryptic
indices to store raw entry vs cleaned entry meant to be passed to embeddings
- This was pushing the load of figuring out what property to extract
from each entry to downstream processes like the filters
- This limited the filters to only work for asymmetric search, not for
symmetric_ledger
- Fix
- Use consistent format for extracted entries
{
'embed': entry_string_meant_to_be_passed_to_model_and_get_embeddings,
'raw' : raw_entry_string_meant_to_be_passed_to_use
}
- Result
- Now filters can be applied across search types, and the specific
field they should be applied on can be configured by each search
type
This commit is contained in:
@@ -13,9 +13,9 @@ from src.search_filter import date_filter
|
||||
def test_date_filter():
|
||||
embeddings = torch.randn(3, 10)
|
||||
entries = [
|
||||
['', 'Entry with no date'],
|
||||
['', 'April Fools entry: 1984-04-01'],
|
||||
['', 'Entry with date:1984-04-02']]
|
||||
{'embed': '', 'raw': 'Entry with no date'},
|
||||
{'embed': '', 'raw': 'April Fools entry: 1984-04-01'},
|
||||
{'embed': '', 'raw': 'Entry with date:1984-04-02'}]
|
||||
|
||||
q_with_no_date_filter = 'head tail'
|
||||
ret_query, ret_entries, ret_emb = date_filter.date_filter(q_with_no_date_filter, entries.copy(), embeddings)
|
||||
|
||||
Reference in New Issue
Block a user