Extract Entries in a standardized format across text search types

Issue: - Had different schema of extracted entries for symmetric_ledger vs asymmetric - Entry extraction for asymmetric was dirty, relying on cryptic indices to store raw entry vs cleaned entry meant to be passed to embeddings - This was pushing the load of figuring out what property to extract from each entry to downstream processes like the filters - This limited the filters to only work for asymmetric search, not for symmetric_ledger - Fix - Use consistent format for extracted entries { 'embed': entry_string_meant_to_be_passed_to_model_and_get_embeddings, 'raw' : raw_entry_string_meant_to_be_passed_to_use } - Result - Now filters can be applied across search types, and the specific field they should be applied on can be configured by each search type
2026-03-06 13:22:12 +00:00 · 2022-07-19 20:52:25 +04:00
parent e66cd5bf59
commit b673d26a12
5 changed files with 18 additions and 18 deletions
--- a/tests/test_date_filter.py
+++ b/tests/test_date_filter.py
@@ -13,9 +13,9 @@ from src.search_filter import date_filter
 def test_date_filter():
    embeddings = torch.randn(3, 10)
    entries = [
-        ['', 'Entry with no date'],
-        ['', 'April Fools entry: 1984-04-01'],
-        ['', 'Entry with date:1984-04-02']]
+        {'embed': '', 'raw': 'Entry with no date'},
+        {'embed': '', 'raw': 'April Fools entry: 1984-04-01'},
+        {'embed': '', 'raw': 'Entry with date:1984-04-02'}]

    q_with_no_date_filter = 'head tail'
    ret_query, ret_entries, ret_emb = date_filter.date_filter(q_with_no_date_filter, entries.copy(), embeddings)