Chunk text in preference order of para, sentence, word, character

- Previous simplistic chunking strategy of splitting text by space
  didn't capture notes with newlines, no spaces. For e.g in #620

- New strategy will try chunk the text at more natural points like
  paragraph, sentence, word first. If none of those work it'll split
  at character to fit within max token limit

- Drop long words while preserving original delimiters

Resolves #620
This commit is contained in:
Debanjum Singh Solanky
2024-01-29 05:03:29 +05:30
parent a627f56a64
commit 86575b2946
3 changed files with 46 additions and 17 deletions

View File

@@ -192,7 +192,7 @@ def test_entry_chunking_by_max_tokens(org_config_with_only_new_file: LocalOrgCon
# Assert
assert (
"Deleted 0 entries. Created 2 new entries for user " in caplog.records[-1].message
"Deleted 0 entries. Created 3 new entries for user " in caplog.records[-1].message
), "new entry not split by max tokens"
@@ -250,7 +250,7 @@ conda activate khoj
# Assert
assert (
"Deleted 0 entries. Created 2 new entries for user " in caplog.records[-1].message
"Deleted 0 entries. Created 3 new entries for user " in caplog.records[-1].message
), "new entry not split by max tokens"