mirror of
https://github.com/khoaliber/khoj.git
synced 2026-03-02 21:19:12 +00:00
Snip prepended heading to avoid crossing model max_token limits
Otherwise if heading > max_tokens than the search models will just see a heading (with repeated filename) for each compiled entry and not actual content. 100 characters should be sufficient to include filename (not path) and entry heading. If longer rather truncate to pass entry unique text to model for search context
This commit is contained in:
@@ -44,7 +44,10 @@ class TextToJsonl(ABC):
|
||||
|
||||
# Prepend heading to all other chunks, the first chunk already has heading from original entry
|
||||
if chunk_index > 0:
|
||||
compiled_entry_chunk = f"{entry.heading}.\n{compiled_entry_chunk}"
|
||||
# Snip heading to avoid crossing max_tokens limit
|
||||
# Keep last 100 characters of heading as entry heading more important than filename
|
||||
snipped_heading = entry.heading[-100:]
|
||||
compiled_entry_chunk = f"{snipped_heading}.\n{compiled_entry_chunk}"
|
||||
|
||||
chunked_entries.append(
|
||||
Entry(
|
||||
|
||||
Reference in New Issue
Block a user