Support incremental update of org-mode entries and embeddings

- What
  - Hash the entries and compare to find new/updated entries
  - Reuse embeddings encoded for existing entries
  - Only encode embeddings for updated or new entries
  - Merge the existing and new entries and embeddings to get the updated
    entries, embeddings

- Why
  - Given most note text entries are expected to be unchanged
    across time. Reusing their earlier encoded embeddings should
    significantly speed up embeddings updates
  - Previously we were regenerating embeddings for all entries,
    even if they had existed in previous runs
This commit is contained in:
Debanjum Singh Solanky
2022-09-07 00:16:48 +03:00
parent 762607fc9f
commit 2f7a6af56a
5 changed files with 80 additions and 30 deletions

View File

@@ -39,7 +39,7 @@ def markdown_to_jsonl(markdown_files, markdown_file_filter, output_file):
elif output_file.suffix == ".jsonl":
dump_jsonl(jsonl_data, output_file)
return entries
return list(enumerate(entries))
def get_markdown_files(markdown_files=None, markdown_file_filter=None):