Commit Graph

17 Commits

Author SHA1 Message Date
Debanjum Singh Solanky
f209e30a3b Split entries by max tokens while converting Markdown entries To JSONL 2022-12-26 13:14:15 -03:00
Debanjum Singh Solanky
7e9298f315 Use new Text Entry class to track text entries in Intermediate Format
- Context
  - The app maintains all text content in a standard, intermediate format
  - The intermediate format was loaded, passed around as a dictionary
    for easier, faster updates to the intermediate format schema initially
  - The intermediate format is reasonably stable now, given it's usage
    by all 3 text content types currently implemented

- Changes
  - Concretize text entries into `Entries' class instead of using dictionaries
    - Code is updated to load, pass around entries as `Entries' objects
      instead of as dictionaries
    - `text_search' and `text_to_jsonl' methods are annotated with
       type hints for the new `Entries' type
    - Code and Tests referencing entries are updated to use class style
      access patterns instead of the previous dictionary access patterns

  - Move `mark_entries_for_update' method into `TextToJsonl' base class
    - This is a more natural location for the method as it is only
      (to be) used by `text_to_jsonl' classes
    - Avoid circular reference issues on importing `Entries' class
2022-10-08 12:06:05 +03:00
Debanjum Singh Solanky
02d944030f Use Base TextToJsonl class to standardize <text>_to_jsonl processors
- Start standardizing implementation of the `text_to_jsonl' processors
  - `text_to_jsonl; scripts already had a shared structure
  - This change starts to codify that implicit structure

- Benefits
  - Ease adding more `text_to_jsonl; processors
  - Allow merging shared functionality
  - Help with type hinting

- Drawbacks
  - Lower agility to change. But this was already an implicit issue as
    the text_to_jsonl processors got more deeply wired into the app
2022-09-16 00:53:11 +03:00
Debanjum Singh Solanky
8f57a62675 Remove unused imports. Fix typing and indentation
- Typing issues discovered using `mypy'. Fixed manually
- Unused imports discovered and fixed using `autoflake'
- Fix indentation in `org_to_jsonl' manually
2022-09-14 04:56:52 +03:00
Debanjum Singh Solanky
0109c7bd91 Disable ability to call <text>_to_jsonl, <type>_search packages directly
- This code is de-synced with expected args by above scripts
- Better to remove unused capabilitity that needlessly increases
  maintainance burden
2022-09-14 04:56:48 +03:00
Debanjum Singh Solanky
536f03af8f Process text content files in sorted order for stable indexing
- Image search already uses a sorted list of images to process
- Prevents index of entries to desync when entries, embeddings
  generated by a separate server/app instance
2022-09-12 11:09:40 +03:00
Debanjum Singh Solanky
a701ad08b9 Support multiple input-filters to configure content to index via khoj.yml
- Update existings code, tests to process input-filters as list
  instead of str
- Test `text_to_jsonl' get files methods to work with combination of
  `input-files' and `input-filters'

Resolves #84
2022-09-12 11:08:59 +03:00
Debanjum Singh Solanky
52e3dd9835 Pass the whole TextContentConfig as argument to text_to_jsonl methods
- Let the specific text_to_jsonl method decide which of the
  TextContentConfig fields it needs to convert <text> type to jsonl
- This simplifies extending TextContentConfig for a specific type without
  modifying all text_to_jsonl methods
- It keeps the number of args being passed to the `text_to_jsonl'
  methods in check
2022-09-11 12:49:56 +03:00
Debanjum Singh Solanky
2e1bbe0cac Fix striping empty escape sequences from strings
- Fix log message on jsonl write
2022-09-10 23:57:05 +03:00
Debanjum Singh Solanky
a7cf6c8458 Use dictionary instead of list to track entry to file maps 2022-09-10 23:08:30 +03:00
Debanjum Singh Solanky
3e1323971b Stack function calls in jsonl converters to avoid unneeded variables 2022-09-10 22:56:06 +03:00
Debanjum Singh Solanky
4eb84c7f51 Log performance metrics for beancount, markdown to jsonl conversion 2022-09-10 22:47:54 +03:00
Debanjum Singh Solanky
030fab9bb2 Support incremental update of Markdown entries, embeddings 2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky
2f7a6af56a Support incremental update of org-mode entries and embeddings
- What
  - Hash the entries and compare to find new/updated entries
  - Reuse embeddings encoded for existing entries
  - Only encode embeddings for updated or new entries
  - Merge the existing and new entries and embeddings to get the updated
    entries, embeddings

- Why
  - Given most note text entries are expected to be unchanged
    across time. Reusing their earlier encoded embeddings should
    significantly speed up embeddings updates
  - Previously we were regenerating embeddings for all entries,
    even if they had existed in previous runs
2022-09-10 20:58:33 +03:00
Debanjum Singh Solanky
490157cafa Setup File Filter for Markdown and Ledger content types
- Pass file associated with entries in markdown, beancount to json converters
- Add File, Word, Date Filters to Ledger, Markdown Types
  - Word, Date Filters were accidently removed from the above types yesterday
  - File Filter is the only filter that newly got added
2022-09-06 15:31:26 +03:00
Debanjum Singh Solanky
094bd18e57 Use python standard logging framework for app logs
- Stop passing verbose flag around app methods
- Minor remap of verbosity levels to match python logging framework levels
  - verbose = 0 maps to logging.WARN
  - verbose = 1 maps to logging.INFO
  - verbose >=2 maps to logging.DEBUG
- Minor clean-up of app: unused modules, conversation file opening
2022-09-03 14:43:32 +03:00
Debanjum Singh Solanky
d4d7dbaca6 Support Natural Search on Markdown Files
- Reason:
  Allow natural search on markdown based notes, documentation,
  websites etc

- Details:
  - Create markdown processor to extract Markdown entries (identified by
    Heading) into standard jsonl format required by text_search
  - Update API, Configs to support interfacing with new markdown type
  - Update Emacs, Web clients to support interfacing with new markdown
    type via API
  - Update Readme to mentiond markdown is also supported

Closes #35
2022-07-21 22:07:05 +04:00