Commit Graph

6 Commits

Author SHA1 Message Date
Debanjum Singh Solanky
aa22d83172 Create and use a context manager to time code
Use the timer context manager in all places where code was being timed

- Benefits
  - Deduplicate timing code scattered across codebase.
  - Provides single place to manage perf timing code
  - Use consistent timing log patterns
2023-01-09 19:48:16 -03:00
Debanjum Singh Solanky
826f9dc054 Drop long words from compiled entries to be within max token limit of models
Long words (>500 characters) provide less useful context to models.

Dropping very long words allow models to create better embeddings by
passing more of the useful context from the entry to the model
2023-01-07 23:13:56 -03:00
Debanjum Singh Solanky
e057c8e208 Add method to split entries by specified max tokens limit
- Issue
   ML Models truncate entries exceeding some max token limit.
   This lowers the quality of search results

- Fix
  Split entries by max tokens before indexing.
  This should improve searching for content in longer entries.

- Miscellaneous
  - Test method to split entries by max tokens
2022-12-23 16:24:04 -03:00
Debanjum Singh Solanky
a9cfd8b800 Extract hash func for incremental text indexing into separate method 2022-10-26 13:56:58 +05:30
Debanjum Singh Solanky
7e9298f315 Use new Text Entry class to track text entries in Intermediate Format
- Context
  - The app maintains all text content in a standard, intermediate format
  - The intermediate format was loaded, passed around as a dictionary
    for easier, faster updates to the intermediate format schema initially
  - The intermediate format is reasonably stable now, given it's usage
    by all 3 text content types currently implemented

- Changes
  - Concretize text entries into `Entries' class instead of using dictionaries
    - Code is updated to load, pass around entries as `Entries' objects
      instead of as dictionaries
    - `text_search' and `text_to_jsonl' methods are annotated with
       type hints for the new `Entries' type
    - Code and Tests referencing entries are updated to use class style
      access patterns instead of the previous dictionary access patterns

  - Move `mark_entries_for_update' method into `TextToJsonl' base class
    - This is a more natural location for the method as it is only
      (to be) used by `text_to_jsonl' classes
    - Avoid circular reference issues on importing `Entries' class
2022-10-08 12:06:05 +03:00
Debanjum Singh Solanky
02d944030f Use Base TextToJsonl class to standardize <text>_to_jsonl processors
- Start standardizing implementation of the `text_to_jsonl' processors
  - `text_to_jsonl; scripts already had a shared structure
  - This change starts to codify that implicit structure

- Benefits
  - Ease adding more `text_to_jsonl; processors
  - Allow merging shared functionality
  - Help with type hinting

- Drawbacks
  - Lower agility to change. But this was already an implicit issue as
    the text_to_jsonl processors got more deeply wired into the app
2022-09-16 00:53:11 +03:00