mirror of
https://github.com/khoaliber/khoj.git
synced 2026-03-02 21:19:12 +00:00
Read Markdown file as utf8 instead of the default encoding used by OS
- Background
1. Obsidian stores markdown notes as utf8[1]
2. By default, the python `open' command uses the OS locale encoding[2]
This was causing the `UnicodeDecodeError: <locale_encoding> codec can't decode byte' error
- Fix
- Read markdown files as utf8
The Obsidian plugin is the main use-case for markdown files in
khoj currently and that stores md files as utf8.
Do not assume utf8 for other content types like org-mode, beancount for now.
- Fail if error in reading file as utf8, instead of ignoring errors.
Would rather have user realize that their files are not going to
get indexed correctly.
[1]: https://forum.obsidian.md/t/better-handle-md-files-not-stored-in-utf8-format/13524/3
[2]: https://docs.python.org/3/library/functions.html#open
This commit is contained in:
@@ -97,7 +97,7 @@ class MarkdownToJsonl(TextToJsonl):
|
||||
entries = []
|
||||
entry_to_file_map = []
|
||||
for markdown_file in markdown_files:
|
||||
with open(markdown_file) as f:
|
||||
with open(markdown_file, 'r', encoding='utf8') as f:
|
||||
markdown_content = f.read()
|
||||
markdown_entries_per_file = []
|
||||
for entry in re.split(markdown_heading_regex, markdown_content, flags=re.MULTILINE):
|
||||
|
||||
Reference in New Issue
Block a user