mirror of
https://github.com/khoaliber/khoj.git
synced 2026-03-07 05:40:17 +00:00
Read Markdown file as utf8 instead of the default encoding used by OS
### Background
1. Obsidian stores markdown notes as `utf8`[1]
2. By default, the python `open` command uses the OS locale encoding[2]
### Issue
Based on above background, if the OS locale encoding isn't `utf8` it causes the `UnicodeDecodeError: <locale_encoding> codec can't decode byte` error
### Fix
- Read markdown files as `utf8`
The Obsidian plugin is the main use-case for markdown files in khoj currently and that stores md files as `utf8`.
Do not assume utf8 for other content types like org-mode, beancount for now.
- Fail if error in reading file as utf8, instead of ignoring errors.
Would rather have user realize that their files are not going to get indexed correctly.
[1]: https://forum.obsidian.md/t/better-handle-md-files-not-stored-in-utf8-format/13524/3
[2]: https://docs.python.org/3/library/functions.html#open
This commit is contained in:
13
.github/workflows/test.yml
vendored
13
.github/workflows/test.yml
vendored
@@ -24,13 +24,20 @@ jobs:
|
|||||||
test:
|
test:
|
||||||
name: Run Tests
|
name: Run Tests
|
||||||
runs-on: ubuntu-latest
|
runs-on: ubuntu-latest
|
||||||
steps:
|
strategy:
|
||||||
|
fail-fast: false
|
||||||
|
matrix:
|
||||||
|
python_version:
|
||||||
|
- 3.8
|
||||||
|
- 3.9
|
||||||
|
- 3.10
|
||||||
|
steps:
|
||||||
- uses: actions/checkout@v3
|
- uses: actions/checkout@v3
|
||||||
|
|
||||||
- name: Set up Python 3.10
|
- name: Set up Python
|
||||||
uses: actions/setup-python@v4
|
uses: actions/setup-python@v4
|
||||||
with:
|
with:
|
||||||
python-version: '3.10'
|
python-version: ${{ matrix.python_version }}
|
||||||
|
|
||||||
- name: Install Dependencies
|
- name: Install Dependencies
|
||||||
run: |
|
run: |
|
||||||
|
|||||||
@@ -97,7 +97,7 @@ class MarkdownToJsonl(TextToJsonl):
|
|||||||
entries = []
|
entries = []
|
||||||
entry_to_file_map = []
|
entry_to_file_map = []
|
||||||
for markdown_file in markdown_files:
|
for markdown_file in markdown_files:
|
||||||
with open(markdown_file) as f:
|
with open(markdown_file, 'r', encoding='utf8') as f:
|
||||||
markdown_content = f.read()
|
markdown_content = f.read()
|
||||||
markdown_entries_per_file = []
|
markdown_entries_per_file = []
|
||||||
for entry in re.split(markdown_heading_regex, markdown_content, flags=re.MULTILINE):
|
for entry in re.split(markdown_heading_regex, markdown_content, flags=re.MULTILINE):
|
||||||
|
|||||||
Reference in New Issue
Block a user