From fbb7747dcc2d03a3b137a7d882412281409cee6c Mon Sep 17 00:00:00 2001
From: Debanjum Singh Solanky <debanjum@gmail.com>
Date: Mon, 6 Feb 2023 18:45:57 -0300
Subject: [PATCH] Read Markdown file as utf8 instead of the default encoding
 used by OS

- Background
  1. Obsidian stores markdown notes as utf8[1]
  2. By default, the python `open' command uses the OS locale encoding[2]

  This was causing the `UnicodeDecodeError: <locale_encoding> codec can't decode byte' error

- Fix
  - Read markdown files as utf8
    The Obsidian plugin is the main use-case for markdown files in
    khoj currently and that stores md files as utf8.
    Do not assume utf8 for other content types like org-mode, beancount for now.
  - Fail if error in reading file as utf8, instead of ignoring errors.
    Would rather have user realize that their files are not going to
    get indexed correctly.

[1]: https://forum.obsidian.md/t/better-handle-md-files-not-stored-in-utf8-format/13524/3
[2]: https://docs.python.org/3/library/functions.html#open
---
 src/processor/markdown/markdown_to_jsonl.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/processor/markdown/markdown_to_jsonl.py b/src/processor/markdown/markdown_to_jsonl.py
index 9b326884..822cad0c 100644
--- a/src/processor/markdown/markdown_to_jsonl.py
+++ b/src/processor/markdown/markdown_to_jsonl.py
@@ -97,7 +97,7 @@ class MarkdownToJsonl(TextToJsonl):
         entries = []
         entry_to_file_map = []
         for markdown_file in markdown_files:
-            with open(markdown_file) as f:
+            with open(markdown_file, 'r', encoding='utf8') as f:
                 markdown_content = f.read()
                 markdown_entries_per_file = []
                 for entry in re.split(markdown_heading_regex, markdown_content, flags=re.MULTILINE):