Fix extracting Markdown Entries with Top Level Headings

- Previously top level headings would have get stripped of the
  space between heading text and the prefix # symbols. That is,
  `# Top Level Heading' would get converted to `#Top Level Heading'
- This would mess up their rendering as a heading in search results

- Add unit tests to text_to_jsonl processors to prevent regression
This commit is contained in:
Debanjum Singh Solanky
2023-01-17 12:42:36 -03:00
parent 1a296518c5
commit 7b4f78776c
3 changed files with 44 additions and 4 deletions

View File

@@ -103,6 +103,25 @@ def test_get_markdown_files(tmp_path):
assert extracted_org_files == expected_files
def test_extract_entries_with_different_level_headings(tmp_path):
"Extract markdown entries with different level headings."
# Arrange
entry = f'''
# Heading 1
## Heading 2
'''
markdownfile = create_file(tmp_path, entry)
# Act
# Extract Entries from specified Markdown files
entries, _ = MarkdownToJsonl.extract_markdown_entries(markdown_files=[markdownfile])
# Assert
assert len(entries) == 2
assert entries[0] == "# Heading 1"
assert entries[1] == "## Heading 2"
# Helper Functions
def create_file(tmp_path, entry=None, filename="test.md"):
markdown_file = tmp_path / filename