Access internal links directly via a simple get request

The other webpage scrapers will not work for internal webpages. Try
access those urls directly if they are visible to the Khoj server over
the network.

Only enable this by default for self-hosted, single user setups.
Otherwise ability to scan internal network would be a liability!

For use-cases where it makes sense, the Khoj server admin can
explicitly add the direct webpage scraper via the admin panel
This commit is contained in:
Debanjum Singh Solanky
2024-10-16 02:57:51 -07:00
parent d94abba2dc
commit 20b6f0c2f4
4 changed files with 75 additions and 5 deletions

View File

@@ -249,6 +249,7 @@ class WebScraper(BaseModel):
FIRECRAWL = "firecrawl", gettext_lazy("Firecrawl")
OLOSTEP = "olostep", gettext_lazy("Olostep")
JINA = "jina", gettext_lazy("Jina")
DIRECT = "direct", gettext_lazy("Direct")
name = models.CharField(max_length=200, default=None, null=True, blank=True, unique=True)
type = models.CharField(max_length=20, choices=WebScraperType.choices, default=WebScraperType.JINA)