Migrate to using docusaurus, rather than docsify for documentation (#603)

* Add docusaurus documentation (to replace the docsify setup * Remove older docs * Specify documentation as the gh pages build action working directory
2026-03-05 21:29:11 +00:00 · 2024-01-07 20:28:15 +05:30
parent 98081bc0d3
commit 9b991eb4fe
65 changed files with 15749 additions and 398 deletions
--- a/documentation/docs/miscellaneous/_category_.json
+++ b/documentation/docs/miscellaneous/_category_.json
@@ -0,0 +1,8 @@
+{
+  "label": "Miscellaneous",
+  "position": 6,
+  "link": {
+    "type": "generated-index",
+    "description": "Additional resources for learning about Khoj"
+  }
+}
--- a/documentation/docs/miscellaneous/advanced.md
+++ b/documentation/docs/miscellaneous/advanced.md
@@ -0,0 +1,32 @@
+---
+sidebar_position: 3
+---
+
+# Advanced Usage
+
+### Search across Different Languages (Self-Hosting)
+To search for notes in multiple, different languages, you can use a [multi-lingual model](https://www.sbert.net/docs/pretrained_models.html#multi-lingual-models).<br />
+For example, the [paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) supports [50+ languages](https://www.sbert.net/docs/pretrained_models.html#:~:text=we%20used%20the%20following%2050%2B%20languages), has good search quality and speed. To use it:
+1. Manually update the search config in server's admin settings page. Go to [the search config](http://localhost:42110/server/admin/database/searchmodelconfig/). Either create a new one, if none exists, or update the existing one. Set the bi_encoder to `sentence-transformers/multi-qa-MiniLM-L6-cos-v1` and the cross_encoder to `cross-encoder/ms-marco-MiniLM-L-6-v2`.
+2. Regenerate your content index from all the relevant clients. This step is very important, as you'll need to re-encode all your content with the new model.
+
+### Query Filters
+
+Use structured query syntax to filter entries from your knowledge based used by search results or chat responses.
+
+- **Word Filter**: Get entries that include/exclude a specified term
+  - Entries that contain term_to_include: `+"term_to_include"`
+  - Entries that contain term_to_exclude: `-"term_to_exclude"`
+- **Date Filter**: Get entries containing dates in YYYY-MM-DD format from specified date (range)
+  - Entries from April 1st 1984: `dt:"1984-04-01"`
+  - Entries after March 31st 1984: `dt>="1984-04-01"`
+  - Entries before April 2nd 1984 : `dt<="1984-04-01"`
+- **File Filter**: Get entries from a specified file
+  - Entries from incoming.org file: `file:"incoming.org"`
+- Combined Example
+  - `what is the meaning of life? file:"1984.org" dt>="1984-01-01" dt<="1985-01-01" -"big" -"brother"`
+  - Adds all filters to the natural language query. It should return entries
+    - from the file *1984.org*
+    - containing dates from the year *1984*
+    - excluding words *"big"* and *"brother"*
+    - that best match the natural language query *"what is the meaning of life?"*
--- a/documentation/docs/miscellaneous/credits.md
+++ b/documentation/docs/miscellaneous/credits.md
@@ -0,0 +1,13 @@
+---
+sidebar_position: 4
+---
+
+# Credits
+Many Open Source projects are used to power Khoj. Here's a few of them:
+
+- [Multi-QA MiniLM Model](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1), [All MiniLM Model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) for Text Search. See [SBert Documentation](https://www.sbert.net/examples/applications/retrieve_rerank/README.html)
+- [OpenAI CLIP Model](https://github.com/openai/CLIP) for Image Search. See [SBert Documentation](https://www.sbert.net/examples/applications/image-search/README.html)
+- Charles Cave for [OrgNode Parser](http://members.optusnet.com.au/~charles57/GTD/orgnode.html)
+- [Org.js](https://mooz.github.io/org-js/) to render Org-mode results on the Web interface
+- [Markdown-it](https://github.com/markdown-it/markdown-it) to render Markdown results on the Web interface
+- [GPT4All](https://github.com/nomic-ai/gpt4all) to chat with local LLM
--- a/documentation/docs/miscellaneous/performance.md
+++ b/documentation/docs/miscellaneous/performance.md
@@ -0,0 +1,25 @@
+---
+sidebar_position: 2
+---
+
+# Performance
+
+Here are some top-level performance metrics for Khoj. These are rough estimates and will vary based on your hardware and data.
+
+### Search performance
+
+- Semantic search using the bi-encoder is fairly fast at \<100 ms across all content types
+- Reranking using the cross-encoder is slower at \<2s on 15 results. Tweak `top_k` to tradeoff speed for accuracy of results
+- Filters in query (e.g by file, word or date) usually add \<20ms to query latency
+
+### Indexing performance
+
+- Indexing is more strongly impacted by the size of the source data
+- Indexing 100K+ line corpus of notes takes about 10 minutes
+- Indexing 4000+ images takes about 15 minutes and more than 8Gb of RAM
+- Note: *It should only take this long on the first run* as the index is incrementally updated
+
+### Miscellaneous
+
+- Testing done on a Mac M1 and a \>100K line corpus of notes
+- Search, indexing on a GPU has not been tested yet
--- a/documentation/docs/miscellaneous/telemetry.md
+++ b/documentation/docs/miscellaneous/telemetry.md
@@ -0,0 +1,22 @@
+---
+sidebar_position: 1
+---
+
+# Telemetry
+
+We collect some high level, anonymized metadata about usage of Khoj. This includes:
+- Client (Web, Emacs, Obsidian)
+- API usage (Search, Chat)
+- Configured content types (Github, Org, etc)
+- Request metadata (e.g., host, referrer)
+
+We don't send any personal information or any information from/about your content. We only send the above metadata. This helps us prioritize feature development and understand how people are using Khoj. Don't just take our word for it -- you can see [the code here](https://github.com/khoj-ai/khoj/tree/master/src/telemetry).
+
+## Disable Telemetry
+
+You can opt out of telemetry at any time. To do so,
+1. Open `~/.khoj/khoj.yml`
+2. Set `should-log-telemetry` to `false`
+3. Save the file and restart Khoj
+
+If you have any questions or concerns, please reach out to us on [Discord](https://discord.gg/BDgyabRM6e).