creation

2026-04-19 09:05:01 +00:00 · 2026-03-15 12:00:53 +08:00
parent 084884ba67
commit 8e3769f606
1 changed files with 766 additions and 0 deletions
--- a/OpenAI-13964/readme-13964.md
+++ b/OpenAI-13964/readme-13964.md
@@ -0,0 +1,766 @@
+Scrape and ingest web pages into a Pinecone RAG stack with Firecrawl and OpenAI
+
+https://n8nworkflows.xyz/workflows/scrape-and-ingest-web-pages-into-a-pinecone-rag-stack-with-firecrawl-and-openai-13964
+
+
+# Scrape and ingest web pages into a Pinecone RAG stack with Firecrawl and OpenAI
+
+## 1. Workflow Overview
+
+This workflow serves two related purposes:
+
+1. **Ingestion pipeline:** accept a URL through an HTTP webhook, validate and normalize it, scrape the page content with Firecrawl, generate OpenAI embeddings, and store the resulting vectors in a Pinecone index.
+2. **RAG chat interface:** expose a chat entry point that lets users query the indexed knowledge base using an agent backed by an OpenRouter chat model, Pinecone retrieval, OpenAI embeddings, and Cohere reranking.
+
+The workflow is organized into the following logical blocks.
+
+### 1.1 Ingestion Entry and URL Validation
+Receives a POST request containing a URL, checks whether the input is present and valid, and either forwards a normalized URL for scraping or returns a validation error response.
+
+### 1.2 Web Scraping and Vector Ingestion
+Uses Firecrawl to scrape the target page into structured content, transforms that content into documents, generates embeddings, and inserts them into Pinecone.
+
+### 1.3 Ingestion Response Handling
+Returns the final HTTP response to the original webhook caller after ingestion completes, or a 422 response when validation fails.
+
+### 1.4 Chat-Based Retrieval Interface
+Provides a second entry point through an n8n chat trigger, then routes the user query into an agent that can retrieve knowledge from Pinecone.
+
+### 1.5 LLM, Memory, Retrieval, Embeddings, and Reranking
+Supplies the chat agent with its required supporting components: OpenRouter as the language model, buffer memory for conversation continuity, Pinecone retrieval as a tool, OpenAI embeddings for query vectorization, and Cohere reranking for retrieval quality improvement.
+
+---
+
+## 2. Block-by-Block Analysis
+
+## 2.1 Ingestion Entry and URL Validation
+
+**Overview:**  
+This block accepts a URL via webhook and ensures it is present and syntactically acceptable before any external API call is made. It is the main safeguard against invalid ingestion requests.
+
+**Nodes Involved:**  
+- Receive URL
+- Validate and normalize URL
+- Return URL validation error
+
+### Node Details
+
+#### Receive URL
+- **Type and technical role:** `n8n-nodes-base.webhook`  
+  HTTP entry point for ingestion requests.
+- **Configuration choices:**  
+  - HTTP method: `POST`
+  - Response mode: `responseNode`, meaning a dedicated Respond to Webhook node must send the HTTP response.
+  - Webhook path is a generated UUID-like path.
+- **Key expressions or variables used:**  
+  None internally; downstream nodes read `body.url`.
+- **Input and output connections:**  
+  - No input
+  - Outputs to: `Validate and normalize URL`
+- **Version-specific requirements:**  
+  Uses webhook node version `2.1`.
+- **Edge cases or potential failure types:**  
+  - If the workflow is inactive, production webhook calls will fail.
+  - If the caller sends malformed JSON or omits `url`, downstream validation will reject it.
+  - Since response mode is handled by another node, missing response execution could leave requests hanging.
+- **Sub-workflow reference:**  
+  None
+
+#### Validate and normalize URL
+- **Type and technical role:** `n8n-nodes-base.code`  
+  Custom JavaScript validation and normalization layer.
+- **Configuration choices:**  
+  - `onError` is set to `continueErrorOutput`, allowing execution to continue even if the code throws an error.
+  - Reads the first incoming item’s request body.
+  - Normalizes the URL by stripping protocol and path, validating only the domain, then rebuilding a normalized `https://domain` URL.
+- **Key expressions or variables used:**  
+  - `const body = $input.first().json.body;`
+  - `const raw = body?.url?.trim();`
+  - Domain extraction:
+    - remove protocol with `^https?://`
+    - remove path with `/.*$`
+  - Validation regex:
+    - `/^[a-zA-Z0-9]([a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])?(\.[a-zA-Z]{2,})+$/`
+- **Input and output connections:**  
+  - Input from: `Receive URL`
+  - Outputs to:
+    - `Scrape page with Firecrawl`
+    - `Return URL validation error`
+- **Version-specific requirements:**  
+  Uses code node version `2`.
+- **Edge cases or potential failure types:**  
+  - Missing `url` returns a JSON object with `status: 422`, but does not throw.
+  - Invalid domains throw an error.
+  - The regex accepts domains but does not validate full URLs with paths, query strings, IP addresses, localhost, or uncommon TLD patterns.
+  - Any submitted path is discarded intentionally.
+  - Because both downstream nodes are connected from the main output, response behavior depends on execution semantics and error routing; this deserves testing after import.
+- **Sub-workflow reference:**  
+  None
+
+#### Return URL validation error
+- **Type and technical role:** `n8n-nodes-base.respondToWebhook`  
+  Sends an HTTP 422 response back to the ingestion caller.
+- **Configuration choices:**  
+  - Response code: `422`
+  - Response key set from `={{ $json.error }}`
+- **Key expressions or variables used:**  
+  - `{{$json.error}}`
+- **Input and output connections:**  
+  - Input from: `Validate and normalize URL`
+  - No output
+- **Version-specific requirements:**  
+  Uses Respond to Webhook version `1.5`.
+- **Edge cases or potential failure types:**  
+  - The code node returns `message` for missing URL, but this response node reads `error`; this mismatch may produce an empty or incorrect response body.
+  - If the code node succeeds, this node may still be reachable depending on flow behavior and should be tested.
+  - If the webhook execution has already been answered elsewhere, n8n may reject a second response.
+- **Sub-workflow reference:**  
+  None
+
+---
+
+## 2.2 Web Scraping and Vector Ingestion
+
+**Overview:**  
+This block takes a normalized URL, scrapes the page with Firecrawl, prepares the scraped content as documents with metadata, generates embeddings through OpenAI, and inserts everything into Pinecone.
+
+**Nodes Involved:**  
+- Scrape page with Firecrawl
+- Generate OpenAI embeddings
+- Load scraped content
+- Store embeddings in Pinecone
+
+### Node Details
+
+#### Scrape page with Firecrawl
+- **Type and technical role:** `@mendable/n8n-nodes-firecrawl.firecrawl`  
+  External scraping connector that fetches the target page and converts it into clean content.
+- **Configuration choices:**  
+  - Operation: `scrape`
+  - URL sourced from the validation node
+  - Scrape options request output in the default format configuration, which in this context is intended to produce markdown.
+- **Key expressions or variables used:**  
+  - `={{ $('Validate and normalize URL').item.json.url }}`
+- **Input and output connections:**  
+  - Input from: `Validate and normalize URL`
+  - Output to: `Store embeddings in Pinecone`
+- **Version-specific requirements:**  
+  Uses Firecrawl node version `1`.
+- **Edge cases or potential failure types:**  
+  - Firecrawl authentication failure
+  - Unreachable site, rate limiting, blocked scraping, or timeout
+  - Target page may return empty or partial content
+  - Some sites may require JavaScript rendering or anti-bot handling beyond this configuration
+- **Sub-workflow reference:**  
+  None
+
+#### Generate OpenAI embeddings
+- **Type and technical role:** `@n8n/n8n-nodes-langchain.embeddingsOpenAi`  
+  Embedding model provider used by the Pinecone vector store node during ingestion.
+- **Configuration choices:**  
+  - No explicit model override is shown, so behavior depends on node defaults or account/node version defaults.
+  - Sticky note indicates Pinecone must be configured for `text-embedding-3-small` with 1536 dimensions.
+- **Key expressions or variables used:**  
+  None
+- **Input and output connections:**  
+  - No main input
+  - AI embedding output to: `Store embeddings in Pinecone`
+- **Version-specific requirements:**  
+  Uses embeddings node version `1.2`.
+- **Edge cases or potential failure types:**  
+  - OpenAI credential errors
+  - Model/dimension mismatch with Pinecone index
+  - Token or content-size limitations if documents are very large and chunking is not handled elsewhere
+- **Sub-workflow reference:**  
+  None
+
+#### Load scraped content
+- **Type and technical role:** `@n8n/n8n-nodes-langchain.documentDefaultDataLoader`  
+  Converts the incoming scrape result into LangChain-compatible document objects and enriches them with metadata.
+- **Configuration choices:**  
+  - Adds metadata field `url`
+  - URL metadata references the normalized URL from the validation step
+- **Key expressions or variables used:**  
+  - `={{ $('Validate and normalize URL').item.json.url }}`
+- **Input and output connections:**  
+  - No main input shown in classic form; it acts as an AI document provider
+  - AI document output to: `Store embeddings in Pinecone`
+- **Version-specific requirements:**  
+  Uses document loader version `1.1`.
+- **Edge cases or potential failure types:**  
+  - If Firecrawl output shape is incompatible or empty, document extraction may fail or produce no documents.
+  - Metadata references the validated base URL, not necessarily the final redirected URL.
+- **Sub-workflow reference:**  
+  None
+
+#### Store embeddings in Pinecone
+- **Type and technical role:** `@n8n/n8n-nodes-langchain.vectorStorePinecone`  
+  Central vector store node that inserts documents and embeddings into Pinecone.
+- **Configuration choices:**  
+  - Mode: `insert`
+  - Target Pinecone index: `firecrawl`
+  - Receives:
+    - main input from scrape result
+    - document input from the loader
+    - embedding model from OpenAI embeddings node
+- **Key expressions or variables used:**  
+  None directly in the node config beyond selected resource list values.
+- **Input and output connections:**  
+  - Main input from: `Scrape page with Firecrawl`
+  - AI document input from: `Load scraped content`
+  - AI embedding input from: `Generate OpenAI embeddings`
+  - Main output to: `Return ingestion result`
+- **Version-specific requirements:**  
+  Uses Pinecone vector store node version `1.3`.
+- **Edge cases or potential failure types:**  
+  - Pinecone auth or index-not-found errors
+  - Index dimension mismatch
+  - Region/environment mismatch in credentials
+  - Duplicate content ingestion if same page is sent repeatedly and no deduplication strategy exists
+  - If scrape content is too large and no chunking occurs, insert behavior may be suboptimal
+- **Sub-workflow reference:**  
+  None
+
+---
+
+## 2.3 Ingestion Response Handling
+
+**Overview:**  
+This block sends the HTTP response for successful ingestion. It closes the webhook request lifecycle after Pinecone insertion completes.
+
+**Nodes Involved:**  
+- Return ingestion result
+
+### Node Details
+
+#### Return ingestion result
+- **Type and technical role:** `n8n-nodes-base.respondToWebhook`  
+  Sends a success response to the original ingestion webhook caller.
+- **Configuration choices:**  
+  - HTTP status code: `200`
+  - Response type: JSON
+  - Executes once
+  - Response body contains a message with the number of items processed
+- **Key expressions or variables used:**  
+  - `{{$input.all().length}}`
+  - Response body text: `"Added {{$input.all().length}} items to Supabase"`
+- **Input and output connections:**  
+  - Input from: `Store embeddings in Pinecone`
+  - No output
+- **Version-specific requirements:**  
+  Uses Respond to Webhook version `1.5`.
+- **Edge cases or potential failure types:**  
+  - The message says **Supabase**, but the workflow actually stores data in **Pinecone**. This is a documentation/output bug and should be corrected.
+  - If no items are inserted, response may still claim success with a low count.
+  - If upstream fails and no error handler exists, the caller may get a generic execution failure instead.
+- **Sub-workflow reference:**  
+  None
+
+---
+
+## 2.4 Chat-Based Retrieval Interface
+
+**Overview:**  
+This block provides an interactive chat entry point. It forwards incoming user messages into a tool-using AI agent connected to the indexed knowledge base.
+
+**Nodes Involved:**  
+- Receive chat message
+- Answer query from knowledge base
+
+### Node Details
+
+#### Receive chat message
+- **Type and technical role:** `@n8n/n8n-nodes-langchain.chatTrigger`  
+  Entry point for chat conversations.
+- **Configuration choices:**  
+  - Uses default options
+  - Generates a separate webhook/chat endpoint for interactive use
+- **Key expressions or variables used:**  
+  None
+- **Input and output connections:**  
+  - No input
+  - Main output to: `Answer query from knowledge base`
+- **Version-specific requirements:**  
+  Uses chat trigger version `1.4`.
+- **Edge cases or potential failure types:**  
+  - Workflow must be active for production chat endpoint use.
+  - If chat payload format differs from what the trigger expects, execution may fail.
+- **Sub-workflow reference:**  
+  None
+
+#### Answer query from knowledge base
+- **Type and technical role:** `@n8n/n8n-nodes-langchain.agent`  
+  AI agent that handles user questions and can call the Pinecone retrieval tool.
+- **Configuration choices:**  
+  - Uses default options
+  - Wired to:
+    - language model
+    - memory
+    - retrieval tool
+- **Key expressions or variables used:**  
+  None explicitly configured.
+- **Input and output connections:**  
+  - Main input from: `Receive chat message`
+  - AI language model input from: `OpenRouter LLM`
+  - AI memory input from: `Chat memory`
+  - AI tool input from: `Retrieve documents from Pinecone`
+- **Version-specific requirements:**  
+  Uses agent node version `3.1`.
+- **Edge cases or potential failure types:**  
+  - If the retrieval tool is unavailable, the agent may answer poorly or fail.
+  - LLM auth/model issues from OpenRouter can stop the chain.
+  - Poor retrieval quality may occur if the Pinecone index is empty or embeddings are inconsistent with ingestion.
+- **Sub-workflow reference:**  
+  None
+
+---
+
+## 2.5 LLM, Memory, Retrieval, Embeddings, and Reranking
+
+**Overview:**  
+This support block equips the chat agent with retrieval-augmented generation capabilities. It provides the language model, short-term memory, query embedding, vector search, and reranking.
+
+**Nodes Involved:**  
+- OpenRouter LLM
+- Chat memory
+- Retrieve documents from Pinecone
+- Generate OpenAI embeddings1
+- Rerank results with Cohere
+
+### Node Details
+
+#### OpenRouter LLM
+- **Type and technical role:** `@n8n/n8n-nodes-langchain.lmChatOpenRouter`  
+  Chat model used by the agent for final answer generation.
+- **Configuration choices:**  
+  - Model: `anthropic/claude-sonnet-4.6`
+- **Key expressions or variables used:**  
+  None
+- **Input and output connections:**  
+  - AI language model output to: `Answer query from knowledge base`
+- **Version-specific requirements:**  
+  Uses node version `1`.
+- **Edge cases or potential failure types:**  
+  - OpenRouter credential issues
+  - Model availability changes
+  - Provider-side rate limits or latency
+- **Sub-workflow reference:**  
+  None
+
+#### Chat memory
+- **Type and technical role:** `@n8n/n8n-nodes-langchain.memoryBufferWindow`  
+  Maintains recent conversation state for the agent.
+- **Configuration choices:**  
+  - Defaults are used; no explicit memory window parameters are shown.
+- **Key expressions or variables used:**  
+  None
+- **Input and output connections:**  
+  - AI memory output to: `Answer query from knowledge base`
+- **Version-specific requirements:**  
+  Uses node version `1.3`.
+- **Edge cases or potential failure types:**  
+  - Default memory window size may be too small or too large depending on use case.
+  - If session handling is not configured externally, conversation continuity may vary by runtime context.
+- **Sub-workflow reference:**  
+  None
+
+#### Retrieve documents from Pinecone
+- **Type and technical role:** `@n8n/n8n-nodes-langchain.vectorStorePinecone`  
+  Exposes Pinecone search as a callable tool for the agent.
+- **Configuration choices:**  
+  - Mode: `retrieve-as-tool`
+  - Pinecone index: `firecrawl`
+  - Tool description: `Retrieve data for the AI Agent.`
+  - Reranking enabled
+- **Key expressions or variables used:**  
+  None directly.
+- **Input and output connections:**  
+  - AI embedding input from: `Generate OpenAI embeddings1`
+  - AI reranker input from: `Rerank results with Cohere`
+  - AI tool output to: `Answer query from knowledge base`
+- **Version-specific requirements:**  
+  Uses Pinecone vector store version `1.3`.
+- **Edge cases or potential failure types:**  
+  - Pinecone auth/index errors
+  - Empty index leads to irrelevant or no results
+  - Query embedding model must remain compatible with indexed vectors
+  - Tool description is minimal; richer guidance can improve agent tool usage behavior
+- **Sub-workflow reference:**  
+  None
+
+#### Generate OpenAI embeddings1
+- **Type and technical role:** `@n8n/n8n-nodes-langchain.embeddingsOpenAi`  
+  Embeds the user query before retrieval from Pinecone.
+- **Configuration choices:**  
+  - Default options only
+- **Key expressions or variables used:**  
+  None
+- **Input and output connections:**  
+  - AI embedding output to: `Retrieve documents from Pinecone`
+- **Version-specific requirements:**  
+  Uses embeddings node version `1.2`.
+- **Edge cases or potential failure types:**  
+  - Must match Pinecone vector dimension expectations
+  - Credential and rate-limit failures are possible
+- **Sub-workflow reference:**  
+  None
+
+#### Rerank results with Cohere
+- **Type and technical role:** `@n8n/n8n-nodes-langchain.rerankerCohere`  
+  Reranks the retrieved Pinecone results before they are returned to the agent.
+- **Configuration choices:**  
+  - Default settings
+- **Key expressions or variables used:**  
+  None
+- **Input and output connections:**  
+  - AI reranker output to: `Retrieve documents from Pinecone`
+- **Version-specific requirements:**  
+  Uses reranker node version `1`.
+- **Edge cases or potential failure types:**  
+  - Cohere auth or quota errors
+  - Added latency on retrieval
+  - If result set is very small, reranking may have limited impact
+- **Sub-workflow reference:**  
+  None
+
+---
+
+## 3. Summary Table
+
+| Node Name | Node Type | Functional Role | Input Node(s) | Output Node(s) | Sticky Note |
+|---|---|---|---|---|---|
+| Sticky Note | n8n-nodes-base.stickyNote | Documentation note for overall workflow behavior and setup |  |  | ## How it works |
+| Sticky Note | n8n-nodes-base.stickyNote | Documentation note for overall workflow behavior and setup |  |  | 1. A webhook receives a URL via POST request |
+| Sticky Note | n8n-nodes-base.stickyNote | Documentation note for overall workflow behavior and setup |  |  | 2. The URL is validated and normalized, returning a 422 error if invalid |
+| Sticky Note | n8n-nodes-base.stickyNote | Documentation note for overall workflow behavior and setup |  |  | 3. Firecrawl scrapes the page and converts it to clean markdown |
+| Sticky Note | n8n-nodes-base.stickyNote | Documentation note for overall workflow behavior and setup |  |  | 4. OpenAI generates 1536-dimensional vector embeddings from the content |
+| Sticky Note | n8n-nodes-base.stickyNote | Documentation note for overall workflow behavior and setup |  |  | 5. The content and embeddings are stored in Pinecone |
+| Sticky Note | n8n-nodes-base.stickyNote | Documentation note for overall workflow behavior and setup |  |  | 6. A built-in RAG chat agent lets you query the knowledge base using natural language, with Cohere reranking for better retrieval |
+| Sticky Note | n8n-nodes-base.stickyNote | Documentation note for overall workflow behavior and setup |  |  | ## Setup steps |
+| Sticky Note | n8n-nodes-base.stickyNote | Documentation note for overall workflow behavior and setup |  |  | 1. Create a Pinecone index with the settings from the "Pinecone setup" sticky |
+| Sticky Note | n8n-nodes-base.stickyNote | Documentation note for overall workflow behavior and setup |  |  | 2. Add your Firecrawl API key |
+| Sticky Note | n8n-nodes-base.stickyNote | Documentation note for overall workflow behavior and setup |  |  | 3. Add your OpenAI API key (for embeddings) |
+| Sticky Note | n8n-nodes-base.stickyNote | Documentation note for overall workflow behavior and setup |  |  | 4. Add your OpenRouter API key (for the chat agent) |
+| Sticky Note | n8n-nodes-base.stickyNote | Documentation note for overall workflow behavior and setup |  |  | 5. Add your Cohere API key (for reranking) |
+| Sticky Note | n8n-nodes-base.stickyNote | Documentation note for overall workflow behavior and setup |  |  | 6. Activate the workflow and send a POST request with `{"url": "https://example.com"}` to the webhook |
+| Receive URL | n8n-nodes-base.webhook | Receives POST ingestion requests |  | Validate and normalize URL | ## How it works |
+| Receive URL | n8n-nodes-base.webhook | Receives POST ingestion requests |  | Validate and normalize URL | 1. A webhook receives a URL via POST request |
+| Receive URL | n8n-nodes-base.webhook | Receives POST ingestion requests |  | Validate and normalize URL | 2. The URL is validated and normalized, returning a 422 error if invalid |
+| Receive URL | n8n-nodes-base.webhook | Receives POST ingestion requests |  | Validate and normalize URL | 3. Firecrawl scrapes the page and converts it to clean markdown |
+| Receive URL | n8n-nodes-base.webhook | Receives POST ingestion requests |  | Validate and normalize URL | 4. OpenAI generates 1536-dimensional vector embeddings from the content |
+| Receive URL | n8n-nodes-base.webhook | Receives POST ingestion requests |  | Validate and normalize URL | 5. The content and embeddings are stored in Pinecone |
+| Receive URL | n8n-nodes-base.webhook | Receives POST ingestion requests |  | Validate and normalize URL | 6. A built-in RAG chat agent lets you query the knowledge base using natural language, with Cohere reranking for better retrieval |
+| Receive URL | n8n-nodes-base.webhook | Receives POST ingestion requests |  | Validate and normalize URL | ## Setup steps |
+| Receive URL | n8n-nodes-base.webhook | Receives POST ingestion requests |  | Validate and normalize URL | 1. Create a Pinecone index with the settings from the "Pinecone setup" sticky |
+| Receive URL | n8n-nodes-base.webhook | Receives POST ingestion requests |  | Validate and normalize URL | 2. Add your Firecrawl API key |
+| Receive URL | n8n-nodes-base.webhook | Receives POST ingestion requests |  | Validate and normalize URL | 3. Add your OpenAI API key (for embeddings) |
+| Receive URL | n8n-nodes-base.webhook | Receives POST ingestion requests |  | Validate and normalize URL | 4. Add your OpenRouter API key (for the chat agent) |
+| Receive URL | n8n-nodes-base.webhook | Receives POST ingestion requests |  | Validate and normalize URL | 5. Add your Cohere API key (for reranking) |
+| Receive URL | n8n-nodes-base.webhook | Receives POST ingestion requests |  | Validate and normalize URL | 6. Activate the workflow and send a POST request with `{"url": "https://example.com"}` to the webhook |
+| Validate and normalize URL | n8n-nodes-base.code | Validates request payload and normalizes domain into https URL | Receive URL | Scrape page with Firecrawl; Return URL validation error | ## How it works |
+| Validate and normalize URL | n8n-nodes-base.code | Validates request payload and normalizes domain into https URL | Receive URL | Scrape page with Firecrawl; Return URL validation error | 1. A webhook receives a URL via POST request |
+| Validate and normalize URL | n8n-nodes-base.code | Validates request payload and normalizes domain into https URL | Receive URL | Scrape page with Firecrawl; Return URL validation error | 2. The URL is validated and normalized, returning a 422 error if invalid |
+| Validate and normalize URL | n8n-nodes-base.code | Validates request payload and normalizes domain into https URL | Receive URL | Scrape page with Firecrawl; Return URL validation error | 3. Firecrawl scrapes the page and converts it to clean markdown |
+| Validate and normalize URL | n8n-nodes-base.code | Validates request payload and normalizes domain into https URL | Receive URL | Scrape page with Firecrawl; Return URL validation error | 4. OpenAI generates 1536-dimensional vector embeddings from the content |
+| Validate and normalize URL | n8n-nodes-base.code | Validates request payload and normalizes domain into https URL | Receive URL | Scrape page with Firecrawl; Return URL validation error | 5. The content and embeddings are stored in Pinecone |
+| Validate and normalize URL | n8n-nodes-base.code | Validates request payload and normalizes domain into https URL | Receive URL | Scrape page with Firecrawl; Return URL validation error | 6. A built-in RAG chat agent lets you query the knowledge base using natural language, with Cohere reranking for better retrieval |
+| Validate and normalize URL | n8n-nodes-base.code | Validates request payload and normalizes domain into https URL | Receive URL | Scrape page with Firecrawl; Return URL validation error | ## Setup steps |
+| Validate and normalize URL | n8n-nodes-base.code | Validates request payload and normalizes domain into https URL | Receive URL | Scrape page with Firecrawl; Return URL validation error | 1. Create a Pinecone index with the settings from the "Pinecone setup" sticky |
+| Validate and normalize URL | n8n-nodes-base.code | Validates request payload and normalizes domain into https URL | Receive URL | Scrape page with Firecrawl; Return URL validation error | 2. Add your Firecrawl API key |
+| Validate and normalize URL | n8n-nodes-base.code | Validates request payload and normalizes domain into https URL | Receive URL | Scrape page with Firecrawl; Return URL validation error | 3. Add your OpenAI API key (for embeddings) |
+| Validate and normalize URL | n8n-nodes-base.code | Validates request payload and normalizes domain into https URL | Receive URL | Scrape page with Firecrawl; Return URL validation error | 4. Add your OpenRouter API key (for the chat agent) |
+| Validate and normalize URL | n8n-nodes-base.code | Validates request payload and normalizes domain into https URL | Receive URL | Scrape page with Firecrawl; Return URL validation error | 5. Add your Cohere API key (for reranking) |
+| Validate and normalize URL | n8n-nodes-base.code | Validates request payload and normalizes domain into https URL | Receive URL | Scrape page with Firecrawl; Return URL validation error | 6. Activate the workflow and send a POST request with `{"url": "https://example.com"}` to the webhook |
+| Scrape page with Firecrawl | @mendable/n8n-nodes-firecrawl.firecrawl | Scrapes and converts page content for ingestion | Validate and normalize URL | Store embeddings in Pinecone | ## How it works |
+| Scrape page with Firecrawl | @mendable/n8n-nodes-firecrawl.firecrawl | Scrapes and converts page content for ingestion | Validate and normalize URL | Store embeddings in Pinecone | 1. A webhook receives a URL via POST request |
+| Scrape page with Firecrawl | @mendable/n8n-nodes-firecrawl.firecrawl | Scrapes and converts page content for ingestion | Validate and normalize URL | Store embeddings in Pinecone | 2. The URL is validated and normalized, returning a 422 error if invalid |
+| Scrape page with Firecrawl | @mendable/n8n-nodes-firecrawl.firecrawl | Scrapes and converts page content for ingestion | Validate and normalize URL | Store embeddings in Pinecone | 3. Firecrawl scrapes the page and converts it to clean markdown |
+| Scrape page with Firecrawl | @mendable/n8n-nodes-firecrawl.firecrawl | Scrapes and converts page content for ingestion | Validate and normalize URL | Store embeddings in Pinecone | 4. OpenAI generates 1536-dimensional vector embeddings from the content |
+| Scrape page with Firecrawl | @mendable/n8n-nodes-firecrawl.firecrawl | Scrapes and converts page content for ingestion | Validate and normalize URL | Store embeddings in Pinecone | 5. The content and embeddings are stored in Pinecone |
+| Scrape page with Firecrawl | @mendable/n8n-nodes-firecrawl.firecrawl | Scrapes and converts page content for ingestion | Validate and normalize URL | Store embeddings in Pinecone | 6. A built-in RAG chat agent lets you query the knowledge base using natural language, with Cohere reranking for better retrieval |
+| Scrape page with Firecrawl | @mendable/n8n-nodes-firecrawl.firecrawl | Scrapes and converts page content for ingestion | Validate and normalize URL | Store embeddings in Pinecone | ## Setup steps |
+| Scrape page with Firecrawl | @mendable/n8n-nodes-firecrawl.firecrawl | Scrapes and converts page content for ingestion | Validate and normalize URL | Store embeddings in Pinecone | 1. Create a Pinecone index with the settings from the "Pinecone setup" sticky |
+| Scrape page with Firecrawl | @mendable/n8n-nodes-firecrawl.firecrawl | Scrapes and converts page content for ingestion | Validate and normalize URL | Store embeddings in Pinecone | 2. Add your Firecrawl API key |
+| Scrape page with Firecrawl | @mendable/n8n-nodes-firecrawl.firecrawl | Scrapes and converts page content for ingestion | Validate and normalize URL | Store embeddings in Pinecone | 3. Add your OpenAI API key (for embeddings) |
+| Scrape page with Firecrawl | @mendable/n8n-nodes-firecrawl.firecrawl | Scrapes and converts page content for ingestion | Validate and normalize URL | Store embeddings in Pinecone | 4. Add your OpenRouter API key (for the chat agent) |
+| Scrape page with Firecrawl | @mendable/n8n-nodes-firecrawl.firecrawl | Scrapes and converts page content for ingestion | Validate and normalize URL | Store embeddings in Pinecone | 5. Add your Cohere API key (for reranking) |
+| Scrape page with Firecrawl | @mendable/n8n-nodes-firecrawl.firecrawl | Scrapes and converts page content for ingestion | Validate and normalize URL | Store embeddings in Pinecone | 6. Activate the workflow and send a POST request with `{"url": "https://example.com"}` to the webhook |
+| Return URL validation error | n8n-nodes-base.respondToWebhook | Returns 422 response for invalid input | Validate and normalize URL |  |  |
+| Store embeddings in Pinecone | @n8n/n8n-nodes-langchain.vectorStorePinecone | Inserts documents and vectors into Pinecone | Scrape page with Firecrawl; Load scraped content; Generate OpenAI embeddings | Return ingestion result |  |
+| Return ingestion result | n8n-nodes-base.respondToWebhook | Returns success response after ingestion | Store embeddings in Pinecone |  |  |
+| Generate OpenAI embeddings | @n8n/n8n-nodes-langchain.embeddingsOpenAi | Provides embedding model for ingestion |  | Store embeddings in Pinecone |  |
+| Load scraped content | @n8n/n8n-nodes-langchain.documentDefaultDataLoader | Converts scrape output into documents with metadata |  | Store embeddings in Pinecone |  |
+| Receive chat message | @n8n/n8n-nodes-langchain.chatTrigger | Receives user chat requests for RAG querying |  | Answer query from knowledge base |  |
+| Answer query from knowledge base | @n8n/n8n-nodes-langchain.agent | Runs the chat agent with tool access | Receive chat message; OpenRouter LLM; Chat memory; Retrieve documents from Pinecone |  |  |
+| OpenRouter LLM | @n8n/n8n-nodes-langchain.lmChatOpenRouter | Chat model used by the RAG agent |  | Answer query from knowledge base |  |
+| Chat memory | @n8n/n8n-nodes-langchain.memoryBufferWindow | Maintains short conversation history |  | Answer query from knowledge base |  |
+| Retrieve documents from Pinecone | @n8n/n8n-nodes-langchain.vectorStorePinecone | Retrieval tool for knowledge base search | Generate OpenAI embeddings1; Rerank results with Cohere | Answer query from knowledge base |  |
+| Generate OpenAI embeddings1 | @n8n/n8n-nodes-langchain.embeddingsOpenAi | Embeds user queries for retrieval |  | Retrieve documents from Pinecone |  |
+| Rerank results with Cohere | @n8n/n8n-nodes-langchain.rerankerCohere | Reranks retrieved documents |  | Retrieve documents from Pinecone |  |
+| Sticky Note1 | n8n-nodes-base.stickyNote | Documentation note for Pinecone configuration |  |  | ## Pinecone setup |
+| Sticky Note1 | n8n-nodes-base.stickyNote | Documentation note for Pinecone configuration |  |  | Your Pinecone index must use 1536 dimensions to match the `text-embedding-3-small` OpenAI model. |
+| Sticky Note1 | n8n-nodes-base.stickyNote | Documentation note for Pinecone configuration |  |  | 1. Go to your Pinecone console and open your index settings |
+| Sticky Note1 | n8n-nodes-base.stickyNote | Documentation note for Pinecone configuration |  |  | 2. Select text-embedding-3-small as the embedding model |
+| Sticky Note1 | n8n-nodes-base.stickyNote | Documentation note for Pinecone configuration |  |  | 3. Confirm these settings: |
+| Sticky Note1 | n8n-nodes-base.stickyNote | Documentation note for Pinecone configuration |  |  | Setting \| Value |
+| Sticky Note1 | n8n-nodes-base.stickyNote | Documentation note for Pinecone configuration |  |  | Modality \| Text |
+| Sticky Note1 | n8n-nodes-base.stickyNote | Documentation note for Pinecone configuration |  |  | Vector type \| Dense |
+| Sticky Note1 | n8n-nodes-base.stickyNote | Documentation note for Pinecone configuration |  |  | Dimension \| 1536 |
+| Sticky Note1 | n8n-nodes-base.stickyNote | Documentation note for Pinecone configuration |  |  | Metric \| cosine |
+
+---
+
+## 4. Reproducing the Workflow from Scratch
+
+1. **Create a new workflow** in n8n and name it something like:  
+   `Scrape and ingest web content into Pinecone with Firecrawl`.
+
+2. **Add a Webhook node** named `Receive URL`.
+   - Node type: `Webhook`
+   - HTTP method: `POST`
+   - Response mode: `Using Respond to Webhook node`
+   - Leave options default unless you need authentication or custom response settings.
+   - This node will be the ingestion entry point.
+
+3. **Add a Code node** named `Validate and normalize URL`.
+   - Connect `Receive URL` → `Validate and normalize URL`.
+   - Set error handling to continue so that invalid input can be routed to a response node.
+   - Paste logic equivalent to:
+     - read `body.url`
+     - trim the string
+     - if missing, return status `422` and an explanatory message
+     - strip protocol and path
+     - validate domain format
+     - if invalid, throw an error
+     - if valid, return:
+       - `status: 200`
+       - `domain`
+       - `url: https://<domain>`
+
+4. **Use this JavaScript behavior in the Code node:**
+   - Read from: first input item, request body
+   - Normalize:
+     - `https://example.com/path?q=1` → `https://example.com`
+     - `firecrawl.dev` → `https://firecrawl.dev`
+   - Domain regex should only accept standard hostnames.
+
+5. **Add a Respond to Webhook node** named `Return URL validation error`.
+   - Connect it from `Validate and normalize URL`.
+   - Response code: `422`
+   - Configure it to return an error field or message field from the incoming JSON.
+   - Recommended improvement: return `{{$json.message || $json.error || 'Invalid URL'}}`
+   - This avoids the mismatch present in the source workflow.
+
+6. **Add the Firecrawl node** named `Scrape page with Firecrawl`.
+   - Connect `Validate and normalize URL` → `Scrape page with Firecrawl`.
+   - Node type: Firecrawl
+   - Operation: `scrape`
+   - URL field expression:
+     - `{{$('Validate and normalize URL').item.json.url}}`
+   - Keep scrape format set to markdown/default text output.
+   - Create and attach a **Firecrawl API credential**.
+
+7. **Create Firecrawl credentials.**
+   - In n8n credentials, add your Firecrawl API key.
+   - Assign it to the Firecrawl node.
+   - Test against a known public page.
+
+8. **Add an OpenAI Embeddings node** named `Generate OpenAI embeddings`.
+   - This node will provide embeddings for ingestion into Pinecone.
+   - Use OpenAI credentials.
+   - If the model can be chosen explicitly, set it to `text-embedding-3-small` to match the stated Pinecone dimension requirement.
+
+9. **Create OpenAI credentials.**
+   - Add an OpenAI API credential in n8n.
+   - Attach it to both embeddings nodes used in this workflow.
+
+10. **Add a Document Default Data Loader node** named `Load scraped content`.
+    - Configure metadata to include:
+      - `url` = `{{$('Validate and normalize URL').item.json.url}}`
+    - This node converts scrape content into documents suitable for vector storage.
+
+11. **Add a Pinecone Vector Store node** named `Store embeddings in Pinecone`.
+    - Mode: `insert`
+    - Select your Pinecone index: `firecrawl` or another equivalent index
+    - Connect:
+      - main input from `Scrape page with Firecrawl`
+      - AI embedding input from `Generate OpenAI embeddings`
+      - AI document input from `Load scraped content`
+
+12. **Create Pinecone credentials.**
+    - Add a Pinecone API credential in n8n.
+    - Ensure region/environment matches your Pinecone project.
+    - Attach credentials to the Pinecone node.
+
+13. **Prepare the Pinecone index before running ingestion.**
+    - Create an index named `firecrawl` or update the workflow to your preferred index name.
+    - Required characteristics:
+      - Modality: Text
+      - Vector type: Dense
+      - Dimension: 1536
+      - Metric: cosine
+    - This is required if using `text-embedding-3-small`.
+
+14. **Add a Respond to Webhook node** named `Return ingestion result`.
+    - Connect `Store embeddings in Pinecone` → `Return ingestion result`.
+    - Response code: `200`
+    - Respond with JSON.
+    - Recommended response body:
+      - `{"message":"Added {{$input.all().length}} items to Pinecone"}`
+    - This corrects the inaccurate “Supabase” message from the source workflow.
+
+15. **Add a Chat Trigger node** named `Receive chat message`.
+    - This creates the chat entry point for querying the knowledge base.
+    - Keep default settings unless you need session customization.
+
+16. **Add an AI Agent node** named `Answer query from knowledge base`.
+    - Connect `Receive chat message` → `Answer query from knowledge base`.
+    - Leave default options unless you want a custom system prompt.
+    - This node will orchestrate retrieval and answer generation.
+
+17. **Add an OpenRouter Chat Model node** named `OpenRouter LLM`.
+    - Connect its AI language model output to `Answer query from knowledge base`.
+    - Set model to:
+      - `anthropic/claude-sonnet-4.6`
+    - Create and attach an OpenRouter API credential.
+
+18. **Create OpenRouter credentials.**
+    - Add your OpenRouter API key in n8n.
+    - Attach it to the `OpenRouter LLM` node.
+
+19. **Add a Buffer Window Memory node** named `Chat memory`.
+    - Connect its AI memory output to `Answer query from knowledge base`.
+    - Defaults are acceptable initially.
+    - Optionally tune the memory window later based on chat length.
+
+20. **Add a second OpenAI Embeddings node** named `Generate OpenAI embeddings1`.
+    - This node is used for retrieval query embeddings.
+    - Attach the same OpenAI credentials.
+    - Prefer the same embedding model family as ingestion for consistency.
+
+21. **Add a Cohere Reranker node** named `Rerank results with Cohere`.
+    - Create and attach a Cohere API credential.
+    - Keep default settings unless you need a specific reranking model or custom parameters.
+
+22. **Create Cohere credentials.**
+    - Add a Cohere API key in n8n.
+    - Attach it to the reranker node.
+
+23. **Add a second Pinecone Vector Store node** named `Retrieve documents from Pinecone`.
+    - Mode: `retrieve as tool`
+    - Select the same Pinecone index used for ingestion
+    - Enable reranking
+    - Tool description:
+      - `Retrieve data for the AI Agent.`
+    - Connect:
+      - AI embedding input from `Generate OpenAI embeddings1`
+      - AI reranker input from `Rerank results with Cohere`
+      - AI tool output to `Answer query from knowledge base`
+
+24. **Confirm all connections.**
+    - Ingestion branch:
+      - `Receive URL` → `Validate and normalize URL`
+      - `Validate and normalize URL` → `Scrape page with Firecrawl`
+      - `Validate and normalize URL` → `Return URL validation error`
+      - `Scrape page with Firecrawl` → `Store embeddings in Pinecone`
+      - `Generate OpenAI embeddings` → `Store embeddings in Pinecone` (AI embedding)
+      - `Load scraped content` → `Store embeddings in Pinecone` (AI document)
+      - `Store embeddings in Pinecone` → `Return ingestion result`
+    - Chat branch:
+      - `Receive chat message` → `Answer query from knowledge base`
+      - `OpenRouter LLM` → `Answer query from knowledge base` (AI language model)
+      - `Chat memory` → `Answer query from knowledge base` (AI memory)
+      - `Generate OpenAI embeddings1` → `Retrieve documents from Pinecone` (AI embedding)
+      - `Rerank results with Cohere` → `Retrieve documents from Pinecone` (AI reranker)
+      - `Retrieve documents from Pinecone` → `Answer query from knowledge base` (AI tool)
+
+25. **Optionally add sticky notes** for operational clarity.
+   Include:
+   - high-level workflow steps
+   - required credentials
+   - Pinecone index settings
+
+26. **Test the ingestion webhook in manual mode.**
+   - Example POST body:
+     ```json
+     {
+       "url": "firecrawl.dev"
+     }
+     ```
+   - Also test:
+     ```json
+     {
+       "url": "https://example.com/docs"
+     }
+     ```
+   - Confirm that the code normalizes these into a base HTTPS domain.
+
+27. **Validate error handling.**
+   - Send an empty body or invalid domain.
+   - Confirm the caller receives HTTP 422.
+   - If not, adjust the code node’s error path or the response node field mapping.
+
+28. **Test Pinecone insertion.**
+   - After a successful call, verify vectors/documents exist in the selected Pinecone index.
+   - Check metadata contains the `url` field.
+
+29. **Test the chat branch.**
+   - Use the chat trigger interface or endpoint.
+   - Ask a question related to the scraped content.
+   - Confirm the agent retrieves relevant content and produces an answer.
+
+30. **Activate the workflow** once both branches work in test mode.
+
+### Expected Inputs and Outputs
+
+#### Ingestion webhook input
+- Method: `POST`
+- JSON body:
+  ```json
+  {
+    "url": "https://example.com"
+  }
+  ```
+
+#### Ingestion success output
+- HTTP 200
+- JSON message confirming processed item count
+
+#### Ingestion error output
+- HTTP 422
+- JSON error/message indicating missing or invalid URL
+
+#### Chat input
+- User message through the Chat Trigger interface
+
+#### Chat output
+- Natural-language answer grounded in retrieved Pinecone documents
+
+### Recommended Improvements While Rebuilding
+- Fix `Return URL validation error` to use `message` as well as `error`.
+- Fix success response text from “Supabase” to “Pinecone”.
+- Add chunking before embeddings if you expect long pages.
+- Add deduplication using URL hash or canonical URL metadata.
+- Add explicit model selection for both embedding nodes.
+
+---
+
+## 5. General Notes & Resources
+
+| Note Content | Context or Link |
+|---|---|
+| Overall workflow note: A webhook receives a URL, validates it, scrapes content with Firecrawl, creates OpenAI embeddings, stores them in Pinecone, and exposes a RAG chat agent with Cohere reranking. | Workflow purpose |
+| Setup note: Required credentials are Firecrawl, OpenAI, OpenRouter, Cohere, and Pinecone. | Environment setup |
+| Pinecone requirement: index must use 1536 dimensions to match `text-embedding-3-small`. | Pinecone configuration |
+| Pinecone settings: Modality = Text, Vector type = Dense, Dimension = 1536, Metric = cosine. | Pinecone configuration |
+| Example ingestion call: `{"url": "https://example.com"}` | Webhook test payload |
+
+### Additional Implementation Notes
+- The workflow contains **two entry points**:
+  1. `Receive URL` for ingestion
+  2. `Receive chat message` for querying
+- There are **no sub-workflow invocation nodes** in this workflow.
+- The pinned sample input on the webhook node shows a real test case using:
+  - `firecrawl.dev`
+- The workflow is currently **inactive** in the provided JSON, so production endpoints would not respond until activated.
+- The ingestion branch and chat branch both rely on the **same Pinecone index**, which is essential for the RAG loop to work correctly.