creation

2026-04-19 17:14:37 +00:00 · 2026-03-10 12:13:50 +08:00
parent acfd2eacd0
commit 4772b5e914
1 changed files with 129 additions and 0 deletions
--- a/Gmail-13798/readme-13798.md
+++ b/Gmail-13798/readme-13798.md
@@ -0,0 +1,129 @@
+Turn websites into a Google Sheets database with MrScraper and Gmail
+
+https://n8nworkflows.xyz/workflows/turn-websites-into-a-google-sheets-database-with-mrscraper-and-gmail-13798
+
+
+# Turn websites into a Google Sheets database with MrScraper and Gmail
+
+# Workflow Reference: Turn Internet Into Database
+
+This document provides a technical breakdown of an n8n workflow designed to automate the extraction of data from websites and store it in Google Sheets, with notifications sent via Gmail. It utilizes **MrScraper** as the core scraping engine.
+
+## 1. Workflow Overview
+
+The workflow automates a multi-stage web scraping pipeline that transitions from high-level domain discovery to deep data extraction. It is designed to turn unstructured web content into a structured Google Sheets database.
+
+The logic is divided into four main functional phases:
+*   **1.1 Discovery (Phase 1):** Crawling a target domain to identify relevant listing or category pages.
+*   **1.2 Listing Extraction (Phase 2):** Processing discovered listing pages to extract specific detail page URLs.
+*   **1.3 Data Scraping (Phase 3):** Navigating to each detail page to extract structured data fields (e.g., prices, descriptions).
+*   **1.4 Export & Notification (Phase 4):** Flattening the extracted data, saving it to Google Sheets, and sending a summary via Gmail.
+
+---
+
+## 2. Block-by-Block Analysis
+
+### 2.1 Phase 1: Discover URL (Crawling)
+**Overview:** This block identifies the structure of the target website and finds all relevant URLs based on inclusion/exclusion patterns.
+*   **Nodes Involved:** `When clicking ‘Execute workflow’`, `Run map agent scraper`.
+*   **Node Details:**
+    *   **Run map agent scraper (MrScraper Node):** 
+        *   **Role:** Performs a "Map" operation to crawl the domain.
+        *   **Configuration:** Requires a `scraperId` and a starting `url`. Supports `includePatterns` (to target specific paths like `/category/`) and `limit` (to control the number of pages found).
+        *   **Edge Cases:** Rate limiting on the target domain or overly restrictive include patterns resulting in zero URLs.
+
+### 2.2 Phase 2: Scrape Listing Page
+**Overview:** Iterates through the URLs found in Phase 1 to identify specific item/product links.
+*   **Nodes Involved:** `Looping Listing Page url`, `Run listing agent scraper`.
+*   **Node Details:**
+    *   **Looping Listing Page url (Split In Batches):**
+        *   **Role:** Manages the iteration over the list of URLs returned by the Map Agent.
+    *   **Run listing agent scraper (MrScraper Node):**
+        *   **Role:** Uses a "Listing Agent" configuration to extract multiple links (URLs) from a single search/category page.
+        *   **Configuration:** Requires a specific `scraperId` optimized for list-view pages.
+
+### 2.3 Phase 3: Scrape Detail Data
+**Overview:** The core extraction phase where individual product or post data is gathered and prepared for storage.
+*   **Nodes Involved:** `Extract All Url`, `Looping Detail Page url`, `Run general agent scraper`, `Flatten Object`.
+*   **Node Details:**
+    *   **Extract All Url (Code Node - Python):**
+        *   **Role:** Parses the JSON response from the Listing Agent to extract a clean, deduplicated list of detail URLs using a Python `set`.
+    *   **Looping Detail Page url (Split In Batches):**
+        *   **Role:** Handles the batch processing of individual detail pages.
+    *   **Run general agent scraper (MrScraper Node):**
+        *   **Role:** Navigates to the specific item page to scrape structured data (title, price, description, etc.).
+    *   **Flatten Object (Code Node - JavaScript):**
+        *   **Role:** Recursively flattens nested JSON objects into a single-level key-value structure suitable for spreadsheet columns (e.g., `details.price` becomes `details_price`).
+
+### 2.4 Phase 4: Export & Notify
+**Overview:** Finalizes the process by updating the database and alerting the user.
+*   **Nodes Involved:** `Append or update row in sheet`, `Send a message`.
+*   **Node Details:**
+    *   **Append or update row in sheet (Google Sheets Node):**
+        *   **Role:** Commits the flattened data to a specific Google Sheet.
+        *   **Configuration:** Uses `appendOrUpdate` operation. It is recommended to use a unique URL as the matching key to avoid duplicates.
+    *   **Send a message (Gmail Node):**
+        *   **Role:** Sends a notification once the processing is complete.
+        *   **Configuration:** Configurable subject and body to provide a summary of the run.
+
+---
+
+## 3. Summary Table
+
+| Node Name | Node Type | Functional Role | Input Node(s) | Output Node(s) | Sticky Note |
+| :--- | :--- | :--- | :--- | :--- | :--- |
+| When clicking ‘Execute workflow’ | Manual Trigger | Entry Point | None | Run map agent scraper | |
+| Run map agent scraper | MrScraper | URL Discovery | Manual Trigger | Looping Listing Page url | Find listing/search pages automatically from a domain. |
+| Looping Listing Page url | Split In Batches | Loop Controller | Run map agent scraper, Run listing agent scraper | Extract All Url, Run listing agent scraper | Extract detail page URLs from each listing/search page. |
+| Run listing agent scraper | MrScraper | List Scraping | Looping Listing Page url | Looping Listing Page url | Extract detail page URLs from each listing/search page. |
+| Extract All Url | Code (Python) | Data Parsing | Looping Listing Page url | Looping Detail Page url | |
+| Looping Detail Page url | Split In Batches | Loop Controller | Extract All Url, Run general agent scraper | Flatten Object, Run general agent scraper | Extract structured fields from each detail page. |
+| Run general agent scraper | MrScraper | Detail Scraping | Looping Detail Page url | Looping Detail Page url | Extract structured fields from each detail page. |
+| Flatten Object | Code (JS) | Data Cleaning | Looping Detail Page url | Append or update row in sheet | Normalize the output: flatten nested JSON, format arrays. |
+| Append or update row in sheet | Google Sheets | Data Storage | Flatten Object | Send a message | Save results into Google Sheets (Append/Upsert). |
+| Send a message | Gmail | Notification | Append or update row in sheet | None | Send a Gmail summary/alert. |
+
+---
+
+## 4. Reproducing the Workflow from Scratch
+
+1.  **Preparation:**
+    *   Create 3 agents in MrScraper: **Map Agent** (discovery), **Listing Agent** (link extraction), and **General Agent** (data extraction). Note their IDs.
+    *   Prepare a Google Sheet with headers matching your expected data.
+
+2.  **Initial Discovery:**
+    *   Add a **Manual Trigger** node.
+    *   Connect it to a **MrScraper** node (Map Agent). Set the `url` and `scraperId`.
+
+3.  **The First Loop (Listings):**
+    *   Add a **Split In Batches** node ("Looping Listing Page url").
+    *   Connect it to another **MrScraper** node (Listing Agent).
+    *   Loop the output of the Listing Agent back into the Batch node.
+
+4.  **Data Extraction:**
+    *   Add a **Code Node** (Python) to extract URLs from the Listing Agent's output. Use a `set()` to ensure uniqueness.
+    *   Add a second **Split In Batches** node ("Looping Detail Page url").
+    *   Connect it to a **MrScraper** node (General Agent) and loop it back.
+
+5.  **Data Transformation:**
+    *   Add a **Code Node** (JavaScript) to flatten the JSON objects returned by the General Agent. This ensures the data fits into Google Sheets columns.
+
+6.  **Integration & Notification:**
+    *   Add a **Google Sheets** node. Set the operation to `Append or Update`. Map the fields from the Flatten node.
+    *   Add a **Gmail** node to send a final "Workflow Complete" notification.
+
+7.  **Credentials:**
+    *   Configure **MrScraper API** (API Token).
+    *   Configure **Google Sheets OAuth2**.
+    *   Configure **Gmail OAuth2**.
+
+---
+
+## 5. General Notes & Resources
+
+| Note Content | Context or Link |
+| :--- | :--- |
+| Ensure MrScraper API access is enabled | Required for n8n communication |
+| Google Sheets Upsert Strategy | Use `source_url` as a unique key to prevent data duplication |
+| MrScraper Documentation | [https://docs.mrscraper.com](https://docs.mrscraper.com) |
+| n8n Split In Batches Guide | Handling large data sets without timeouts |