From e966d8f6803178fc3019f1cd83ddc2d92e9d4770 Mon Sep 17 00:00:00 2001 From: nusquama Date: Wed, 12 Nov 2025 20:09:29 +0100 Subject: [PATCH] creation --- .../readme-8195.md | 335 ++++++++++++++++++ 1 file changed, 335 insertions(+) create mode 100644 workflows/Crawl Website Blog Content and Save to Google Sheets with Dumpling AI-8195/readme-8195.md diff --git a/workflows/Crawl Website Blog Content and Save to Google Sheets with Dumpling AI-8195/readme-8195.md b/workflows/Crawl Website Blog Content and Save to Google Sheets with Dumpling AI-8195/readme-8195.md new file mode 100644 index 000000000..cbeb2e90a --- /dev/null +++ b/workflows/Crawl Website Blog Content and Save to Google Sheets with Dumpling AI-8195/readme-8195.md @@ -0,0 +1,335 @@ +Crawl Website Blog Content and Save to Google Sheets with Dumpling AI + +https://n8nworkflows.xyz/workflows/crawl-website-blog-content-and-save-to-google-sheets-with-dumpling-ai-8195 + + +# Crawl Website Blog Content and Save to Google Sheets with Dumpling AI + +### 1. Workflow Overview + +This workflow automates the process of crawling a client’s website blog content, extracting relevant blog URLs, scraping their content, and saving structured data into a newly created Google Sheets document. It leverages Dumpling AI for crawling and scraping tasks, and Google Sheets for data storage and audit purposes. + +The workflow is logically organized into the following blocks: + +- **1.1 Input Reception:** Captures client input via a form trigger. +- **1.2 Google Sheets Setup:** Creates a new spreadsheet, sets headers, and prepares it to receive data. +- **1.3 Website Crawl and URL Extraction:** Uses Dumpling AI to crawl the submitted website URL, then extracts and filters blog-related URLs. +- **1.4 Blog Content Scraping:** Scrapes content from each filtered blog URL using Dumpling AI. +- **1.5 Data Preparation and Storage:** Formats the scraped data and appends it into the Google Sheet for audit and review. + +--- + +### 2. Block-by-Block Analysis + +#### 2.1 Input Reception + +- **Overview:** + This block triggers the workflow when a client submits a URL through a form. It initiates the entire process by capturing the target website URL. + +- **Nodes Involved:** + - Form Submission + +- **Node Details:** + + - **Form Submission** + - Type: Form Trigger + - Role: Starts the workflow by receiving a client-submitted URL. + - Configuration: + - Form titled "blog content strategy" + - Single required field labeled "Client URL" + - Expressions: None beyond form field capture. + - Input: External user form submission + - Output: JSON object containing `{ Client URL: }` + - Version: 2.2 + - Potential failures: Form submission timeout, invalid or empty URL input, webhook connectivity issues. + +--- + +#### 2.2 Google Sheets Setup + +- **Overview:** + Creates a new Google Sheet named after the client URL domain, initializes a single sheet tab titled "Blog content audit," defines the header row, formats it, and inserts it into the sheet. + +- **Nodes Involved:** + - Create Blog Audit Sheet + - Set Sheet Headers + - Format Header Row + - Insert Headers into Sheet + +- **Node Details:** + + - **Create Blog Audit Sheet** + - Type: Google Sheets + - Role: Creates a new Google Sheets document. + - Configuration: + - Title derived by trimming the submitted URL and extracting the domain name portion before the first dot. For example, "https://crane-baby.com" → "https://crane-baby" (trimmed and split logic). + - Creates one sheet tab titled "Blog content audit" + - Credentials: OAuth2 with Google Sheets account + - Input: Client URL from Form Submission + - Output: JSON with spreadsheet ID, URL, and sheet properties + - Version: 4.6 + - Potential failures: Google OAuth permission issues, API rate limits, invalid spreadsheet creation parameters. + + - **Set Sheet Headers** + - Type: Set Node + - Role: Defines the header row as a CSV string "Url,Crawled_pages,website_content" in a variable named `rows`. + - Input: Output of "Create Blog Audit Sheet" + - Output: JSON with header string + - Version: 1 + - Potential failures: None significant (simple data setting). + + - **Format Header Row** + - Type: Code (JavaScript) + - Role: Converts the CSV string into a 2D array format suitable for Google Sheets API (an array of arrays). + - Key code: Splits the string by commas and wraps it inside an array: + ```js + return [ + { + json: { + data: [ $json.rows.split(',') ] + } + } + ]; + ``` + - Input: From Set Sheet Headers + - Output: JSON with a `data` property containing the array of headers + - Version: 2 + - Potential failures: If `rows` is missing or malformed. + + - **Insert Headers into Sheet** + - Type: HTTP Request + - Role: Calls Google Sheets API v4 to insert the header row into the created sheet. + - Configuration: + - HTTP Method: PUT + - URL constructed dynamically using the spreadsheet ID and sheet title from "Create Blog Audit Sheet" output. + - Request body includes `range` (sheet name + columns), and `values` (header array). + - Query parameter `valueInputOption=RAW` to insert values as-is. + - Uses Google Sheets OAuth2 credential. + - Input: Formatted headers from "Format Header Row", spreadsheet info from "Create Blog Audit Sheet" + - Output: API response confirming update + - Version: 4.1 + - Potential failures: OAuth token expiration, API quota limits, invalid spreadsheet ID or range. + +--- + +#### 2.3 Website Crawl and URL Extraction + +- **Overview:** + Crawls the submitted client website URL using Dumpling AI API to discover URLs, then extracts and filters URLs related to blog content using pattern matching. + +- **Nodes Involved:** + - Dumpling AI: Crawl Website + - Extract Blog URLs + +- **Node Details:** + + - **Dumpling AI: Crawl Website** + - Type: HTTP Request + - Role: Sends a POST request to Dumpling AI’s crawl API endpoint. + - Configuration: + - URL: `https://app.dumplingai.com/api/v1/crawl` + - Body parameters: + - `url` set to the client URL from form input + - `limit` set to 10 (max pages to crawl) + - Authentication: HTTP Header Auth using stored Dumpling AI API key + - Input: Client URL from Form Submission + - Output: JSON containing crawled page data and URLs + - Version: 4.2 + - Potential failures: API key invalid/expired, Dumpling AI service downtime, network timeouts. + + - **Extract Blog URLs** + - Type: Code (JavaScript) + - Role: Parses the JSON crawl results to extract all URLs, then filters them to identify blog-related URLs based on inclusion and exclusion regex patterns. + - Logic highlights: + - Extracts all URLs from the JSON as strings using regex. + - Deduplicates URLs. + - Filters URLs matching typical blog patterns (`/blog/`, `/post/`, `/article/`, date-based URLs, guides, tips, etc.) + - Excludes URLs with file extensions (images, css, js, pdf, zip) or ecommerce/cart/account related paths. + - Returns a sorted list of unique blog-related URLs. + - Input: Dumpling AI crawl output + - Output: Array of JSON objects each with a `blogUrl` property or a message if none found + - Version: 2 + - Edge cases: No blog URLs found, malformed URLs, encoding issues. + +--- + +#### 2.4 Blog Content Scraping + +- **Overview:** + For each extracted blog URL, scrapes the webpage content using Dumpling AI’s scraping API. + +- **Nodes Involved:** + - Dumpling AI: Scrape Blog Pages + +- **Node Details:** + + - **Dumpling AI: Scrape Blog Pages** + - Type: HTTP Request + - Role: Sends POST requests to Dumpling AI’s scrape API for each blog URL. + - Configuration: + - URL: `https://app.dumplingai.com/api/v1/scrape` + - Body parameter `url` set dynamically per each blog URL from the previous node + - Authentication: HTTP Header Auth with Dumpling AI API key + - Input: Blog URLs from Extract Blog URLs + - Output: JSON containing scraped content from each blog page + - Version: 4.2 + - Potential failures: API errors for invalid URLs, rate limiting, network issues, content not found. + +--- + +#### 2.5 Data Preparation and Storage + +- **Overview:** + Structures the data into a row format with URL, crawled page URL, and website content, then appends it to the Google Sheet. + +- **Nodes Involved:** + - Prepare Row Data + - Save Blog Data to Google Sheets + +- **Node Details:** + + - **Prepare Row Data** + - Type: Set Node + - Role: Maps and constructs objects with three fields: + - `Url`: original client URL from form input + - `Crawled_pages`: blog URL currently processed + - `website_content`: scraped content from Dumpling AI scrape response + - Input: Output from Dumpling AI scrape node and original form submission + - Output: JSON object with the three fields for each blog page + - Version: 3.4 + - Potential failures: Missing fields in scraped content, null content. + + - **Save Blog Data to Google Sheets** + - Type: Google Sheets + - Role: Appends rows of structured blog data to the audit sheet created earlier. + - Configuration: + - Operation: Append + - Columns automatically mapped based on input fields (Url, Crawled_pages, website_content) + - Target sheet specified by sheet ID from "Create Blog Audit Sheet" output + - Document specified by URL from "Create Blog Audit Sheet" + - Credentials: OAuth2 Google Sheets account + - Input: Structured row data from "Prepare Row Data" + - Output: Confirmation of appended rows + - Version: 4.7 + - Potential failures: API limits, invalid sheet ID, OAuth failures. + +--- + +### 3. Summary Table + +| Node Name | Node Type | Functional Role | Input Node(s) | Output Node(s) | Sticky Note | +|----------------------------|--------------------|----------------------------------------------|---------------------------|---------------------------------|---------------------------------------------------------------------------------------------------------| +| Form Submission | Form Trigger | Entry point to receive client URL | - | Create Blog Audit Sheet | ## Workflow Overview 1. Trigger: Form Submission (Client URL) — Starts the workflow when a client URL is entered. | +| Create Blog Audit Sheet | Google Sheets | Creates a new Google Sheets document | Form Submission | Set Sheet Headers | 2. Create Blog Audit Sheet — Creates a new Google Sheet for the audit. | +| Set Sheet Headers | Set Node | Defines header row as CSV string | Create Blog Audit Sheet | Format Header Row | 3. Set Sheet Headers — Defines the columns (URL, Crawled Pages, Website Content). | +| Format Header Row | Code Node | Converts CSV headers into array for Sheets API | Set Sheet Headers | Insert Headers into Sheet | 4. Format Header Row — Prepares the headers into the right format for Google Sheets. | +| Insert Headers into Sheet | HTTP Request | Inserts header row into Google Sheet | Format Header Row | Dumpling AI: Crawl Website | 5. Insert Headers into Sheet — Updates the sheet with the headers. | +| Dumpling AI: Crawl Website | HTTP Request | Crawls submitted website to discover URLs | Insert Headers into Sheet | Extract Blog URLs | 6. Dumpling AI: Crawl Website — Crawls the submitted URL to discover pages. | +| Extract Blog URLs | Code Node | Extracts and filters blog-related URLs | Dumpling AI: Crawl Website | Dumpling AI: Scrape Blog Pages | 7. Extract Blog URLs — Filters the crawl results to keep only blog-related links. | +| Dumpling AI: Scrape Blog Pages | HTTP Request | Scrapes content from each blog page | Extract Blog URLs | Prepare Row Data | 8. Dumpling AI: Scrape Blog Pages — Scrapes the content from each blog page. | +| Prepare Row Data | Set Node | Maps URL, crawled page, and content into fields | Dumpling AI: Scrape Blog Pages | Save Blog Data to Google Sheets | 9. Prepare Row Data — Maps the URL, crawled page, and content into structured fields. | +| Save Blog Data to Google Sheets | Google Sheets | Appends blog data rows to the audit sheet | Prepare Row Data | - | 10. Save Blog Data to Google Sheets — Appends the results into the audit sheet for review. | +| Sticky Note | Sticky Note | Provides workflow overview and step summary | - | - | ## Workflow Overview (Describes all main steps from Form Submission to Saving Data in Sheets) | + +--- + +### 4. Reproducing the Workflow from Scratch + +1. **Create a Form Trigger Node** + - Type: Form Trigger (version 2.2) + - Configure with a form titled "blog content strategy" + - Add one required field labeled "Client URL" + - This node will trigger the workflow on form submission with the client URL. + +2. **Add Google Sheets Node to Create Spreadsheet** + - Type: Google Sheets (version 4.6) + - Operation: Create spreadsheet + - Title expression: + ```= {{ $json["Client URL"].trim().split(/›|>|»/)[0].trim().split(".")[0] }}``` + - Add one sheet named "Blog content audit" + - Use Google Sheets OAuth2 credentials. + +3. **Add Set Node to Define Header Row** + - Type: Set (version 1) + - Set field `rows` to string: `"Url,Crawled_pages,website_content"` + +4. **Add Code Node to Format Header Row** + - Type: Code (version 2) + - JS code: + ```js + return [ + { + json: { + data: [ $json.rows.split(',') ] + } + } + ]; + ``` + +5. **Add HTTP Request Node to Insert Headers into Sheet** + - Type: HTTP Request (version 4.1) + - Method: PUT + - URL expression: + ```= https://sheets.googleapis.com/v4/spreadsheets/{{ $('Create Blog Audit Sheet').first().json.spreadsheetId }}/values/{{ $('Create Blog Audit Sheet').first().json.sheets[0].properties.title }}!A:Z``` + - Body parameters: + - `range`: sheet title + `!A:Z` + - `values`: from `data` property of previous node + - Query parameter: `valueInputOption=RAW` + - Authentication: Google Sheets OAuth2 credentials. + +6. **Add HTTP Request Node for Dumpling AI Crawl** + - Type: HTTP Request (version 4.2) + - Method: POST + - URL: `https://app.dumplingai.com/api/v1/crawl` + - Body parameters: + - `url` from form submission: `={{ $('Form Submission ').item.json["Client URL"] }}` + - `limit`: `10` + - Authentication: HTTP Header Auth with Dumpling AI API key. + +7. **Add Code Node to Extract Blog URLs** + - Type: Code (version 2) + - Paste the provided JavaScript code that extracts and filters blog-related URLs using regex patterns and returns the list. + +8. **Add HTTP Request Node for Dumpling AI Scrape** + - Type: HTTP Request (version 4.2) + - Method: POST + - URL: `https://app.dumplingai.com/api/v1/scrape` + - Body parameter: `url` set dynamically for each `blogUrl` from previous node + - Authentication: HTTP Header Auth with Dumpling AI API key. + +9. **Add Set Node to Prepare Row Data** + - Type: Set (version 3.4) + - Assign fields: + - `Url`: `={{ $('Form Submission ').item.json["Client URL"] }}` + - `Crawled_pages`: `={{ $('Extract Blog URLs').item.json.blogUrl }}` + - `website_content`: `={{ $json.content }}` + +10. **Add Google Sheets Node to Append Data** + - Type: Google Sheets (version 4.7) + - Operation: Append + - Document Id: `={{ $('Create Blog Audit Sheet').item.json.spreadsheetUrl }}` (mode URL) + - Sheet Name: `={{ $('Create Blog Audit Sheet').item.json.sheets[0].properties.sheetId }}` (mode ID) + - Columns: Auto-mapped for `Url`, `Crawled_pages`, `website_content` + - Use Google Sheets OAuth2 credentials. + +11. **Connect the Nodes in the Following order:** + - Form Submission → Create Blog Audit Sheet → Set Sheet Headers → Format Header Row → Insert Headers into Sheet → Dumpling AI: Crawl Website → Extract Blog URLs → Dumpling AI: Scrape Blog Pages → Prepare Row Data → Save Blog Data to Google Sheets + +12. **Credential Setup:** + - Google Sheets OAuth2 API credential with permissions to create and edit spreadsheets. + - Dumpling AI HTTP Header Auth credential containing the API key. + +--- + +### 5. General Notes & Resources + +| Note Content | Context or Link | +|---------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------| +| This workflow uses Dumpling AI APIs for crawling and scraping web content. You must have a valid API key set up in n8n credentials. | Dumpling AI API documentation (https://app.dumplingai.com/) | +| The Google Sheets operations require OAuth2 credentials with scopes to create and edit spreadsheets. | Google Sheets API documentation (https://developers.google.com/sheets/api) | +| The blog URL extraction step uses custom regex patterns tailored to common blog URL structures and exclusions. | Patterns can be adjusted in the Extract Blog URLs node code. | +| The workflow is designed for blog content auditing and can be adapted to other content types with regex tuning. | | + +--- + +**Disclaimer:** The provided text is exclusively derived from an automated workflow created with n8n, an integration and automation tool. All data handled is legal and public, respecting content policies strictly. \ No newline at end of file