This commit is contained in:
nusquama
2026-03-15 12:01:10 +08:00
parent 6322a9320c
commit 499e669fa0

View File

@@ -0,0 +1,847 @@
Build a Reddit no-API weekly digest with ScrapeOps and Google Sheets
https://n8nworkflows.xyz/workflows/build-a-reddit-no-api-weekly-digest-with-scrapeops-and-google-sheets-13992
# Build a Reddit no-API weekly digest with ScrapeOps and Google Sheets
# 1. Workflow Overview
This workflow creates a weekly Reddit industry digest without using the Reddit API. It scrapes public subreddit listing pages through ScrapeOps, extracts post metadata, enriches posts with Reddit JSON post details, deduplicates against a Google Sheet, stores only new posts, then compiles a weekly digest and optionally emails it.
Typical use cases:
- Weekly monitoring of technical communities such as `selfhosted`, `devops`, `programming`, and `webdev`
- Building a content pipeline for newsletters, internal trend reports, or research tracking
- Persisting scraped community content into Google Sheets for later analysis
## 1.1 Trigger & Runtime Configuration
The workflow starts on a weekly schedule. It calculates the current week range, resets workflow-level static memory, and emits one item per subreddit with shared configuration such as timeframe, per-subreddit post limit, and Google Sheet ID.
## 1.2 Subreddit Listing Scraping
Each subreddit is processed one at a time via batching. For each subreddit, the workflow requests the “Top of Week” page from `old.reddit.com` through ScrapeOps Proxy and inserts a randomized 13 second delay.
## 1.3 Listing Parsing
The returned HTML is parsed in a Code node. The parser extracts metadata including post title, canonical Reddit URL, author, flair, score, comment count, timestamp, and a generated SHA-1 content hash.
## 1.4 Post Enrichment
For every parsed listing item, the workflow fetches the corresponding Reddit `.json` endpoint through ScrapeOps, extracts `selftext` and inferred post type, then merges that data back into the listing metadata and normalizes the final post fields.
## 1.5 Deduplication & Persistence
In parallel with the subreddit configuration stage, the workflow reads existing rows from the `posts` tab in Google Sheets. New scraped posts are compared against existing sheet content using both `content_hash` and normalized `post_url`. Only unseen posts are marked as new and appended to the `posts` sheet.
## 1.6 Weekly Digest Generation & Delivery
After batch processing completes, the workflow builds a digest from the in-memory collection of newly discovered posts. It derives lightweight topics from repeated words, selects top posts by score and comment count, writes a summary row into the `weekly_digest` sheet, and optionally emails the digest.
---
# 2. Block-by-Block Analysis
## 2.1 Block: Trigger & Configuration
### Overview
This block launches the workflow weekly and prepares one execution item per target subreddit. It also initializes workflow static data used later for deduplication and digest generation.
### Nodes Involved
- Weekly Schedule Trigger
- Configure Subreddits & Week Range
### Node Details
#### Weekly Schedule Trigger
- **Type and role:** `n8n-nodes-base.scheduleTrigger`; workflow entry point
- **Configuration choices:** Uses a basic interval rule. In this exported JSON the rule is minimal and represents a schedule-based trigger intended to run weekly.
- **Key expressions or variables used:** None
- **Input and output connections:**
- Input: none
- Output: `Configure Subreddits & Week Range`
- **Version-specific requirements:** Type version 1
- **Edge cases / failure types:**
- Misconfigured schedule may cause it to run too often or not at all
- Timezone interpretation depends on instance settings
- **Sub-workflow reference:** None
#### Configure Subreddits & Week Range
- **Type and role:** `n8n-nodes-base.code`; emits runtime configuration records
- **Configuration choices:**
- Resets static global arrays:
- `global.seen = []`
- `global.newPosts = []`
- Calculates:
- `run_id` from current ISO timestamp
- `run_date` as `YYYY-MM-DD`
- `week_range` from Monday to Sunday in UTC
- Defines subreddit list:
- `selfhosted`
- `devops`
- `programming`
- `webdev`
- Sets fixed parameters per emitted item:
- `sort = "top"`
- `time_range = "week"`
- `limit = 20`
- `sheet_id = "1rKuVREV4pedie7uAbuEcLvghNrAEeAjIPUBE6cQPleI"`
- **Key expressions or variables used:**
- Workflow static data via `$getWorkflowStaticData('global')`
- UTC week calculations
- **Input and output connections:**
- Input: `Weekly Schedule Trigger`
- Outputs:
- ` Split Subreddits Into Batches`
- `Read Existing Posts from Sheet`
- **Version-specific requirements:** Code node type version 2
- **Edge cases / failure types:**
- If static data is unavailable in a given runtime, fallback behavior relies on `globalThis`
- Hardcoded subreddit list requires manual editing for changes
- Hardcoded sheet ID may drift from the IDs configured in Google Sheets nodes
- **Sub-workflow reference:** None
---
## 2.2 Block: Scrape Subreddit Listings
### Overview
This block processes subreddits sequentially, fetches each subreddits weekly top page through ScrapeOps, and deliberately slows the request cadence to reduce scraping pressure.
### Nodes Involved
- Split Subreddits Into Batches
- ScrapeOps: Fetch Subreddit Listing
- Polite Delay (13s)
### Node Details
#### Split Subreddits Into Batches
- **Type and role:** `n8n-nodes-base.splitInBatches`; controls iteration over subreddit items
- **Configuration choices:**
- `batchSize = 1`
- Processes one subreddit at a time
- **Key expressions or variables used:** None
- **Input and output connections:**
- Input:
- `Configure Subreddits & Week Range`
- loop-back from `Append New Posts to Sheet`
- Outputs:
- `ScrapeOps: Fetch Subreddit Listing`
- `Build Weekly Digest`
- **Version-specific requirements:** Type version 2
- **Edge cases / failure types:**
- Because `Build Weekly Digest` is connected to the second output, the digest runs when batching completes; if no new posts were collected, the digest may still execute with empty data
- Loop behavior depends on n8n batch semantics; incorrect downstream termination can cause partial processing
- **Sub-workflow reference:** None
#### ScrapeOps: Fetch Subreddit Listing
- **Type and role:** `@scrapeops/n8n-nodes-scrapeops.ScrapeOps`; performs proxied HTTP fetch of subreddit HTML
- **Configuration choices:**
- URL expression:
- `https://old.reddit.com/r/{{$json.subreddit}}/top/?t=week`
- Uses ScrapeOps account credential
- No advanced options explicitly set
- **Key expressions or variables used:**
- `$json.subreddit`
- **Input and output connections:**
- Input: ` Split Subreddits Into Batches`
- Output: ` Polite Delay (13s)`
- **Version-specific requirements:** ScrapeOps node type version 1; requires installed ScrapeOps n8n node package and valid API credentials
- **Edge cases / failure types:**
- Invalid or missing ScrapeOps credential
- Reddit returning blocked, challenge, or alternate HTML
- Network timeout or proxy errors
- If subreddit does not exist, parsing stage may return zero posts
- **Sub-workflow reference:** None
#### Polite Delay (13s)
- **Type and role:** `n8n-nodes-base.wait`; rate-control pause
- **Configuration choices:**
- Wait duration in seconds
- Randomized expression: `Math.floor(Math.random()*3)+1`
- **Key expressions or variables used:**
- Dynamic wait duration expression
- **Input and output connections:**
- Input: `ScrapeOps: Fetch Subreddit Listing`
- Output: `Parse Listing HTML → Post Metadata`
- **Version-specific requirements:** Type version 1
- **Edge cases / failure types:**
- Wait node resumes execution asynchronously; environment must support wait/resume properly
- Very large runs can accumulate runtime overhead
- **Sub-workflow reference:** None
---
## 2.3 Block: Parse Post Metadata
### Overview
This block converts scraped subreddit listing HTML into structured post objects. It also generates fallback metadata, detects subreddit names, and computes stable hashes for deduplication.
### Nodes Involved
- Parse Listing HTML → Post Metadata
### Node Details
#### Parse Listing HTML → Post Metadata
- **Type and role:** `n8n-nodes-base.code`; HTML parser and record builder
- **Configuration choices:**
- Reads HTML from `$json.data`, `$json.body`, or raw `$json`
- Uses `limit` from input, defaulting to 20
- Calculates fallback values:
- `run_id`
- `run_date`
- `week_range`
- Parses old Reddit listing blocks by splitting on `<div class="thing`
- Extracts:
- `post_id`
- `post_url`
- `post_title`
- `post_text` from listing `data-selftext-html` if present
- `author`
- `created_utc`
- `score`
- `num_comments`
- `flair`
- `subreddit`
- Normalizes URLs to `www.reddit.com`
- Builds `content_hash` as SHA-1 of `subreddit + title + url`
- Sets `is_new = true` initially
- `alwaysOutputData = true`
- **Key expressions or variables used:**
- `$json.data`
- `$json.body`
- `$json.limit`
- `require('crypto')`
- **Input and output connections:**
- Input: ` Polite Delay (13s)`
- Outputs:
- ` ScrapeOps: Fetch Post Details (JSON)`
- `Merge Post Metadata + Text` (input 0)
- **Version-specific requirements:** Code node type version 2; runtime must permit `require('crypto')`
- **Edge cases / failure types:**
- HTML structure changes on Reddit can break regex parsing
- Encoded characters may not be fully normalized
- If no posts are detected, downstream merge may receive empty data
- If `old.reddit.com` changes attributes like `data-permalink` or `data-fullname`, extraction can fail silently
- **Sub-workflow reference:** None
---
## 2.4 Block: Post Enrichment & Finalization
### Overview
This block fetches each posts JSON representation, extracts full text where available, merges enrichment data with listing data, and produces the final normalized post record.
### Nodes Involved
- ScrapeOps: Fetch Post Details (JSON)
- Extract Selftext & Post Type
- Merge Post Metadata + Text
- Finalize & Normalize Post Fields
### Node Details
#### ScrapeOps: Fetch Post Details (JSON)
- **Type and role:** `@scrapeops/n8n-nodes-scrapeops.ScrapeOps`; fetches per-post Reddit JSON endpoint
- **Configuration choices:**
- URL expression:
- `($json.post_url || '').replace(/\?.*$/, '').replace(/\/$/, '') + '.json?raw_json=1'`
- `returnType = "json"`
- Uses ScrapeOps account credential
- **Key expressions or variables used:**
- `$json.post_url`
- **Input and output connections:**
- Input: `Parse Listing HTML → Post Metadata`
- Output: `Extract Selftext & Post Type`
- **Version-specific requirements:** ScrapeOps node type version 1
- **Edge cases / failure types:**
- Post URL may be malformed or blank
- Some posts may return HTML instead of JSON due to blocking or redirects
- Deleted or removed posts can yield incomplete data
- **Sub-workflow reference:** None
#### Extract Selftext & Post Type
- **Type and role:** `n8n-nodes-base.code`; robust JSON extractor for Reddit post details
- **Configuration choices:**
- Searches for response text in `body`, `data`, `response`, or longest string field
- Decodes common HTML entities
- Detects and rejects HTML responses
- Parses either array or object JSON payload
- Extracts:
- `post_text_extracted`
- `post_type`
- `post_title`
- `post_id`
- `post_url`
- `subreddit`
- `score`
- `num_comments`
- `author`
- `created_utc`
- Returns `extracted_ok: false` with diagnostic fields on failure
- **Key expressions or variables used:**
- `$json`
- nested Reddit JSON path `data.children[0].data`
- **Input and output connections:**
- Input: ` ScrapeOps: Fetch Post Details (JSON)`
- Output: `Merge Post Metadata + Text` (input 1)
- **Version-specific requirements:** Code node type version 2
- **Edge cases / failure types:**
- Response body absent
- JSON parse errors
- HTML challenge pages
- Alternate Reddit JSON structures not matching expected path
- `selftext` legitimately empty for link, image, and many media posts
- **Sub-workflow reference:** None
#### Merge Post Metadata + Text
- **Type and role:** `n8n-nodes-base.merge`; combines listing metadata with enriched post data
- **Configuration choices:**
- `mode = combine`
- `combinationMode = mergeByPosition`
- **Key expressions or variables used:** None
- **Input and output connections:**
- Inputs:
- Input 0: `Parse Listing HTML → Post Metadata`
- Input 1: `Extract Selftext & Post Type`
- Output: `Finalize & Normalize Post Fields`
- **Version-specific requirements:** Merge node type version 2
- **Edge cases / failure types:**
- Merge-by-position assumes both branches emit items in exactly the same order and count
- If the JSON-fetch branch drops or adds items, records can become misaligned
- **Sub-workflow reference:** None
#### Finalize & Normalize Post Fields
- **Type and role:** `n8n-nodes-base.code`; final field cleanup
- **Configuration choices:**
- If `post_text_extracted` is non-empty, it overwrites `post_text`
- Otherwise keeps existing `post_text`
- Deletes `post_text_extracted`
- **Key expressions or variables used:**
- `item.json.post_text_extracted`
- **Input and output connections:**
- Input: `Merge Post Metadata + Text`
- Output: ` Merge Scraped + Existing Posts`
- **Version-specific requirements:** Code node type version 2
- **Edge cases / failure types:**
- If merge misalignment occurred earlier, the wrong text can be assigned to a record
- **Sub-workflow reference:** None
---
## 2.5 Block: Read Existing Posts, Deduplicate, and Save
### Overview
This block loads existing post history from Google Sheets, compares scraped posts against historical content, flags only unseen posts as new, and appends them to the `posts` worksheet.
### Nodes Involved
- Read Existing Posts from Sheet
- Merge Scraped + Existing Posts
- Deduplicate New Posts
- Append New Posts to Sheet
### Node Details
#### Read Existing Posts from Sheet
- **Type and role:** `n8n-nodes-base.googleSheets`; loads existing saved posts
- **Configuration choices:**
- Reads from spreadsheet ID `1rKuVREV4pedie7uAbuEcLvghNrAEeAjIPUBE6cQPleI`
- Targets `gid=0`, cached as sheet name `posts`
- `alwaysOutputData = true`
- **Key expressions or variables used:** None
- **Input and output connections:**
- Input: `Configure Subreddits & Week Range`
- Output: ` Merge Scraped + Existing Posts` (input 1)
- **Version-specific requirements:** Google Sheets node type version 3; requires OAuth2 credentials
- **Edge cases / failure types:**
- OAuth token expiration or missing scopes
- Spreadsheet or sheet tab renamed/deleted
- Large sheets may increase runtime
- **Sub-workflow reference:** None
#### Merge Scraped + Existing Posts
- **Type and role:** `n8n-nodes-base.merge`; synchronization point between scraped items and sheet-read branch
- **Configuration choices:**
- `mode = combine`
- `combinationMode = mergeByPosition`
- **Key expressions or variables used:** None
- **Input and output connections:**
- Inputs:
- Input 0: `Finalize & Normalize Post Fields`
- Input 1: `Read Existing Posts from Sheet`
- Output: `Deduplicate New Posts`
- **Version-specific requirements:** Merge node type version 2
- **Edge cases / failure types:**
- The deduplication code does not rely on merged row pairing; it separately calls `$items('Read Existing Posts from Sheet')`, so this merge mainly acts as a gating/join node
- Merge-by-position here is not semantically ideal because counts will differ greatly between scraped posts and historical sheet rows
- **Sub-workflow reference:** None
#### Deduplicate New Posts
- **Type and role:** `n8n-nodes-base.code`; deduplicates against sheet data and workflow static cache
- **Configuration choices:**
- Loads workflow static global store
- Ensures `global.seen` and `global.newPosts` arrays exist
- Normalizes URLs by replacing `old.reddit.com` with `www.reddit.com`
- Reads all rows from `Read Existing Posts from Sheet` using `$items(...)`
- Adds both `content_hash` and `post_url` to a lookup set
- For each incoming scraped item:
- marks `is_new = true` if unseen
- pushes new items into `global.newPosts`
- otherwise sets `is_new = false`
- Returns all items regardless of newness
- `alwaysOutputData = true`
- **Key expressions or variables used:**
- `$getWorkflowStaticData('global')`
- `$items('Read Existing Posts from Sheet')`
- **Input and output connections:**
- Input: ` Merge Scraped + Existing Posts`
- Output: `Append New Posts to Sheet`
- **Version-specific requirements:** Code node type version 2
- **Edge cases / failure types:**
- Because all items are returned, the next append node will write duplicates too unless separately filtered; this is a major behavioral issue
- Static global cache is reset at the start of each run, so cross-run memory comes mainly from Google Sheets, not static memory
- Column fallback references like `r[15]`, `r.Q`, `r[6]`, `r.G` assume possible alternate formats and may not always map correctly
- **Sub-workflow reference:** None
#### Append New Posts to Sheet
- **Type and role:** `n8n-nodes-base.googleSheets`; appends scraped post rows to the `posts` tab
- **Configuration choices:**
- Operation: `append`
- Spreadsheet ID: `1rKuVREV4pedie7uAbuEcLvghNrAEeAjIPUBE6cQPleI`
- Sheet name: `posts`
- Explicit column mapping for:
- `run_id`
- `run_date`
- `subreddit`
- `sort`
- `time_range`
- `post_id`
- `post_url`
- `post_title`
- `post_text`
- `author`
- `created_utc`
- `score`
- `num_comments`
- `flair`
- `extracted_at`
- `content_hash`
- `is_new`
- **Key expressions or variables used:**
- Per-column `{{$json.field}}` expressions
- **Input and output connections:**
- Input: `Deduplicate New Posts`
- Output: ` Split Subreddits Into Batches` (loop-back)
- **Version-specific requirements:** Google Sheets node type version 4.5; requires OAuth2 credentials
- **Edge cases / failure types:**
- As currently wired, this node appends all items passed from `Deduplicate New Posts`, including those with `is_new = false`
- Sheet schema mismatch can cause blank cells or append errors
- Numeric fields are not force-converted; Google Sheets may infer unexpected types
- **Sub-workflow reference:** None
---
## 2.6 Block: Build Weekly Digest and Send Email
### Overview
This block compiles the runs newly discovered posts into a lightweight digest, stores the digest in Google Sheets, and optionally sends it by email.
### Nodes Involved
- Build Weekly Digest
- Append Weekly Digest to Sheet
- Send Weekly Digest Email
### Node Details
#### Build Weekly Digest
- **Type and role:** `n8n-nodes-base.code`; creates summary metrics, topic clusters, and a text digest
- **Configuration choices:**
- Reads `global.newPosts`
- Computes:
- `total_posts`
- combined `subreddits`
- frequency-based topic words excluding common stopwords
- up to 8 topic clusters, padded to at least 5 when possible
- top 10 posts sorted by score then comments
- Produces:
- `top_topics_json`
- `top_posts_json`
- `weekly_brief_text`
- `created_at`
- `run_id`
- `week_range`
- **Key expressions or variables used:**
- `$getWorkflowStaticData('global')`
- word tokenization using `/[^a-z0-9+#]+/`
- **Input and output connections:**
- Input: ` Split Subreddits Into Batches` (completion path)
- Output: `Append Weekly Digest to Sheet`
- **Version-specific requirements:** Code node type version 2
- **Edge cases / failure types:**
- If `global.newPosts` is empty, it still generates a digest row with `total_posts = 0`
- Topic extraction is simplistic and may produce weak topic labels
- Stopword list is English-only and minimal
- **Sub-workflow reference:** None
#### Append Weekly Digest to Sheet
- **Type and role:** `n8n-nodes-base.googleSheets`; stores weekly digest output
- **Configuration choices:**
- Operation: `append`
- Spreadsheet ID: `1rKuVREV4pedie7uAbuEcLvghNrAEeAjIPUBE6cQPleI`
- Sheet name: `weekly_digest`
- Column mapping:
- `run_id`
- `created_at`
- `subreddits`
- `week_range`
- `total_posts`
- `top_posts_json`
- `top_topics_json`
- `weekly_brief_text`
- **Key expressions or variables used:**
- Per-column `{{$json.field}}` expressions
- **Input and output connections:**
- Input: `Build Weekly Digest`
- Output: `Send Weekly Digest Email`
- **Version-specific requirements:** Google Sheets node type version 4.5
- **Edge cases / failure types:**
- Missing `weekly_digest` tab causes failure
- Large JSON strings may make sheet cells difficult to inspect
- **Sub-workflow reference:** None
#### Send Weekly Digest Email
- **Type and role:** `n8n-nodes-base.emailSend`; sends the digest via email
- **Configuration choices:**
- Subject expression:
- `Weekly Developer Tools Digest (Reddit) {{$json.week_range}}`
- The exported JSON shows mojibake in the dash character, which should be corrected manually
- Email body: `{{$json.weekly_brief_text}}`
- To: `user@example.com`
- From: `you@example.com`
- `executeOnce = true`
- **Key expressions or variables used:**
- `$json.weekly_brief_text`
- `$json.week_range`
- **Input and output connections:**
- Input: `Append Weekly Digest to Sheet`
- Output: none
- **Version-specific requirements:** Email Send node type version 2; requires SMTP or email transport configuration depending on n8n setup
- **Edge cases / failure types:**
- Placeholder email addresses must be replaced
- Missing email transport credentials causes failure
- `executeOnce` means only one email is sent even if upstream emits multiple items
- **Sub-workflow reference:** None
---
# 3. Summary Table
| Node Name | Node Type | Functional Role | Input Node(s) | Output Node(s) | Sticky Note |
|---|---|---|---|---|---|
| Weekly Schedule Trigger | Schedule Trigger | Starts the workflow on a schedule | | Configure Subreddits & Week Range | ## 1. Trigger & Configuration<br>Fires weekly and sets runtime config — subreddit list, week range, batch size, and Google Sheet IDs. |
| Configure Subreddits & Week Range | Code | Builds per-subreddit runtime items and resets static state | Weekly Schedule Trigger | Split Subreddits Into Batches; Read Existing Posts from Sheet | ## 1. Trigger & Configuration<br>Fires weekly and sets runtime config — subreddit list, week range, batch size, and Google Sheet IDs. |
| Split Subreddits Into Batches | Split In Batches | Iterates through subreddits one at a time | Configure Subreddits & Week Range; Append New Posts to Sheet | ScrapeOps: Fetch Subreddit Listing; Build Weekly Digest | ## 2. Scrape Subreddit Listings<br>Batch through each subreddit and scrape the "Top of Week" page via [ScrapeOps Proxy](https://scrapeops.io/docs/n8n/proxy-api/) with a polite delay between requests. |
| ScrapeOps: Fetch Subreddit Listing | ScrapeOps | Fetches subreddit top-of-week HTML via proxy | Split Subreddits Into Batches | Polite Delay (13s) | ## 2. Scrape Subreddit Listings<br>Batch through each subreddit and scrape the "Top of Week" page via [ScrapeOps Proxy](https://scrapeops.io/docs/n8n/proxy-api/) with a polite delay between requests. |
| Polite Delay (13s) | Wait | Adds random delay between requests | ScrapeOps: Fetch Subreddit Listing | Parse Listing HTML → Post Metadata | ## 2. Scrape Subreddit Listings<br>Batch through each subreddit and scrape the "Top of Week" page via [ScrapeOps Proxy](https://scrapeops.io/docs/n8n/proxy-api/) with a polite delay between requests. |
| Parse Listing HTML → Post Metadata | Code | Parses old Reddit listing HTML into structured posts | Polite Delay (13s) | ScrapeOps: Fetch Post Details (JSON); Merge Post Metadata + Text | ## 3. Parse Post Metadata<br>Extract title, URL, score, comment count, author, and timestamps from listing HTML into structured JSON. |
| ScrapeOps: Fetch Post Details (JSON) | ScrapeOps | Fetches per-post Reddit JSON | Parse Listing HTML → Post Metadata | Extract Selftext & Post Type | ## 4. Enrich & Finalize Posts<br>Fetch each post as JSON to extract `selftext`, merge with listing metadata, and normalize all fields into the final record. |
| Extract Selftext & Post Type | Code | Extracts selftext and post characteristics from Reddit JSON | ScrapeOps: Fetch Post Details (JSON) | Merge Post Metadata + Text | ## 4. Enrich & Finalize Posts<br>Fetch each post as JSON to extract `selftext`, merge with listing metadata, and normalize all fields into the final record. |
| Merge Post Metadata + Text | Merge | Merges listing metadata with post JSON extraction | Parse Listing HTML → Post Metadata; Extract Selftext & Post Type | Finalize & Normalize Post Fields | ## 4. Enrich & Finalize Posts<br>Fetch each post as JSON to extract `selftext`, merge with listing metadata, and normalize all fields into the final record. |
| Finalize & Normalize Post Fields | Code | Chooses best post text and cleans fields | Merge Post Metadata + Text | Merge Scraped + Existing Posts | ## 4. Enrich & Finalize Posts<br>Fetch each post as JSON to extract `selftext`, merge with listing metadata, and normalize all fields into the final record. |
| Read Existing Posts from Sheet | Google Sheets | Loads existing saved posts for deduplication | Configure Subreddits & Week Range | Merge Scraped + Existing Posts | ## 5. Deduplicate & Save<br>Compare against existing Sheet rows by hash and URL, then append only new posts to the `posts` tab. |
| Merge Scraped + Existing Posts | Merge | Synchronizes scraped branch and sheet-read branch before deduplication | Finalize & Normalize Post Fields; Read Existing Posts from Sheet | Deduplicate New Posts | ## 5. Deduplicate & Save<br>Compare against existing Sheet rows by hash and URL, then append only new posts to the `posts` tab. |
| Deduplicate New Posts | Code | Flags duplicates using hash and URL and stores new posts in static memory | Merge Scraped + Existing Posts | Append New Posts to Sheet | ## 5. Deduplicate & Save<br>Compare against existing Sheet rows by hash and URL, then append only new posts to the `posts` tab. |
| Append New Posts to Sheet | Google Sheets | Appends post rows to the `posts` sheet and loops batch execution | Deduplicate New Posts | Split Subreddits Into Batches | ## 5. Deduplicate & Save<br>Compare against existing Sheet rows by hash and URL, then append only new posts to the `posts` tab. |
| Build Weekly Digest | Code | Builds digest summary from newly found posts | Split Subreddits Into Batches | Append Weekly Digest to Sheet | ## 6. Weekly Digest & Email<br>Generate topic clusters and top post summaries, write to `weekly_digest` tab, and optionally send by email. |
| Append Weekly Digest to Sheet | Google Sheets | Stores weekly digest in sheet | Build Weekly Digest | Send Weekly Digest Email | ## 6. Weekly Digest & Email<br>Generate topic clusters and top post summaries, write to `weekly_digest` tab, and optionally send by email. |
| Send Weekly Digest Email | Email Send | Emails the final digest text | Append Weekly Digest to Sheet | | ## 6. Weekly Digest & Email<br>Generate topic clusters and top post summaries, write to `weekly_digest` tab, and optionally send by email. |
| Overview (Sticky) | Sticky Note | Workspace documentation | | | # 📰 Reddit Industry Digest (Weekly) → Google Sheets<br>This workflow builds a weekly industry digest by collecting top posts from selected subreddits — no Reddit API needed. It scrapes public Reddit pages via **ScrapeOps Proxy**, enriches each post with full text using Reddit's JSON endpoint, deduplicates against your Google Sheet, and generates a weekly summary that can optionally be emailed.<br>### How it works<br>1. ⏰ **Weekly Schedule Trigger** fires automatically once a week.<br>2. ⚙️ **Configure Subreddits & Week Range** sets the subreddit list, week range, and Sheet IDs.<br>3. 📦 **Split Subreddits Into Batches** processes each subreddit one at a time.<br>4. 🌐 **ScrapeOps: Fetch Subreddit Listing** scrapes the top-of-week page from `old.reddit.com`.<br>5. ⏳ **Polite Delay** adds a 13s pause between requests.<br>6. 🔍 **Parse Listing HTML** extracts title, URL, score, comments, author, and timestamps.<br>7. 📡 **ScrapeOps: Fetch Post Details** retrieves each post as JSON to extract `selftext`.<br>8. 🔀 **Merge & Normalize** combines listing data with post body text into a final record.<br>9. 🧹 **Deduplicate New Posts** filters posts already in the Sheet by hash and URL.<br>10. 💾 **Append New Posts** saves only new posts to the `posts` tab.<br>11. 📊 **Build Weekly Digest** generates topic clusters and top post summaries.<br>12. 📧 **Send Digest Email** optionally emails the weekly summary.<br>### Setup steps<br>- Register for a free ScrapeOps API key: https://scrapeops.io/app/register/n8n<br>- Add ScrapeOps credentials in n8n. Docs: https://scrapeops.io/docs/n8n/overview/<br>- Duplicate [this sheet](https://docs.google.com/spreadsheets/d/1rKuVREV4pedie7uAbuEcLvghNrAEeAjIPUBE6cQPleI/edit?usp=sharing) to copy Columns and Spreadsheet ID.<br>- Connect Google Sheets and set your Spreadsheet ID in the Sheet nodes.<br>- Update your subreddit list in **Configure Subreddits & Week Range**.<br>- Optional: enable **Send Digest Email** and configure credentials.<br>### Customization<br>- Add or remove subreddits in the configure node.<br>- Change timeframe from `week` to `month` in the fetch URL.<br>- Add a Slack node to post the digest to a channel. |
| Section: Trigger & Inputs | Sticky Note | Visual section label | | | ## 1. Trigger & Configuration<br>Fires weekly and sets runtime config — subreddit list, week range, batch size, and Google Sheet IDs. |
| Section: Scrape Listings | Sticky Note | Visual section label | | | ## 2. Scrape Subreddit Listings<br>Batch through each subreddit and scrape the "Top of Week" page via [ScrapeOps Proxy](https://scrapeops.io/docs/n8n/proxy-api/) with a polite delay between requests. |
| Section: Post Enrichment | Sticky Note | Visual section label | | | ## 3. Parse Post Metadata<br>Extract title, URL, score, comment count, author, and timestamps from listing HTML into structured JSON. |
| Section: Post Enrichment1 | Sticky Note | Visual section label | | | ## 4. Enrich & Finalize Posts<br>Fetch each post as JSON to extract `selftext`, merge with listing metadata, and normalize all fields into the final record. |
| Section: Post Enrichment2 | Sticky Note | Visual section label | | | ## 5. Deduplicate & Save<br>Compare against existing Sheet rows by hash and URL, then append only new posts to the `posts` tab. |
| Section: Post Enrichment3 | Sticky Note | Visual section label | | | ## 6. Weekly Digest & Email<br>Generate topic clusters and top post summaries, write to `weekly_digest` tab, and optionally send by email. |
---
# 4. Reproducing the Workflow from Scratch
1. **Create a new workflow**
- Name it something like: `Reddit Industry Digest with ScrapeOps and Google Sheets`.
2. **Add a Schedule Trigger node**
- Node type: `Schedule Trigger`
- Configure it to run weekly.
- Choose the desired weekday and time in your n8n instance timezone.
3. **Add a Code node named `Configure Subreddits & Week Range`**
- Connect it after the trigger.
- Paste logic that:
- resets workflow static global arrays `seen` and `newPosts`
- computes:
- `run_id`
- `run_date`
- Monday-to-Sunday `week_range`
- defines a subreddit list such as:
- `selfhosted`
- `devops`
- `programming`
- `webdev`
- emits one item per subreddit with:
- `subreddit`
- `sort = top`
- `time_range = week`
- `limit = 20`
- `sheet_id = your spreadsheet ID`
4. **Add a Google Sheets credential**
- Use OAuth2 for Google Sheets.
- Ensure access to the destination spreadsheet.
5. **Prepare the spreadsheet**
- Create or duplicate a spreadsheet with two tabs:
- `posts`
- `weekly_digest`
- The `posts` tab should contain columns:
- `run_id`
- `run_date`
- `subreddit`
- `sort`
- `time_range`
- `post_id`
- `post_url`
- `post_title`
- `post_text`
- `author`
- `created_utc`
- `score`
- `num_comments`
- `flair`
- `extracted_at`
- `content_hash`
- `is_new`
- The `weekly_digest` tab should contain columns:
- `run_id`
- `week_range`
- `subreddits`
- `total_posts`
- `top_topics_json`
- `weekly_brief_text`
- `top_posts_json`
- `created_at`
6. **Add a Google Sheets node named `Read Existing Posts from Sheet`**
- Connect it from `Configure Subreddits & Week Range`.
- Configure it to read from your spreadsheet.
- Select the `posts` tab.
- Enable it to output data even if empty, if available in your node version.
7. **Add a `Split In Batches` node**
- Name it ` Split Subreddits Into Batches`.
- Connect it from `Configure Subreddits & Week Range`.
- Set `Batch Size` to `1`.
8. **Install and configure ScrapeOps**
- Install the ScrapeOps n8n node package if it is not already installed.
- Create ScrapeOps credentials with your API key.
- Reference:
- https://scrapeops.io/app/register/n8n
- https://scrapeops.io/docs/n8n/overview/
9. **Add a ScrapeOps node named `ScrapeOps: Fetch Subreddit Listing`**
- Connect it from ` Split Subreddits Into Batches`.
- Set URL to:
- `https://old.reddit.com/r/{{$json.subreddit}}/top/?t=week`
- Use the ScrapeOps credential.
- Keep response as HTML/text.
10. **Add a Wait node named ` Polite Delay (13s)`**
- Connect it after the listing fetch.
- Set unit to `seconds`.
- Set amount expression to:
- `{{ Math.floor(Math.random()*3)+1 }}`
11. **Add a Code node named `Parse Listing HTML → Post Metadata`**
- Connect it after the wait node.
- Implement logic that:
- reads listing HTML from `data` or `body`
- parses each Reddit post block from `old.reddit.com`
- extracts title, author, permalink, score, comments, flair, and timestamp
- normalizes Reddit URLs to `https://www.reddit.com/...`
- computes `content_hash` using SHA-1
- emits one item per post
- honors a `limit` from input, default `20`
- Enable `Always Output Data`.
12. **Add a ScrapeOps node named ` ScrapeOps: Fetch Post Details (JSON)`**
- Connect it from `Parse Listing HTML → Post Metadata`.
- Set URL expression to:
- `{{ ($json.post_url || '').replace(/\?.*$/, '').replace(/\/$/, '') + '.json?raw_json=1' }}`
- Set return type to `json`.
- Use the same ScrapeOps credential.
13. **Add a Code node named `Extract Selftext & Post Type`**
- Connect it after the post-details node.
- Implement logic that:
- looks for the raw response in `body`, `data`, `response`, or the longest string field
- decodes HTML entities
- rejects HTML responses
- parses JSON
- extracts post data from `data.children[0].data`
- emits fields including:
- `post_text_extracted`
- `post_type`
- `post_title`
- `post_id`
- `post_url`
- `subreddit`
- `score`
- `num_comments`
- `author`
- `created_utc`
- returns diagnostic data on parse failure
14. **Add a Merge node named `Merge Post Metadata + Text`**
- Connect input 0 from `Parse Listing HTML → Post Metadata`
- Connect input 1 from `Extract Selftext & Post Type`
- Set:
- Mode: `Combine`
- Combination mode: `Merge By Position`
15. **Add a Code node named `Finalize & Normalize Post Fields`**
- Connect it after the merge.
- Configure it to:
- overwrite `post_text` with `post_text_extracted` when non-empty
- otherwise keep the existing `post_text`
- remove `post_text_extracted`
16. **Add a Merge node named ` Merge Scraped + Existing Posts`**
- Connect input 0 from `Finalize & Normalize Post Fields`
- Connect input 1 from `Read Existing Posts from Sheet`
- Set:
- Mode: `Combine`
- Combination mode: `Merge By Position`
- Note: this node mainly acts as a synchronization point.
17. **Add a Code node named `Deduplicate New Posts`**
- Connect it after the merge.
- Implement logic that:
- loads workflow static global data
- reads all rows from `Read Existing Posts from Sheet` with `$items(...)`
- builds a set of existing `content_hash` and normalized `post_url`
- checks each scraped item against that set
- sets `is_new` true or false
- pushes only new posts into `global.newPosts`
- returns items for downstream use
- Enable `Always Output Data`.
18. **Important correction: filter before appending**
- The provided workflow claims to append only new posts, but as wired it returns all items to the append node.
- To reproduce the intended behavior safely, add an `IF` node or a Code filter after `Deduplicate New Posts`:
- condition: `{{$json.is_new}}` is true
- Send only the true branch to the append node.
- If reproducing the JSON exactly, omit this filter; if reproducing the intended logic, include it.
19. **Add a Google Sheets node named `Append New Posts to Sheet`**
- Connect it from:
- ideally the filtered `true` branch from step 18
- or directly from `Deduplicate New Posts` if you want to mirror the provided wiring
- Configure:
- Operation: `Append`
- Spreadsheet: your spreadsheet
- Sheet: `posts`
- Map the columns explicitly to the fields listed in step 5
20. **Loop batch execution**
- Connect `Append New Posts to Sheet` back to ` Split Subreddits Into Batches`.
- This continues processing the next subreddit.
21. **Add a Code node named `Build Weekly Digest`**
- Connect it to the second output of ` Split Subreddits Into Batches`, which runs when batching completes.
- Implement logic that:
- reads `global.newPosts`
- counts total new posts
- creates a subreddit summary
- tokenizes title + post text
- excludes common stopwords
- derives top keywords and simple topic clusters
- sorts top posts by score, then comment count
- creates:
- `top_topics_json`
- `top_posts_json`
- `weekly_brief_text`
- `created_at`
- `run_id`
- `week_range`
22. **Add a Google Sheets node named `Append Weekly Digest to Sheet`**
- Connect it after `Build Weekly Digest`.
- Configure:
- Operation: `Append`
- Sheet: `weekly_digest`
- Explicitly map:
- `run_id`
- `created_at`
- `subreddits`
- `week_range`
- `total_posts`
- `top_posts_json`
- `top_topics_json`
- `weekly_brief_text`
23. **Add an Email Send node named `Send Weekly Digest Email`**
- Connect it after `Append Weekly Digest to Sheet`.
- Configure:
- To: your recipient address
- From: a valid sender address
- Subject: `Weekly Developer Tools Digest (Reddit) {{$json.week_range}}`
- Text body: `{{$json.weekly_brief_text}}`
- Enable `Execute Once`.
24. **Configure email credentials**
- Depending on your n8n environment, configure SMTP or the supported email transport.
- Replace placeholder addresses.
25. **Test with manual execution**
- Run the workflow manually.
- Verify:
- subreddit pages are fetched
- posts are parsed
- per-post JSON is readable
- `posts` tab receives rows
- `weekly_digest` tab receives one digest row
- email sends correctly if enabled
26. **Validate edge conditions**
- Test with:
- a nonexistent subreddit
- an empty `posts` tab
- a repeated run on the same week
- one or more link/image posts with empty `selftext`
27. **Recommended hardening improvements**
- Add a filter before `Append New Posts to Sheet` so only `is_new = true` rows are appended
- Replace merge-by-position with a safer key-based join where practical
- Add error handling for blocked HTML, bad JSON, and credential failures
- Move hardcoded subreddit list and spreadsheet ID into environment variables or workflow variables
---
# 5. General Notes & Resources
| Note Content | Context or Link |
|---|---|
| Register for a free ScrapeOps API key | https://scrapeops.io/app/register/n8n |
| ScrapeOps n8n documentation | https://scrapeops.io/docs/n8n/overview/ |
| ScrapeOps Proxy API documentation | https://scrapeops.io/docs/n8n/proxy-api/ |
| Duplicate the sample Google Sheet template | https://docs.google.com/spreadsheets/d/1rKuVREV4pedie7uAbuEcLvghNrAEeAjIPUBE6cQPleI/edit?usp=sharing |
| Customization note: add or remove subreddits in the configuration Code node | Workflow setup note |
| Customization note: change timeframe from `week` to `month` in the listing fetch URL | Workflow setup note |
| Customization note: add a Slack node to send the digest to a channel | Workflow setup note |
## Additional implementation observations
- The workflow has a single entry point: `Weekly Schedule Trigger`.
- There are no sub-workflows or workflow-execution nodes in this workflow.
- The current implementation does **not fully enforce** “append only new posts” because `Deduplicate New Posts` returns all items and `Append New Posts to Sheet` receives them directly.
- The digest is based only on posts collected during the current run and stored in `global.newPosts`, not on all posts in the spreadsheet.
- The workflow depends on `old.reddit.com` HTML structure; if Reddit changes markup, the parser will need updates.