klbr/n8nworkflows.xyz

Fork 0

mirror of https://github.com/khoaliber/n8nworkflows.xyz.git synced 2026-04-20 01:24:33 +00:00

Files

nusquama e7c2071350 creation

2025-11-13 13:16:49 +01:00

16 KiB

Raw Blame History

Automate Scraping Y Combinator Startups with Apify & Google Sheets

https://n8nworkflows.xyz/workflows/automate-scraping-y-combinator-startups-with-apify---google-sheets-8013

Automate Scraping Y Combinator Startups with Apify & Google Sheets

1. Workflow Overview

This workflow automates the scraping of startup company data from the Y Combinator directory using Apify actors and logs the extracted information into a Google Sheet. It is designed for users seeking to collect, update, and maintain structured data about startups filtered by specific criteria such as industry, location, and company size.

The workflow is logically divided into the following blocks:

1.1 Prerequisites Setup: Preparation and configuration necessary before initiating the scraping process.
1.2 Manual Trigger: User-controlled start point to invoke the scraping workflow on demand.
1.3 Scraping Data with Apify Actor: Execution of an Apify actor to scrape company data based on specified filters.
1.4 Retrieving Scraped Dataset: Fetching the structured dataset generated by the Apify actor.
1.5 Logging Data into Google Sheets: Adding or updating rows in a Google Sheet to store the retrieved company data.

2. Block-by-Block Analysis

2.1 Prerequisites Setup

Overview:
This preparatory block defines the required configurations to ensure smooth execution of the workflow, including integration credentials and Google Sheet setup. It clarifies necessary parameters before running the scraping process.

Nodes Involved:

Sticky Note4

Node Details:

Sticky Note4
- Type: Sticky Note (informational)
- Role: Outlines prerequisites and setup instructions for Apify and Google Sheets integrations.
- Content Highlights:
  - Apify account connection and selecting the correct actor ("Y Combinator Directory Scraper").
  - Setting the Y Combinator filtered search URL and maxCompanies parameter.
  - Google Sheets OAuth2 authentication with proper scopes enabled.
  - Ensuring the target Google Sheet has predefined column headers (case-sensitive).
- Input/Output: None (informational node)
- Edge Cases: Misconfiguration of credentials or missing Google Sheet columns may cause workflow failure downstream.

2.2 Manual Trigger

Overview:
Allows the user to manually start the workflow whenever desired, providing full control over data scraping timing.

Nodes Involved:

Start Workflow
Sticky Note (4eea9bab-911c-4480-9073-831b8ac46571)

Node Details:

Start Workflow
- Type: Manual Trigger
- Role: Initiates the workflow execution on user command.
- Configuration: No parameters; activated manually in n8n UI.
- Input: None
- Output: Triggers the next node ("Run an Actor").
- Edge Cases: If not triggered, no scraping occurs.
Sticky Note (4eea9bab-911c-4480-9073-831b8ac46571)
- Type: Sticky Note
- Role: Explains the purpose of the manual trigger node.
- Content: Emphasizes manual start for controlled scraping.
- Input/Output: None

2.3 Scraping Data with Apify Actor

Overview:
Executes an Apify actor to scrape startup data from the Y Combinator directory filtered by user-defined criteria.

Nodes Involved:

Run an Actor
Sticky Note1

Node Details:

Run an Actor
- Type: Apify Node (Actor execution)
- Role: Runs the "Y Combinator Directory Scraper" Apify actor to scrape company data.
- Configuration:
  - Actor ID: XXsXDaNQLjoF4lgmU (Y Combinator directory scraper)
  - Input JSON includes:
    - maxCompanies: 5 (limits companies scraped per run)
    - startUrls: Y Combinator companies search URL with filters (industry=Fintech, regions=America/Canada, team_size=[1,25])
    - Proxy: Uses Apify Proxy for reliable scraping
- Credentials: Apify API key configured
- Input: Triggered by "Start Workflow" node
- Output: Returns dataset ID and other metadata used for fetching scraped data
- Edge Cases:
  - Network errors or proxy failures may cause scraping to fail.
  - Invalid actor ID or API credentials will cause authentication errors.
  - The maxCompanies parameter limits results; adjusting needed for larger datasets.
Sticky Note1
- Type: Sticky Note
- Role: Describes the role of the Apify actor node and how to configure the search URL and filters.
- Content: Details on applying filters to the Y Combinator URL and scraping structured company data.

2.4 Retrieving Scraped Dataset

Overview:
Fetches the dataset items generated by the Apify actor, providing detailed structured data for further processing.

Nodes Involved:

Get dataset items
Sticky Note2

Node Details:

Get dataset items
- Type: Apify Node (Dataset retrieval)
- Role: Retrieves the scraped company data using the dataset ID returned by the actor node.
- Configuration:
  - Resource: Datasets
  - Dataset ID: Dynamically assigned from previous node’s output ({{ $json.defaultDatasetId }})
- Credentials: Uses same Apify API credentials as "Run an Actor"
- Input: From "Run an Actor" node
- Output: Outputs array of company data objects (name, description, website, location, sector, etc.)
- Edge Cases:
  - If dataset ID is missing or invalid, fetch will fail.
  - API rate limits or connectivity issues may cause errors.
Sticky Note2
- Type: Sticky Note
- Role: Explains the purpose of fetching dataset items and the expected output structure.
- Content: Highlights the details retrieved and preparation for Google Sheets logging.

2.5 Logging Data into Google Sheets

Overview:
Logs the retrieved company data into a specified Google Sheet, adding new rows or updating existing entries to maintain an up-to-date database.

Nodes Involved:

Add data to Google Sheet
Sticky Note3

Node Details:

Add data to Google Sheet
- Type: Google Sheets Node (Add or Update Row)
- Role: Inserts or updates rows in a Google Sheet with scraped company details.
- Configuration:
  - Operation: appendOrUpdate (adds new rows or updates matched rows)
  - Document ID: Target Google Sheet ID (1AEOYMIRNgxYN3gihT1bIrGswnkCzuWbFljX2ac4XjUU)
  - Sheet Name: gid=0 (Sheet1)
  - Matching Column: "Company" (to identify existing rows)
  - Columns Mapped:
    - Company: {{ $json.company_name }}
    - Location: {{ $json.company_location }}
    - Website: {{ $json.website }}
    - LinkedIn: {{ $json.company_linkedin }}
    - Founded: {{ $json.year_founded }}
    - Description: {{ $json.long_description }}
    - Industry Tags: Concatenation of up to four tags (tags/0 to tags/3)
    - Founder 1 Name and LinkedIn: founders/0/name and founders/0/linkedin
    - Founder 2 Name and LinkedIn: founders/1/name and founders/1/linkedin
- Credentials: Google Sheets OAuth2 credentials with required scopes
- Input: Dataset items from "Get dataset items" node
- Output: Confirmation of added/updated rows
- Edge Cases:
  - Missing or incorrectly named columns in Google Sheet cause mapping failures.
  - Authentication errors if Google OAuth token expires.
  - Large datasets may hit rate limits.
  - Empty or missing JSON fields may result in blank cells.
Sticky Note3
- Type: Sticky Note
- Role: Details the Google Sheets node setup and column requirements.
- Content: Lists mandatory columns (case-sensitive) and explains the append or update mechanism.

3. Summary Table

Node Name	Node Type	Functional Role	Input Node(s)	Output Node(s)	Sticky Note
Sticky Note4	Sticky Note	Prerequisites and setup instructions	-	-	Describes Apify and Google Sheets prerequisites, credential setup, and sheet column requirements.
Start Workflow	Manual Trigger	Starts the workflow manually	-	Run an Actor	Explains manual triggering for controlled scraping.
Sticky Note	Sticky Note	Explains Manual Trigger node	-	-	Emphasizes manual start for controlled scraping.
Run an Actor	Apify Actor Node	Executes scraping actor on Y Combinator	Start Workflow	Get dataset items	Details actor configuration and input URL filters for scraping.
Sticky Note1	Sticky Note	Explains Apify Actor usage	-	-	Describes setting filters and actor operation.
Get dataset items	Apify Dataset Retrieval Node	Fetches scraped data from Apify dataset	Run an Actor	Add data to Google Sheet	Explains fetching dataset items and output structure.
Sticky Note2	Sticky Note	Explains dataset retrieval	-	-	Details dataset fetch and data for logging.
Add data to Google Sheet	Google Sheets (Add or Update)	Logs company data into Google Sheet	Get dataset items	-	Lists required sheet columns and operation details.
Sticky Note3	Sticky Note	Explains Google Sheets node setup	-	-	Emphasizes column headers and append/update logic.

4. Reproducing the Workflow from Scratch

Create a new workflow in n8n.
Add a Manual Trigger node:
- Name: Start Workflow
- No parameters needed.
- This node acts as the workflow entry point.
Add an Apify "Run an Actor" node:
- Name: Run an Actor
- Connect it from the Start Workflow node.
- Configure:
  - Set Actor ID to XXsXDaNQLjoF4lgmU (Y Combinator Directory Scraper).
  - In the JSON body input, set:
```
{
  "maxCompanies": 5,
  "startUrls": ["https://www.ycombinator.com/companies?industry=Fintech&regions=America%20%2F%20Canada&team_size=%5B%221%22%2C%2225%22%5D"],
  "proxyConfiguration": {
    "useApifyProxy": true
  }
}
```
- Credentials: Set Apify API credentials (create or select existing).
- Version: Use type version 1.
Add an Apify "Get dataset items" node:
- Name: Get dataset items
- Connect output from Run an Actor node.
- Configure:
  - Resource: Datasets
  - Dataset ID: Set expression to {{$json["defaultDatasetId"]}} (from previous node output).
- Use the same Apify API credentials as above.
- Version: Use type version 1.
Add a Google Sheets node:
- Name: Add data to Google Sheet
- Connect output from Get dataset items.
- Configure:
  - Operation: appendOrUpdate
  - Document ID: 1AEOYMIRNgxYN3gihT1bIrGswnkCzuWbFljX2ac4XjUU (replace with your sheet ID)
  - Sheet Name: gid=0 (Sheet1)
  - Matching Columns: Company (to update existing rows)
  - Columns mapping: Define columns as below with expressions:
    - Company: {{ $json.company_name }}
    - Location: {{ $json.company_location }}
    - Website: {{ $json.website }}
    - LinkedIn: {{ $json.company_linkedin }}
    - Founded: {{ $json.year_founded }}
    - Description: {{ $json.long_description }}
    - Industry Tags: Concatenate up to four tags, e.g., {{ $json['tags/0'] }} {{ $json['tags/1'] }} {{ $json['tags/2'] }} {{ $json['tags/3'] }}
    - Founder 1 Name: {{ $json['founders/0/name'] }}
    - Founder 1 LinkedIn: {{ $json['founders/0/linkedin'] }}
    - Founder 2 Name: {{ $json['founders/1/name'] }}
    - Founder 2 LinkedIn: {{ $json['founders/1/linkedin'] }}
- Credentials: Add or use existing Google Sheets OAuth2 credentials with Google Sheets and Drive API scopes enabled.
- Version: Use type version 4.7.
Create Google Sheet:
- Before running, ensure the Google Sheet exists and has the following columns as exact headers (case-sensitive):
  - Company
  - Location
  - Website
  - LinkedIn
  - Founded
  - Description
  - Industry Tags
  - Founder 1 Name
  - Founder 1 LinkedIn
  - Founder 2 Name
  - Founder 2 LinkedIn
Validate Credentials:
- Confirm Apify API key is valid and authorized.
- Confirm Google OAuth2 credentials are valid and authorized.
Run the workflow:
- Manually trigger the Start Workflow node.
- The workflow will execute the actor, fetch data, and populate the Google Sheet accordingly.

5. General Notes & Resources

Note Content	Context or Link
The Apify actor used is "Y Combinator Directory Scraper" by fatihtahta, available at: https://console.apify.com/actors/XXsXDaNQLjoF4lgmU	Actor details and pricing information linked in the workflow nodes.
Ensure Google OAuth2 credentials have both Google Sheets and Google Drive scopes enabled to allow sheet access and modification.	Google Cloud Console OAuth2 configuration for n8n integrations.
Column headers in the Google Sheet are case-sensitive and must exactly match those defined in the workflow for mapping to work correctly.	Important for the Google Sheets node to correctly append or update data.
The use of Apify Proxy within the actor configuration helps avoid IP-based scraping blocks and improves reliability.	Proxy usage recommended for web scraping jobs.
Maximum companies scraped can be adjusted by changing the `maxCompanies` parameter in the actor's input JSON.	Controls volume of data - too high may impact execution time and API limits.

Disclaimer:
The text provided is exclusively derived from an automated workflow created with n8n, an integration and automation tool. This processing strictly complies with applicable content policies and contains no illegal, offensive, or protected elements. All data handled is legal and publicly available.

16 KiB Raw Blame History Unescape Escape

Automate Scraping Y Combinator Startups with Apify & Google Sheets

1. Workflow Overview

2. Block-by-Block Analysis

2.1 Prerequisites Setup

2.2 Manual Trigger

2.3 Scraping Data with Apify Actor

2.4 Retrieving Scraped Dataset

2.5 Logging Data into Google Sheets

3. Summary Table

4. Reproducing the Workflow from Scratch

5. General Notes & Resources

16 KiB

Raw Blame History