16 KiB
Automate Scraping Y Combinator Startups with Apify & Google Sheets
Automate Scraping Y Combinator Startups with Apify & Google Sheets
1. Workflow Overview
This workflow automates the scraping of startup company data from the Y Combinator directory using Apify actors and logs the extracted information into a Google Sheet. It is designed for users seeking to collect, update, and maintain structured data about startups filtered by specific criteria such as industry, location, and company size.
The workflow is logically divided into the following blocks:
- 1.1 Prerequisites Setup: Preparation and configuration necessary before initiating the scraping process.
- 1.2 Manual Trigger: User-controlled start point to invoke the scraping workflow on demand.
- 1.3 Scraping Data with Apify Actor: Execution of an Apify actor to scrape company data based on specified filters.
- 1.4 Retrieving Scraped Dataset: Fetching the structured dataset generated by the Apify actor.
- 1.5 Logging Data into Google Sheets: Adding or updating rows in a Google Sheet to store the retrieved company data.
2. Block-by-Block Analysis
2.1 Prerequisites Setup
Overview:
This preparatory block defines the required configurations to ensure smooth execution of the workflow, including integration credentials and Google Sheet setup. It clarifies necessary parameters before running the scraping process.
Nodes Involved:
- Sticky Note4
Node Details:
- Sticky Note4
- Type: Sticky Note (informational)
- Role: Outlines prerequisites and setup instructions for Apify and Google Sheets integrations.
- Content Highlights:
- Apify account connection and selecting the correct actor ("Y Combinator Directory Scraper").
- Setting the Y Combinator filtered search URL and
maxCompaniesparameter. - Google Sheets OAuth2 authentication with proper scopes enabled.
- Ensuring the target Google Sheet has predefined column headers (case-sensitive).
- Input/Output: None (informational node)
- Edge Cases: Misconfiguration of credentials or missing Google Sheet columns may cause workflow failure downstream.
2.2 Manual Trigger
Overview:
Allows the user to manually start the workflow whenever desired, providing full control over data scraping timing.
Nodes Involved:
- Start Workflow
- Sticky Note (4eea9bab-911c-4480-9073-831b8ac46571)
Node Details:
-
Start Workflow
- Type: Manual Trigger
- Role: Initiates the workflow execution on user command.
- Configuration: No parameters; activated manually in n8n UI.
- Input: None
- Output: Triggers the next node ("Run an Actor").
- Edge Cases: If not triggered, no scraping occurs.
-
Sticky Note (4eea9bab-911c-4480-9073-831b8ac46571)
- Type: Sticky Note
- Role: Explains the purpose of the manual trigger node.
- Content: Emphasizes manual start for controlled scraping.
- Input/Output: None
2.3 Scraping Data with Apify Actor
Overview:
Executes an Apify actor to scrape startup data from the Y Combinator directory filtered by user-defined criteria.
Nodes Involved:
- Run an Actor
- Sticky Note1
Node Details:
-
Run an Actor
- Type: Apify Node (Actor execution)
- Role: Runs the "Y Combinator Directory Scraper" Apify actor to scrape company data.
- Configuration:
- Actor ID:
XXsXDaNQLjoF4lgmU(Y Combinator directory scraper) - Input JSON includes:
maxCompanies: 5 (limits companies scraped per run)startUrls: Y Combinator companies search URL with filters (industry=Fintech, regions=America/Canada, team_size=[1,25])- Proxy: Uses Apify Proxy for reliable scraping
- Actor ID:
- Credentials: Apify API key configured
- Input: Triggered by "Start Workflow" node
- Output: Returns dataset ID and other metadata used for fetching scraped data
- Edge Cases:
- Network errors or proxy failures may cause scraping to fail.
- Invalid actor ID or API credentials will cause authentication errors.
- The maxCompanies parameter limits results; adjusting needed for larger datasets.
-
Sticky Note1
- Type: Sticky Note
- Role: Describes the role of the Apify actor node and how to configure the search URL and filters.
- Content: Details on applying filters to the Y Combinator URL and scraping structured company data.
2.4 Retrieving Scraped Dataset
Overview:
Fetches the dataset items generated by the Apify actor, providing detailed structured data for further processing.
Nodes Involved:
- Get dataset items
- Sticky Note2
Node Details:
-
Get dataset items
- Type: Apify Node (Dataset retrieval)
- Role: Retrieves the scraped company data using the dataset ID returned by the actor node.
- Configuration:
- Resource: Datasets
- Dataset ID: Dynamically assigned from previous node’s output (
{{ $json.defaultDatasetId }})
- Credentials: Uses same Apify API credentials as "Run an Actor"
- Input: From "Run an Actor" node
- Output: Outputs array of company data objects (name, description, website, location, sector, etc.)
- Edge Cases:
- If dataset ID is missing or invalid, fetch will fail.
- API rate limits or connectivity issues may cause errors.
-
Sticky Note2
- Type: Sticky Note
- Role: Explains the purpose of fetching dataset items and the expected output structure.
- Content: Highlights the details retrieved and preparation for Google Sheets logging.
2.5 Logging Data into Google Sheets
Overview:
Logs the retrieved company data into a specified Google Sheet, adding new rows or updating existing entries to maintain an up-to-date database.
Nodes Involved:
- Add data to Google Sheet
- Sticky Note3
Node Details:
-
Add data to Google Sheet
- Type: Google Sheets Node (Add or Update Row)
- Role: Inserts or updates rows in a Google Sheet with scraped company details.
- Configuration:
- Operation:
appendOrUpdate(adds new rows or updates matched rows) - Document ID: Target Google Sheet ID (
1AEOYMIRNgxYN3gihT1bIrGswnkCzuWbFljX2ac4XjUU) - Sheet Name:
gid=0(Sheet1) - Matching Column: "Company" (to identify existing rows)
- Columns Mapped:
- Company:
{{ $json.company_name }} - Location:
{{ $json.company_location }} - Website:
{{ $json.website }} - LinkedIn:
{{ $json.company_linkedin }} - Founded:
{{ $json.year_founded }} - Description:
{{ $json.long_description }} - Industry Tags: Concatenation of up to four tags (
tags/0totags/3) - Founder 1 Name and LinkedIn:
founders/0/nameandfounders/0/linkedin - Founder 2 Name and LinkedIn:
founders/1/nameandfounders/1/linkedin
- Company:
- Operation:
- Credentials: Google Sheets OAuth2 credentials with required scopes
- Input: Dataset items from "Get dataset items" node
- Output: Confirmation of added/updated rows
- Edge Cases:
- Missing or incorrectly named columns in Google Sheet cause mapping failures.
- Authentication errors if Google OAuth token expires.
- Large datasets may hit rate limits.
- Empty or missing JSON fields may result in blank cells.
-
Sticky Note3
- Type: Sticky Note
- Role: Details the Google Sheets node setup and column requirements.
- Content: Lists mandatory columns (case-sensitive) and explains the append or update mechanism.
3. Summary Table
| Node Name | Node Type | Functional Role | Input Node(s) | Output Node(s) | Sticky Note |
|---|---|---|---|---|---|
| Sticky Note4 | Sticky Note | Prerequisites and setup instructions | - | - | Describes Apify and Google Sheets prerequisites, credential setup, and sheet column requirements. |
| Start Workflow | Manual Trigger | Starts the workflow manually | - | Run an Actor | Explains manual triggering for controlled scraping. |
| Sticky Note | Sticky Note | Explains Manual Trigger node | - | - | Emphasizes manual start for controlled scraping. |
| Run an Actor | Apify Actor Node | Executes scraping actor on Y Combinator | Start Workflow | Get dataset items | Details actor configuration and input URL filters for scraping. |
| Sticky Note1 | Sticky Note | Explains Apify Actor usage | - | - | Describes setting filters and actor operation. |
| Get dataset items | Apify Dataset Retrieval Node | Fetches scraped data from Apify dataset | Run an Actor | Add data to Google Sheet | Explains fetching dataset items and output structure. |
| Sticky Note2 | Sticky Note | Explains dataset retrieval | - | - | Details dataset fetch and data for logging. |
| Add data to Google Sheet | Google Sheets (Add or Update) | Logs company data into Google Sheet | Get dataset items | - | Lists required sheet columns and operation details. |
| Sticky Note3 | Sticky Note | Explains Google Sheets node setup | - | - | Emphasizes column headers and append/update logic. |
4. Reproducing the Workflow from Scratch
-
Create a new workflow in n8n.
-
Add a Manual Trigger node:
- Name:
Start Workflow - No parameters needed.
- This node acts as the workflow entry point.
- Name:
-
Add an Apify "Run an Actor" node:
- Name:
Run an Actor - Connect it from the
Start Workflownode. - Configure:
- Set Actor ID to
XXsXDaNQLjoF4lgmU(Y Combinator Directory Scraper). - In the JSON body input, set:
{ "maxCompanies": 5, "startUrls": ["https://www.ycombinator.com/companies?industry=Fintech®ions=America%20%2F%20Canada&team_size=%5B%221%22%2C%2225%22%5D"], "proxyConfiguration": { "useApifyProxy": true } }
- Set Actor ID to
- Credentials: Set Apify API credentials (create or select existing).
- Version: Use type version 1.
- Name:
-
Add an Apify "Get dataset items" node:
- Name:
Get dataset items - Connect output from
Run an Actornode. - Configure:
- Resource:
Datasets - Dataset ID: Set expression to
{{$json["defaultDatasetId"]}}(from previous node output).
- Resource:
- Use the same Apify API credentials as above.
- Version: Use type version 1.
- Name:
-
Add a Google Sheets node:
- Name:
Add data to Google Sheet - Connect output from
Get dataset items. - Configure:
- Operation:
appendOrUpdate - Document ID:
1AEOYMIRNgxYN3gihT1bIrGswnkCzuWbFljX2ac4XjUU(replace with your sheet ID) - Sheet Name:
gid=0(Sheet1) - Matching Columns:
Company(to update existing rows) - Columns mapping: Define columns as below with expressions:
- Company:
{{ $json.company_name }} - Location:
{{ $json.company_location }} - Website:
{{ $json.website }} - LinkedIn:
{{ $json.company_linkedin }} - Founded:
{{ $json.year_founded }} - Description:
{{ $json.long_description }} - Industry Tags: Concatenate up to four tags, e.g.,
{{ $json['tags/0'] }} {{ $json['tags/1'] }} {{ $json['tags/2'] }} {{ $json['tags/3'] }} - Founder 1 Name:
{{ $json['founders/0/name'] }} - Founder 1 LinkedIn:
{{ $json['founders/0/linkedin'] }} - Founder 2 Name:
{{ $json['founders/1/name'] }} - Founder 2 LinkedIn:
{{ $json['founders/1/linkedin'] }}
- Company:
- Operation:
- Credentials: Add or use existing Google Sheets OAuth2 credentials with Google Sheets and Drive API scopes enabled.
- Version: Use type version 4.7.
- Name:
-
Create Google Sheet:
- Before running, ensure the Google Sheet exists and has the following columns as exact headers (case-sensitive):
- Company
- Location
- Website
- Founded
- Description
- Industry Tags
- Founder 1 Name
- Founder 1 LinkedIn
- Founder 2 Name
- Founder 2 LinkedIn
- Before running, ensure the Google Sheet exists and has the following columns as exact headers (case-sensitive):
-
Validate Credentials:
- Confirm Apify API key is valid and authorized.
- Confirm Google OAuth2 credentials are valid and authorized.
-
Run the workflow:
- Manually trigger the
Start Workflownode. - The workflow will execute the actor, fetch data, and populate the Google Sheet accordingly.
- Manually trigger the
5. General Notes & Resources
| Note Content | Context or Link |
|---|---|
| The Apify actor used is "Y Combinator Directory Scraper" by fatihtahta, available at: https://console.apify.com/actors/XXsXDaNQLjoF4lgmU | Actor details and pricing information linked in the workflow nodes. |
| Ensure Google OAuth2 credentials have both Google Sheets and Google Drive scopes enabled to allow sheet access and modification. | Google Cloud Console OAuth2 configuration for n8n integrations. |
| Column headers in the Google Sheet are case-sensitive and must exactly match those defined in the workflow for mapping to work correctly. | Important for the Google Sheets node to correctly append or update data. |
| The use of Apify Proxy within the actor configuration helps avoid IP-based scraping blocks and improves reliability. | Proxy usage recommended for web scraping jobs. |
Maximum companies scraped can be adjusted by changing the maxCompanies parameter in the actor's input JSON. |
Controls volume of data - too high may impact execution time and API limits. |
Disclaimer:
The text provided is exclusively derived from an automated workflow created with n8n, an integration and automation tool. This processing strictly complies with applicable content policies and contains no illegal, offensive, or protected elements. All data handled is legal and publicly available.