creation

2026-04-28 00:29:22 +00:00 · 2025-11-13 14:08:43 +01:00
parent 9c93aacd8c
commit d5edcdd95c
1 changed files with 288 additions and 0 deletions
@@ -0,0 +1,288 @@
+Analyse papers from Hugging Face with AI and store them in Notion
+
+https://n8nworkflows.xyz/workflows/analyse-papers-from-hugging-face-with-ai-and-store-them-in-notion-2804
+
+
+# Analyse papers from Hugging Face with AI and store them in Notion
+
+### 1. Workflow Overview
+
+This workflow automates the retrieval, analysis, and storage of academic paper summaries from Hugging Face into Notion. It is designed to run automatically on weekdays at 8 AM, fetching the latest papers, analyzing their abstracts with OpenAI’s GPT model, and saving structured insights into a Notion database. The workflow is logically divided into the following blocks:
+
+- **1.1 Scheduled Trigger:** Initiates the workflow automatically on weekdays at 8 AM.
+- **1.2 Fetching Paper Data:** Retrieves the list of recent papers from Hugging Face using their API.
+- **1.3 Extracting Paper URLs:** Parses the HTML response to extract individual paper URLs.
+- **1.4 Looping Over Papers:** Processes each paper URL one by one.
+- **1.5 Duplicate Check in Notion:** Checks if the paper already exists in the Notion database to avoid duplicates.
+- **1.6 Fetching Paper Details:** For new papers, fetches detailed HTML content of the paper page.
+- **1.7 Extracting Paper Abstract and Title:** Extracts the abstract and title from the paper’s HTML.
+- **1.8 OpenAI Content Analysis:** Sends the abstract to OpenAI GPT-4o for detailed analysis and structured extraction.
+- **1.9 Storing Results in Notion:** Saves the analyzed data and metadata into the Notion database.
+
+---
+
+### 2. Block-by-Block Analysis
+
+#### 2.1 Scheduled Trigger
+
+- **Overview:** Automatically triggers the workflow at 8 AM on weekdays (Monday to Friday).
+- **Nodes Involved:** `Schedule Trigger`
+- **Node Details:**
+  - Type: Schedule Trigger (Cron-like)
+  - Configuration: Runs weekly on days 1-5 (Mon-Fri) at hour 8.
+  - Inputs: None (trigger node)
+  - Outputs: Initiates the HTTP request to fetch Hugging Face papers.
+  - Edge Cases: Misconfiguration could cause missed runs or runs on weekends.
+  - Version: 1.2
+
+#### 2.2 Fetching Paper Data
+
+- **Overview:** Sends an HTTP GET request to Hugging Face’s papers endpoint, querying papers from the previous day.
+- **Nodes Involved:** `Request Hugging Face Paper`
+- **Node Details:**
+  - Type: HTTP Request
+  - Configuration: URL `https://huggingface.co/papers` with query parameter `date` set dynamically to yesterday’s date in `yyyy-MM-dd` format.
+  - Inputs: Trigger from Schedule Trigger
+  - Outputs: Raw HTML response containing paper listings.
+  - Edge Cases: API downtime, rate limiting, or changes in Hugging Face’s API/HTML structure.
+  - Version: 4.2
+
+#### 2.3 Extracting Paper URLs
+
+- **Overview:** Parses the HTML response to extract URLs of individual papers using CSS selectors.
+- **Nodes Involved:** `Extract Hugging Face Paper`
+- **Node Details:**
+  - Type: HTML Extract
+  - Configuration: Extracts `href` attributes from elements matching `.line-clamp-3` CSS selector, returning an array of URLs.
+  - Inputs: Output of HTTP Request node
+  - Outputs: List of paper URLs for further processing.
+  - Edge Cases: Changes in HTML structure or CSS classes could break extraction.
+  - Version: 1.2
+
+#### 2.4 Splitting and Looping Over Papers
+
+- **Overview:** Splits the list of paper URLs into individual items and processes them sequentially.
+- **Nodes Involved:** `Split Out`, `Loop Over Items`
+- **Node Details:**
+  - `Split Out`:
+    - Type: Split Out
+    - Configuration: Splits out the `url` field from the extracted URLs.
+    - Inputs: Extracted URLs
+    - Outputs: Individual paper URLs
+    - Version: 1
+  - `Loop Over Items`:
+    - Type: Split In Batches
+    - Configuration: Processes one paper URL at a time without resetting.
+    - Inputs: Output from Split Out
+    - Outputs: Single paper URL per iteration
+    - Edge Cases: Large number of papers could slow processing.
+    - Version: 3
+
+#### 2.5 Duplicate Check in Notion
+
+- **Overview:** Checks if the current paper URL already exists in the Notion database to avoid duplicate entries.
+- **Nodes Involved:** `Check Paper URL Existed`, `If`
+- **Node Details:**
+  - `Check Paper URL Existed`:
+    - Type: Notion (Get All Pages)
+    - Configuration: Queries Notion database filtering pages where the `URL` property equals the full Hugging Face paper URL.
+    - Inputs: Single paper URL from Loop Over Items
+    - Outputs: List of matching pages (empty if none)
+    - Credentials: Notion API token required
+    - Edge Cases: Notion API rate limits, incorrect database ID or property keys.
+    - Version: 2.2
+  - `If`:
+    - Type: Conditional
+    - Configuration: Checks if the output from Notion query is empty (paper not found).
+    - Inputs: Output from Notion check
+    - Outputs: If true (paper new) proceeds to fetch details; if false loops to next paper.
+    - Edge Cases: Expression errors if data format changes.
+    - Version: 2.2
+
+#### 2.6 Fetching Paper Details
+
+- **Overview:** For new papers, fetches the detailed HTML page of the paper from Hugging Face.
+- **Nodes Involved:** `Request Hugging Face Paper Detail`
+- **Node Details:**
+  - Type: HTTP Request
+  - Configuration: URL dynamically constructed by concatenating `https://huggingface.co` with the paper URL.
+  - Inputs: Paper URL from If node (new papers only)
+  - Outputs: HTML content of the paper detail page.
+  - Edge Cases: Network errors, page not found, or HTML structure changes.
+  - Version: 4.2
+
+#### 2.7 Extracting Paper Abstract and Title
+
+- **Overview:** Extracts the abstract and title text from the paper’s HTML detail page.
+- **Nodes Involved:** `Extract Hugging Face Paper Abstract`
+- **Node Details:**
+  - Type: HTML Extract
+  - Configuration: Extracts text content from `.text-gray-700` (abstract) and `.text-2xl` (title) CSS selectors.
+  - Inputs: HTML content from paper detail request
+  - Outputs: JSON with `abstract` and `title` fields.
+  - Edge Cases: Changes in CSS selectors or page layout.
+  - Version: 1.2
+
+#### 2.8 OpenAI Content Analysis
+
+- **Overview:** Sends the extracted abstract to OpenAI GPT-4o for detailed structured analysis, extracting key insights in JSON format.
+- **Nodes Involved:** `OpenAI Analysis Abstract`
+- **Node Details:**
+  - Type: OpenAI (Langchain node)
+  - Configuration:
+    - Model: GPT-4o-2024-11-20
+    - System prompt instructs the model to extract:
+      - Core Introduction
+      - Keywords (2-5)
+      - Key Data and Results
+      - Technical Details
+      - Classification (academic categories)
+    - User prompt: The paper abstract text.
+    - Output: JSON with structured fields.
+  - Inputs: Abstract text from HTML extraction
+  - Outputs: JSON with analyzed content
+  - Credentials: OpenAI API key required
+  - Edge Cases: API rate limits, prompt failures, or unexpected output format.
+  - Version: 1.8
+
+#### 2.9 Storing Results in Notion
+
+- **Overview:** Saves the analyzed paper data and metadata into a Notion database for future reference.
+- **Nodes Involved:** `Store Abstract Notion`
+- **Node Details:**
+  - Type: Notion (Create Database Page)
+  - Configuration:
+    - Database ID: Specified Notion database for Hugging Face abstracts.
+    - Properties mapped:
+      - URL: Full Hugging Face paper URL
+      - Title: Extracted paper title
+      - Abstract: Abstract text truncated to 2000 characters
+      - Scrap Date: Current date (no time)
+      - Classification, Technical Details, Data and Results, Keywords, Core Introduction: From OpenAI JSON output
+  - Inputs: Output from OpenAI analysis node
+  - Credentials: Notion API token required
+  - Edge Cases: API limits, property mismatches, data truncation issues.
+  - Version: 2.2
+
+---
+
+### 3. Summary Table
+
+| Node Name                   | Node Type                     | Functional Role                      | Input Node(s)                   | Output Node(s)                   | Sticky Note                                                                                         |
+|-----------------------------|-------------------------------|------------------------------------|---------------------------------|---------------------------------|---------------------------------------------------------------------------------------------------|
+| Schedule Trigger            | Schedule Trigger               | Initiates workflow on weekdays 8AM | None                            | Request Hugging Face Paper       |                                                                                                   |
+| Request Hugging Face Paper  | HTTP Request                  | Fetches paper list HTML from API  | Schedule Trigger                | Extract Hugging Face Paper       |                                                                                                   |
+| Extract Hugging Face Paper  | HTML Extract                  | Extracts paper URLs from HTML      | Request Hugging Face Paper      | Split Out                       |                                                                                                   |
+| Split Out                  | Split Out                     | Splits URLs into individual items  | Extract Hugging Face Paper      | Loop Over Items                 |                                                                                                   |
+| Loop Over Items            | Split In Batches              | Processes each paper URL sequentially | Split Out                      | Check Paper URL Existed          |                                                                                                   |
+| Check Paper URL Existed    | Notion (Get All Pages)        | Checks if paper URL exists in Notion | Loop Over Items                | If                             |                                                                                                   |
+| If                         | If                           | Conditional: proceed if paper new | Check Paper URL Existed         | Request Hugging Face Paper Detail (true), Loop Over Items (false) |                                                                                                   |
+| Request Hugging Face Paper Detail | HTTP Request              | Fetches detailed paper HTML page  | If (true branch)                | Extract Hugging Face Paper Abstract |                                                                                                   |
+| Extract Hugging Face Paper Abstract | HTML Extract              | Extracts abstract and title from HTML | Request Hugging Face Paper Detail | OpenAI Analysis Abstract        |                                                                                                   |
+| OpenAI Analysis Abstract   | OpenAI (Langchain)            | Analyzes abstract with GPT-4o      | Extract Hugging Face Paper Abstract | Store Abstract Notion           |                                                                                                   |
+| Store Abstract Notion      | Notion (Create Database Page) | Stores analyzed data in Notion     | OpenAI Analysis Abstract        | Loop Over Items                 |                                                                                                   |
+
+---
+
+### 4. Reproducing the Workflow from Scratch
+
+1. **Create a Schedule Trigger node:**
+   - Type: Schedule Trigger
+   - Set to run weekly on Monday to Friday at 8 AM.
+   - No credentials needed.
+
+2. **Add an HTTP Request node named "Request Hugging Face Paper":**
+   - URL: `https://huggingface.co/papers`
+   - Method: GET
+   - Query Parameter: `date` = `={{ $now.minus(1,'days').format('yyyy-MM-dd') }}`
+   - Connect Schedule Trigger output to this node.
+
+3. **Add an HTML Extract node named "Extract Hugging Face Paper":**
+   - Operation: Extract HTML content
+   - Extraction Values:
+     - Key: `url`
+     - CSS Selector: `.line-clamp-3`
+     - Attribute: `href`
+     - Return Array: true
+   - Connect "Request Hugging Face Paper" output to this node.
+
+4. **Add a Split Out node named "Split Out":**
+   - Field to split out: `url`
+   - Connect "Extract Hugging Face Paper" output to this node.
+
+5. **Add a Split In Batches node named "Loop Over Items":**
+   - Default batch size (process one item at a time)
+   - Connect "Split Out" output to this node.
+
+6. **Add a Notion node named "Check Paper URL Existed":**
+   - Resource: Database Page
+   - Operation: Get All
+   - Database ID: Your Notion database ID for storing papers
+   - Filter: Property `URL` equals `={{ 'https://huggingface.co'+$json.url }}`
+   - Credentials: Configure Notion API credentials
+   - Connect "Loop Over Items" output to this node.
+
+7. **Add an If node named "If":**
+   - Condition: Check if the output from "Check Paper URL Existed" is empty (no existing page)
+   - Expression: `={{ $json.length === 0 }}`
+   - Connect "Check Paper URL Existed" output to this node.
+
+8. **Add an HTTP Request node named "Request Hugging Face Paper Detail":**
+   - URL: `={{ 'https://huggingface.co'+$('Split Out').item.json.url }}`
+   - Method: GET
+   - Connect "If" node’s true output to this node.
+
+9. **Add an HTML Extract node named "Extract Hugging Face Paper Abstract":**
+   - Operation: Extract HTML content
+   - Extraction Values:
+     - Key: `abstract`, CSS Selector: `.text-gray-700`
+     - Key: `title`, CSS Selector: `.text-2xl`
+   - Connect "Request Hugging Face Paper Detail" output to this node.
+
+10. **Add an OpenAI node named "OpenAI Analysis Abstract":**
+    - Model: GPT-4o-2024-11-20 (or latest GPT-4 variant)
+    - System Prompt: Provide instructions to extract Core Introduction, Keywords, Data and Results, Technical Details, Classification in JSON format.
+    - User Prompt: `={{ $json.abstract }}`
+    - Enable JSON output parsing.
+    - Credentials: Configure OpenAI API credentials.
+    - Connect "Extract Hugging Face Paper Abstract" output to this node.
+
+11. **Add a Notion node named "Store Abstract Notion":**
+    - Resource: Database Page
+    - Operation: Create
+    - Database ID: Same as in step 6
+    - Map properties:
+      - URL: `={{ 'https://huggingface.co'+$('Split Out').item.json.url }}`
+      - Title: `={{ $('Extract Hugging Face Paper Abstract').item.json.title }}`
+      - Abstract: `={{ $('Extract Hugging Face Paper Abstract').item.json.abstract.substring(0,2000) }}`
+      - Scrap Date: `={{ $today.format('yyyy-MM-dd') }}`
+      - Classification: `={{ $json.message.content.Classification.join(',') }}`
+      - Technical Details: `={{ $json.message.content.Technical_Details }}`
+      - Data and Results: `={{ $json.message.content.Data_and_Results }}`
+      - Keywords: `={{ $json.message.content.Keywords.join(',') }}`
+      - Core Introduction: `={{ $json.message.content.Core_Introduction }}`
+    - Credentials: Notion API credentials
+    - Connect "OpenAI Analysis Abstract" output to this node.
+
+12. **Connect the false output of the "If" node back to "Loop Over Items" to continue processing remaining papers.**
+
+13. **Ensure all credentials (OpenAI, Notion) are properly set up in n8n.**
+
+14. **Activate the workflow.**
+
+---
+
+### 5. General Notes & Resources
+
+| Note Content                                                                                                   | Context or Link                                                                                   |
+|---------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|
+| The workflow uses GPT-4o-2024-11-20, a GPT-4 variant optimized for academic and technical content analysis.    | OpenAI model selection in the OpenAI node.                                                      |
+| Notion database must have properties matching the keys used in the workflow (URL, title, abstract, classification, etc.) | Notion database schema setup.                                                                    |
+| Hugging Face papers page HTML structure is critical for CSS selectors `.line-clamp-3`, `.text-gray-700`, `.text-2xl` | Changes in Hugging Face website may require updates to HTML extraction nodes.                    |
+| Scheduled trigger uses n8n’s built-in cron-like scheduling for weekday runs at 8 AM.                            | n8n Schedule Trigger documentation: https://docs.n8n.io/nodes/n8n-nodes-base.scheduleTrigger/    |
+| OpenAI prompt is designed to output JSON for easy parsing and storage.                                         | Prompt engineering is key to consistent output format.                                          |
+| Image illustrating workflow result is referenced as `huggingface.png` (fileId:918)                             | Visual asset for documentation or UI display.                                                    |
+
+---
+
+This document fully describes the workflow structure, node configurations, and setup instructions to enable reproduction, modification, and troubleshooting by advanced users or automation agents.