diff --git a/workflows/Parse CVs from Emails with OCR & GPT for Notion Database-6326/readme-6326.md b/workflows/Parse CVs from Emails with OCR & GPT for Notion Database-6326/readme-6326.md new file mode 100644 index 000000000..b2f6fa382 --- /dev/null +++ b/workflows/Parse CVs from Emails with OCR & GPT for Notion Database-6326/readme-6326.md @@ -0,0 +1,303 @@ +Parse CVs from Emails with OCR & GPT for Notion Database + +https://n8nworkflows.xyz/workflows/parse-cvs-from-emails-with-ocr---gpt-for-notion-database-6326 + + +# Parse CVs from Emails with OCR & GPT for Notion Database + +### 1. Workflow Overview + +This workflow automates the processing of CVs (resumes) received via email attachments, extracting candidate information using OCR and AI, and storing it in a Notion database. It is designed for HR teams or recruiters who want to streamline candidate data collection from emails. + +The workflow is logically divided into the following blocks: + +- **1.1 Email Input Reception** + Watches a Gmail inbox for incoming emails with attachments. + +- **1.2 PDF Attachment Filtering and OCR Processing** + Filters PDF attachments from emails and extracts their text content using the OCR.space API. + +- **1.3 AI-Powered CV Parsing** + Sends the extracted OCR text to OpenAIโ€™s GPT model to detect if the document is a CV and to extract structured candidate fields. + +- **1.4 Data Mapping and Duplicate Verification** + Maps extracted fields to the workflow's internal structure and queries the Notion database to check for existing entries by email to avoid duplicates. + +- **1.5 Notion Database Entry Creation** + Creates a new candidate record in Notion if no duplicate is found. + +--- + +### 2. Block-by-Block Analysis + +#### 2.1 Email Input Reception + +**Overview:** +This block triggers the workflow upon receiving a new email with attachments in a Gmail inbox. + +**Nodes Involved:** +- Gmail Trigger +- Code (Attachment Filter) + +**Node Details:** + +- **Gmail Trigger** + - Type: Trigger node for Gmail + - Configuration: Filters emails with attachments (`has:attachment`), downloads attachments, polls every minute. + - Input: None (trigger) + - Output: Email data including attachments in binary form. + - Edge Cases: OAuth2 token expiration, Gmail API rate limits. + +- **Code (Attachment Filter)** + - Type: Function (JavaScript) node + - Role: Filters email attachments to only PDF files. + - Key Logic: Iterates over binary attachments and outputs only those with mime type `application/pdf`. + - Input: Gmail Trigger output + - Output: One item per PDF attachment with binary data. + - Edge Cases: Emails without PDFs lead to empty output; code assumes `binary` data exists; malformed emails might cause errors. + +#### 2.2 PDF Attachment Filtering and OCR Processing + +**Overview:** +This block sends the PDF attachment binary data to OCR.space to extract textual content. + +**Nodes Involved:** +- HTTP Request (OCR API) +- If1 (HTTP status check) + +**Node Details:** + +- **HTTP Request** + - Type: HTTP Request node + - Role: Sends PDF binary data as multipart/form-data to OCR.space API for OCR text extraction. + - Configuration: + - URL: `https://api.ocr.space/parse/image` + - Method: POST + - Headers: API key dynamically read from variable `OCR_SPACE_API_KEY` + - Timeout: 40 seconds + - Input: PDF binary from Code node + - Output: OCR response including parsed text. + - Edge Cases: API timeouts, invalid API key, malformed PDF files, OCR errors. + +- **If1** + - Type: Conditional node + - Role: Checks HTTP response status code is between 200 and 299 (success). + - Input: HTTP Request output + - Output: Proceeds to AI parsing if success, terminates otherwise. + - Edge Cases: Handles non-2xx responses gracefully by skipping downstream. + +#### 2.3 AI-Powered CV Parsing + +**Overview:** +Sends OCR text content to a GPT model to classify if the document is a CV and extract structured candidate data fields. + +**Nodes Involved:** +- Message a model (OpenAI GPT node) +- If (CV check) + +**Node Details:** + +- **Message a model** + - Type: OpenAI GPT node (via LangChain integration) + - Role: Provides instructions and OCR text to GPT-4.1-NANO model to parse CV fields. + - Key prompt details: + - Check if document is a CV + - Extract fields: full_name, email, phone, location, job_title, years_experience, skills (array), education (array), languages, certifications, linkedin, availability + - Return JSON or `{ "is_cv": false }` if not a CV + - Input: OCR parsed text (`ParsedResults[0].ParsedText`) from HTTP Request + - Output: JSON with candidate data or `is_cv: false` + - Credentials: OpenAI API key + - Edge Cases: Model misinterpretation, rate limiting, incomplete data extraction. + +- **If** + - Type: Conditional node + - Role: Checks if the OpenAI response contains a non-null email field (indicating a CV was detected). + - Input: GPT output + - Output: Continues if CV detected, stops otherwise. + - Edge Cases: False negatives if email missing; expression errors if response malformed. + +#### 2.4 Data Mapping and Duplicate Verification + +**Overview:** +Maps GPT-extracted fields into a structured format and queries Notion to check for existing candidate entries by email. + +**Nodes Involved:** +- Map Fields (Set node) +- Get many entries (Notion query) +- If2 (Duplicate check) + +**Node Details:** + +- **Map Fields** + - Type: Set node + - Role: Assigns GPT-extracted fields to a structured JSON format under `message.content` namespace, e.g., `full_name`, `email`, `phone`, `location`, `years_experience`, `education` (array), `languages` (array), `linkedin`. + - Input: GPT output JSON + - Output: Mapped structured data for downstream use + - Edge Cases: Null or missing fields handled with defaults (`null` or empty arrays). + +- **Get many entries** + - Type: Notion node (database query) + - Role: Queries Notion "Candidates" database filtering by email to find duplicates. + - Input: Mapped data from Map Fields + - Output: List of matching entries (empty if none) + - Credentials: Notion API integration + - Edge Cases: Notion API errors, rate limits, incorrect database ID. + +- **If2** + - Type: Conditional node + - Role: Checks if No duplicates exist (result length = 0) to proceed with creation. + - Input: Notion query output + - Output: Creates new entry if no duplicates found, skips otherwise. + - Edge Cases: Duplicate detection logic relies on exact email match; partial duplicates may be missed. + +#### 2.5 Notion Database Entry Creation + +**Overview:** +Creates a new candidate record in the Notion database with the parsed and mapped candidate data. + +**Nodes Involved:** +- Create entry (Notion create database page) + +**Node Details:** + +- **Create entry** + - Type: Notion node (create database page) + - Role: Inserts a new page in the "Candidates" database with mapped candidate properties. + - Properties mapped: + - Name (title) + - Email (email) + - Phone (phone_number) + - Location (rich_text) + - Years of Experience (rich_text) + - Education (joined string from array) + - Languages (joined string) + - LinkedIn (url) + - Application Date (current date/time in Europe/Madrid timezone) + - Credentials: Notion API integration + - Edge Cases: Notion API errors, improper data formatting, missing required fields. + +--- + +### 3. Summary Table + +| Node Name | Node Type | Functional Role | Input Node(s) | Output Node(s) | Sticky Note | +|-------------------|---------------------------------|-----------------------------------------|-----------------------|----------------------|--------------------------------------------------------------------------------------------------------------| +| Gmail Trigger | Gmail Trigger | Trigger workflow on new email with attachments | None | Code | ### ๐Ÿง  CV Parser - Email to Notion: Workflow overview and purpose | +| Code | Function (Code) | Filter attachments to PDFs only | Gmail Trigger | HTTP Request | ### ๐Ÿ“Œ Notes: Only processes emails with PDF attachments; skips non-CV docs automatically | +| HTTP Request | HTTP Request | Send PDF binary to OCR.space API for text extraction | Code | If1 | | +| If1 | If Condition | Check OCR API HTTP status (200-299) | HTTP Request | Message a model | | +| Message a model | OpenAI GPT (LangChain node) | Parse OCR text to detect CV and extract candidate data | If1 | If | | +| If | If Condition | Check if AI detected a CV (email present) | Message a model | Map Fields | | +| Map Fields | Set | Map AI-extracted candidate fields to structured format | If | Get many entries | | +| Get many entries | Notion (Database query) | Query Notion database for duplicates by email | Map Fields | If2 | | +| If2 | If Condition | Proceed only if no duplicate found | Get many entries | Create entry | | +| Create entry | Notion (Create database page) | Create new candidate record in Notion | If2 | None | | +| Sticky Note | Sticky Note | Documentation and workflow overview | None | None | ### ๐Ÿง  CV Parser - Email to Notion: This workflow automates extraction and organization of candidate data. | +| Sticky Note1 | Sticky Note | Additional notes on processing and duplicates | None | None | ### ๐Ÿ“Œ Notes: Only processes emails with PDF attachments; skips non-CV; checks duplicates by email. | +| Sticky Note2 | Sticky Note | Requirements and setup notes | None | None | ### โœ… Requirements: Gmail OAuth2, OCR.space API Key, OpenAI API Key, Notion integration with proper fields. | + +--- + +### 4. Reproducing the Workflow from Scratch + +1. **Create Gmail Trigger Node** + - Type: Gmail Trigger + - Configure OAuth2 credentials for Gmail. + - Set filter query: `has:attachment` + - Enable attachment download. + - Set polling interval to every minute. + +2. **Add Code Node to Filter PDF Attachments** + - Type: Function (Code) node + - Paste JavaScript code to filter attachments with mime type `application/pdf`. + - Connect Gmail Trigger output to this node. + +3. **Add HTTP Request Node to OCR.space API** + - Type: HTTP Request + - Set URL: `https://api.ocr.space/parse/image` + - Method: POST + - Content-Type: multipart/form-data + - Add header parameter `apikey` with value from variable `OCR_SPACE_API_KEY` (create this variable in n8n or set manually). + - Configure body to send binary data under field `file`. + - Set timeout to 40000 ms. + - Connect Code node output to this node. + +4. **Add If Node to Check OCR API Response Status** + - Type: If node (version 2) + - Condition: `statusCode >= 200 AND statusCode < 300` + - Connect HTTP Request node output to this node. + +5. **Add OpenAI GPT Node (LangChain OpenAi)** + - Type: OpenAi (LangChain) node + - Model: `gpt-4.1-nano` + - Credentials: Configure with OpenAI API Key. + - Message content: + ``` + You're an HR assistant. I will provide you with text from a PDF document. + + 1. First, determine if it's a CV/resume. + 2. If it is, extract the following fields: + - full_name + - email + - phone + - location + - job_title + - years_experience + - skills (as an array) + - education (as an array of strings) + - languages + - certifications + - linkedin + - availability + + 3. Return the result as a JSON object with those fields. + If it's not a CV, just return: { "is_cv": false } + Here is the document text: + + {{ $json.body.ParsedResults[0].ParsedText }} + ``` + - Connect the successful branch of the If node (OCR response check) to this node. + +6. **Add If Node to Check if Email Field Exists in GPT Output** + - Type: If node (version 2) + - Condition: Check if `message.content.email !== null` + - Connect GPT node output to this node. + +7. **Add Set Node to Map Fields** + - Type: Set node + - Map fields from GPT output JSON to structured keys under `message.content` for: full_name, email, phone, location, years_experience, education, languages, linkedin, etc. + - Connect positive branch of previous If node. + +8. **Add Notion Node to Query Database for Existing Entries** + - Type: Notion (Database Page - Get All) + - Credentials: Connect to Notion integration with access to your Candidates database. + - Set database ID to your Notion Candidates DB. + - Filter: `Email` equals `={{ $json.message.content.email }}` + - Connect Map Fields node output to this node. + +9. **Add If Node to Check for Duplicates** + - Type: If node (version 2) + - Condition: Check if the number of returned entries is zero (no duplicates). + - Connect Notion query output to this node. + +10. **Add Notion Node to Create New Candidate Entry** + - Type: Notion (Create Database Page) + - Credentials: Same Notion integration. + - Database ID: Candidates DB. + - Map properties from mapped fields to Notion properties: Name, Email, Phone, Location, Years of Experience, Education (joined by `/`), Languages (joined by `, `), LinkedIn URL, Application Date (set current date/time, timezone Europe/Madrid). + - Connect positive branch of previous If node. + +--- + +### 5. General Notes & Resources + +| Note Content | Context or Link | +|-----------------------------------------------------------------------------------------------------------|-----------------------------------------------------------| +| Workflow automates CV extraction from Gmail attachments, including OCR and GPT-powered parsing into Notion | Sticky Note on workflow overview | +| Only processes PDF attachments; skips non-CV documents automatically; checks duplicates by email | Sticky Note1 | +| Requires Gmail OAuth2 credentials, OCR.space API key, OpenAI API key, and Notion integration with proper database fields | Sticky Note2 | +| Notion database should contain fields: Name, Email, Phone, Location, Experience, Education, Languages, LinkedIn | Configured in Notion create and query nodes | + +--- + +This structured documentation provides a comprehensive understanding of the workflow, enabling reproduction, modification, and anticipation of potential errors or integration issues. \ No newline at end of file