creation

2026-04-28 00:29:22 +00:00 · 2025-11-12 19:13:35 +01:00
parent 8b2f1fe4c3
commit 355ac00794
1 changed files with 273 additions and 0 deletions
@@ -0,0 +1,273 @@
+Extract and Format PDF Data from Google Drive
+
+https://n8nworkflows.xyz/workflows/extract-and-format-pdf-data-from-google-drive-9061
+
+
+# Extract and Format PDF Data from Google Drive
+
+---
+
+## 1. Workflow Overview
+
+This workflow automates the process of extracting and cleaning text data from PDF files stored in a Google Drive folder. It is designed for use cases such as archiving, data extraction from reports or invoices, and any scenario requiring automated PDF text processing.
+
+The workflow is logically divided into four main blocks:
+
+- **1.1 Input Reception (File Discovery):** Triggering the workflow manually and locating PDF files in a specified Google Drive folder.
+- **1.2 Retrieval Stage (File Download):** Downloading the identified PDF files from Google Drive.
+- **1.3 Processing Stage (Data Extraction):** Extracting raw text content from the downloaded PDF files.
+- **1.4 Formatting Stage (Data Parsing & Cleaning):** Cleaning and formatting the extracted text using custom JavaScript code to prepare it for downstream use.
+
+---
+
+## 2. Block-by-Block Analysis
+
+### 2.1 Input Reception (File Discovery)
+
+**Overview:**  
+This block starts the workflow on demand and searches a designated Google Drive folder for PDF files, identifying all files that match the `.pdf` extension.
+
+**Nodes Involved:**  
+- Start  
+- Get PDF Files/File
+
+**Node Details:**
+
+- **Start**  
+  - Type: Manual Trigger  
+  - Role: Initiates the workflow manually by user action.  
+  - Configuration: Default manual trigger with no parameters.  
+  - Inputs: None  
+  - Outputs: Connected to "Get PDF Files/File" node.  
+  - Potential Failures: None, unless user neglects to trigger workflow.
+
+- **Get PDF Files/File**  
+  - Type: Google Drive  
+  - Role: Searches for PDF files within a specified Google Drive folder.  
+  - Configuration:  
+    - Search Query: `*.pdf` to target PDF files only.  
+    - Filter: Folder filter applied to restrict search to a specific folder (folder ID must be set).  
+    - Fields Requested: `id`, `name` of files.  
+    - Return All: True (returns all matching files).  
+  - Expressions: Folder ID set dynamically (must be configured).  
+  - Inputs: Connected from "Start".  
+  - Outputs: Sends file metadata (id, name) to "Download Retrieval Files/File".  
+  - Credential: Uses Google Drive OAuth2 credentials named "Template".  
+  - Potential Failures:  
+    - Invalid or expired Google credentials (authentication error).  
+    - Incorrect or empty folder ID (no files found).  
+    - No PDF files present in folder.  
+  - Edge Cases: Empty folder, permission restrictions.
+
+---
+
+### 2.2 Retrieval Stage (File Download)
+
+**Overview:**  
+This block downloads each PDF file found in the previous step, converting it to plain text format to facilitate extraction.
+
+**Nodes Involved:**  
+- Download Retrieval Files/File
+
+**Node Details:**
+
+- **Download Retrieval Files/File**  
+  - Type: Google Drive  
+  - Role: Downloads files by their ID from Google Drive, converting PDFs to plain text.  
+  - Configuration:  
+    - Operation: Download  
+    - File ID: Dynamically set via expression `{{$json.id}}` from previous node output.  
+    - Google File Conversion: Converts Google Docs to `text/plain` format (applied for PDFs).  
+  - Inputs: Receives file metadata from "Get PDF Files/File".  
+  - Outputs: Sends binary file data to "Extract Files/File's Data".  
+  - Credential: Uses Google Drive OAuth2 credentials named "Template".  
+  - Potential Failures:  
+    - File ID invalid or deleted.  
+    - Permission denied for file access.  
+    - Conversion errors if file is corrupted or unsupported.  
+  - Edge Cases: Large files causing timeout, network issues.
+
+---
+
+### 2.3 Processing Stage (Data Extraction)
+
+**Overview:**  
+Extracts raw text content from the downloaded PDF binary data, preparing it for cleaning and formatting.
+
+**Nodes Involved:**  
+- Extract Files/File's Data  
+- Get PDF Data Only
+
+**Node Details:**
+
+- **Extract Files/File's Data**  
+  - Type: Extract From File  
+  - Role: Extracts text content from PDF files.  
+  - Configuration:  
+    - Operation: PDF extraction mode enabled.  
+  - Inputs: Receives binary PDF data from "Download Retrieval Files/File".  
+  - Outputs: Provides extracted text under the field `text` along with other file data.  
+  - Potential Failures:  
+    - Malformed PDF that cannot be parsed.  
+    - Extraction failure due to unsupported PDF features.  
+  - Edge Cases: Very large PDFs, encrypted PDFs.
+
+- **Get PDF Data Only**  
+  - Type: Set  
+  - Role: Isolates the extracted text field (`text`) from the full extraction output, simplifying data for the next step.  
+  - Configuration:  
+    - Sets a new JSON property `text` equal to the extracted text from the previous node (`{{$json.text}}`).  
+  - Inputs: From "Extract Files/File's Data".  
+  - Outputs: Passes cleaned data structure to "Data Parser & Cleaner".  
+  - Potential Failures:  
+    - Missing `text` field if extraction failed or empty PDF.  
+    - Expression evaluation errors if input data structure changes.
+
+---
+
+### 2.4 Formatting Stage (Data Parsing & Cleaning)
+
+**Overview:**  
+Cleans and formats the raw extracted text to remove unwanted characters such as newlines and prepare the data for further use or export.
+
+**Nodes Involved:**  
+- Data Parser & Cleaner  
+- Done !
+
+**Node Details:**
+
+- **Data Parser & Cleaner**  
+  - Type: Code (JavaScript)  
+  - Role: Processes the raw text string by removing newline characters and optionally other cleanup tasks.  
+  - Configuration:  
+    - Custom JavaScript code that:  
+      - Checks if input is a string.  
+      - Replaces all newline characters (`\n`) with spaces.  
+      - Logs original and cleaned text for debugging.  
+      - Returns an object containing the cleaned text as `cleanedText`.  
+  - Key Expressions:  
+    - Input accessed via `$input.first().json.text`  
+  - Inputs: Receives JSON with raw text from "Get PDF Data Only".  
+  - Outputs: Sends cleaned text JSON to "Done !".  
+  - Potential Failures:  
+    - Input is not a string, causing errors or empty output.  
+    - JavaScript syntax errors in code node.  
+    - Input path changes breaking the code.  
+  - Version Specific: Uses n8n Code node v2 syntax.
+
+- **Done !**  
+  - Type: No Operation (NoOp)  
+  - Role: Terminal node indicating workflow completion.  
+  - Configuration: None.  
+  - Inputs: Receives cleaned text from "Data Parser & Cleaner".  
+  - Outputs: None.  
+  - Purpose: Can be used to inspect final output or for future extensions.
+
+---
+
+## 3. Summary Table
+
+| Node Name                  | Node Type             | Functional Role                  | Input Node(s)            | Output Node(s)           | Sticky Note                                                                                                  |
+|----------------------------|-----------------------|---------------------------------|--------------------------|--------------------------|--------------------------------------------------------------------------------------------------------------|
+| Start                      | Manual Trigger        | Initiate workflow manually       | —                        | Get PDF Files/File        |                                                                                                              |
+| Get PDF Files/File          | Google Drive          | Search PDF files in folder       | Start                    | Download Retrieval Files/File | See Sticky Note6 for setup instructions on folder, creds, and search configuration.                          |
+| Download Retrieval Files/File | Google Drive        | Download PDF files               | Get PDF Files/File        | Extract Files/File's Data | See Sticky Note5 for download node configuration details.                                                   |
+| Extract Files/File's Data   | Extract From File     | Extract text from PDF            | Download Retrieval Files/File | Get PDF Data Only         |                                                                                                              |
+| Get PDF Data Only           | Set                   | Isolate extracted text           | Extract Files/File's Data | Data Parser & Cleaner     |                                                                                                              |
+| Data Parser & Cleaner       | Code (JavaScript)     | Clean and format extracted text  | Get PDF Data Only         | Done !                   | See Sticky Note5 for code node usage and troubleshooting.                                                   |
+| Done !                     | No Operation (NoOp)  | Workflow completion marker       | Data Parser & Cleaner     | —                        |                                                                                                              |
+| Sticky Note2                | Sticky Note           | Thank you and feedback request   | —                        | —                        | Expresses gratitude and invites feedback on workflow improvements.                                          |
+| Sticky Note3                | Sticky Note           | Troubleshooting tips             | —                        | —                        | Provides common fixes and debug checklist for credential and node issues.                                   |
+| Sticky Note4                | Sticky Note           | Step-by-step setup guide placeholder | —                      | —                        | Contains placeholders for detailed setup instructions.                                                     |
+| Sticky Note5                | Sticky Note           | Detailed configuration notes     | —                        | —                        | Explains download node and code node configurations, plus testing instructions.                             |
+| Sticky Note6                | Sticky Note           | Google Drive preparation steps   | —                        | —                        | Advises on folder preparation, credential connection, and search node setup.                                |
+| Sticky Note7                | Sticky Note           | Customization suggestions        | —                        | —                        | Suggests modifying data fields and parser code for customization.                                          |
+| Sticky Note8                | Sticky Note           | Workflow flow explanation        | —                        | —                        | Explains the four main stages of the workflow in detail.                                                   |
+| Sticky Note9                | Sticky Note           | Quick demo and use case overview | —                        | —                        | Summarizes input/output and use cases of the workflow.                                                     |
+
+---
+
+## 4. Reproducing the Workflow from Scratch
+
+1. **Create Manual Trigger Node ("Start")**  
+   - Type: Manual Trigger  
+   - No special configuration needed.
+
+2. **Add Google Drive Node ("Get PDF Files/File")**  
+   - Set Operation: Search  
+   - Resource: File/Folder  
+   - Search Query: `*.pdf`  
+   - Add Filter: Folder  
+     - Operation: In Folder  
+     - Select the Google Drive folder containing PDFs.  
+   - Fields to Return: `id`, `name`  
+   - Return All: Enabled  
+   - Connect Manual Trigger output to this node input.  
+   - Configure Google Drive OAuth2 credentials (create new if not existing):  
+     - Provide Client ID and Client Secret from Google Cloud Console.  
+     - Authenticate and authorize access to Google Drive.
+
+3. **Add Google Drive Node ("Download Retrieval Files/File")**  
+   - Set Operation: Download  
+   - File ID: Use expression `{{$json.id}}` to get the file ID dynamically from previous node.  
+   - Google File Conversion: Enable conversion to `text/plain` format for PDFs.  
+   - Connect output of "Get PDF Files/File" to this node's input.  
+   - Use the same Google Drive OAuth2 credentials as above.
+
+4. **Add Extract From File Node ("Extract Files/File's Data")**  
+   - Operation: PDF extraction enabled.  
+   - Connect output of "Download Retrieval Files/File" to this node.  
+   - No credentials needed.
+
+5. **Add Set Node ("Get PDF Data Only")**  
+   - In Parameters, assign one field:  
+     - Name: `text`  
+     - Type: String  
+     - Value: Expression `{{$json.text}}` to extract the text field from previous node's output.  
+   - Connect output of "Extract Files/File's Data" to this node.
+
+6. **Add Code Node ("Data Parser & Cleaner")**  
+   - Language: JavaScript  
+   - Paste the following script:  
+     ```javascript
+     function removeNewlines(text) {
+       if (typeof text !== 'string') {
+         console.error("Input must be a string.");
+         return "";
+       }
+       return text.replace(/\n/g, ' ');
+     }
+     const inputText = $input.first().json.text;
+     const cleanedText = removeNewlines(inputText);
+     return { cleanedText };
+     ```  
+   - Connect output of "Get PDF Data Only" to this node.
+
+7. **Add No Operation Node ("Done !")**  
+   - This node serves as the workflow endpoint.  
+   - Connect output of "Data Parser & Cleaner" to this node.
+
+8. **Testing the Workflow**  
+   - Save the workflow.  
+   - Ensure Google Drive folder contains PDF files.  
+   - Trigger the workflow manually using the "Start" node.  
+   - Verify nodes execute without errors; inspect the output in the "Done !" node for cleaned text results.
+
+---
+
+## 5. General Notes & Resources
+
+| Note Content                                                                                                                                                                                                             | Context or Link                         |
+|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------|
+| 🙏 Thank you for trying this workflow! Feedback is welcome on improvements, new features, or additional workflows.                                                                                                       | Sticky Note2                           |
+| 🔍 Troubleshooting tips include checking Google credentials, folder selection, file presence, and verifying outputs at each node.                                                                                        | Sticky Note3                           |
+| 🛠️ Step-by-step setup guide placeholders are included to assist with quick deployment and configuration.                                                                                                               | Sticky Note4                           |
+| 📋 The workflow operates in four stages: Input (file search), Retrieval (download), Processing (extraction), and Formatting (parsing and cleaning).                                                                       | Sticky Note8                           |
+| 💾 Customization options: Modify the "Get PDF Data Only" node to extract more metadata fields; adjust the "Data Parser & Cleaner" JavaScript code to suit specific formatting or parsing needs.                            | Sticky Note7                           |
+| 📁 Use case overview: Automates extraction of text from PDFs stored in Google Drive folders, ideal for archiving or data processing tasks involving PDFs such as invoices or reports.                                    | Sticky Note9                           |
+
+---
+
+**Disclaimer:** The content described derives exclusively from an n8n automated workflow. It complies fully with content policies, containing no illegal or offensive materials. All handled data is public and legal.
+
+---