klbr/n8nworkflows.xyz

Fork 0

mirror of https://github.com/khoaliber/n8nworkflows.xyz.git synced 2026-04-19 17:14:37 +00:00

Files

nusquama abf504fbe1 creation

2025-11-14 10:39:44 +01:00

18 KiB

Raw Permalink Blame History

Process Documents with OCR, Analytics & Google Drive using PDF Vector

https://n8nworkflows.xyz/workflows/process-documents-with-ocr--analytics---google-drive-using-pdf-vector-8505

Process Documents with OCR, Analytics & Google Drive using PDF Vector

1. Workflow Overview

This workflow automates the processing of documents stored in a specified Google Drive folder. It performs Optical Character Recognition (OCR) and analytics on PDF, Word, and image files using PDF Vector technology and compiles detailed real-time analytics reports on processing performance and quality. The workflow is designed for batch processing and continuous monitoring of document processing metrics for operational insights.

Logical blocks included:

1.1 Input Reception: Triggering and fetching documents from Google Drive.
1.2 Validation & Prioritization: Filtering and prioritizing files based on type and size.
1.3 Batch Processing: Splitting files into manageable batches and preparing individual items.
1.4 Document Processing: Processing each document or image via PDF Vector node.
1.5 Result Tracking: Analyzing processing outcomes and quality metrics.
1.6 Analytics Reporting: Aggregating results into a comprehensive analytics report.
1.7 Informational Notes: Sticky notes providing contextual information on analytics and metrics.

2. Block-by-Block Analysis

2.1 Input Reception

Overview:
Starts processing manually via trigger and lists documents in a specified Google Drive folder for subsequent handling.

Nodes Involved:

Manual Trigger
List Documents

Node Details:

Manual Trigger
- Type: Trigger node
- Role: Initiates the workflow manually for batch processing.
- Configuration: No parameters; user manually starts the workflow.
- Inputs: None
- Outputs: Connected to List Documents node
- Edge Cases: None expected; manual start avoids unintended execution.
List Documents
- Type: Google Drive node
- Role: Lists up to 100 files in a given Google Drive folder (requires folder ID replacement).
- Configuration:
  - Operation: list
  - Query: Files in folder 'FOLDER_ID_HERE' that are not trashed
  - Fields retrieved: id, name, mimeType, size, webViewLink, createdTime
- Inputs: Manual Trigger
- Outputs: Validated & queued files node
- Credentials: Requires Google Drive OAuth2 credentials
- Edge Cases:
  - Folder ID must be replaced with a valid folder.
  - API quota or permission errors possible.
  - Empty folder returns no files.

2.2 Validation & Prioritization

Overview:
Validates file types and sizes, categorizes files into valid or invalid queues, and prioritizes processing based on file size.

Nodes Involved:

Validate & Queue Files

Node Details:

Validate & Queue Files
- Type: Code (JavaScript) node
- Role: Applies business logic to determine valid files for processing and assign priorities.
- Configuration:
  - Supported formats: PDF, Word (doc/docx), and common images (jpeg, png, gif).
  - Size limit: 50MB max; files above are invalidated.
  - Assigns priority:
    - High if <5MB
    - Medium if between 5MB and 20MB
    - Low otherwise
  - Calculates estimated credits based on size for PDFs (2 credits per MB), flat 1 credit otherwise.
  - Outputs: An object with arrays of valid and invalid files, plus processing stats.
- Inputs: List Documents
- Outputs: Process in Batches
- Edge Cases:
  - Files with unsupported mime types or oversized files go to invalid queue.
  - Potential for incorrect MIME type detection.
  - Large file size numbers might cause float precision issues.

2.3 Batch Processing

Overview:
Splits the valid files into batches of 5 for manageable processing, then prepares individual file items for detailed processing.

Nodes Involved:

Process in Batches
Split Out Files
Split Items

Node Details:

Process in Batches
- Type: SplitInBatches node
- Role: Manages batch size for downstream processing to prevent overload.
- Configuration: Batch size set to 5 files per batch.
- Inputs: Validate & Queue Files
- Outputs: Split Out Files and Generate Analytics Report (for analytics after batching)
- Edge Cases:
  - Batch size too large may cause timeouts or API rate limiting.
Split Out Files
- Type: Set node
- Role: Converts the batch object into a single attribute (processingBatch) for splitting.
- Configuration: Assigns entire batch JSON to a single field.
- Inputs: Process in Batches
- Outputs: Split Items
- Edge Cases: None significant.
Split Items
- Type: SplitOut node
- Role: Splits the batch into individual file items for per-file processing.
- Configuration: Field to split: processingBatch.valid (array of valid files).
- Inputs: Split Out Files
- Outputs: PDF Vector - Process Document/Image
- Edge Cases: Empty batch arrays produce no output items.

2.4 Document Processing

Overview:
Processes each individual document or image using PDF Vector’s OCR and NLP capabilities, supporting automatic LLM usage.

Nodes Involved:

PDF Vector - Process Document/Image

Node Details:

PDF Vector - Process Document/Image
- Type: PDF Vector node (custom integration)
- Role: Parses documents/images from URL, performs OCR, and optionally uses Large Language Models (LLMs).
- Configuration:
  - Resource: document
  - Operation: parse
  - Input type: URL (uses Google Drive webViewLink)
  - LLM usage: auto (automatic decision to use LLM)
- Inputs: Split Items
- Outputs: Track Processing Results
- Continue On Fail: true (workflow continues even if processing fails)
- Edge Cases:
  - Network issues or invalid URLs cause errors.
  - OCR failures or unsupported document contents.
  - LLM API rate limits or authentication errors.

2.5 Result Tracking

Overview:
Analyzes each processed file’s results, evaluating success, processing time, credits used, content quality, and error details.

Nodes Involved:

Track Processing Results

Node Details:

Track Processing Results
- Type: Code node
- Role: Extracts processing metadata and performs quality checks on the output content.
- Configuration:
  - Measures processing time based on execution timestamps
  - Determines success based on absence of errors
  - Calculates quality checks: content presence, reasonable word count, encoding correctness, credit efficiency
  - Computes overall quality score (percentage)
  - Returns a detailed summary object per file.
- Inputs: PDF Vector - Process Document/Image
- Outputs: Collect Batch Results
- Edge Cases:
  - Missing timestamps or content can skew metrics.
  - Edge cases for files with minimal content or encoding anomalies.

2.6 Analytics Reporting

Overview:
Aggregates all batch results, computes comprehensive metrics, success rates, error counts, performance highlights, and generates a formatted markdown report with actionable recommendations.

Nodes Involved:

Collect Batch Results
Generate Analytics Report

Node Details:

Collect Batch Results
- Type: Aggregate node
- Role: Aggregates all individual processed file results into a single dataset for reporting.
- Configuration: Aggregate all item data together.
- Inputs: Track Processing Results
- Outputs: Generate Analytics Report
- Edge Cases: Empty input produces empty aggregate.
Generate Analytics Report
- Type: Code node
- Role: Processes aggregated results and initial validation stats to produce detailed analytics and a human-readable report.
- Configuration:
  - Calculates overview stats (files processed, success, failure, time, credits, quality scores)
  - Breaks down by file type (pdf, word, image) with averages and success rates
  - Tracks error types and counts
  - Identifies fastest/slowest and most/least credit-efficient files
  - Generates markdown report with recommendations based on thresholds (e.g., success rate < 90%)
- Inputs: Collect Batch Results and initial validation stats (from Validate & Queue Files)
- Outputs: End of processing data with analytics and report
- Edge Cases: Divisions by zero, empty datasets, unexpected error messages.

2.7 Informational Notes (Sticky Notes)

Overview:
Provides contextual information about the workflow’s analytics capabilities, tracked metrics, and output destinations.

Nodes Involved:

Analytics Overview
Metrics Tracked
Dashboard Output

Node Details:

Analytics Overview
- Type: Sticky Note
- Content: Describes real-time analytics features such as tracking workflows, calculating KPIs every 30 minutes, monitoring success/failure, analyzing trends, and updating dashboards automatically.
Metrics Tracked
- Type: Sticky Note
- Content: Lists key metrics tracked (documents/hour, processing time, error rates, API usage, cost) over a 30-day rolling window.
Dashboard Output
- Type: Sticky Note
- Content: Lists output channels for analytics (Google Sheets, Tableau, Power BI, Slack alerts) with real-time update emphasis.

3. Summary Table

Node Name	Node Type	Functional Role	Input Node(s)	Output Node(s)	Sticky Note
Manual Trigger	Manual Trigger	Initiates workflow manually	None	List Documents	Start batch processing
List Documents	Google Drive	Lists files in specified folder	Manual Trigger	Validate & Queue Files	Replace FOLDER_ID_HERE with your Google Drive folder ID
Validate & Queue Files	Code	Validates files and prioritizes	List Documents	Process in Batches	Validate and prioritize files
Process in Batches	SplitInBatches	Splits files into batches of 5	Validate & Queue Files	Split Out Files, Generate Analytics Report	Process 5 files at a time
Split Out Files	Set	Prepares batch object for splitting	Process in Batches	Split Items	Prepare individual files
Split Items	SplitOut	Splits batch into individual files	Split Out Files	PDF Vector - Process Document/Image
PDF Vector - Process Document/Image	PDF Vector	Processes document/image OCR & NLP	Split Items	Track Processing Results	Process document or image
Track Processing Results	Code	Analyzes processing result quality	PDF Vector - Process Document/Image	Collect Batch Results	Analyze results
Collect Batch Results	Aggregate	Aggregates batch results	Track Processing Results	Generate Analytics Report	Aggregate batch results
Generate Analytics Report	Code	Generates detailed analytics report	Collect Batch Results	None	Create analytics dashboard
Analytics Overview	Sticky Note	Overview of analytics capabilities	None	None	## 📊 Real-Time Analytics\n\nDocument processing metrics:\n• Tracks all workflows in database\n• Calculates KPIs every 30 minutes\n• Monitors success/failure rates\n• Analyzes trends & patterns\n• Updates dashboards automatically
Metrics Tracked	Sticky Note	Lists key tracked metrics	None	None	## 📈 Key Metrics\n\nTracking:\n• Documents/hour\n• Processing time\n• Error rates\n• API usage\n• Cost analysis\n\n💡 30-day rolling window
Dashboard Output	Sticky Note	Lists analytics output destinations	None	None	## 📊 Visualizations\n\nOutputs to:\n• Google Sheets\n• Tableau\n• Power BI\n• Slack alerts\n\n✨ Real-time updates!

4. Reproducing the Workflow from Scratch

Create Manual Trigger Node
- Type: Manual Trigger
- No parameters needed
- Position on canvas: start of the flow
Create Google Drive Node (List Documents)
- Type: Google Drive
- Operation: List
- Limit: 100
- Fields: id, name, mimeType, size, webViewLink, createdTime
- Query: 'FOLDER_ID_HERE' in parents and trashed=false (replace FOLDER_ID_HERE with actual folder ID)
- Credentials: Set Google Drive OAuth2 credentials
- Connect Manual Trigger → List Documents
Create Code Node (Validate & Queue Files)
- Type: Code
- Language: JavaScript
- Paste provided validation script (validates file types, size, priority, and estimated credits)
- Connect List Documents → Validate & Queue Files
Create SplitInBatches Node (Process in Batches)
- Type: SplitInBatches
- Batch Size: 5
- Connect Validate & Queue Files → Process in Batches
Create Set Node (Split Out Files)
- Type: Set
- Add assignment: processingBatch = ={{ $json }} (assign entire batch object)
- Connect Process in Batches → Split Out Files
Create SplitOut Node (Split Items)
- Type: SplitOut
- Field To Split Out: processingBatch.valid
- Connect Split Out Files → Split Items
Create PDF Vector Node (PDF Vector - Process Document/Image)
- Type: PDF Vector (custom node)
- Resource: Document
- Operation: Parse
- Input Type: URL
- URL: ={{ $json.webViewLink }}
- Use LLM: Auto
- Enable “Continue On Fail”
- Connect Split Items → PDF Vector - Process Document/Image
- Credentials: Configure API credentials as needed for PDF Vector
Create Code Node (Track Processing Results)
- Type: Code
- JavaScript code: Provided result tracking code (calculates success, quality, timing, credits)
- Connect PDF Vector - Process Document/Image → Track Processing Results
Create Aggregate Node (Collect Batch Results)
- Type: Aggregate
- Operation: Aggregate All Item Data
- Connect Track Processing Results → Collect Batch Results
Create Code Node (Generate Analytics Report)
- Type: Code
- Paste the provided analytics report generation script
- Connect Collect Batch Results → Generate Analytics Report
- Also connect Process in Batches → Generate Analytics Report (to pass initial stats)
Add Sticky Notes for Documentation:
- Create three sticky notes with the provided content:
  - Analytics Overview (near start)
  - Metrics Tracked (near bottom left)
  - Dashboard Output (near bottom right)
Verify Credential Configurations:
- Google Drive node requires OAuth2 credentials with read permissions on the target folder.
- PDF Vector node requires API credentials for OCR and LLM services.
- No other external credentials needed.
Test Workflow:
- Manually trigger the workflow.
- Confirm files are fetched, validated, processed, and analytics generated.
- Monitor for errors in unsupported file types or large files.

5. General Notes & Resources

Note Content	Context or Link
Real-time analytics update dashboards every 30 minutes with KPIs, error rates, and trend analysis.	Sticky note "Analytics Overview"
Key metrics tracked include documents/hour, processing time, error rates, API usage, and cost.	Sticky note "Metrics Tracked"
Outputs analytics to Google Sheets, Tableau, Power BI, and Slack alerts with real-time updates.	Sticky note "Dashboard Output"
Replace placeholder folder ID in Google Drive node query with your actual Google Drive folder ID.	Node "List Documents" note
Batch processing size is set to 5 to balance throughput and rate limits.	Node "Process in Batches" note

This structured documentation enables a comprehensive understanding of the entire workflow, facilitates reproduction or modification, and highlights critical error handling points and integration dependencies.

18 KiB Raw Permalink Blame History Unescape Escape

Process Documents with OCR, Analytics & Google Drive using PDF Vector

1. Workflow Overview

2. Block-by-Block Analysis

2.1 Input Reception

2.2 Validation & Prioritization

2.3 Batch Processing

2.4 Document Processing

2.5 Result Tracking

2.6 Analytics Reporting

2.7 Informational Notes (Sticky Notes)

3. Summary Table

4. Reproducing the Workflow from Scratch

5. General Notes & Resources

18 KiB

Raw Permalink Blame History