Files
nusquama abf504fbe1 creation
2025-11-14 10:39:44 +01:00

18 KiB
Raw Permalink Blame History

Process Documents with OCR, Analytics & Google Drive using PDF Vector

https://n8nworkflows.xyz/workflows/process-documents-with-ocr--analytics---google-drive-using-pdf-vector-8505

Process Documents with OCR, Analytics & Google Drive using PDF Vector

1. Workflow Overview

This workflow automates the processing of documents stored in a specified Google Drive folder. It performs Optical Character Recognition (OCR) and analytics on PDF, Word, and image files using PDF Vector technology and compiles detailed real-time analytics reports on processing performance and quality. The workflow is designed for batch processing and continuous monitoring of document processing metrics for operational insights.

Logical blocks included:

  • 1.1 Input Reception: Triggering and fetching documents from Google Drive.
  • 1.2 Validation & Prioritization: Filtering and prioritizing files based on type and size.
  • 1.3 Batch Processing: Splitting files into manageable batches and preparing individual items.
  • 1.4 Document Processing: Processing each document or image via PDF Vector node.
  • 1.5 Result Tracking: Analyzing processing outcomes and quality metrics.
  • 1.6 Analytics Reporting: Aggregating results into a comprehensive analytics report.
  • 1.7 Informational Notes: Sticky notes providing contextual information on analytics and metrics.

2. Block-by-Block Analysis

2.1 Input Reception

Overview:
Starts processing manually via trigger and lists documents in a specified Google Drive folder for subsequent handling.

Nodes Involved:

  • Manual Trigger
  • List Documents

Node Details:

  • Manual Trigger

    • Type: Trigger node
    • Role: Initiates the workflow manually for batch processing.
    • Configuration: No parameters; user manually starts the workflow.
    • Inputs: None
    • Outputs: Connected to List Documents node
    • Edge Cases: None expected; manual start avoids unintended execution.
  • List Documents

    • Type: Google Drive node
    • Role: Lists up to 100 files in a given Google Drive folder (requires folder ID replacement).
    • Configuration:
      • Operation: list
      • Query: Files in folder 'FOLDER_ID_HERE' that are not trashed
      • Fields retrieved: id, name, mimeType, size, webViewLink, createdTime
    • Inputs: Manual Trigger
    • Outputs: Validated & queued files node
    • Credentials: Requires Google Drive OAuth2 credentials
    • Edge Cases:
      • Folder ID must be replaced with a valid folder.
      • API quota or permission errors possible.
      • Empty folder returns no files.

2.2 Validation & Prioritization

Overview:
Validates file types and sizes, categorizes files into valid or invalid queues, and prioritizes processing based on file size.

Nodes Involved:

  • Validate & Queue Files

Node Details:

  • Validate & Queue Files
    • Type: Code (JavaScript) node
    • Role: Applies business logic to determine valid files for processing and assign priorities.
    • Configuration:
      • Supported formats: PDF, Word (doc/docx), and common images (jpeg, png, gif).
      • Size limit: 50MB max; files above are invalidated.
      • Assigns priority:
        • High if <5MB
        • Medium if between 5MB and 20MB
        • Low otherwise
      • Calculates estimated credits based on size for PDFs (2 credits per MB), flat 1 credit otherwise.
      • Outputs: An object with arrays of valid and invalid files, plus processing stats.
    • Inputs: List Documents
    • Outputs: Process in Batches
    • Edge Cases:
      • Files with unsupported mime types or oversized files go to invalid queue.
      • Potential for incorrect MIME type detection.
      • Large file size numbers might cause float precision issues.

2.3 Batch Processing

Overview:
Splits the valid files into batches of 5 for manageable processing, then prepares individual file items for detailed processing.

Nodes Involved:

  • Process in Batches
  • Split Out Files
  • Split Items

Node Details:

  • Process in Batches

    • Type: SplitInBatches node
    • Role: Manages batch size for downstream processing to prevent overload.
    • Configuration: Batch size set to 5 files per batch.
    • Inputs: Validate & Queue Files
    • Outputs: Split Out Files and Generate Analytics Report (for analytics after batching)
    • Edge Cases:
      • Batch size too large may cause timeouts or API rate limiting.
  • Split Out Files

    • Type: Set node
    • Role: Converts the batch object into a single attribute (processingBatch) for splitting.
    • Configuration: Assigns entire batch JSON to a single field.
    • Inputs: Process in Batches
    • Outputs: Split Items
    • Edge Cases: None significant.
  • Split Items

    • Type: SplitOut node
    • Role: Splits the batch into individual file items for per-file processing.
    • Configuration: Field to split: processingBatch.valid (array of valid files).
    • Inputs: Split Out Files
    • Outputs: PDF Vector - Process Document/Image
    • Edge Cases: Empty batch arrays produce no output items.

2.4 Document Processing

Overview:
Processes each individual document or image using PDF Vectors OCR and NLP capabilities, supporting automatic LLM usage.

Nodes Involved:

  • PDF Vector - Process Document/Image

Node Details:

  • PDF Vector - Process Document/Image
    • Type: PDF Vector node (custom integration)
    • Role: Parses documents/images from URL, performs OCR, and optionally uses Large Language Models (LLMs).
    • Configuration:
      • Resource: document
      • Operation: parse
      • Input type: URL (uses Google Drive webViewLink)
      • LLM usage: auto (automatic decision to use LLM)
    • Inputs: Split Items
    • Outputs: Track Processing Results
    • Continue On Fail: true (workflow continues even if processing fails)
    • Edge Cases:
      • Network issues or invalid URLs cause errors.
      • OCR failures or unsupported document contents.
      • LLM API rate limits or authentication errors.

2.5 Result Tracking

Overview:
Analyzes each processed files results, evaluating success, processing time, credits used, content quality, and error details.

Nodes Involved:

  • Track Processing Results

Node Details:

  • Track Processing Results
    • Type: Code node
    • Role: Extracts processing metadata and performs quality checks on the output content.
    • Configuration:
      • Measures processing time based on execution timestamps
      • Determines success based on absence of errors
      • Calculates quality checks: content presence, reasonable word count, encoding correctness, credit efficiency
      • Computes overall quality score (percentage)
      • Returns a detailed summary object per file.
    • Inputs: PDF Vector - Process Document/Image
    • Outputs: Collect Batch Results
    • Edge Cases:
      • Missing timestamps or content can skew metrics.
      • Edge cases for files with minimal content or encoding anomalies.

2.6 Analytics Reporting

Overview:
Aggregates all batch results, computes comprehensive metrics, success rates, error counts, performance highlights, and generates a formatted markdown report with actionable recommendations.

Nodes Involved:

  • Collect Batch Results
  • Generate Analytics Report

Node Details:

  • Collect Batch Results

    • Type: Aggregate node
    • Role: Aggregates all individual processed file results into a single dataset for reporting.
    • Configuration: Aggregate all item data together.
    • Inputs: Track Processing Results
    • Outputs: Generate Analytics Report
    • Edge Cases: Empty input produces empty aggregate.
  • Generate Analytics Report

    • Type: Code node
    • Role: Processes aggregated results and initial validation stats to produce detailed analytics and a human-readable report.
    • Configuration:
      • Calculates overview stats (files processed, success, failure, time, credits, quality scores)
      • Breaks down by file type (pdf, word, image) with averages and success rates
      • Tracks error types and counts
      • Identifies fastest/slowest and most/least credit-efficient files
      • Generates markdown report with recommendations based on thresholds (e.g., success rate < 90%)
    • Inputs: Collect Batch Results and initial validation stats (from Validate & Queue Files)
    • Outputs: End of processing data with analytics and report
    • Edge Cases: Divisions by zero, empty datasets, unexpected error messages.

2.7 Informational Notes (Sticky Notes)

Overview:
Provides contextual information about the workflows analytics capabilities, tracked metrics, and output destinations.

Nodes Involved:

  • Analytics Overview
  • Metrics Tracked
  • Dashboard Output

Node Details:

  • Analytics Overview

    • Type: Sticky Note
    • Content: Describes real-time analytics features such as tracking workflows, calculating KPIs every 30 minutes, monitoring success/failure, analyzing trends, and updating dashboards automatically.
  • Metrics Tracked

    • Type: Sticky Note
    • Content: Lists key metrics tracked (documents/hour, processing time, error rates, API usage, cost) over a 30-day rolling window.
  • Dashboard Output

    • Type: Sticky Note
    • Content: Lists output channels for analytics (Google Sheets, Tableau, Power BI, Slack alerts) with real-time update emphasis.

3. Summary Table

Node Name Node Type Functional Role Input Node(s) Output Node(s) Sticky Note
Manual Trigger Manual Trigger Initiates workflow manually None List Documents Start batch processing
List Documents Google Drive Lists files in specified folder Manual Trigger Validate & Queue Files Replace FOLDER_ID_HERE with your Google Drive folder ID
Validate & Queue Files Code Validates files and prioritizes List Documents Process in Batches Validate and prioritize files
Process in Batches SplitInBatches Splits files into batches of 5 Validate & Queue Files Split Out Files, Generate Analytics Report Process 5 files at a time
Split Out Files Set Prepares batch object for splitting Process in Batches Split Items Prepare individual files
Split Items SplitOut Splits batch into individual files Split Out Files PDF Vector - Process Document/Image
PDF Vector - Process Document/Image PDF Vector Processes document/image OCR & NLP Split Items Track Processing Results Process document or image
Track Processing Results Code Analyzes processing result quality PDF Vector - Process Document/Image Collect Batch Results Analyze results
Collect Batch Results Aggregate Aggregates batch results Track Processing Results Generate Analytics Report Aggregate batch results
Generate Analytics Report Code Generates detailed analytics report Collect Batch Results None Create analytics dashboard
Analytics Overview Sticky Note Overview of analytics capabilities None None ## 📊 Real-Time Analytics\n\nDocument processing metrics:\n• Tracks all workflows in database\n• Calculates KPIs every 30 minutes\n• Monitors success/failure rates\n• Analyzes trends & patterns\n• Updates dashboards automatically
Metrics Tracked Sticky Note Lists key tracked metrics None None ## 📈 Key Metrics\n\nTracking:\n• Documents/hour\n• Processing time\n• Error rates\n• API usage\n• Cost analysis\n\n💡 30-day rolling window
Dashboard Output Sticky Note Lists analytics output destinations None None ## 📊 Visualizations\n\nOutputs to:\n• Google Sheets\n• Tableau\n• Power BI\n• Slack alerts\n\n Real-time updates!

4. Reproducing the Workflow from Scratch

  1. Create Manual Trigger Node

    • Type: Manual Trigger
    • No parameters needed
    • Position on canvas: start of the flow
  2. Create Google Drive Node (List Documents)

    • Type: Google Drive
    • Operation: List
    • Limit: 100
    • Fields: id, name, mimeType, size, webViewLink, createdTime
    • Query: 'FOLDER_ID_HERE' in parents and trashed=false (replace FOLDER_ID_HERE with actual folder ID)
    • Credentials: Set Google Drive OAuth2 credentials
    • Connect Manual Trigger → List Documents
  3. Create Code Node (Validate & Queue Files)

    • Type: Code
    • Language: JavaScript
    • Paste provided validation script (validates file types, size, priority, and estimated credits)
    • Connect List Documents → Validate & Queue Files
  4. Create SplitInBatches Node (Process in Batches)

    • Type: SplitInBatches
    • Batch Size: 5
    • Connect Validate & Queue Files → Process in Batches
  5. Create Set Node (Split Out Files)

    • Type: Set
    • Add assignment: processingBatch = ={{ $json }} (assign entire batch object)
    • Connect Process in Batches → Split Out Files
  6. Create SplitOut Node (Split Items)

    • Type: SplitOut
    • Field To Split Out: processingBatch.valid
    • Connect Split Out Files → Split Items
  7. Create PDF Vector Node (PDF Vector - Process Document/Image)

    • Type: PDF Vector (custom node)
    • Resource: Document
    • Operation: Parse
    • Input Type: URL
    • URL: ={{ $json.webViewLink }}
    • Use LLM: Auto
    • Enable “Continue On Fail”
    • Connect Split Items → PDF Vector - Process Document/Image
    • Credentials: Configure API credentials as needed for PDF Vector
  8. Create Code Node (Track Processing Results)

    • Type: Code
    • JavaScript code: Provided result tracking code (calculates success, quality, timing, credits)
    • Connect PDF Vector - Process Document/Image → Track Processing Results
  9. Create Aggregate Node (Collect Batch Results)

    • Type: Aggregate
    • Operation: Aggregate All Item Data
    • Connect Track Processing Results → Collect Batch Results
  10. Create Code Node (Generate Analytics Report)

    • Type: Code
    • Paste the provided analytics report generation script
    • Connect Collect Batch Results → Generate Analytics Report
    • Also connect Process in Batches → Generate Analytics Report (to pass initial stats)
  11. Add Sticky Notes for Documentation:

    • Create three sticky notes with the provided content:
      • Analytics Overview (near start)
      • Metrics Tracked (near bottom left)
      • Dashboard Output (near bottom right)
  12. Verify Credential Configurations:

    • Google Drive node requires OAuth2 credentials with read permissions on the target folder.
    • PDF Vector node requires API credentials for OCR and LLM services.
    • No other external credentials needed.
  13. Test Workflow:

    • Manually trigger the workflow.
    • Confirm files are fetched, validated, processed, and analytics generated.
    • Monitor for errors in unsupported file types or large files.

5. General Notes & Resources

Note Content Context or Link
Real-time analytics update dashboards every 30 minutes with KPIs, error rates, and trend analysis. Sticky note "Analytics Overview"
Key metrics tracked include documents/hour, processing time, error rates, API usage, and cost. Sticky note "Metrics Tracked"
Outputs analytics to Google Sheets, Tableau, Power BI, and Slack alerts with real-time updates. Sticky note "Dashboard Output"
Replace placeholder folder ID in Google Drive node query with your actual Google Drive folder ID. Node "List Documents" note
Batch processing size is set to 5 to balance throughput and rate limits. Node "Process in Batches" note

This structured documentation enables a comprehensive understanding of the entire workflow, facilitates reproduction or modification, and highlights critical error handling points and integration dependencies.