22 KiB
Multi-Format Document Processing for RAG Chatbot with Google Drive & Supabase
Multi-Format Document Processing for RAG Chatbot with Google Drive & Supabase
1. Workflow Overview
This workflow automates the ingestion and processing of multiple document formats uploaded to a specific Google Drive folder for use in a Retrieval-Augmented Generation (RAG) chatbot system. It monitors newly created files, detects their type, extracts text content accordingly, optionally converts certain documents to Google Docs format, splits large texts for better processing, generates embeddings using OpenAI, and finally stores the processed document vectors in a Supabase vector database. The workflow is organized into the following logical blocks:
- 1.1 Input Reception: Watches a designated Google Drive folder for new file uploads.
- 1.2 File Metadata and Looping: Captures file metadata, loops through files for individual processing.
- 1.3 File Type Detection and Conditional Processing: Uses a switch node to route files based on MIME type for appropriate extraction or conversion.
- 1.4 Text Extraction and Processing: Extracts text from PDFs, Excel, Google Docs, Word documents, and plain text files.
- 1.5 Document Aggregation and Summarization: Aggregates and summarizes extracted data when necessary.
- 1.6 Text Splitting and Embedding Generation: Splits large text documents recursively and generates OpenAI embeddings.
- 1.7 Data Insertion: Inserts processed document vectors into Supabase vectorstore.
- 1.8 Cleanup: Deletes temporary files in Google Drive after conversion.
- 1.9 Informational and Credit Notes: Provides workflow comments and credits.
2. Block-by-Block Analysis
1.1 Input Reception
Overview:
This block triggers the workflow when new files are created in a specified Google Drive folder.
Nodes Involved:
- File Created
Node Details:
- File Created
- Type: Google Drive Trigger
- Role: Watches for newly created files in a specific Google Drive folder (ID:
1M9h9OnDSWa0kV7_Yj7fdSFHhM57pMPX8) - Configuration: Trigger on
fileCreatedevent, polling every minute, limited to specified folder - Inputs: None (trigger node)
- Outputs: Emits metadata about new files (ID, MIME type, owner, timestamps)
- Potential Failures: Auth errors if Google Drive OAuth2 credentials expire; rate limiting from Drive API
1.2 File Metadata and Looping
Overview:
Sets file metadata in variables and loops over multiple items for individual processing.
Nodes Involved:
- Set File ID1
- Loop Over Items
Node Details:
-
Set File ID1
- Type: Set
- Role: Extracts and stores
file_idandfile_typefrom the created file metadata for downstream use - Key Expressions:
file_id={{$('File Created').item.json.id}}file_type={{$('File Created').item.json.mimeType}}
- Inputs: File Created node output
- Outputs: Single item with file metadata
- Edge Cases: Missing or malformed file metadata may cause failures downstream
-
Loop Over Items
- Type: SplitInBatches
- Role: Processes each file individually in batch mode, useful if multiple files triggered simultaneously
- Inputs: Output from Set File ID1
- Outputs: Single file item streamed to downstream nodes
- Edge Cases: Large batch sizes may cause timeouts; single-file operation expected in normal use
1.3 File Type Detection and Conditional Processing
Overview:
Routes files according to MIME type to the appropriate extraction or conversion flow.
Nodes Involved:
- Switch2
Node Details:
- Switch2
- Type: Switch
- Role: Checks
file_typeto route files to PDF extraction, text extraction, Excel extraction, or Google Doc conversion - Conditions & Outputs:
- PDF:
application/pdf - Google Docs:
application/vnd.google-apps.document - Excel:
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet - Windows Word Doc (multiple MIME types)
- PDF:
- Inputs: Single file metadata item from Loop Over Items
- Outputs: Routes to corresponding extraction or conversion nodes
- Edge Cases: Unknown MIME types fall through to fallback output, possibly unhandled
1.4 Text Extraction and Processing
Overview:
Extracts text content from files based on their format.
Nodes Involved:
- Extract PDF Text
- Extract from Text File
- Extract from Excel
- Convert to Google Doc1
Node Details:
-
Extract PDF Text
- Type: ExtractFromFile (PDF)
- Role: Extracts text content from PDF files
- Inputs: Routed from Switch2 PDF output
- Outputs: Text data for embedding
- Errors: Corrupt PDFs, large files causing timeouts
-
Extract from Text File
- Type: ExtractFromFile (Text)
- Role: Extracts plain text from Google Docs files
- Inputs: Routed from Switch2 Google Docs output
- Outputs: Text data for embedding
- Edge Cases: Empty documents; unsupported encodings
-
Extract from Excel
- Type: ExtractFromFile (XLSX)
- Role: Extracts text or cell data from Excel spreadsheets
- Inputs: Routed from Switch2 Excel output
- Outputs: Rows or concatenated data
- Connected to: Aggregate1 (aggregation of extracted rows)
- Edge Cases: Complex spreadsheets, formulas, or large files may cause extraction issues
-
Convert to Google Doc1
- Type: HTTP Request (Google Drive API copy file)
- Role: Converts Word documents to Google Docs format by copying with MIME type change
- Inputs: Routed from Switch2 Windows Doc outputs
- Outputs: New Google Doc file ID and name
- Credentials: Google Drive OAuth2
- Edge Cases: API quota limits, conversion failures, unsupported Word formats
- Followed by: Delete File node to remove original file
1.5 Document Aggregation and Summarization
Overview:
Aggregates multiple extracted data items and summarizes text content before embedding.
Nodes Involved:
- Aggregate1
- Summarize1
Node Details:
-
Aggregate1
- Type: Aggregate
- Role: Combines all extracted data items from Excel extraction into one dataset
- Inputs: Extract from Excel output
- Outputs: Aggregated data to Summarize1
-
Summarize1
- Type: Summarize
- Role: Concatenates aggregated data fields into a single text string for embedding insertion
- Inputs: Aggregate1 output
- Outputs: Summarized concatenated text for insertion
1.6 Text Splitting and Embedding Generation
Overview:
Splits large text documents recursively for manageable chunks and generates vector embeddings using OpenAI.
Nodes Involved:
- Recursive Character Text Splitter
- Enhanced Default Data Loader1
- Embeddings OpenAI1
Node Details:
-
Recursive Character Text Splitter
- Type: LangChain Text Splitter (Recursive Character)
- Role: Splits text into chunks of 2000 characters with 200 characters overlap using markdown splitting
- Inputs: Text from previous extraction/summarization steps (implied connection)
- Outputs: Text chunks for embedding generation
-
Enhanced Default Data Loader1
- Type: LangChain Document Data Loader
- Role: Wraps split text chunks with metadata (file ID, version, creator, timestamps, folder path, file name, extension)
- Inputs: Text chunks from Recursive Character Text Splitter
- Outputs: Fully prepared document objects for embedding insertion
-
Embeddings OpenAI1
- Type: LangChain OpenAI Embeddings
- Role: Generates vector embeddings for documents
- Credentials: OpenAI API key required
- Inputs: Document objects from Data Loader
- Outputs: Embeddings for vector store insertion
- Edge Cases: API rate limits, token limits, network failures
1.7 Data Insertion
Overview:
Inserts processed document embeddings and summarized text into Supabase vectorstore.
Nodes Involved:
- Insert into Supabase Vectorstore1
Node Details:
- Insert into Supabase Vectorstore1
- Type: LangChain Vector Store (Supabase)
- Role: Inserts embeddings and document metadata into Supabase vector database table named
documents - Credentials: Supabase API key
- Inputs: From Summarize1, Extract PDF Text, Extract from Text File, Embeddings OpenAI1, and Enhanced Default Data Loader1 nodes via different input types (
main,ai_embedding,ai_document) - Edge Cases: Database connectivity issues, quota limits, schema mismatches
1.8 Cleanup
Overview:
Deletes temporary converted files from Google Drive to maintain storage hygiene.
Nodes Involved:
- Delete File
Node Details:
- Delete File
- Type: Google Drive
- Role: Deletes original files after conversion to Google Docs to prevent duplication
- Inputs: Output from Convert to Google Doc1 (file ID)
- Credentials: Google Drive OAuth2
- Edge Cases: Permissions errors, file already deleted, API rate limits
1.9 Informational and Credit Notes
Overview:
Provides descriptive and credit information about the workflow.
Nodes Involved:
- Sticky Note1
- Sticky Note
Node Details:
-
Sticky Note1
- Type: Sticky Note
- Content: Describes the workflow trigger and processing logic in Indonesian language
- Position: Top left for overview
-
Sticky Note
- Type: Sticky Note
- Content: Credits Nate Herk as the original author and provides a link to his community:
https://www.skool.com/ai-automation-society-plus/about
3. Summary Table
| Node Name | Node Type | Functional Role | Input Node(s) | Output Node(s) | Sticky Note |
|---|---|---|---|---|---|
| File Created | Google Drive Trigger | Watches for new files in designated Google Drive folder | None | Set File ID1 | # Watch Trigger (Drive) - Files Created. Multiple files uploaded to monitored folder -> processing |
| Set File ID1 | Set | Extracts file_id and file_type from trigger output | File Created | Loop Over Items | # Watch Trigger (Drive) - Files Created. Multiple files uploaded to monitored folder -> processing |
| Loop Over Items | SplitInBatches | Iterates over files individually | Set File ID1 | Switch2, Download File1 | # Watch Trigger (Drive) - Files Created. Multiple files uploaded to monitored folder -> processing |
| Switch2 | Switch | Routes files by MIME type to specific extraction paths | Loop Over Items | Extract PDF Text, Extract from Text File, Extract from Excel, Convert to Google Doc1 | # Watch Trigger (Drive) - Files Created. Multiple files uploaded to monitored folder -> processing |
| Extract PDF Text | ExtractFromFile (PDF) | Extracts text from PDF files | Switch2 (PDF output) | Insert into Supabase Vectorstore1 | |
| Extract from Text File | ExtractFromFile (Text) | Extracts text from Google Docs files | Switch2 (Text File output) | Insert into Supabase Vectorstore1 | |
| Extract from Excel | ExtractFromFile (XLSX) | Extracts data from Excel spreadsheets | Switch2 (Excel output) | Aggregate1 | |
| Convert to Google Doc1 | HTTP Request (Google Drive API) | Converts Word docs to Google Docs | Switch2 (Windows Doc outputs) | Delete File | |
| Delete File | Google Drive | Deletes original Word files after conversion | Convert to Google Doc1 | Loop Over Items | |
| Aggregate1 | Aggregate | Aggregates extracted Excel data | Extract from Excel | Summarize1 | |
| Summarize1 | Summarize | Concatenates aggregated data for embedding insertion | Aggregate1 | Insert into Supabase Vectorstore1 | |
| Download File1 | Google Drive | Downloads files with Google Docs conversion option | Loop Over Items | Loop Over Items (feedback) | # Watch Trigger (Drive) - Files Created. Multiple files uploaded to monitored folder -> processing |
| Recursive Character Text Splitter | LangChain Text Splitter (Recursive) | Splits large texts into manageable chunks | (Implied from extraction) | Enhanced Default Data Loader1 | |
| Enhanced Default Data Loader1 | LangChain Document Data Loader | Prepares documents with metadata for embedding | Recursive Character Text Splitter | Insert into Supabase Vectorstore1 | |
| Embeddings OpenAI1 | LangChain OpenAI Embeddings | Generates vector embeddings | Enhanced Default Data Loader1 | Insert into Supabase Vectorstore1 | |
| Insert into Supabase Vectorstore1 | LangChain VectorStore Supabase | Inserts vectors and metadata into Supabase | Summarize1, Extract PDF Text, Extract from Text File, Embeddings OpenAI1, Enhanced Default Data Loader1 | None | |
| Sticky Note1 | Sticky Note | Workflow overview note | None | None | # Watch Trigger (Drive) - Files Created. Multiple files uploaded to monitored folder -> processing |
| Sticky Note | Sticky Note | Workflow credit note | None | None | Credits Nate Herk. Link: https://www.skool.com/ai-automation-society-plus/about |
4. Reproducing the Workflow from Scratch
-
Create Google Drive Trigger node:
- Name:
File Created - Event:
fileCreated - Poll interval: every minute
- Folder to watch: Google Drive folder ID
1M9h9OnDSWa0kV7_Yj7fdSFHhM57pMPX8 - Credentials: Google Drive OAuth2 account
- Name:
-
Add Set node:
- Name:
Set File ID1 - Parameters:
file_id={{$('File Created').item.json.id}}file_type={{$('File Created').item.json.mimeType}}
- Connect
File Created→Set File ID1
- Name:
-
Add SplitInBatches node:
- Name:
Loop Over Items - Default batch size
- Connect
Set File ID1→Loop Over Items
- Name:
-
Add Switch node:
- Name:
Switch2 - Field to evaluate:
{{$('Set File ID1').item.json.file_type}} - Rules:
- PDF: equals
application/pdf - Google Docs: equals
application/vnd.google-apps.document - Excel: equals
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet - Windows Word Doc (3 rules for MIME types):
application/vnd.openxmlformats-officedocument.wordprocessingml.documentapplication/mswordapplication/vnd.ms-word
- PDF: equals
- Connect
Loop Over Items→Switch2
- Name:
-
Add ExtractFromFile nodes for each type:
Extract PDF Text(operation: pdf)Extract from Text File(operation: text)Extract from Excel(operation: xlsx)- Connect respective outputs from
Switch2to these extraction nodes
-
Add HTTP Request node for conversion:
- Name:
Convert to Google Doc1 - Method: POST
- URL:
https://www.googleapis.com/drive/v3/files/{{ $('Set File ID1').item.json.file_id }}/copy - Body parameters:
name={{ $('Set File ID1').item.json.name }}mimeType=application/vnd.google-apps.document
- Credentials: Google Drive OAuth2
- Connect Windows Doc outputs from
Switch2to this node
- Name:
-
Add Google Drive node:
- Name:
Delete File - Operation:
deleteFile - File ID:
{{$('Set File ID1').item.json.file_id}} - Credentials: Google Drive OAuth2
- Connect
Convert to Google Doc1→Delete File
- Name:
-
Add Aggregate node:
- Name:
Aggregate1 - Operation: aggregate all item data
- Connect
Extract from Excel→Aggregate1
- Name:
-
Add Summarize node:
- Name:
Summarize1 - Field to summarize: concatenate
datafields - Connect
Aggregate1→Summarize1
- Name:
-
Add LangChain Recursive Character Text Splitter node:
- Name:
Recursive Character Text Splitter - Chunk size: 2000 characters
- Chunk overlap: 200 characters
- Split code type: markdown
- Connect extracted text outputs (implied connection)
- Name:
-
Add LangChain Document Data Loader node:
- Name:
Enhanced Default Data Loader1 - Metadata fields:
- file_id, version (v1), creator, created_at, last_modified, folder_path ("DOCUMENTS"), file_name, file_extension
- JSON data: Expression to use extracted or concatenated data
- Connect
Recursive Character Text Splitter→Enhanced Default Data Loader1
- Name:
-
Add LangChain OpenAI Embeddings node:
- Name:
Embeddings OpenAI1 - Credentials: OpenAI API key
- Connect
Enhanced Default Data Loader1→Embeddings OpenAI1
- Name:
-
Add LangChain VectorStore Supabase node:
- Name:
Insert into Supabase Vectorstore1 - Mode: insert
- Table name:
documents - Credentials: Supabase API
- Connect outputs from:
Summarize1Extract PDF TextExtract from Text FileEmbeddings OpenAI1Enhanced Default Data Loader1
- Name:
-
Add Google Drive Download node (optional):
- Name:
Download File1 - Operation: download with Google Docs conversion to plain text
- Credentials: Google Drive OAuth2
- Connect
Loop Over Items→Download File1→ back toLoop Over Items(feedback loop)
- Name:
-
Add Sticky Notes for documentation:
Sticky Note1with workflow overview content near input nodesSticky Notecrediting Nate Herk with link: https://www.skool.com/ai-automation-society-plus/about
5. General Notes & Resources
| Note Content | Context or Link |
|---|---|
| Workflow triggers on new files uploaded to a designated Google Drive folder and processes them | Workflow Trigger Description (Sticky Note1) |
| Original workflow by Nate Herk — Join his community for more AI automation resources | https://www.skool.com/ai-automation-society-plus/about |
| Requires Google Drive OAuth2 credentials configured with access to monitored folders | Credential setup note |
| Requires valid OpenAI API key for embeddings generation | Credential setup note |
Requires Supabase API key with permissions to write to documents vectorstore table |
Credential setup note |
| Supports multiple document formats: PDF, Google Docs, Excel, Word (Office Open XML and legacy) | File type handling note |
| Converts Word documents to Google Docs format for uniform text extraction | Conversion node rationale |
| Text splitting uses recursive character method with markdown split code | Optimizes input size for embedding; reduces token overflow risk |
| Aggregation and summarization applied for Excel extraction to concatenate cell data | Data preparation step |
| Deletes original Word files after conversion to avoid duplicates in Google Drive | Cleanup step to reduce clutter |
Disclaimer:
The text provided is derived solely from an automated n8n workflow implementation. The workflow complies fully with current content policies and contains no illegal, offensive, or protected elements. All data handled is legal and publicly accessible.