21 KiB
Create RAG Vector Database from Google Drive Documents using Gemini & Supabase
Create RAG Vector Database from Google Drive Documents using Gemini & Supabase
1. Workflow Overview
This workflow automates the process of creating a Retrieval-Augmented Generation (RAG) vector database from documents stored in a specified Google Drive folder. It is designed for semantic search and knowledge retrieval applications leveraging embeddings generated by Google Gemini and stored in a Supabase-managed Postgres vector database.
Target Use Cases:
- Building vector embeddings from Google Drive documents for RAG applications.
- Enabling semantic search over company or personal documents stored in Google Drive.
- Automating document ingestion and vector storage for AI-powered retrieval.
Logical Blocks:
- 1.1 Input Reception: Trigger the workflow with a Google Drive folder URL input.
- 1.2 Drive Folder ID Extraction: Parse the folder or file ID from the input URL to use with Google Drive API.
- 1.3 Database Initialization: Setup Supabase vector storage by creating the necessary Postgres table and vector search function, resetting any previous data.
- 1.4 Fetch Drive Files: List all files and folders inside the specified Drive folder using the extracted ID.
- 1.5 Iterative Processing Loop: Process each file one-by-one through a batch splitter.
- 1.6 Download and Document Loading: Download each file and load its content for embedding.
- 1.7 Embedding Generation: Generate 768-dimensional embeddings for each document using Google Gemini embeddings API.
- 1.8 Vector Storage: Insert the document content, metadata, and embeddings into the Supabase vector table for semantic search.
2. Block-by-Block Analysis
1.1 Input Reception
-
Overview:
Starts the workflow when called externally, accepting a JSON input containing the Google Drive folder URL. -
Nodes Involved:
- When Executed by Another Workflow
- Sticky Note (Trigger explanation)
-
Node Details:
-
When Executed by Another Workflow
- Type: Execute Workflow Trigger
- Configuration: Accepts JSON input example with key
Drive_Folder_linkholding the Google Drive folder URL. - Input/Output: Receives external trigger input → outputs JSON with folder URL.
- Edge Cases: Missing or malformed input JSON will cause downstream failures. No validation on input structure.
-
Sticky Note
- Content: Explains trigger node usage and input format.
-
1.2 Drive Folder ID Extraction
-
Overview:
Parses the provided Google Drive URL to extract the folder or file ID required for Google Drive API calls. -
Nodes Involved:
- Code in JavaScript
- Sticky Note (Regex extraction explanation)
-
Node Details:
-
Code in JavaScript
- Type: Code Node (JavaScript)
- Configuration:
- Extracts folder or file ID from the URL using regex patterns matching
/folders/{id}or/file/d/{id}. - Outputs JSON with
folderIdanddriveId.
- Extracts folder or file ID from the URL using regex patterns matching
- Key Expressions: Custom regex parsing logic within the JavaScript code.
- Connections: Input from trigger node → output to initialize DB node.
- Edge Cases:
- If URL format is unexpected or missing the ID, output will be null, potentially causing failure downstream.
- No validation or error handling for invalid IDs.
-
Sticky Note
- Content: Explains the regex extraction logic for folder/file ID.
-
1.3 Database Initialization
-
Overview:
Resets and prepares the Supabase Postgres database by dropping and recreating the documents table, enabling the pgvector extension, and defining a vector similarity search function. -
Nodes Involved:
- Execute a SQL query
- Sticky Note (DB initialization explanation)
-
Node Details:
-
Execute a SQL query
- Type: Postgres node
- Configuration:
- Executes multi-statement SQL: drops the
documentstable if exists, installsvectorextension, createsdocumentstable withid,content,metadata, and 768-dimensional vector columnembedding. - Defines a SQL function
match_documentsto perform similarity search using the<=>operator for vector distance.
- Executes multi-statement SQL: drops the
- Credentials: Uses configured Postgres/Supabase credentials.
- Input/Output: Receives input from JS code node → outputs to file/folder search node.
- Edge Cases:
- Drops existing table, so all previous data is lost on each run.
- SQL command failures (permissions, extension availability) can break workflow.
- Version: Requires Postgres version supporting pgvector extension.
-
Sticky Note
- Content: Warns about table drop and explains database setup.
-
1.4 Fetch Drive Files
-
Overview:
Retrieves the list of files and folders inside the specified Google Drive folder using the Drive API and the extracted folder ID. -
Nodes Involved:
- Search files and folders (Google Drive node)
- Sticky Notes (List files explanation)
-
Node Details:
-
Search files and folders
- Type: Google Drive node
- Configuration:
- Resource:
fileFolder - Filter: By folderId extracted from JS code node (dynamic expression)
- Operation: List files/folders within the target folder.
- Resource:
- Credentials: Google Drive OAuth2 credentials.
- Input/Output: Receives from DB init node → outputs to batch loop node.
- Edge Cases:
- Empty folders produce no items, causing empty processing loops.
- API authentication failures or permission errors may halt workflow.
- Large folder contents may require pagination (not explicitly handled).
-
Sticky Notes
- Content: Explains that this node lists files in the specified Drive folder.
-
1.5 Iterative Processing Loop
-
Overview:
Processes each file in batches (default batch size 1) to sequentially download and embed the content. -
Nodes Involved:
- Loop Over Items (SplitInBatches node)
- Sticky Note (none specific)
-
Node Details:
- Loop Over Items
- Type: SplitInBatches
- Configuration: Default batch size (1) to process files serially.
- Input/Output: Input from Google Drive file list → outputs to Download File node.
- Edge Cases: Large number of files may slow down processing; no parallelization configured.
- Loop Over Items
1.6 Download and Document Loading
-
Overview:
Downloads each file from Google Drive, converting Google Docs to plain text, then prepares document content for vector embedding. -
Nodes Involved:
- Download File (Google Drive download node)
- Default Data Loader2 (Langchain document loader)
- Sticky Notes (Download and loader explanations)
-
Node Details:
-
Download File
- Type: Google Drive node
- Configuration:
- Operation: Download file by ID
- Google Docs conversion: Converts docs to
text/plainfor embedding.
- Credentials: Google Drive OAuth2 credentials.
- Input/Output: Input from batch loop node → output to document loader.
- Edge Cases:
- Unsupported file formats or corrupted files may cause download or conversion failure.
- Large files may timeout or cause memory issues.
-
Default Data Loader2
- Type: Langchain Document Default Data Loader
- Configuration: Extracts text from binary data to feed into embeddings node.
- Input/Output: Input from Download File node → outputs to embeddings node.
- Edge Cases:
- Unsupported binary types may cause extraction failure.
-
1.7 Embedding Generation
-
Overview:
Converts the extracted document text into a 768-dimensional vector embedding using the Google Gemini embeddings API. -
Nodes Involved:
- Embeddings Google Gemini4 (Langchain embeddings node)
- Sticky Note (Embedding explanation)
-
Node Details:
- Embeddings Google Gemini4
- Type: Langchain Google Gemini embeddings node
- Configuration: Uses default parameters targeting Google Gemini text-embedding-004 model.
- Credentials: Google Gemini API key (Google Palm API).
- Input/Output: Input from document loader → output to vector storage node.
- Edge Cases:
- API rate limits or key invalidation can cause failures.
- Text exceeding model limits may be truncated or rejected.
- Embeddings Google Gemini4
1.8 Vector Storage
-
Overview:
Inserts the document content, metadata, and vector embedding into the Supabase Postgres vector table for future semantic search queries. -
Nodes Involved:
- Insert into Supabase Vectorstore (Langchain vector store node)
- Sticky Note (Storage explanation)
-
Node Details:
- Insert into Supabase Vectorstore
- Type: Langchain vector store Supabase node
- Configuration:
- Mode: Insert
- Table:
documents - Query function for search:
match_documents
- Credentials: Supabase API credentials.
- Input/Output: Accepts embeddings and document data from embedding and loader nodes → loops back to batch node for next item.
- Edge Cases:
- Insert failures due to DB connection issues or schema mismatches.
- Handling duplicates or failed inserts is not explicitly handled.
- Insert into Supabase Vectorstore
3. Summary Table
| Node Name | Node Type | Functional Role | Input Node(s) | Output Node(s) | Sticky Note |
|---|---|---|---|---|---|
| When Executed by Another Workflow | Execute Workflow Trigger | Workflow trigger and input reception | - | Code in JavaScript | Trigger Node - Starts workflow when called from another n8n workflow. Accepts Drive folder URL as input. |
| Code in JavaScript | Code (JavaScript) | Extract folder/file ID from Drive URL | When Executed by Another Workflow | Execute a SQL query | Extract Folder ID - Parses Google Drive URL using regex to extract folder/file ID for API calls. |
| Execute a SQL query | Postgres | Initialize database & vector storage | Code in JavaScript | Search files and folders | Initialize Database - Creates Supabase vector table with pgvector extension and match_documents search function. ⚠️ Drops existing table! |
| Search files and folders | Google Drive | List all files/folders in Drive folder | Execute a SQL query | Loop Over Items | List Drive Files - Retrieves all files from the specified Google Drive folder using extracted folder ID. |
| Loop Over Items | SplitInBatches | Loop over each file for processing | Search files and folders | Download File | |
| Download File | Google Drive | Download each file and convert to text | Loop Over Items | Default Data Loader2 | List Drive Files - Retrieves all files from the specified Google Drive folder using extracted folder ID. |
| Default Data Loader2 | Langchain Document Loader | Extract text content from binary file | Download File | Embeddings Google Gemini4 | Store Embeddings - Generates 768-dim vectors via Gemini and inserts documents into Supabase for semantic search. |
| Embeddings Google Gemini4 | Langchain Embeddings Google Gemini | Generate 768-dim embeddings | Default Data Loader2 | Insert into Supabase Vectorstore | AI Embeddings - Converts text to 768-dimensional vectors using Google Gemini text-embedding-004 model. |
| Insert into Supabase Vectorstore | Langchain Vector Store Supabase | Insert document and embedding into DB | Embeddings Google Gemini4, Download File | Loop Over Items | Store Embeddings - Generates 768-dim vectors via Gemini and inserts documents into Supabase for semantic search. |
| Sticky Note1 | Sticky Note | Documentation overview | - | - | # 📁 Drive to Supabase Vector Store for Study RAG - Processes Google Drive folder files into Supabase vector embeddings for RAG applications. |
| Sticky Note | Sticky Note | Documentation Trigger explanation | - | - | Trigger Node - Starts workflow when called from another n8n workflow. Accepts Drive folder URL as input. |
| Sticky Note2 | Sticky Note | Documentation folder ID extraction | - | - | Extract Folder ID - Parses Google Drive URL using regex to extract folder/file ID for API calls. |
| Sticky Note3 | Sticky Note | Documentation DB initialization | - | - | Initialize Database - Creates Supabase vector table with pgvector extension and match_documents search function. ⚠️ Drops existing table! |
| Sticky Note4 | Sticky Note | Documentation Drive files listing | - | - | List Drive Files - Retrieves all files from the specified Google Drive folder using extracted folder ID. |
| Sticky Note5 | Sticky Note | Documentation Drive files listing | - | - | List Drive Files - Retrieves all files from the specified Google Drive folder using extracted folder ID. |
| Sticky Note6 | Sticky Note | Documentation embedding and storage | - | - | Store Embeddings - Generates 768-dim vectors via Gemini and inserts documents into Supabase for semantic search. |
4. Reproducing the Workflow from Scratch
-
Create Trigger Node:
- Add "Execute Workflow Trigger" node named When Executed by Another Workflow.
- Configure with JSON example input:
{ "Drive_Folder_link": "https://drive.google.com/drive/folders/example" } - This node will start the workflow and accept the Drive folder URL.
-
Add JavaScript Code Node:
- Name: Code in JavaScript
- Purpose: Extract
folderIdanddriveIdfrom provided URL. - Use the following JS logic:
const driveUrl = $input.first().json.Drive_Folder_link; function getDriveId(url) { const folderMatch = url.match(/\/folders\/([a-zA-Z0-9_-]+)/); const fileMatch = url.match(/\/file\/d\/([a-zA-Z0-9_-]+)/); return folderMatch ? folderMatch[1] : (fileMatch ? fileMatch[1] : null); } return items.map(item => { const chatInput = item.json.chatInput || driveUrl || ''; const driveId = getDriveId(chatInput); return { json: { originalInput: chatInput, folderId: driveId, driveId: driveId } }; }); - Connect When Executed by Another Workflow → Code in JavaScript.
-
Add Postgres Node for DB Initialization:
- Name: Execute a SQL query
- Credentials: Configure with Supabase or Postgres credentials.
- Query (multi-statement):
DROP TABLE IF EXISTS documents CASCADE; CREATE EXTENSION IF NOT EXISTS vector; CREATE TABLE IF NOT EXISTS documents ( id bigserial PRIMARY KEY, content text, metadata jsonb, embedding vector(768) ); CREATE OR REPLACE FUNCTION match_documents( query_embedding vector(768), match_count int DEFAULT NULL, filter jsonb DEFAULT '{}'::jsonb ) RETURNS TABLE ( id bigint, content text, metadata jsonb, similarity double precision ) LANGUAGE sql AS $$ SELECT d.id, d.content, d.metadata, 1 - (d.embedding <=> query_embedding) AS similarity FROM documents d WHERE (filter = '{}'::jsonb OR d.metadata @> filter) ORDER BY d.embedding <=> query_embedding LIMIT match_count; $$; - Connect Code in JavaScript → Execute a SQL query.
-
Add Google Drive Node to List Files:
- Name: Search files and folders
- Resource:
fileFolder - Filter: Set folderId with expression referencing
Code in JavaScriptoutputfolderIdfield. - Credentials: Configure Google Drive OAuth2 credentials.
- Connect Execute a SQL query → Search files and folders.
-
Add SplitInBatches Node:
- Name: Loop Over Items
- Default batch size (1) suffices.
- Connect Search files and folders → Loop Over Items.
-
Add Google Drive Download Node:
- Name: Download File
- Operation: Download file by ID.
- File ID set to current item
iddynamically. - Enable Google Docs conversion to
text/plain. - Credentials: Google Drive OAuth2 credentials (same as above).
- Connect Loop Over Items → Download File.
-
Add Langchain Document Loader Node:
- Name: Default Data Loader2
- Data Type: binary
- Connect Download File → Default Data Loader2 (to AI document input).
-
Add Langchain Google Gemini Embeddings Node:
- Name: Embeddings Google Gemini4
- Credentials: Google Palm API key configured for Google Gemini embeddings.
- Connect Default Data Loader2 → Embeddings Google Gemini4 (AI document → AI embedding).
-
Add Langchain Supabase Vector Store Node:
- Name: Insert into Supabase Vectorstore
- Mode: Insert
- Table Name:
documents - Query Name:
match_documents - Credentials: Supabase API credentials.
- Connect Embeddings Google Gemini4 → Insert into Supabase Vectorstore (AI embedding input).
- Also connect Download File → Insert into Supabase Vectorstore (main input) to pass document content and metadata.
- Connect Insert into Supabase Vectorstore → Loop Over Items (to continue loop).
-
Add Sticky Notes (Optional):
- Add sticky notes as described for documentation and clarity within the workflow editor.
5. General Notes & Resources
| Note Content | Context or Link |
|---|---|
| Workflow creates vector embeddings for RAG applications from Google Drive files. | Overview sticky note in workflow titled "📁 Drive to Supabase Vector Store for Study RAG" |
| Requires Google Drive OAuth2 credentials and Google Gemini API key for embedding generation. | Credential requirements across Google Drive and Google Palm API nodes. |
| Supabase Postgres must support pgvector extension for vector storage and search functionality. | Database initialization node setup and SQL code. |
Workflow input format: { "Drive_Folder_link": "your_drive_url" } |
Input JSON example in trigger node. |
| Embeddings generated are 768-dimensional vectors using Google Gemini text-embedding-004 model. | Embeddings node and sticky note documentation. |
| Google Docs files are automatically converted to plain text for embedding. | Download File node Google Docs conversion setting. |
Workflow drops and recreates the documents table on each execution - data is not persisted. |
Important caution in DB init sticky note. |
| Potential failure points include invalid Drive URLs, API authentication failures, and DB errors. | Consider adding error handling and retries in production environments. |
Disclaimer:
The provided text is derived exclusively from an automated workflow created with n8n, a tool for integration and automation. This processing strictly adheres to content policies and contains no illegal, offensive, or protected elements. All manipulated data is legal and public.