klbr/n8nworkflows.xyz

Fork 0

mirror of https://github.com/khoaliber/n8nworkflows.xyz.git synced 2026-04-27 00:31:06 +00:00

Files

T

nusquama db2e018811 creation

2025-11-12 15:57:48 +01:00

21 KiB

Raw Blame History

Create RAG Vector Database from Google Drive Documents using Gemini & Supabase

https://n8nworkflows.xyz/workflows/create-rag-vector-database-from-google-drive-documents-using-gemini---supabase-10651

Create RAG Vector Database from Google Drive Documents using Gemini & Supabase

1. Workflow Overview

This workflow automates the process of creating a Retrieval-Augmented Generation (RAG) vector database from documents stored in a specified Google Drive folder. It is designed for semantic search and knowledge retrieval applications leveraging embeddings generated by Google Gemini and stored in a Supabase-managed Postgres vector database.

Target Use Cases:

Building vector embeddings from Google Drive documents for RAG applications.
Enabling semantic search over company or personal documents stored in Google Drive.
Automating document ingestion and vector storage for AI-powered retrieval.

Logical Blocks:

1.1 Input Reception: Trigger the workflow with a Google Drive folder URL input.
1.2 Drive Folder ID Extraction: Parse the folder or file ID from the input URL to use with Google Drive API.
1.3 Database Initialization: Setup Supabase vector storage by creating the necessary Postgres table and vector search function, resetting any previous data.
1.4 Fetch Drive Files: List all files and folders inside the specified Drive folder using the extracted ID.
1.5 Iterative Processing Loop: Process each file one-by-one through a batch splitter.
1.6 Download and Document Loading: Download each file and load its content for embedding.
1.7 Embedding Generation: Generate 768-dimensional embeddings for each document using Google Gemini embeddings API.
1.8 Vector Storage: Insert the document content, metadata, and embeddings into the Supabase vector table for semantic search.

2. Block-by-Block Analysis

1.1 Input Reception

Overview:
Starts the workflow when called externally, accepting a JSON input containing the Google Drive folder URL.
Nodes Involved:
- When Executed by Another Workflow
- Sticky Note (Trigger explanation)
Node Details:
- When Executed by Another Workflow
  - Type: Execute Workflow Trigger
  - Configuration: Accepts JSON input example with key Drive_Folder_link holding the Google Drive folder URL.
  - Input/Output: Receives external trigger input → outputs JSON with folder URL.
  - Edge Cases: Missing or malformed input JSON will cause downstream failures. No validation on input structure.
- Sticky Note
  - Content: Explains trigger node usage and input format.

1.2 Drive Folder ID Extraction

Overview:
Parses the provided Google Drive URL to extract the folder or file ID required for Google Drive API calls.
Nodes Involved:
- Code in JavaScript
- Sticky Note (Regex extraction explanation)
Node Details:
- Code in JavaScript
  - Type: Code Node (JavaScript)
  - Configuration:
    - Extracts folder or file ID from the URL using regex patterns matching /folders/{id} or /file/d/{id}.
    - Outputs JSON with folderId and driveId.
  - Key Expressions: Custom regex parsing logic within the JavaScript code.
  - Connections: Input from trigger node → output to initialize DB node.
  - Edge Cases:
    - If URL format is unexpected or missing the ID, output will be null, potentially causing failure downstream.
    - No validation or error handling for invalid IDs.
- Sticky Note
  - Content: Explains the regex extraction logic for folder/file ID.

1.3 Database Initialization

Overview:
Resets and prepares the Supabase Postgres database by dropping and recreating the documents table, enabling the pgvector extension, and defining a vector similarity search function.
Nodes Involved:
- Execute a SQL query
- Sticky Note (DB initialization explanation)
Node Details:
- Execute a SQL query
  - Type: Postgres node
  - Configuration:
    - Executes multi-statement SQL: drops the documents table if exists, installs vector extension, creates documents table with id, content, metadata, and 768-dimensional vector column embedding.
    - Defines a SQL function match_documents to perform similarity search using the <=> operator for vector distance.
  - Credentials: Uses configured Postgres/Supabase credentials.
  - Input/Output: Receives input from JS code node → outputs to file/folder search node.
  - Edge Cases:
    - Drops existing table, so all previous data is lost on each run.
    - SQL command failures (permissions, extension availability) can break workflow.
  - Version: Requires Postgres version supporting pgvector extension.
- Sticky Note
  - Content: Warns about table drop and explains database setup.

1.4 Fetch Drive Files

Overview:
Retrieves the list of files and folders inside the specified Google Drive folder using the Drive API and the extracted folder ID.
Nodes Involved:
- Search files and folders (Google Drive node)
- Sticky Notes (List files explanation)
Node Details:
- Search files and folders
  - Type: Google Drive node
  - Configuration:
    - Resource: fileFolder
    - Filter: By folderId extracted from JS code node (dynamic expression)
    - Operation: List files/folders within the target folder.
  - Credentials: Google Drive OAuth2 credentials.
  - Input/Output: Receives from DB init node → outputs to batch loop node.
  - Edge Cases:
    - Empty folders produce no items, causing empty processing loops.
    - API authentication failures or permission errors may halt workflow.
    - Large folder contents may require pagination (not explicitly handled).
- Sticky Notes
  - Content: Explains that this node lists files in the specified Drive folder.

1.5 Iterative Processing Loop

Overview:
Processes each file in batches (default batch size 1) to sequentially download and embed the content.
Nodes Involved:
- Loop Over Items (SplitInBatches node)
- Sticky Note (none specific)
Node Details:
- Loop Over Items
  - Type: SplitInBatches
  - Configuration: Default batch size (1) to process files serially.
  - Input/Output: Input from Google Drive file list → outputs to Download File node.
  - Edge Cases: Large number of files may slow down processing; no parallelization configured.

1.6 Download and Document Loading

Overview:
Downloads each file from Google Drive, converting Google Docs to plain text, then prepares document content for vector embedding.
Nodes Involved:
- Download File (Google Drive download node)
- Default Data Loader2 (Langchain document loader)
- Sticky Notes (Download and loader explanations)
Node Details:
- Download File
  - Type: Google Drive node
  - Configuration:
    - Operation: Download file by ID
    - Google Docs conversion: Converts docs to text/plain for embedding.
  - Credentials: Google Drive OAuth2 credentials.
  - Input/Output: Input from batch loop node → output to document loader.
  - Edge Cases:
    - Unsupported file formats or corrupted files may cause download or conversion failure.
    - Large files may timeout or cause memory issues.
- Default Data Loader2
  - Type: Langchain Document Default Data Loader
  - Configuration: Extracts text from binary data to feed into embeddings node.
  - Input/Output: Input from Download File node → outputs to embeddings node.
  - Edge Cases:
    - Unsupported binary types may cause extraction failure.

1.7 Embedding Generation

Overview:
Converts the extracted document text into a 768-dimensional vector embedding using the Google Gemini embeddings API.
Nodes Involved:
- Embeddings Google Gemini4 (Langchain embeddings node)
- Sticky Note (Embedding explanation)
Node Details:
- Embeddings Google Gemini4
  - Type: Langchain Google Gemini embeddings node
  - Configuration: Uses default parameters targeting Google Gemini text-embedding-004 model.
  - Credentials: Google Gemini API key (Google Palm API).
  - Input/Output: Input from document loader → output to vector storage node.
  - Edge Cases:
    - API rate limits or key invalidation can cause failures.
    - Text exceeding model limits may be truncated or rejected.

1.8 Vector Storage

Overview:
Inserts the document content, metadata, and vector embedding into the Supabase Postgres vector table for future semantic search queries.
Nodes Involved:
- Insert into Supabase Vectorstore (Langchain vector store node)
- Sticky Note (Storage explanation)
Node Details:
- Insert into Supabase Vectorstore
  - Type: Langchain vector store Supabase node
  - Configuration:
    - Mode: Insert
    - Table: documents
    - Query function for search: match_documents
  - Credentials: Supabase API credentials.
  - Input/Output: Accepts embeddings and document data from embedding and loader nodes → loops back to batch node for next item.
  - Edge Cases:
    - Insert failures due to DB connection issues or schema mismatches.
    - Handling duplicates or failed inserts is not explicitly handled.

3. Summary Table

Node Name	Node Type	Functional Role	Input Node(s)	Output Node(s)	Sticky Note
When Executed by Another Workflow	Execute Workflow Trigger	Workflow trigger and input reception	-	Code in JavaScript	Trigger Node - Starts workflow when called from another n8n workflow. Accepts Drive folder URL as input.
Code in JavaScript	Code (JavaScript)	Extract folder/file ID from Drive URL	When Executed by Another Workflow	Execute a SQL query	Extract Folder ID - Parses Google Drive URL using regex to extract folder/file ID for API calls.
Execute a SQL query	Postgres	Initialize database & vector storage	Code in JavaScript	Search files and folders	Initialize Database - Creates Supabase vector table with pgvector extension and match_documents search function. ⚠️ Drops existing table!
Search files and folders	Google Drive	List all files/folders in Drive folder	Execute a SQL query	Loop Over Items	List Drive Files - Retrieves all files from the specified Google Drive folder using extracted folder ID.
Loop Over Items	SplitInBatches	Loop over each file for processing	Search files and folders	Download File
Download File	Google Drive	Download each file and convert to text	Loop Over Items	Default Data Loader2	List Drive Files - Retrieves all files from the specified Google Drive folder using extracted folder ID.
Default Data Loader2	Langchain Document Loader	Extract text content from binary file	Download File	Embeddings Google Gemini4	Store Embeddings - Generates 768-dim vectors via Gemini and inserts documents into Supabase for semantic search.
Embeddings Google Gemini4	Langchain Embeddings Google Gemini	Generate 768-dim embeddings	Default Data Loader2	Insert into Supabase Vectorstore	AI Embeddings - Converts text to 768-dimensional vectors using Google Gemini text-embedding-004 model.
Insert into Supabase Vectorstore	Langchain Vector Store Supabase	Insert document and embedding into DB	Embeddings Google Gemini4, Download File	Loop Over Items	Store Embeddings - Generates 768-dim vectors via Gemini and inserts documents into Supabase for semantic search.
Sticky Note1	Sticky Note	Documentation overview	-	-	# 📁 Drive to Supabase Vector Store for Study RAG - Processes Google Drive folder files into Supabase vector embeddings for RAG applications.
Sticky Note	Sticky Note	Documentation Trigger explanation	-	-	Trigger Node - Starts workflow when called from another n8n workflow. Accepts Drive folder URL as input.
Sticky Note2	Sticky Note	Documentation folder ID extraction	-	-	Extract Folder ID - Parses Google Drive URL using regex to extract folder/file ID for API calls.
Sticky Note3	Sticky Note	Documentation DB initialization	-	-	Initialize Database - Creates Supabase vector table with pgvector extension and match_documents search function. ⚠️ Drops existing table!
Sticky Note4	Sticky Note	Documentation Drive files listing	-	-	List Drive Files - Retrieves all files from the specified Google Drive folder using extracted folder ID.
Sticky Note5	Sticky Note	Documentation Drive files listing	-	-	List Drive Files - Retrieves all files from the specified Google Drive folder using extracted folder ID.
Sticky Note6	Sticky Note	Documentation embedding and storage	-	-	Store Embeddings - Generates 768-dim vectors via Gemini and inserts documents into Supabase for semantic search.

4. Reproducing the Workflow from Scratch

Create Trigger Node:
- Add "Execute Workflow Trigger" node named When Executed by Another Workflow.
- Configure with JSON example input:
```
{
  "Drive_Folder_link": "https://drive.google.com/drive/folders/example"
}
```
- This node will start the workflow and accept the Drive folder URL.

Add JavaScript Code Node:

Name: Code in JavaScript
Purpose: Extract folderId and driveId from provided URL.

Use the following JS logic:

const driveUrl = $input.first().json.Drive_Folder_link;
function getDriveId(url) {
  const folderMatch = url.match(/\/folders\/([a-zA-Z0-9_-]+)/);
  const fileMatch = url.match(/\/file\/d\/([a-zA-Z0-9_-]+)/);
  return folderMatch ? folderMatch[1] : (fileMatch ? fileMatch[1] : null);
}
return items.map(item => {
  const chatInput = item.json.chatInput || driveUrl || '';
  const driveId = getDriveId(chatInput);
  return {
    json: {
      originalInput: chatInput,
      folderId: driveId,
      driveId: driveId
    }
  };
});

Connect When Executed by Another Workflow → Code in JavaScript.

Add Postgres Node for DB Initialization:

Name: Execute a SQL query
Credentials: Configure with Supabase or Postgres credentials.

Query (multi-statement):

DROP TABLE IF EXISTS documents CASCADE;
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE IF NOT EXISTS documents (
  id bigserial PRIMARY KEY,
  content text,
  metadata jsonb,
  embedding vector(768)
);
CREATE OR REPLACE FUNCTION match_documents(
  query_embedding vector(768),
  match_count int DEFAULT NULL,
  filter jsonb DEFAULT '{}'::jsonb
)
RETURNS TABLE (
  id bigint,
  content text,
  metadata jsonb,
  similarity double precision
)
LANGUAGE sql
AS $$
  SELECT
    d.id,
    d.content,
    d.metadata,
    1 - (d.embedding <=> query_embedding) AS similarity
  FROM documents d
  WHERE (filter = '{}'::jsonb OR d.metadata @> filter)
  ORDER BY d.embedding <=> query_embedding
  LIMIT match_count;
$$;

Connect Code in JavaScript → Execute a SQL query.

Add Google Drive Node to List Files:
- Name: Search files and folders
- Resource: fileFolder
- Filter: Set folderId with expression referencing Code in JavaScript output folderId field.
- Credentials: Configure Google Drive OAuth2 credentials.
- Connect Execute a SQL query → Search files and folders.
Add SplitInBatches Node:
- Name: Loop Over Items
- Default batch size (1) suffices.
- Connect Search files and folders → Loop Over Items.
Add Google Drive Download Node:
- Name: Download File
- Operation: Download file by ID.
- File ID set to current item id dynamically.
- Enable Google Docs conversion to text/plain.
- Credentials: Google Drive OAuth2 credentials (same as above).
- Connect Loop Over Items → Download File.
Add Langchain Document Loader Node:
- Name: Default Data Loader2
- Data Type: binary
- Connect Download File → Default Data Loader2 (to AI document input).
Add Langchain Google Gemini Embeddings Node:
- Name: Embeddings Google Gemini4
- Credentials: Google Palm API key configured for Google Gemini embeddings.
- Connect Default Data Loader2 → Embeddings Google Gemini4 (AI document → AI embedding).
Add Langchain Supabase Vector Store Node:
- Name: Insert into Supabase Vectorstore
- Mode: Insert
- Table Name: documents
- Query Name: match_documents
- Credentials: Supabase API credentials.
- Connect Embeddings Google Gemini4 → Insert into Supabase Vectorstore (AI embedding input).
- Also connect Download File → Insert into Supabase Vectorstore (main input) to pass document content and metadata.
- Connect Insert into Supabase Vectorstore → Loop Over Items (to continue loop).
Add Sticky Notes (Optional):
- Add sticky notes as described for documentation and clarity within the workflow editor.

5. General Notes & Resources

Note Content	Context or Link
Workflow creates vector embeddings for RAG applications from Google Drive files.	Overview sticky note in workflow titled "📁 Drive to Supabase Vector Store for Study RAG"
Requires Google Drive OAuth2 credentials and Google Gemini API key for embedding generation.	Credential requirements across Google Drive and Google Palm API nodes.
Supabase Postgres must support pgvector extension for vector storage and search functionality.	Database initialization node setup and SQL code.
Workflow input format: `{ "Drive_Folder_link": "your_drive_url" }`	Input JSON example in trigger node.
Embeddings generated are 768-dimensional vectors using Google Gemini text-embedding-004 model.	Embeddings node and sticky note documentation.
Google Docs files are automatically converted to plain text for embedding.	Download File node Google Docs conversion setting.
Workflow drops and recreates the `documents` table on each execution - data is not persisted.	Important caution in DB init sticky note.
Potential failure points include invalid Drive URLs, API authentication failures, and DB errors.	Consider adding error handling and retries in production environments.

Disclaimer:
The provided text is derived exclusively from an automated workflow created with n8n, a tool for integration and automation. This processing strictly adheres to content policies and contains no illegal, offensive, or protected elements. All manipulated data is legal and public.

21 KiB Raw Blame History

Create RAG Vector Database from Google Drive Documents using Gemini & Supabase

1. Workflow Overview

2. Block-by-Block Analysis

1.1 Input Reception

1.2 Drive Folder ID Extraction

1.3 Database Initialization

1.4 Fetch Drive Files

1.5 Iterative Processing Loop

1.6 Download and Document Loading

1.7 Embedding Generation

1.8 Vector Storage

3. Summary Table

4. Reproducing the Workflow from Scratch

5. General Notes & Resources

21 KiB

Raw Blame History