Files
nusquama db2e018811 creation
2025-11-12 15:57:48 +01:00

21 KiB

Create RAG Vector Database from Google Drive Documents using Gemini & Supabase

https://n8nworkflows.xyz/workflows/create-rag-vector-database-from-google-drive-documents-using-gemini---supabase-10651

Create RAG Vector Database from Google Drive Documents using Gemini & Supabase


1. Workflow Overview

This workflow automates the process of creating a Retrieval-Augmented Generation (RAG) vector database from documents stored in a specified Google Drive folder. It is designed for semantic search and knowledge retrieval applications leveraging embeddings generated by Google Gemini and stored in a Supabase-managed Postgres vector database.

Target Use Cases:

  • Building vector embeddings from Google Drive documents for RAG applications.
  • Enabling semantic search over company or personal documents stored in Google Drive.
  • Automating document ingestion and vector storage for AI-powered retrieval.

Logical Blocks:

  • 1.1 Input Reception: Trigger the workflow with a Google Drive folder URL input.
  • 1.2 Drive Folder ID Extraction: Parse the folder or file ID from the input URL to use with Google Drive API.
  • 1.3 Database Initialization: Setup Supabase vector storage by creating the necessary Postgres table and vector search function, resetting any previous data.
  • 1.4 Fetch Drive Files: List all files and folders inside the specified Drive folder using the extracted ID.
  • 1.5 Iterative Processing Loop: Process each file one-by-one through a batch splitter.
  • 1.6 Download and Document Loading: Download each file and load its content for embedding.
  • 1.7 Embedding Generation: Generate 768-dimensional embeddings for each document using Google Gemini embeddings API.
  • 1.8 Vector Storage: Insert the document content, metadata, and embeddings into the Supabase vector table for semantic search.

2. Block-by-Block Analysis

1.1 Input Reception

  • Overview:
    Starts the workflow when called externally, accepting a JSON input containing the Google Drive folder URL.

  • Nodes Involved:

    • When Executed by Another Workflow
    • Sticky Note (Trigger explanation)
  • Node Details:

    • When Executed by Another Workflow

      • Type: Execute Workflow Trigger
      • Configuration: Accepts JSON input example with key Drive_Folder_link holding the Google Drive folder URL.
      • Input/Output: Receives external trigger input → outputs JSON with folder URL.
      • Edge Cases: Missing or malformed input JSON will cause downstream failures. No validation on input structure.
    • Sticky Note

      • Content: Explains trigger node usage and input format.

1.2 Drive Folder ID Extraction

  • Overview:
    Parses the provided Google Drive URL to extract the folder or file ID required for Google Drive API calls.

  • Nodes Involved:

    • Code in JavaScript
    • Sticky Note (Regex extraction explanation)
  • Node Details:

    • Code in JavaScript

      • Type: Code Node (JavaScript)
      • Configuration:
        • Extracts folder or file ID from the URL using regex patterns matching /folders/{id} or /file/d/{id}.
        • Outputs JSON with folderId and driveId.
      • Key Expressions: Custom regex parsing logic within the JavaScript code.
      • Connections: Input from trigger node → output to initialize DB node.
      • Edge Cases:
        • If URL format is unexpected or missing the ID, output will be null, potentially causing failure downstream.
        • No validation or error handling for invalid IDs.
    • Sticky Note

      • Content: Explains the regex extraction logic for folder/file ID.

1.3 Database Initialization

  • Overview:
    Resets and prepares the Supabase Postgres database by dropping and recreating the documents table, enabling the pgvector extension, and defining a vector similarity search function.

  • Nodes Involved:

    • Execute a SQL query
    • Sticky Note (DB initialization explanation)
  • Node Details:

    • Execute a SQL query

      • Type: Postgres node
      • Configuration:
        • Executes multi-statement SQL: drops the documents table if exists, installs vector extension, creates documents table with id, content, metadata, and 768-dimensional vector column embedding.
        • Defines a SQL function match_documents to perform similarity search using the <=> operator for vector distance.
      • Credentials: Uses configured Postgres/Supabase credentials.
      • Input/Output: Receives input from JS code node → outputs to file/folder search node.
      • Edge Cases:
        • Drops existing table, so all previous data is lost on each run.
        • SQL command failures (permissions, extension availability) can break workflow.
      • Version: Requires Postgres version supporting pgvector extension.
    • Sticky Note

      • Content: Warns about table drop and explains database setup.

1.4 Fetch Drive Files

  • Overview:
    Retrieves the list of files and folders inside the specified Google Drive folder using the Drive API and the extracted folder ID.

  • Nodes Involved:

    • Search files and folders (Google Drive node)
    • Sticky Notes (List files explanation)
  • Node Details:

    • Search files and folders

      • Type: Google Drive node
      • Configuration:
        • Resource: fileFolder
        • Filter: By folderId extracted from JS code node (dynamic expression)
        • Operation: List files/folders within the target folder.
      • Credentials: Google Drive OAuth2 credentials.
      • Input/Output: Receives from DB init node → outputs to batch loop node.
      • Edge Cases:
        • Empty folders produce no items, causing empty processing loops.
        • API authentication failures or permission errors may halt workflow.
        • Large folder contents may require pagination (not explicitly handled).
    • Sticky Notes

      • Content: Explains that this node lists files in the specified Drive folder.

1.5 Iterative Processing Loop

  • Overview:
    Processes each file in batches (default batch size 1) to sequentially download and embed the content.

  • Nodes Involved:

    • Loop Over Items (SplitInBatches node)
    • Sticky Note (none specific)
  • Node Details:

    • Loop Over Items
      • Type: SplitInBatches
      • Configuration: Default batch size (1) to process files serially.
      • Input/Output: Input from Google Drive file list → outputs to Download File node.
      • Edge Cases: Large number of files may slow down processing; no parallelization configured.

1.6 Download and Document Loading

  • Overview:
    Downloads each file from Google Drive, converting Google Docs to plain text, then prepares document content for vector embedding.

  • Nodes Involved:

    • Download File (Google Drive download node)
    • Default Data Loader2 (Langchain document loader)
    • Sticky Notes (Download and loader explanations)
  • Node Details:

    • Download File

      • Type: Google Drive node
      • Configuration:
        • Operation: Download file by ID
        • Google Docs conversion: Converts docs to text/plain for embedding.
      • Credentials: Google Drive OAuth2 credentials.
      • Input/Output: Input from batch loop node → output to document loader.
      • Edge Cases:
        • Unsupported file formats or corrupted files may cause download or conversion failure.
        • Large files may timeout or cause memory issues.
    • Default Data Loader2

      • Type: Langchain Document Default Data Loader
      • Configuration: Extracts text from binary data to feed into embeddings node.
      • Input/Output: Input from Download File node → outputs to embeddings node.
      • Edge Cases:
        • Unsupported binary types may cause extraction failure.

1.7 Embedding Generation

  • Overview:
    Converts the extracted document text into a 768-dimensional vector embedding using the Google Gemini embeddings API.

  • Nodes Involved:

    • Embeddings Google Gemini4 (Langchain embeddings node)
    • Sticky Note (Embedding explanation)
  • Node Details:

    • Embeddings Google Gemini4
      • Type: Langchain Google Gemini embeddings node
      • Configuration: Uses default parameters targeting Google Gemini text-embedding-004 model.
      • Credentials: Google Gemini API key (Google Palm API).
      • Input/Output: Input from document loader → output to vector storage node.
      • Edge Cases:
        • API rate limits or key invalidation can cause failures.
        • Text exceeding model limits may be truncated or rejected.

1.8 Vector Storage

  • Overview:
    Inserts the document content, metadata, and vector embedding into the Supabase Postgres vector table for future semantic search queries.

  • Nodes Involved:

    • Insert into Supabase Vectorstore (Langchain vector store node)
    • Sticky Note (Storage explanation)
  • Node Details:

    • Insert into Supabase Vectorstore
      • Type: Langchain vector store Supabase node
      • Configuration:
        • Mode: Insert
        • Table: documents
        • Query function for search: match_documents
      • Credentials: Supabase API credentials.
      • Input/Output: Accepts embeddings and document data from embedding and loader nodes → loops back to batch node for next item.
      • Edge Cases:
        • Insert failures due to DB connection issues or schema mismatches.
        • Handling duplicates or failed inserts is not explicitly handled.

3. Summary Table

Node Name Node Type Functional Role Input Node(s) Output Node(s) Sticky Note
When Executed by Another Workflow Execute Workflow Trigger Workflow trigger and input reception - Code in JavaScript Trigger Node - Starts workflow when called from another n8n workflow. Accepts Drive folder URL as input.
Code in JavaScript Code (JavaScript) Extract folder/file ID from Drive URL When Executed by Another Workflow Execute a SQL query Extract Folder ID - Parses Google Drive URL using regex to extract folder/file ID for API calls.
Execute a SQL query Postgres Initialize database & vector storage Code in JavaScript Search files and folders Initialize Database - Creates Supabase vector table with pgvector extension and match_documents search function. ⚠️ Drops existing table!
Search files and folders Google Drive List all files/folders in Drive folder Execute a SQL query Loop Over Items List Drive Files - Retrieves all files from the specified Google Drive folder using extracted folder ID.
Loop Over Items SplitInBatches Loop over each file for processing Search files and folders Download File
Download File Google Drive Download each file and convert to text Loop Over Items Default Data Loader2 List Drive Files - Retrieves all files from the specified Google Drive folder using extracted folder ID.
Default Data Loader2 Langchain Document Loader Extract text content from binary file Download File Embeddings Google Gemini4 Store Embeddings - Generates 768-dim vectors via Gemini and inserts documents into Supabase for semantic search.
Embeddings Google Gemini4 Langchain Embeddings Google Gemini Generate 768-dim embeddings Default Data Loader2 Insert into Supabase Vectorstore AI Embeddings - Converts text to 768-dimensional vectors using Google Gemini text-embedding-004 model.
Insert into Supabase Vectorstore Langchain Vector Store Supabase Insert document and embedding into DB Embeddings Google Gemini4, Download File Loop Over Items Store Embeddings - Generates 768-dim vectors via Gemini and inserts documents into Supabase for semantic search.
Sticky Note1 Sticky Note Documentation overview - - # 📁 Drive to Supabase Vector Store for Study RAG - Processes Google Drive folder files into Supabase vector embeddings for RAG applications.
Sticky Note Sticky Note Documentation Trigger explanation - - Trigger Node - Starts workflow when called from another n8n workflow. Accepts Drive folder URL as input.
Sticky Note2 Sticky Note Documentation folder ID extraction - - Extract Folder ID - Parses Google Drive URL using regex to extract folder/file ID for API calls.
Sticky Note3 Sticky Note Documentation DB initialization - - Initialize Database - Creates Supabase vector table with pgvector extension and match_documents search function. ⚠️ Drops existing table!
Sticky Note4 Sticky Note Documentation Drive files listing - - List Drive Files - Retrieves all files from the specified Google Drive folder using extracted folder ID.
Sticky Note5 Sticky Note Documentation Drive files listing - - List Drive Files - Retrieves all files from the specified Google Drive folder using extracted folder ID.
Sticky Note6 Sticky Note Documentation embedding and storage - - Store Embeddings - Generates 768-dim vectors via Gemini and inserts documents into Supabase for semantic search.

4. Reproducing the Workflow from Scratch

  1. Create Trigger Node:

    • Add "Execute Workflow Trigger" node named When Executed by Another Workflow.
    • Configure with JSON example input:
      {
        "Drive_Folder_link": "https://drive.google.com/drive/folders/example"
      }
      
    • This node will start the workflow and accept the Drive folder URL.
  2. Add JavaScript Code Node:

    • Name: Code in JavaScript
    • Purpose: Extract folderId and driveId from provided URL.
    • Use the following JS logic:
      const driveUrl = $input.first().json.Drive_Folder_link;
      function getDriveId(url) {
        const folderMatch = url.match(/\/folders\/([a-zA-Z0-9_-]+)/);
        const fileMatch = url.match(/\/file\/d\/([a-zA-Z0-9_-]+)/);
        return folderMatch ? folderMatch[1] : (fileMatch ? fileMatch[1] : null);
      }
      return items.map(item => {
        const chatInput = item.json.chatInput || driveUrl || '';
        const driveId = getDriveId(chatInput);
        return {
          json: {
            originalInput: chatInput,
            folderId: driveId,
            driveId: driveId
          }
        };
      });
      
    • Connect When Executed by Another WorkflowCode in JavaScript.
  3. Add Postgres Node for DB Initialization:

    • Name: Execute a SQL query
    • Credentials: Configure with Supabase or Postgres credentials.
    • Query (multi-statement):
      DROP TABLE IF EXISTS documents CASCADE;
      CREATE EXTENSION IF NOT EXISTS vector;
      CREATE TABLE IF NOT EXISTS documents (
        id bigserial PRIMARY KEY,
        content text,
        metadata jsonb,
        embedding vector(768)
      );
      CREATE OR REPLACE FUNCTION match_documents(
        query_embedding vector(768),
        match_count int DEFAULT NULL,
        filter jsonb DEFAULT '{}'::jsonb
      )
      RETURNS TABLE (
        id bigint,
        content text,
        metadata jsonb,
        similarity double precision
      )
      LANGUAGE sql
      AS $$
        SELECT
          d.id,
          d.content,
          d.metadata,
          1 - (d.embedding <=> query_embedding) AS similarity
        FROM documents d
        WHERE (filter = '{}'::jsonb OR d.metadata @> filter)
        ORDER BY d.embedding <=> query_embedding
        LIMIT match_count;
      $$;
      
    • Connect Code in JavaScriptExecute a SQL query.
  4. Add Google Drive Node to List Files:

    • Name: Search files and folders
    • Resource: fileFolder
    • Filter: Set folderId with expression referencing Code in JavaScript output folderId field.
    • Credentials: Configure Google Drive OAuth2 credentials.
    • Connect Execute a SQL querySearch files and folders.
  5. Add SplitInBatches Node:

    • Name: Loop Over Items
    • Default batch size (1) suffices.
    • Connect Search files and foldersLoop Over Items.
  6. Add Google Drive Download Node:

    • Name: Download File
    • Operation: Download file by ID.
    • File ID set to current item id dynamically.
    • Enable Google Docs conversion to text/plain.
    • Credentials: Google Drive OAuth2 credentials (same as above).
    • Connect Loop Over ItemsDownload File.
  7. Add Langchain Document Loader Node:

    • Name: Default Data Loader2
    • Data Type: binary
    • Connect Download FileDefault Data Loader2 (to AI document input).
  8. Add Langchain Google Gemini Embeddings Node:

    • Name: Embeddings Google Gemini4
    • Credentials: Google Palm API key configured for Google Gemini embeddings.
    • Connect Default Data Loader2Embeddings Google Gemini4 (AI document → AI embedding).
  9. Add Langchain Supabase Vector Store Node:

    • Name: Insert into Supabase Vectorstore
    • Mode: Insert
    • Table Name: documents
    • Query Name: match_documents
    • Credentials: Supabase API credentials.
    • Connect Embeddings Google Gemini4Insert into Supabase Vectorstore (AI embedding input).
    • Also connect Download FileInsert into Supabase Vectorstore (main input) to pass document content and metadata.
    • Connect Insert into Supabase VectorstoreLoop Over Items (to continue loop).
  10. Add Sticky Notes (Optional):

    • Add sticky notes as described for documentation and clarity within the workflow editor.

5. General Notes & Resources

Note Content Context or Link
Workflow creates vector embeddings for RAG applications from Google Drive files. Overview sticky note in workflow titled "📁 Drive to Supabase Vector Store for Study RAG"
Requires Google Drive OAuth2 credentials and Google Gemini API key for embedding generation. Credential requirements across Google Drive and Google Palm API nodes.
Supabase Postgres must support pgvector extension for vector storage and search functionality. Database initialization node setup and SQL code.
Workflow input format: { "Drive_Folder_link": "your_drive_url" } Input JSON example in trigger node.
Embeddings generated are 768-dimensional vectors using Google Gemini text-embedding-004 model. Embeddings node and sticky note documentation.
Google Docs files are automatically converted to plain text for embedding. Download File node Google Docs conversion setting.
Workflow drops and recreates the documents table on each execution - data is not persisted. Important caution in DB init sticky note.
Potential failure points include invalid Drive URLs, API authentication failures, and DB errors. Consider adding error handling and retries in production environments.

Disclaimer:
The provided text is derived exclusively from an automated workflow created with n8n, a tool for integration and automation. This processing strictly adheres to content policies and contains no illegal, offensive, or protected elements. All manipulated data is legal and public.