Making RAG work better: Implementing Parent Document Retriever with SAP HANA Cloud Vector Engine

Chunk size is the number of characters or tokens in a single text segment (chunk) before it’s converted into a vector embedding. Here’s why the chunk size matters:

Vector embeddings of every chunks are of exactly same dimension, regardless of the text/information contained in the chunk.A smaller chunk size means smaller text/information converted to single vector, which can lead to more precise similarity searches, as it provides a more granular representation of the text. However, it this also mean that crucial information for example, information in a paragraph is split across multiple chunks, potentially causing the similarity search to miss the chunk to be identified. Larger chunk size means more text/information is converted to single vector, which may dilute the overall semantic meaning.
This is where the ParentDocumentRetriever (PDR) helps strike a balance between these two requirements.

Another Example: Imagine breaking a book into pages. If you split it by every paragraph, the meaning might get fragmented. But if you split it by chapter, you might end up with sections too long for the AI to process effectively. So what’s the right balance?

This is where ParentDocumentRetriever (PDR) helps.

PDR works by keeping track of which small chunks came from which larger document. When the AI finds a relevant chunk, PDR also brings in related information from the original document it came from. It’s like finding a useful quote and automatically getting the full article around it for better understanding. This way, the system doesn’t just retrieve a small snippet—it gets the full context behind it.

For more information on Parent Documet Retriever please read https://medium.com/ai-insights-cobet/rag-and-parent-document-retrievers-making-sense-of-complex-contexts-with-code-5bd5c3474a8a

In my blog, we’ll explore how to use ParentDocumentRetriever with the SAP HANA Cloud Vector Engineto improve how documents are retrieved, understood, and used—ensuring your AI system delivers more relevant and accurate results.

How ParentDocumentRetriever works:

Ingestion Process (Preparing the Data):

Step 1: Decide on Chunk Sizes

First, we pick a size for breaking up long documents in a way that still keeps the overall meaning.

Example: Use 4,000 characters as the size for a bigger section.

Then, we also choose a smaller size for creating more precise AI-friendly representations.

Example: Use 1,000 characters for smaller chunks.

Step 2: Break Documents into Chunks

The large 4,000-character chunks become the “parent documents.”

Each parent document is then broken down into smaller 1,000-character “child chunks.” These are what the AI will use to match against queries.

Step 3: Store the Chunks

Both parent and child documents are stored in the system, but kept separately so we know which chunk came from where.

2. Retrieval Process (Using the Data):

When someone asks a question, the system searches only the small child chunks (1,000 characters) to find the most relevant piece.

After finding a useful child chunk, the system also pulls in the parent document (4,000 characters) that it came from—giving a fuller, more meaningful answer.

Why This Is Useful:

This approach ensures that the system benefits from the precision of small chunks while retaining the broader context of the original documents.

IMPLEMENTATION OF DATA INGESTION PROCESS

we need two store for this approach:

DOCSTORE: This Store will store parent documents.VECTORSTORE: This Store will store child documents along with its embeddings and metadata.

A: Create DOCSTORE

Normally, we don’t allow runtime user to create table dynamically, so, create table with name DOCSTORE in hanadb.
Run this command to create table in SAP HANA Database:CREATE COLUMN TABLE “Schema”.”DOCSTORE”(
“KEY” NVARCHAR(500) NOT NULL,
“VALUE” NCLOB MEMORY THRESHOLD 1000,
PRIMARY KEY(“KEY”)
)Create file with name PDR.py which will define HANAStore class to implement ParentDocumentRetriever.
Code for implementation:

Import necessary libraries:from sqlalchemy_hana.types import NCLOB
from pydantic import BaseModel, Field
from typing import Optional, Sequence, Iterator, TypeVar, Generic
from sqlalchemy import Column, String, create_engine
from sqlalchemy.orm import declarative_base, sessionmaker, scoped_session
from langchain.schema import Document
from langchain_core.stores import BaseStore
import json
import logging
from sqlalchemy import create_engine, inspect
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, scoped_session
from sqlalchemy import create_engine, MetaData, Table

Define documentmodel:

class DocumentModel(BaseModel):
key: Optional[str] = Field(None)
page_content: Optional[str] = Field(None)
metadata: dict = Field(default_factory=dict)

D = TypeVar(“D”, bound=Document)

Create HANAStore: This Store will store parent documents.

To define this class, we need to extend BaseStore present in langchain_core library.
we need to define these methods inside this class:
a) serialize_document : it will be used to serialize of document
b) deserialize_document: it will be used to deserialize documents.
c) mget: it will return documents based on given keys
d) mset: it will take key value pair and save it in hanastore, i.e, key and documents.
e) mdelete: it will delete documents based on keys
f) yield_keys: it will return keys.

Define HanaStore class:
connect hanadb in __init__ and make get_session method that will return the session which connects to db.

class HanaStore(BaseStore[str, DocumentModel], Generic[D]):
def __init__(self, connection_string: str, schema: str, table_name: str):
# code to connect hanadb

def get_session(self):
return self.Session()

a) Implement serialize_document method:

def serialize_document(self, doc: Document) -> str:
return json.dumps({“page_content”: doc.page_content, “metadata”: doc.metadata})z

b) Implement deserialize_document method:

def deserialize_document(self, value: str) -> Document:
try:
data = json.loads(value)
return Document(page_content=data.get(“page_content”, “”), metadata=data.get(“metadata”, {}))
except json.JSONDecodeError as e:
logger.error(f”Failed to deserialize document: {e}”)

c) Implement mget method:

def mget(self, keys: Sequence[str]) -> list[Document]:
with self.get_session() as session:
try:
# Query table directly
select_stmt = self.table.select().where(self.table.c.key.in_(keys))
result = session.execute(select_stmt).fetchall()
# take result one by one, deserialize it and return it
documents = []
for row in result:
logger.debug(
f”Retrieved SQLDocument with key: {row.key}, value: {row.value}”)
doc = self.deserialize_document(row.value)
documents.append(doc)
return documents
except Exception as e:
logger.error(f”Error in mget: {e}”)
session.rollback()
return []

d) Implement mset method:

def mset(self, key_value_pairs: Sequence[tuple[str, Document]]) -> None:
with self.get_session() as session:
try:
# Prepare serialized documents
serialized_docs = []
for key, document in key_value_pairs:
serialized_doc = self.serialize_document(document)
serialized_docs.append(
{“key”: key, “value”: serialized_doc})

# Insert or update documents manually
for doc in serialized_docs:
# Check if the document with the same key already exists
select_stmt = self.table.select().where(
self.table.c.key == doc[“key”])
existing_doc = session.execute(select_stmt).fetchone()

if existing_doc:
# Update the existing document
update_stmt = self.table.update().where(
self.table.c.key == doc[“key”]).values(value=doc[“value”])
session.execute(update_stmt)
else:
# Insert a new document
insert_stmt = self.table.insert().values(doc)
session.execute(insert_stmt)

session.commit()
except Exception as e:
logger.error(f”Error in mset: {e}”)
session.rollback()

e) Implement mdelete method:

# delete doc based on key
def mdelete(self, keys: Sequence[str]) -> None:
with self.get_session() as session:
try:
# Perform delete operation directly on the reflected table
delete_stmt = self.table.delete().where(self.table.c.key.in_(keys))
session.execute(delete_stmt)
session.commit()
except Exception as e:
logger.error(f”Error in mdelete: {e}”)
session.rollback()

f) Implement yield_keys method:

# return keys
def yield_keys(self, *, prefix: Optional[str] = None) -> Iterator[str]:
with self.get_session() as session:
try:
# Build the query for the reflected table
select_stmt = self.table.select().with_only_columns([
self.table.c.key])
if prefix:
select_stmt = select_stmt.where(
self.table.c.key.like(f”{prefix}%”))

# Execute the query and yield the keys
result = session.execute(select_stmt)
for row in result:
yield row.key
except Exception as e:
logger.error(f”Error in yield_keys: {e}”)
session.rollback()

B) Define Vectorstore:

This will store child document along with embeddings.
Name of this Vectorstore is VECTOR_STORE
Create file with name vectorstore.py to implement Vectorstore.

Import libraries:

from langchain_community.vectorstores.hanavector import HanaDB
from hdbcli import dbapi
from gen_ai_hub.proxy.langchain.init_models import init_embedding_model

Make connection:

connection = dbapi.connect
address=HANA_DB_ADDRESS,
port=HANA_DB_PORT,
user=HANA_DB_USER,
password=HANA_DB_PASSWORD,
autocommit=True,
sslValidateCertificate=False,
)

Create Vectorstore:

# define your embedding model here
embedding_model = init_embedding_model(‘text-embedding-ada-002’)

cursor = connection.cursor()
cursor.execute(f”set schema {schema}”)
store= HanaDB(
embedding=embedding_model, connection=connection, table_name=’VECTOR_STORE’
)

C: Implement functionality to Ingest data for ParentDocumentRetriever

1: Create a file with name ingestion.py and import necessary libraries:

from PDR import HanaStore
from vectorstore import store
from langchain_community.vectorstores.hanavector import HanaDB
from langchain.retrievers import ParentDocumentRetriever
from langchain_text_splitters import RecursiveCharacterTextSplitter
import fitz # PyMuPDF
import uuid
from sqlalchemy.engine import URL

2: Create two splitter which will be used to split parent documents and child documents respectively.

# Splitter to split document with size 4000 character, it will have larger contex and act as parent document.
# we can configure parameters

parent_splitter = RecursiveCharacterTextSplitter(chunk_size=4000,
chunk_overlap=100,
length_function=len,
add_start_index=True,)

# Splitter to split document with size 1000 character, it will have act as child document.
# we can configure parameters

child_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,
chunk_overlap=100,
length_function=len,
add_start_index=True,)

3) Read file and create documents:

# Open the PDF file
pdf_document = fitz.open(“sample.pdf”)

# Extract text from each page
text = “”
for page in pdf_document:
text += page.get_text(“text”) + “n”

#create document
doc = Document(page_content=text, metadata={“source”:”source”})

4) Create Parent documents:

docs = parent_splitter.split_documents([doc])
print(len(docs))

5) Create doc_id and save it to metadata:

# make ids to corrosponding docs
doc_ids = [str(uuid.uuid4()) for _ in docs]

# this is useful to add in metadata, use to connect vectorstore and docstore
id_key = “doc_id”

for i, d in enumerate(docs)
_metadata = {id_key: doc_ids[i]}
d.metadata.update(_metadata)

6) Create child documents:

child_doc = []
for d in docs:
smaller_doc = child_splitter.transform_documents([d])
child_doc.extend(smaller_doc)

7) Create ParentDocumentRetriever:

# Define the connection URL
connection_url = URL.create(
drivername=’hana+hdbcli’,
username=HANA_DB_USER,
password=HANA_DB_PASSWORD,
host=HANA_DB_ADDRESS,
port=HANA_DB_PORT,
query={“currentSchema”: schema} # remove it for no schema

)

retriever = ParentDocumentRetriever(
vectorstore=store,
docstore=HanaStore(connection_string=connection_url),
child_splitter=child_splitter,
)

8 ) Save Parent document and child document in HanaStore and Vectorstore respectively:

# add child document in vectorstore
retriever.vectorstore.add_documents(child_doc)

# add document in docstore
retriever.docstore.mset(list(zip(doc_ids, docs)))

IMPLEMENTATION OF RETRIEVAL PROCESS

1) Create file with name retrieval.py and import necessary libraries

2) Create document splitter

splitter = RecursiveCharacterTextSplitter(chunk_size=10,
chunk_overlap=100,
length_function=len,
add_start_index=True,)

3) Create ParentDocumentRetriever:

# Define the connection URL for hana SQLAlchemy
connection_url = URL.create(
drivername=’hana+hdbcli’,
username=HANA_DB_USER,
password=HANA_DB_PASSWORD,
host=HANA_DB_ADDRESS,
port=HANA_DB_PORT,
query={“currentSchema”: schema} # remove it for no schema

)

retriever = ParentDocumentRetriever(
vectorstore=store,
docstore=HanaStore(connection_string=connection_url),
child_splitter=splitter,
)

4) Get relevant document using retriever:

result = retriever.invoke(” what is sap “)
print(result)

When working with Retrieval-Augmented Generation (RAG), one of the first and important steps is document chunking, i.e. breaking up of large texts into smaller pieces (or “chunks”) that can be efficiently processed by text embedding AI models. But this process comes with a common challenge: getting the chunk size just right.Chunk size is the number of characters or tokens in a single text segment (chunk) before it’s converted into a vector embedding. Here’s why the chunk size matters:Vector embeddings of every chunks are of exactly same dimension, regardless of the text/information contained in the chunk.A smaller chunk size means smaller text/information converted to single vector, which can lead to more precise similarity searches, as it provides a more granular representation of the text. However, it this also mean that crucial information for example, information in a paragraph is split across multiple chunks, potentially causing the similarity search to miss the chunk to be identified. Larger chunk size means more text/information is converted to single vector, which may dilute the overall semantic meaning. This is where the ParentDocumentRetriever (PDR) helps strike a balance between these two requirements.Another Example: Imagine breaking a book into pages. If you split it by every paragraph, the meaning might get fragmented. But if you split it by chapter, you might end up with sections too long for the AI to process effectively. So what’s the right balance?This is where ParentDocumentRetriever (PDR) helps.PDR works by keeping track of which small chunks came from which larger document. When the AI finds a relevant chunk, PDR also brings in related information from the original document it came from. It’s like finding a useful quote and automatically getting the full article around it for better understanding. This way, the system doesn’t just retrieve a small snippet—it gets the full context behind it.For more information on Parent Documet Retriever please read https://medium.com/ai-insights-cobet/rag-and-parent-document-retrievers-making-sense-of-complex-contexts-with-code-5bd5c3474a8aIn my blog, we’ll explore how to use ParentDocumentRetriever with the SAP HANA Cloud Vector Engineto improve how documents are retrieved, understood, and used—ensuring your AI system delivers more relevant and accurate results.How ParentDocumentRetriever works:Ingestion Process (Preparing the Data): Step 1: Decide on Chunk Sizes First, we pick a size for breaking up long documents in a way that still keeps the overall meaning. Example: Use 4,000 characters as the size for a bigger section. Then, we also choose a smaller size for creating more precise AI-friendly representations. Example: Use 1,000 characters for smaller chunks. Step 2: Break Documents into Chunks The large 4,000-character chunks become the “parent documents.” Each parent document is then broken down into smaller 1,000-character “child chunks.” These are what the AI will use to match against queries. Step 3: Store the Chunks Both parent and child documents are stored in the system, but kept separately so we know which chunk came from where. 2. Retrieval Process (Using the Data): When someone asks a question, the system searches only the small child chunks (1,000 characters) to find the most relevant piece. After finding a useful child chunk, the system also pulls in the parent document (4,000 characters) that it came from—giving a fuller, more meaningful answer.Why This Is Useful:This approach ensures that the system benefits from the precision of small chunks while retaining the broader context of the original documents. IMPLEMENTATION OF DATA INGESTION PROCESS we need two store for this approach:DOCSTORE: This Store will store parent documents.VECTORSTORE: This Store will store child documents along with its embeddings and metadata.A: Create DOCSTORENormally, we don’t allow runtime user to create table dynamically, so, create table with name DOCSTORE in hanadb.Run this command to create table in SAP HANA Database:CREATE COLUMN TABLE “Schema”.”DOCSTORE”(
“KEY” NVARCHAR(500) NOT NULL,
“VALUE” NCLOB MEMORY THRESHOLD 1000,
PRIMARY KEY(“KEY”)
)Create file with name PDR.py which will define HANAStore class to implement ParentDocumentRetriever.Code for implementation:Import necessary libraries:from sqlalchemy_hana.types import NCLOB
from pydantic import BaseModel, Field
from typing import Optional, Sequence, Iterator, TypeVar, Generic
from sqlalchemy import Column, String, create_engine
from sqlalchemy.orm import declarative_base, sessionmaker, scoped_session
from langchain.schema import Document
from langchain_core.stores import BaseStore
import json
import logging
from sqlalchemy import create_engine, inspect
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, scoped_session
from sqlalchemy import create_engine, MetaData, Table Define documentmodel:class DocumentModel(BaseModel):
key: Optional[str] = Field(None)
page_content: Optional[str] = Field(None)
metadata: dict = Field(default_factory=dict)

D = TypeVar(“D”, bound=Document) Create HANAStore: This Store will store parent documents. To define this class, we need to extend BaseStore present in langchain_core library. we need to define these methods inside this class: a) serialize_document : it will be used to serialize of document b) deserialize_document: it will be used to deserialize documents. c) mget: it will return documents based on given keys d) mset: it will take key value pair and save it in hanastore, i.e, key and documents. e) mdelete: it will delete documents based on keys f) yield_keys: it will return keys. Define HanaStore class: connect hanadb in __init__ and make get_session method that will return the session which connects to db. class HanaStore(BaseStore[str, DocumentModel], Generic[D]):
def __init__(self, connection_string: str, schema: str, table_name: str):
# code to connect hanadb

def get_session(self):
return self.Session()   a) Implement serialize_document method: def serialize_document(self, doc: Document) -> str:
return json.dumps({“page_content”: doc.page_content, “metadata”: doc.metadata})z   b) Implement deserialize_document method:def deserialize_document(self, value: str) -> Document:
try:
data = json.loads(value)
return Document(page_content=data.get(“page_content”, “”), metadata=data.get(“metadata”, {}))
except json.JSONDecodeError as e:
logger.error(f”Failed to deserialize document: {e}”)
c) Implement mget method:def mget(self, keys: Sequence[str]) -> list[Document]:
with self.get_session() as session:
try:
# Query table directly
select_stmt = self.table.select().where(self.table.c.key.in_(keys))
result = session.execute(select_stmt).fetchall()
# take result one by one, deserialize it and return it
documents = []
for row in result:
logger.debug(
f”Retrieved SQLDocument with key: {row.key}, value: {row.value}”)
doc = self.deserialize_document(row.value)
documents.append(doc)
return documents
except Exception as e:
logger.error(f”Error in mget: {e}”)
session.rollback()
return []   d) Implement mset method:def mset(self, key_value_pairs: Sequence[tuple[str, Document]]) -> None:
with self.get_session() as session:
try:
# Prepare serialized documents
serialized_docs = []
for key, document in key_value_pairs:
serialized_doc = self.serialize_document(document)
serialized_docs.append(
{“key”: key, “value”: serialized_doc})

session.commit()
except Exception as e:
logger.error(f”Error in mset: {e}”)
session.rollback() e) Implement mdelete method: # delete doc based on key
def mdelete(self, keys: Sequence[str]) -> None:
with self.get_session() as session:
try:
# Perform delete operation directly on the reflected table
delete_stmt = self.table.delete().where(self.table.c.key.in_(keys))
session.execute(delete_stmt)
session.commit()
except Exception as e:
logger.error(f”Error in mdelete: {e}”)
session.rollback() f) Implement yield_keys method: # return keys
def yield_keys(self, *, prefix: Optional[str] = None) -> Iterator[str]:
with self.get_session() as session:
try:
# Build the query for the reflected table
select_stmt = self.table.select().with_only_columns([
self.table.c.key])
if prefix:
select_stmt = select_stmt.where(
self.table.c.key.like(f”{prefix}%”))

# Execute the query and yield the keys
result = session.execute(select_stmt)
for row in result:
yield row.key
except Exception as e:
logger.error(f”Error in yield_keys: {e}”)
session.rollback()B) Define Vectorstore: This will store child document along with embeddings. Name of this Vectorstore is VECTOR_STORE Create file with name vectorstore.py to implement Vectorstore.     Import libraries:from langchain_community.vectorstores.hanavector import HanaDB
from hdbcli import dbapi
from gen_ai_hub.proxy.langchain.init_models import init_embedding_model   Make connection: connection = dbapi.connect
address=HANA_DB_ADDRESS,
port=HANA_DB_PORT,
user=HANA_DB_USER,
password=HANA_DB_PASSWORD,
autocommit=True,
sslValidateCertificate=False,
) Create Vectorstore:  # define your embedding model here
embedding_model = init_embedding_model(‘text-embedding-ada-002’)

cursor = connection.cursor()
cursor.execute(f”set schema {schema}”)
store= HanaDB(
embedding=embedding_model, connection=connection, table_name=’VECTOR_STORE’
)C: Implement functionality to Ingest data for ParentDocumentRetriever 1: Create a file with name ingestion.py and import necessary libraries: from PDR import HanaStore
from vectorstore import store
from langchain_community.vectorstores.hanavector import HanaDB
from langchain.retrievers import ParentDocumentRetriever
from langchain_text_splitters import RecursiveCharacterTextSplitter
import fitz # PyMuPDF
import uuid
from sqlalchemy.engine import URL 2: Create two splitter which will be used to split parent documents and child documents respectively. # Splitter to split document with size 4000 character, it will have larger contex and act as parent document.
# we can configure parameters

parent_splitter = RecursiveCharacterTextSplitter(chunk_size=4000,
chunk_overlap=100,
length_function=len,
add_start_index=True,)

# Splitter to split document with size 1000 character, it will have act as child document.
# we can configure parameters

child_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,
chunk_overlap=100,
length_function=len,
add_start_index=True,) 3) Read file and create documents:# Open the PDF file
pdf_document = fitz.open(“sample.pdf”)

# Extract text from each page
text = “”
for page in pdf_document:
text += page.get_text(“text”) + “n”

#create document
doc = Document(page_content=text, metadata={“source”:”source”}) 4) Create Parent documents:docs = parent_splitter.split_documents([doc])
print(len(docs)) 5) Create doc_id and save it to metadata: # make ids to corrosponding docs
doc_ids = [str(uuid.uuid4()) for _ in docs]

# this is useful to add in metadata, use to connect vectorstore and docstore
id_key = “doc_id”

for i, d in enumerate(docs)
_metadata = {id_key: doc_ids[i]}
d.metadata.update(_metadata) 6) Create child documents: child_doc = []
for d in docs:
smaller_doc = child_splitter.transform_documents([d])
child_doc.extend(smaller_doc) 7) Create ParentDocumentRetriever:# Define the connection URL
connection_url = URL.create(
drivername=’hana+hdbcli’,
username=HANA_DB_USER,
password=HANA_DB_PASSWORD,
host=HANA_DB_ADDRESS,
port=HANA_DB_PORT,
query={“currentSchema”: schema} # remove it for no schema

)

retriever = ParentDocumentRetriever(
vectorstore=store,
docstore=HanaStore(connection_string=connection_url),
child_splitter=child_splitter,
) 8 ) Save Parent document and child document in HanaStore and Vectorstore respectively: # add child document in vectorstore
retriever.vectorstore.add_documents(child_doc)

# add document in docstore
retriever.docstore.mset(list(zip(doc_ids, docs)))IMPLEMENTATION OF RETRIEVAL PROCESS 1) Create file with name retrieval.py and import necessary libraries from PDR import HanaStore
from vectorstore import store
from langchain_community.vectorstores.hanavector import HanaDB
from langchain.retrievers import ParentDocumentRetriever
from langchain_text_splitters import RecursiveCharacterTextSplitter
from sqlalchemy.engine import URL 2) Create document splitter splitter = RecursiveCharacterTextSplitter(chunk_size=10,
chunk_overlap=100,
length_function=len,
add_start_index=True,) 3) Create ParentDocumentRetriever: # Define the connection URL for hana SQLAlchemy
connection_url = URL.create(
drivername=’hana+hdbcli’,
username=HANA_DB_USER,
password=HANA_DB_PASSWORD,
host=HANA_DB_ADDRESS,
port=HANA_DB_PORT,
query={“currentSchema”: schema} # remove it for no schema

)

retriever = ParentDocumentRetriever(
vectorstore=store,
docstore=HanaStore(connection_string=connection_url),
child_splitter=splitter,
) 4) Get relevant document using retriever:result = retriever.invoke(” what is sap “)
print(result) Read More Technology Blog Posts by SAP articles

#SAP

#SAPTechnologyblog