One of the most frustrating limitations of language models is that they know nothing about your specific company. They do not know your internal policies, they do not have access to your customer base, they do not know how your expense approval process works. Retrieval Augmented Generation (RAG) is the technique that solves exactly this problem, and in this article we explain how to implement it correctly.
RAG vs. fine-tuning: the fundamental decision
Before diving into implementation, we need to answer the question many teams ask first: should I fine-tune a model with my data or use RAG?
Fine-tuning modifies the model's weights with specific examples. It is useful for changing the response style, teaching particular formats, or adapting general behavior. Its limitations: it is expensive, requires curated training data, knowledge gets "frozen" in the model, and updating it requires a new training cycle.
RAG does not modify the model. Instead, it dynamically retrieves relevant information from your knowledge base and includes it in the context of each query. Its advantages: knowledge is updated simply by updating the source documents, it requires no training data, and it is much more economical.
For 95% of enterprise use cases where the goal is for the AI to respond with internal information, RAG is the right answer.
The RAG architecture explained
A RAG system has three main components:
1. Document ingestion
Your company's documents (PDFs, Word files, internal web pages, databases) are processed and divided into smaller fragments called "chunks." Each chunk is converted into a numerical vector (embedding) that captures its semantic meaning.
2. Vector database
The embeddings are stored in a database specialized in semantic similarity search. When someone asks a question, the question is also converted into an embedding and the most similar chunks are found.
3. Augmented generation
The most relevant chunks are included in the prompt along with the original question, giving the LLM the context needed to respond with information specific to your company.
[User question] + [Retrieved relevant chunks] → [LLM response]
Embeddings: the key that everyone oversimplifies
An embedding is a numerical representation of text in a high-dimensional space where texts with similar meaning are close together. This allows finding "how many days do I have to file a warranty claim?" even when the documents use the phrase "warranty period: 30 days."
The most commonly used embedding models:
- text-embedding-3-small (OpenAI): Excellent cost-quality ratio for Spanish
- text-embedding-3-large (OpenAI): Higher accuracy, higher cost
- all-MiniLM-L6-v2 (Sentence Transformers): Open-source, deployable locally
For Latin American Spanish content, we have found that text-embedding-3-small from OpenAI works well. Open-source models require more experimentation but are viable.
Vector databases: a comparison
pgvector (PostgreSQL)
The simplest extension to adopt if you already use PostgreSQL. Excellent for getting started and for moderate volumes (up to several million vectors). Requires no new infrastructure.
-- Enable extension
CREATE EXTENSION vector;
-- Create table with vector column
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
source VARCHAR(500),
embedding vector(1536)
);
-- Index for fast search
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops);
Pinecone
Specialized cloud service. Very easy to use, scales automatically. Cost can be high at volume. Good option when pgvector is not enough.
Chroma
Open-source vector database, ideal for development and prototyping. Simple to configure locally.
Chunking strategies
How you split the documents has an enormous impact on response quality.
Fixed-size chunking
You split by number of tokens, with overlap between chunks. Simple to implement but may cut ideas in half.
Semantic chunking
You split at paragraph or logical section boundaries. Better for structured documents.
Hierarchical chunking
You maintain hierarchy: the parent document + child chunks. Useful for legal or technical documents where the document context matters.
For typical corporate documents (policies, manuals, FAQs), we recommend chunks of 512–800 tokens with 100-token overlap.
Complete RAG pipeline example in Python
from openai import OpenAI
import psycopg2
import numpy as np
client = OpenAI()
def get_embedding(text: str) -> list[float]:
"""Generates an embedding for a text."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def search_relevant_documents(
question: str,
conn,
limit: int = 5
) -> list[dict]:
"""Finds the most relevant chunks for a question."""
question_embedding = get_embedding(question)
cursor = conn.cursor()
cursor.execute("""
SELECT content, source,
1 - (embedding <=> %s::vector) AS similarity
FROM documents
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (question_embedding, question_embedding, limit))
results = cursor.fetchall()
return [
{"content": r[0], "source": r[1], "similarity": r[2]}
for r in results
if r[2] > 0.75 # Minimum relevance threshold
]
def answer_with_context(
question: str,
documents: list[dict]
) -> str:
"""Generates a response using the retrieved documents."""
if not documents:
return "I did not find relevant information in the knowledge base."
context = "\n\n".join([
f"[Source: {d['source']}]\n{d['content']}"
for d in documents
])
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": """You are the company's internal assistant.
Answer ONLY based on the information provided.
If you do not have enough information, say so clearly."""
},
{
"role": "user",
"content": f"""Company context:
{context}
Question: {question}"""
}
],
temperature=0.1
)
return response.choices[0].message.content
Real use cases we have implemented
HR chatbot
For a company with 300 employees in Colombia, we implemented a chatbot that answers questions about vacation policies, leave, benefits, and internal procedures. Employees can ask in natural language and get precise answers based on the updated employee handbook.
Result: 60% reduction in HR department queries for administrative questions.
Product catalog search
A distributor with more than 50,000 SKUs implemented RAG so sales reps can ask questions like "what USB-C cable supports 100W charging and comes in black?" and get the exact products from the catalog.
Policy and compliance assistant
For a financial firm in Mexico, the legal team uses RAG to query the extensive regulatory compliance manual. What previously took hours of manual searching now takes seconds.
Estimated costs
For a knowledge base of 1,000 documents (approximately 500 pages):
- Embedding generation (once): ~$0.50 with text-embedding-3-small
- Monthly operation (1,000 queries/day): ~$15–50 depending on the LLM model
For 10,000 documents and higher volume, costs scale, but they remain orders of magnitude lower than the equivalent human cost.
Common pitfalls
Chunks too small: You lose context. A response from a contract needs the context of the complete clause.
Chunks too large: You include irrelevant information that confuses the model.
No relevance threshold: If you do not filter by minimum similarity, the model will receive irrelevant context and fabricate responses.
Not updating embeddings: When you update a document, you must regenerate its embeddings. Without this, responses become outdated.
Conclusion: RAG is the gateway to enterprise AI
Retrieval Augmented Generation is the technology that makes LLMs useful in real enterprise contexts. It requires no training data, updates easily, and costs are accessible for mid-sized companies.
If you have valuable corporate knowledge scattered across documents that no one consults, RAG can convert it into an active and accessible asset for your entire team.
At Alternetica we have implemented RAG systems for clients in multiple industries across LATAM. Contact us to explore how this technology can be applied to your specific use case.

