AI Tech Stack | AI.JBHerrera

The front door to AI. Every interface you see is essentially a wrapper around an API call — sending your prompt and displaying the response. The variety is wide, but the underlying mechanic is the same.

Web & mobile chat

Claude.ai, ChatGPT, Gemini — clean interfaces that handle conversation history, file uploads, and streaming responses token by token.

IDE plugins

Tools like Cursor or GitHub Copilot embed AI directly in your code editor. They pass your file context automatically with every prompt.

Voice assistants

Siri, Alexa, and now AI-native assistants add a speech-to-text layer before the prompt, and text-to-speech on the way back out.

Direct API consumers

Developers and platforms call the API directly — your own app, Zapier automations, custom agents — no visual interface at all.

This is where AI application design lives — and where Synergi AI operates. The raw model is powerful but generic. Application logic shapes it into a specific, valuable tool for a specific audience.

RAG (Retrieval-Augmented Generation)

Before sending a prompt, search a knowledge base for relevant documents and inject them into context. This lets the model answer questions about your private data without retraining.

Agentic workflows

Multi-step AI that can take actions — search the web, run code, send emails, call APIs — not just generate text. The model reasons about what to do next.

System prompts

Hidden instructions given to the model before every conversation. This is how you define persona, rules, tone, and constraints. The model always follows these first.

MCP (Model Context Protocol)

Anthropic's open standard for connecting AI to external tools — databases, calendars, CRMs. Enables richer, more capable agent systems without custom glue code.

The API is the public face of the AI system — a clean HTTP interface that hides the enormous complexity of what happens beneath. It handles authentication, routing, and streaming your response back word by word.

Streaming responses

Rather than waiting for the full response before sending anything, the API streams tokens as they're generated. This is why you see text appearing word by word — not all at once.

Safety filters

Inputs and outputs pass through classifiers and rule systems before reaching the model or the user. These enforce content policies independently of the model's own training.

SDKs

Libraries for Python, JavaScript, and other languages that wrap the raw HTTP API — handling authentication, retries, streaming, and error handling so developers don't have to.

Rate limiting

Measured in tokens per minute and requests per minute. Controls access fairness across millions of users and protects the infrastructure from overload.

The inference engine is what runs the model in real time — loading billions of parameters into GPU memory and executing the forward pass for each token generated. It's heavily optimized for speed and cost.

KV cache

Key-Value cache stores the intermediate results of processing the prompt so they don't have to be recomputed for each new token. Critical for long-context performance.

Quantization

Weights stored in lower precision (INT8 vs FP32) use 4× less memory with minimal quality loss. Enables larger models to fit in available VRAM.

Token sampling

After the model scores all possible next tokens, a sampling strategy picks one. Temperature, top-p, and top-k control the randomness/creativity of the output.

Request batching

Multiple user requests are grouped together and processed simultaneously on the GPU — dramatically increasing throughput and reducing cost per token.

The model itself — a deep neural network with billions of learned parameters, organized into stacked transformer blocks. Everything it knows is encoded in those weights. At inference time, weights are frozen; the model only predicts.

Tokenizer (BPE)

Text is broken into subword pieces before entering the model. "Unbelievable" becomes three tokens. A vocabulary of ~50K–100K pieces covers virtually any language or code.

Transformer blocks

The core repeating unit. Each block runs self-attention (every token looks at every other token) followed by a feed-forward network. Models stack dozens to hundreds of these.

Multi-head attention

The mechanism that lets tokens relate to each other across the full context window. Multiple "heads" run in parallel, each learning different relationship patterns.

Output softmax

The final layer converts raw scores into a probability distribution over all vocabulary tokens. The highest-probability token (adjusted for temperature) becomes the next word.

Training happens in two phases: pre-training (learning language from trillions of tokens over months of GPU compute) and alignment fine-tuning (teaching the model to be helpful and safe). By the time you use a model, this phase is long finished.

Pre-training

The model reads trillions of tokens from the web, books, and code. For each token, it predicts the next one, measures its error (loss), and adjusts weights via backpropagation. Repeated billions of times.

RLHF

Reinforcement Learning from Human Feedback. Human raters rank model outputs → a reward model learns those preferences → the LLM is fine-tuned to score higher on the reward model.

Constitutional AI

Anthropic's method: the model critiques its own outputs against a set of principles and revises them. Trains on the improved versions — less human labeling, more systematic value alignment.

Cost reality

Pre-training frontier models costs $50M–$500M+ in compute. This is why only a handful of organizations can do it — but application design on top of these models is accessible to everyone.

Three distinct storage needs: the raw training data (petabytes of text), the trained model weights (hundreds of gigabytes per checkpoint), and runtime data (vector embeddings, user sessions, application state).

Object storage (S3/GCS)

Cloud blob storage for training datasets and model weight checkpoints. Cheap, infinitely scalable, and the backbone of every major AI lab's data infrastructure.

Vector databases

Pinecone, Weaviate, Qdrant — specialized databases that store embeddings and enable semantic search. The data layer that powers RAG: find documents by meaning, not keyword.

Model weight checkpoints

Snapshots of model weights saved during training. A single checkpoint for a large model can be 300GB–1TB. Labs maintain many checkpoints to roll back if training goes wrong.

Application databases

SQL (Postgres, MySQL) and NoSQL (MongoDB, Supabase) for user data, conversation history, application state, and anything else your AI application needs to remember.

The physical layer everything runs on. AI training and inference are fundamentally matrix multiplication problems — GPUs excel at this because they have thousands of parallel cores designed for exactly that math.

NVIDIA H100 GPUs

The current standard for AI training. Each H100 has 80GB of HBM3 memory and ~3,000 teraflops of tensor performance. Training clusters link thousands of them together.

Google TPUs

Custom chips designed specifically for tensor operations. Google trains Gemini on its own TPU pods — gives them control over the full stack from silicon to model.

High-speed interconnects

NVLink (GPU-to-GPU on one server) and InfiniBand (server-to-server) allow the cluster to act as one giant computer. Bandwidth between GPUs is as important as GPU speed itself.

Power & cooling

A single H100 draws ~700W. A training cluster of 10,000 GPUs needs ~7 megawatts — roughly the power of a small town. Liquid cooling is now standard at this scale.