Python 3.11+ Deep Learning LLMs · Agents · CV · RL
📊Core sklearn Pipeline
sklearn
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf",    GradientBoostingClassifier(
                  n_estimators=200, learning_rate=0.05
               ))
])
pipe.fit(X_train, y_train)
score = pipe.score(X_test, y_test)
🔍Cross-Validation & Grid Search
sklearn
from sklearn.model_selection import GridSearchCV

params = {
    "clf__n_estimators":   [100, 200, 300],
    "clf__learning_rate": [0.01, 0.05, 0.1],
    "clf__max_depth":      [3, 5, 7],
}
gs = GridSearchCV(pipe, params, cv=5, scoring="f1_macro", n_jobs=-1)
gs.fit(X_train, y_train)
print(gs.best_params_, gs.best_score_)
📈Evaluation Metrics
metrics
MetricFunctionUse When
Accuracyaccuracy_scoreBalanced classes
F1 Scoref1_scoreImbalanced classes
ROC-AUCroc_auc_scoreRanking quality
MSE/RMSEmean_squared_errorRegression
r2_scoreRegression fit
Log Losslog_lossProbability output
🌲Algorithm Quick Reference
algos
AlgorithmBest ForKey Params
LinearRegressionContinuous, linearfit_intercept
LogisticRegressionBinary/multi clfC, solver
RandomForestTabular, robustn_estimators, max_depth
XGBoostKaggle, tabularlr, subsample, colsample
SVMHigh-dim, small dataC, kernel, gamma
KNNSimple baselinen_neighbors, metric
K-MeansClusteringn_clusters, init
DBSCANDensity clusteringeps, min_samples
⚖️Bias-Variance Tradeoff
theory
Error = Bias² + Variance + Irreducible Noise
High BiasUnderfitting — model too simple. Fix: more features, complex model
High VarianceOverfitting — memorizing noise. Fix: regularization, more data, pruning
RegularizationL1 (Lasso) → sparsity. L2 (Ridge) → shrinkage. ElasticNet → both
🧹Feature Engineering
features
import pandas as pd
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler,
    LabelEncoder, OneHotEncoder
)

df["log_feat"] = np.log1p(df["skewed_col"])
df["interact"] = df["a"] * df["b"]
df["binned"]  = pd.cut(df["age"], bins=5, labels=False)
df = pd.get_dummies(df, columns=["category"], drop_first=True)
🔀Train/Val/Test Split Strategy
data
from sklearn.model_selection import (
    train_test_split, StratifiedKFold, TimeSeriesSplit
)
X_tv, X_test, y_tv, y_test = train_test_split(
    X, y, test_size=0.15, stratify=y, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X_tv, y_tv, test_size=0.18, stratify=y_tv
)
For time-series use TimeSeriesSplit. For imbalanced use StratifiedKFold.
🗜️PCA & t-SNE
dim-red
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)
print(f"Dims: {X_pca.shape[1]}, var: {pca.explained_variance_ratio_.sum():.2%}")

tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_2d = tsne.fit_transform(X_pca)
🔥Model Anatomy
torch
import torch.nn as nn

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(128, 256),
            nn.LayerNorm(256),
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(256, 10)
        )
    def forward(self, x):
        return self.layers(x)
🔄Training Loop Template
torch
for epoch in range(epochs):
    model.train()
    for xb, yb in train_loader:
        xb, yb = xb.to(device), yb.to(device)
        pred = model(xb)
        loss = criterion(pred, yb)
        opt.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        opt.step()
        scheduler.step()

    model.eval()
    with torch.no_grad():
        val_loss = sum(criterion(model(xb.to(device)), yb.to(device))
                       for xb, yb in val_loader)
Optimizers & Schedulers
optim
OptimizerBest For
AdamGeneral default, fast convergence
AdamWTransformers, proper weight decay
SGD + momentumVision, fine-tuning, sharp minima
LionLarge models, memory efficient
opt = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
sched = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=epochs)
📉Loss Functions
loss
LossTask
CrossEntropyLossMulti-class classification
BCEWithLogitsLossBinary / multi-label
MSELossRegression
HuberLossRobust regression
NLLLossLog-prob outputs
CTCLossSequence-to-sequence (ASR)
TripletMarginLossMetric learning
🏗️Activation Functions
activations
FunctionFormulaUse
ReLUmax(0,x)Hidden layers (default)
GELUx·Φ(x)Transformers, BERT
SiLU/Swishx·σ(x)LLaMA, modern nets
Sigmoid1/(1+e⁻ˣ)Binary output
Tanh(eˣ−e⁻ˣ)/(eˣ+e⁻ˣ)RNNs, normalised
Softmaxeˣᵢ/ΣeˣⱼFinal classification
🧱Normalization Layers
norm
BatchNormNormalize over batch dim. Great for CNNs. Batch-size dependent.
LayerNormNormalize over feature dim. Standard for Transformers & RNNs.
GroupNormNormalize over groups. Works with small batches. Good for CV.
RMSNormSimplified LayerNorm (no mean shift). Used in LLaMA, Mistral.
💾Save / Load / Checkpoint
torch
torch.save({
    "epoch":      epoch,
    "model":      model.state_dict(),
    "optimizer":  opt.state_dict(),
    "scheduler": sched.state_dict(),
    "loss":       loss.item(),
}, "checkpoint.pt")

ckpt = torch.load("checkpoint.pt", map_location=device)
model.load_state_dict(ckpt["model"])
🎛️Regularization Techniques
reg
DropoutWeight DecayLabel Smoothing Grad ClippingEarly StoppingData Augment MixupCutMix
nn.Dropout(0.3)
nn.Dropout2d(0.2)
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
🔤Self-Attention Mechanism
theory
Attention(Q,K,V) = softmax( QKᵀ / √d_k ) · V
import torch, torch.nn.functional as F

def scaled_dot_attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = (Q @ K.transpose(-2, -1)) / d_k ** 0.5
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    attn = F.softmax(scores, dim=-1)
    return attn @ V, attn
🤗HuggingFace Pipelines
transformers
from transformers import pipeline

tasks = {
  "sentiment":    pipeline("sentiment-analysis"),
  "ner":           pipeline("ner", aggregation_strategy="simple"),
  "summarize":     pipeline("summarization", model="facebook/bart-large-cnn"),
  "qa":            pipeline("question-answering"),
  "translate":     pipeline("translation_en_to_fr"),
  "zero-shot":     pipeline("zero-shot-classification"),
  "fill-mask":     pipeline("fill-mask", model="bert-base-uncased"),
  "text-gen":      pipeline("text-generation", model="gpt2"),
}
⚙️Fine-Tuning with Trainer API
transformers
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer, Trainer, TrainingArguments
)
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=3
)
args = TrainingArguments(
    output_dir="./out", num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5, weight_decay=0.01,
    evaluation_strategy="epoch", load_best_model_at_end=True
)
trainer = Trainer(model=model, args=args,
                  train_dataset=train_ds, eval_dataset=val_ds)
trainer.train()
🧩LoRA / PEFT Fine-Tuning
PEFT
from peft import get_peft_model, LoraConfig, TaskType

config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16, lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none"
)
model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# trainable: 0.1% of total params
🔗RAG — Retrieval Augmented Generation
RAG
Query Embed Vector Search Top-K Docs LLM + Context Answer
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import RetrievalQA

embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(docs, embeddings)
retriever = db.as_retriever(search_kwargs={"k": 4})
chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4o-mini"),
    retriever=retriever
)
result = chain.invoke({"query": question})
💬Prompt Engineering Patterns
prompts
TechniqueWhen to Use
Zero-shotSimple tasks, capable models
Few-shotFormat/style matters, examples help
Chain-of-ThoughtReasoning, math, multi-step logic
Tree-of-ThoughtComplex decisions, branching
ReActTool use, agents, search
Self-ConsistencyReduce variance, vote on answers
Role PromptingPersona, domain expertise
🌡️LLM Sampling Parameters
LLM
ParamRangeEffect
temperature0–20=deterministic, high=creative
top_p0–1Nucleus sampling mass
top_k1–∞Restrict to top k tokens
frequency_penalty-2–2Reduce repetition
presence_penalty-2–2Encourage new topics
max_tokens1–ctxOutput length cap
📐Tokenization Quick Facts
tokens
GPT-4o~4 chars ≈ 1 token · 128K context window
Claude 3.5~3.5 chars ≈ 1 token · 200K context window
Gemini 1.51M context · video, audio, code, images
BPEByte-Pair Encoding — used by GPT, Llama, Mistral
WordPieceUsed by BERT, DistilBERT
SentencePieceUsed by T5, Gemma, multilingual models
🔲Conv Layer Output Size
math
W_out = (W_inK + 2P) / S + 1
W_inInput width/height
KKernel size
PPadding (same padding: P = (K-1)/2)
SStride
🏛️CNN Architecture Families
architectures
ModelYearKey Innovation
AlexNet2012Deep CNN on GPU, ReLU, Dropout
VGG-16/192014Uniform 3×3 convs, depth
ResNet-502015Residual skip connections
EfficientNet2019Compound scaling (W×D×R)
ViT2020Patch-based transformer
ConvNeXt2022CNN with transformer design
SAM 22024Segment anything, video
🖼️torchvision Transforms
torchvision
from torchvision import transforms as T

train_tfm = T.Compose([
    T.RandomResizedCrop(224),
    T.RandomHorizontalFlip(),
    T.ColorJitter(0.4, 0.4, 0.4, 0.1),
    T.RandomGrayscale(p=0.2),
    T.ToTensor(),
    T.Normalize([0.485,0.456,0.406],
                [0.229,0.224,0.225]),
])
val_tfm = T.Compose([
    T.Resize(256), T.CenterCrop(224),
    T.ToTensor(),
    T.Normalize([0.485,0.456,0.406],
                [0.229,0.224,0.225])
])
🎯Object Detection Metrics
metrics
IoU = Area(A∩B) / Area(A∪B)
mAP@0.5 = mean AP across classes at IoU 0.5
TPIoU ≥ threshold (typically 0.5)
FPPrediction with IoU < threshold
FNGround truth not detected
NMSNon-max suppression removes duplicate boxes
🔄Transfer Learning
transfer
from torchvision.models import resnet50, ResNet50_Weights

model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)

for param in model.parameters():
    param.requires_grad = False

model.fc = nn.Sequential(
    nn.Linear(model.fc.in_features, 256),
    nn.ReLU(), nn.Dropout(0.3),
    nn.Linear(256, num_classes)
)
opt = torch.optim.Adam(model.fc.parameters(), lr=1e-3)
🌐Segmentation Methods
segmentation
MethodTypeModel
SemanticClass per pixelDeepLabV3, SegFormer
InstanceIndividual objectsMask R-CNN, YOLACT
PanopticSemantic+InstancePanoptic-FPN, DETR
PromptableAny objectSAM, SAM 2
🎯Bellman Equations
theory
V(s) = max_a Σ P(s'|s,a) [R(s,a,s') + γ·V(s')]
Q(s,a) = R + γ·max_a' Q(s',a') [Bellman Optimality]
TD Error = R + γ·V(s') − V(s)
🗺️RL Algorithm Map
algos
AlgorithmTypeBest For
Q-LearningModel-free, off-policyDiscrete tabular
DQNDeep, off-policyDiscrete, Atari
DDQNDeep, off-policyOverestimation fix
A3C/A2CPolicy gradientParallel envs
PPOOn-policy, clipContinuous actions
SACOff-policy, entropyContinuous, robust
TD3Deterministic PGRobotics, continuous
🏋️Gymnasium Environment Loop
gymnasium
import gymnasium as gym

env = gym.make("CartPole-v1", render_mode="rgb_array")
obs, info = env.reset(seed=42)

for _ in range(500):
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        obs, info = env.reset()

env.close()
📦Stable-Baselines3 Quick Start
SB3
from stable_baselines3 import PPO, SAC
from stable_baselines3.common.env_util import make_vec_env

env = make_vec_env("LunarLander-v2", n_envs=4)

model = PPO("MlpPolicy", env,
            n_steps=2048, batch_size=64,
            learning_rate=3e-4, verbose=1)
model.learn(total_timesteps=500_000)
model.save("ppo_lunar")
🧭Exploration Strategies
exploration
ε-greedyRandom action with prob ε, decay over time
BoltzmannSample actions with prob ∝ exp(Q/τ)
UCBUpper Confidence Bound — explore uncertain states
Entropy BonusAdd entropy term to reward (SAC, A2C)
RNDRandom Network Distillation — curiosity driven
🔁RLHF Pipeline
RLHF
Pretrain LLM SFT Reward Model PPO / DPO Aligned Model
SFTSupervised fine-tune on demonstration data
RMTrain reward model on human preference pairs
PPOOptimize LLM policy against reward model
DPODirect Preference Optimization — no explicit RM
📐Key Probability Distributions
stats
DistributionParamsUse in AI
Gaussian N(μ,σ²)μ, σWeight init, VAE latent
BernoullipBinary classification
Categoricalp₁..pₖToken sampling, Softmax
DirichletαTopic models, LDA
PoissonλCount data modeling
Betaα, βBayesian priors, Thompson
Gradient Descent Variants
optimization
SGD: θ = θ − η·∇L(θ)
Momentum: v = βv + η∇L ; θ = θ − v
Adam: m̂ = m/(1−β₁ᵗ) ; v̂ = v/(1−β₂ᵗ) ; θ = θ − ηm̂/(√v̂+ε)
📏Distance & Similarity Metrics
metrics
MetricFormulaUse
Euclidean√Σ(aᵢ−bᵢ)²KNN, KMeans
Cosinea·b / (‖a‖‖b‖)Embeddings, NLP
ManhattanΣ|aᵢ−bᵢ|Sparse, robust
KL DivergenceΣ P log(P/Q)VAE, distributions
Mahalanobis√(a-b)ᵀΣ⁻¹(a-b)Anomaly detection
🧮Information Theory Essentials
info theory
Entropy: H(X) = −Σ P(x) log P(x)
Cross-Entropy: H(P,Q) = −Σ P(x) log Q(x)
KL: D_KL(P‖Q) = H(P,Q) − H(P)
Mutual Info: I(X;Y) = H(X) + H(Y) − H(X,Y)
🔢Linear Algebra for ML
linear algebra
import numpy as np

A = np.random.randn(4, 4)
U, S, Vt = np.linalg.svd(A)
eigenvals = np.linalg.eigvals(A)
rank = np.linalg.matrix_rank(A)

x = np.linalg.solve(A, b)
inv_A = np.linalg.inv(A)
det_A = np.linalg.det(A)
norm_A = np.linalg.norm(A, ord="fro")
🔗Bayes' Theorem & MAP
Bayesian
P(θ|X) = P(X|θ) · P(θ) / P(X)
Posterior ∝ Likelihood · Prior
MAP: θ* = argmax P(X|θ) · P(θ)
MLE: θ* = argmax P(X|θ) [flat prior]
📦Python AI Ecosystem
ecosystem
LibraryPurposeInstall
numpyArray math, linear algebrapip install numpy
pandasDataFrames, data wranglingpip install pandas
torchDeep learning frameworkpip install torch
transformersPretrained models, NLPpip install transformers
scikit-learnClassical ML, preprocessingpip install scikit-learn
xgboostGradient boostingpip install xgboost
langchainLLM pipelines, agentspip install langchain
gymnasiumRL environmentspip install gymnasium
diffusersDiffusion models (HF)pip install diffusers
faiss-cpuVector similarity searchpip install faiss-cpu
einopsTensor rearrangingpip install einops
accelerateMulti-GPU, mixed precisionpip install accelerate
Accelerate & Mixed Precision
training
from accelerate import Accelerator

accelerator = Accelerator(mixed_precision="bf16")
model, optimizer, train_loader = accelerator.prepare(
    model, optimizer, train_loader
)
with accelerator.autocast():
    outputs = model(batch)
    loss = criterion(outputs, targets)
accelerator.backward(loss)
optimizer.step()
📊Weights & Biases (W&B)
experiment
import wandb

wandb.init(project="my-model", config={
    "lr": 3e-4, "epochs": 20,
    "batch_size": 64, "arch": "resnet50"
})

wandb.log({"loss": loss, "acc": acc, "epoch": epoch})
wandb.watch(model, log="all", log_freq=100)
wandb.finish()
🔍FAISS Vector Search
vector-db
import faiss, numpy as np

d = 768
index = faiss.IndexFlatIP(d)
faiss.normalize_L2(vectors)
index.add(vectors)

query = embed_text("What is attention?")
faiss.normalize_L2(query)
D, I = index.search(query, k=5)
print(f"Top-5 ids: {I[0]}, scores: {D[0]}")
🔧GPU Memory Tips
GPU
torch.cuda.empty_cache()
torch.backends.cuda.matmul.allow_tf32 = True

model = model.half()

with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
    output = model(input)

torch.utils.checkpoint.checkpoint(fn, *args)
Gradient checkpointing trades compute for memory — saves ~60% VRAM at 30% slowdown.
🤖LLM API Providers
LLMs
ProviderTop ModelsContext
OpenAIGPT-4o, o3, o4-mini128K
AnthropicClaude 3.5, Claude 3 Opus200K
GoogleGemini 2.0, Gemini 1.5 Pro1M
Meta (OSS)LLaMA 3.1, 3.3 (405B)128K
MistralMistral Large, Mixtral 8x22B64K
GroqLLaMA 3 (fast inference)128K
🐳Docker + CUDA Setup
infra