Deep Learning Notes, converted into an interactive HTML study sheet
Fundamentals of Neural Network
Key concepts
- Biological neuron: dendrites receive signals, soma processes, axon sends output, synapse acts as the connection.
- Artificial neuron: receives inputs, applies weights, computes weighted sum plus bias, then passes through an activation function.
- McCulloch-Pitts neuron: first computational neuron model using binary inputs and threshold logic.
- Perceptron: extends MCP by introducing learnable weights and bias for linearly separable classification.
- Three classes of deep learning: generative, discriminative, and hybrid architectures.
McCulloch-Pitts model: capabilities and limitations
- Can simulate basic logic gates such as AND, OR, NOT, NOR, and NAND.
- Shows that networks of simple neurons can perform logical computation.
- Uses binary inputs and hard thresholding.
- Cannot solve non-linearly separable problems such as XOR.
- Has no learning capability; weights and threshold are fixed manually.
- Limited to binary input-output behavior, so it cannot naturally model real-valued data.
Biological neuron vs artificial neuron
| Biological | Artificial |
|---|---|
| Dendrites receive impulses | Input features enter the model |
| Synapse controls signal strength | Weight controls feature importance |
| Soma integrates signals | Summation computes weighted input |
| Axon sends spike | Activation produces output |
| Learning through synaptic plasticity | Learning through weight updates |
Perceptron learning rule
Weight update: wᵢ(new) = wᵢ(old) + C · e · xᵢ
Bias update: b(new) = b(old) + C · e
The perceptron repeatedly computes output, compares it with the desired label, then updates weights until classification error becomes zero or sufficiently small.
Mini flashcards
Training, Optimization and Regularization
Activation functions at a glance
| Function | Range | Best use | Main issue / benefit |
|---|---|---|---|
| Linear | (-∞, +∞) | Regression output | No non-linearity |
| Sigmoid | (0, 1) | Binary classification output | Vanishing gradients |
| Tanh | (-1, 1) | Zero-centered hidden activations | Still suffers vanishing gradients |
| ReLU | [0, ∞) | Hidden layers | Fast and reduces vanishing gradient |
| Leaky ReLU | (-∞, ∞) | Hidden layers | Fixes dying ReLU issue |
| Softmax | Probabilities summing to 1 | Multi-class output | Ideal for class probabilities |
Choosing output layer and loss function
| Problem | Output activation | Loss |
|---|---|---|
| Regression | Linear | MSE / MAE |
| Binary classification | Sigmoid | Binary cross-entropy |
| Multi-class classification | Softmax | Categorical cross-entropy |
| Multi-label classification | Sigmoid per class | Binary cross-entropy |
MSE vs Cross-Entropy
- MSE: average squared difference between actual and predicted values. Used mainly in regression.
- Cross-Entropy: measures mismatch between predicted probabilities and true labels. Used in classification.
- Cross-entropy heavily penalizes confident wrong predictions, which is why it fits classification better.
Binary CE = -[ y log(p) + (1-y) log(1-p) ]
Optimizers: SGD, Momentum, Adam
- SGD: updates parameters using one sample or mini-batch. Fast but noisy.
- Momentum: adds past update direction to reduce oscillation and speed convergence.
- Adam: combines momentum with adaptive learning rates per parameter, usually stable and fast.
Regularization methods
- L1: encourages sparse weights.
- L2 / weight decay: shrinks weights to small values.
- Dropout: randomly drops neurons during training to reduce co-adaptation.
- Early stopping: stops training when validation performance stops improving.
- Batch normalization: stabilizes learning and also provides mild regularization.
- Data augmentation: enlarges dataset using label-preserving transformations.
High-yield points
- Learning rate is one of the most important hyperparameters.
- Too few epochs cause underfitting; too many cause overfitting.
- Batch size affects stability and speed of optimization.
- Bias-variance trade-off explains underfitting vs overfitting.
Exam-ready contrast: Momentum vs Adam
| Momentum | Adam |
|---|---|
| Uses velocity from past gradients | Uses first and second moments of gradients |
| Single global learning rate | Adaptive learning rate per parameter |
| Simpler, cheaper | Usually faster and more robust |
| Good upgrade over SGD | Common default for deep learning |
Autoencoders
Core idea
- An autoencoder is an unsupervised or self-supervised neural network that learns to reconstruct its own input.
- It contains three parts: encoder, bottleneck / latent code, and decoder.
- The network is trained to minimize reconstruction loss, usually MSE or L1 loss.
Types of autoencoders
- Basic / Linear Autoencoder: simple compression and reconstruction.
- Undercomplete: code dimension smaller than input, so it must learn salient features.
- Overcomplete: code dimension larger than input; may simply copy input unless regularized.
- Sparse: enforces that only a few hidden neurons are active.
- Denoising: takes noisy input and reconstructs clean output.
- Contractive: learns representations robust to small input changes.
Applications of autoencoders
- Dimensionality reduction and visualization
- Image compression
- Anomaly detection using high reconstruction error
- Denoising and missing-value imputation
- Feature extraction and similarity search
- Image inpainting, colorization, super-resolution
- Recommendation systems
Autoencoder vs PCA
| PCA | Autoencoder |
|---|---|
| Linear dimensionality reduction | Can learn non-linear representations |
| Closed-form mathematical solution | Trained using backpropagation |
| Limited to linear patterns | Captures more complex structures |
Mini flashcards
Convolutional Neural Networks (CNN)
CNN pipeline
- Convolution: filter slides over image to extract local patterns.
- Feature map: output of a filter after convolution.
- Stride: number of pixels the filter moves at each step.
- Padding: extra pixels added around the input, often zeros, to preserve edge information or output size.
- Pooling: downsampling operation, commonly max pooling.
Output size formula
- N = input size
- F = filter size
- P = padding
- S = stride
Why pooling is used
- Reduces spatial dimensions and computation
- Helps retain important features
- Adds some translation invariance
- Can reduce overfitting
CNN vs fully connected network
| CNN | Fully connected NN |
|---|---|
| Uses local receptive fields | Every neuron connects to all inputs |
| Parameter sharing through filters | No weight sharing |
| Efficient for images | Too many parameters for images |
| Learns spatial features | Ignores spatial locality |
Exam tip: parameter sharing
The same filter weights are reused across the whole image. This drastically reduces the number of learnable parameters compared with a dense layer, while also making the model good at detecting the same feature anywhere in the image.
Typical use cases
- Image classification
- Object detection
- OCR and handwriting recognition
- Medical image analysis
Recurrent Neural Networks (RNN)
Why RNN is needed
- Traditional feedforward and CNN models assume fixed-size input and no memory of previous steps.
- Sequence learning problems require output at time t to depend on previous inputs or outputs.
- RNN introduces a hidden state that acts like memory.
Unfolding computational graphs
RNN can be unrolled across time steps. Each time step behaves like a repeated layer that shares the same parameters. This makes sequential processing visible and trainable with backpropagation through time.
BPTT and gradient issues
- BPTT: backpropagation is applied through all previous time steps.
- Vanishing gradient: gradients become tiny, so the model forgets long-term dependencies.
- Exploding gradient: gradients grow too large, making training unstable.
- These problems motivate improved architectures such as LSTM and GRU.
Bidirectional RNN
Uses one RNN moving forward and another moving backward through the sequence. This helps the model use both past and future context, which is useful in tasks like speech recognition and language understanding.
LSTM in one view
- Forget gate: decides what old information to discard.
- Input gate: decides what new information to store.
- Cell state: carries long-term memory.
- Output gate: decides what to expose as the hidden state.
When to choose which model
| Model | Use |
|---|---|
| Simple RNN | Short sequences, simple dependencies |
| Bidirectional RNN | When both past and future context matter |
| LSTM | Long-term dependencies and better memory control |
| GRU | Good real-time choice with fewer gates and faster computation |
Recent Trends and Applications
GAN basics
- GAN stands for Generative Adversarial Network.
- It contains two networks: Generator (G) and Discriminator (D).
- The generator creates fake samples from random noise.
- The discriminator tries to distinguish real data from generated data.
- Training is adversarial: both models improve by competing with each other.
GAN working in short
- Feed real data to discriminator.
- Feed generated fake data to discriminator.
- Update discriminator to classify correctly.
- Update generator to fool discriminator.
- Repeat until generated samples look realistic.
Applications
- Image generation
- DeepFake and media synthesis
- Image-to-image translation
- Super-resolution
- Art and design generation
Deep learning application mapping
| Task | Suitable technique | Reason |
|---|---|---|
| Image classification | CNN | Learns spatial features efficiently |
| Text prediction | RNN / LSTM / GRU | Handles sequential context |
| Fake image generation | GAN | Generative adversarial learning |
| Fake image detection | CNN | Learns artifacts and visual inconsistencies |