Deep Learning Notes, converted into an interactive HTML study sheet

Module 1

Fundamentals of Neural Network

Biological neuron, McCulloch-Pitts neuron, perceptron, deep learning classes.

Key concepts

  • Biological neuron: dendrites receive signals, soma processes, axon sends output, synapse acts as the connection.
  • Artificial neuron: receives inputs, applies weights, computes weighted sum plus bias, then passes through an activation function.
  • McCulloch-Pitts neuron: first computational neuron model using binary inputs and threshold logic.
  • Perceptron: extends MCP by introducing learnable weights and bias for linearly separable classification.
  • Three classes of deep learning: generative, discriminative, and hybrid architectures.
Perceptron output: y = f( Σ(wᵢxᵢ) + b )
McCulloch-Pitts model: capabilities and limitations
  • Can simulate basic logic gates such as AND, OR, NOT, NOR, and NAND.
  • Shows that networks of simple neurons can perform logical computation.
  • Uses binary inputs and hard thresholding.
  • Cannot solve non-linearly separable problems such as XOR.
  • Has no learning capability; weights and threshold are fixed manually.
  • Limited to binary input-output behavior, so it cannot naturally model real-valued data.
Biological neuron vs artificial neuron
BiologicalArtificial
Dendrites receive impulsesInput features enter the model
Synapse controls signal strengthWeight controls feature importance
Soma integrates signalsSummation computes weighted input
Axon sends spikeActivation produces output
Learning through synaptic plasticityLearning through weight updates
Perceptron learning rule
Error: e = d - y
Weight update: wᵢ(new) = wᵢ(old) + C · e · xᵢ
Bias update: b(new) = b(old) + C · e

The perceptron repeatedly computes output, compares it with the desired label, then updates weights until classification error becomes zero or sufficiently small.

Mini flashcards

Why is perceptron better than MCP?
Because perceptron has learnable weights and bias, so it can learn decision boundaries from data.
Main limitation of MCP neuron?
It cannot learn and cannot solve non-linearly separable problems such as XOR.
Module 2

Training, Optimization and Regularization

Activation functions, loss functions, gradient descent, optimizers, overfitting control.

Activation functions at a glance

FunctionRangeBest useMain issue / benefit
Linear(-∞, +∞)Regression outputNo non-linearity
Sigmoid(0, 1)Binary classification outputVanishing gradients
Tanh(-1, 1)Zero-centered hidden activationsStill suffers vanishing gradients
ReLU[0, ∞)Hidden layersFast and reduces vanishing gradient
Leaky ReLU(-∞, ∞)Hidden layersFixes dying ReLU issue
SoftmaxProbabilities summing to 1Multi-class outputIdeal for class probabilities
Choosing output layer and loss function
ProblemOutput activationLoss
RegressionLinearMSE / MAE
Binary classificationSigmoidBinary cross-entropy
Multi-class classificationSoftmaxCategorical cross-entropy
Multi-label classificationSigmoid per classBinary cross-entropy
MSE vs Cross-Entropy
  • MSE: average squared difference between actual and predicted values. Used mainly in regression.
  • Cross-Entropy: measures mismatch between predicted probabilities and true labels. Used in classification.
  • Cross-entropy heavily penalizes confident wrong predictions, which is why it fits classification better.
MSE = (1/n) Σ(y - ŷ)²
Binary CE = -[ y log(p) + (1-y) log(1-p) ]
Optimizers: SGD, Momentum, Adam
  • SGD: updates parameters using one sample or mini-batch. Fast but noisy.
  • Momentum: adds past update direction to reduce oscillation and speed convergence.
  • Adam: combines momentum with adaptive learning rates per parameter, usually stable and fast.
Regularization methods
  • L1: encourages sparse weights.
  • L2 / weight decay: shrinks weights to small values.
  • Dropout: randomly drops neurons during training to reduce co-adaptation.
  • Early stopping: stops training when validation performance stops improving.
  • Batch normalization: stabilizes learning and also provides mild regularization.
  • Data augmentation: enlarges dataset using label-preserving transformations.

High-yield points

  • Learning rate is one of the most important hyperparameters.
  • Too few epochs cause underfitting; too many cause overfitting.
  • Batch size affects stability and speed of optimization.
  • Bias-variance trade-off explains underfitting vs overfitting.
Exam-ready contrast: Momentum vs Adam
MomentumAdam
Uses velocity from past gradientsUses first and second moments of gradients
Single global learning rateAdaptive learning rate per parameter
Simpler, cheaperUsually faster and more robust
Good upgrade over SGDCommon default for deep learning
Module 3

Autoencoders

Encoder-decoder model for compression, reconstruction, denoising, feature learning.

Core idea

  • An autoencoder is an unsupervised or self-supervised neural network that learns to reconstruct its own input.
  • It contains three parts: encoder, bottleneck / latent code, and decoder.
  • The network is trained to minimize reconstruction loss, usually MSE or L1 loss.
Input → Encoder → Latent Space (Code) → Decoder → Reconstructed Output
Types of autoencoders
  • Basic / Linear Autoencoder: simple compression and reconstruction.
  • Undercomplete: code dimension smaller than input, so it must learn salient features.
  • Overcomplete: code dimension larger than input; may simply copy input unless regularized.
  • Sparse: enforces that only a few hidden neurons are active.
  • Denoising: takes noisy input and reconstructs clean output.
  • Contractive: learns representations robust to small input changes.
Applications of autoencoders
  • Dimensionality reduction and visualization
  • Image compression
  • Anomaly detection using high reconstruction error
  • Denoising and missing-value imputation
  • Feature extraction and similarity search
  • Image inpainting, colorization, super-resolution
  • Recommendation systems
Autoencoder vs PCA
PCAAutoencoder
Linear dimensionality reductionCan learn non-linear representations
Closed-form mathematical solutionTrained using backpropagation
Limited to linear patternsCaptures more complex structures

Mini flashcards

Why is the bottleneck important?
It forces the network to compress information and learn the most useful features instead of memorizing the input.
How does anomaly detection work with autoencoders?
Train on normal data only; abnormal inputs reconstruct poorly, so high reconstruction error flags anomalies.
Module 4

Convolutional Neural Networks (CNN)

Convolution, stride, padding, pooling, feature maps, image classification pipeline.

CNN pipeline

Input → Convolution → ReLU → Pooling → (repeat) → Flatten → Fully Connected → Softmax
  • Convolution: filter slides over image to extract local patterns.
  • Feature map: output of a filter after convolution.
  • Stride: number of pixels the filter moves at each step.
  • Padding: extra pixels added around the input, often zeros, to preserve edge information or output size.
  • Pooling: downsampling operation, commonly max pooling.
Output size formula
Output size = ((N - F + 2P) / S) + 1
  • N = input size
  • F = filter size
  • P = padding
  • S = stride
Why pooling is used
  • Reduces spatial dimensions and computation
  • Helps retain important features
  • Adds some translation invariance
  • Can reduce overfitting
CNN vs fully connected network
CNNFully connected NN
Uses local receptive fieldsEvery neuron connects to all inputs
Parameter sharing through filtersNo weight sharing
Efficient for imagesToo many parameters for images
Learns spatial featuresIgnores spatial locality
Exam tip: parameter sharing

The same filter weights are reused across the whole image. This drastically reduces the number of learnable parameters compared with a dense layer, while also making the model good at detecting the same feature anywhere in the image.

Typical use cases
  • Image classification
  • Object detection
  • OCR and handwriting recognition
  • Medical image analysis
Module 5

Recurrent Neural Networks (RNN)

Sequence learning, hidden state, BPTT, vanishing gradient, bidirectional RNN, LSTM, GRU.

Why RNN is needed

  • Traditional feedforward and CNN models assume fixed-size input and no memory of previous steps.
  • Sequence learning problems require output at time t to depend on previous inputs or outputs.
  • RNN introduces a hidden state that acts like memory.
xₜ + hₜ₋₁ → hₜ → yₜ
Unfolding computational graphs

RNN can be unrolled across time steps. Each time step behaves like a repeated layer that shares the same parameters. This makes sequential processing visible and trainable with backpropagation through time.

BPTT and gradient issues
  • BPTT: backpropagation is applied through all previous time steps.
  • Vanishing gradient: gradients become tiny, so the model forgets long-term dependencies.
  • Exploding gradient: gradients grow too large, making training unstable.
  • These problems motivate improved architectures such as LSTM and GRU.
Bidirectional RNN

Uses one RNN moving forward and another moving backward through the sequence. This helps the model use both past and future context, which is useful in tasks like speech recognition and language understanding.

LSTM in one view
  • Forget gate: decides what old information to discard.
  • Input gate: decides what new information to store.
  • Cell state: carries long-term memory.
  • Output gate: decides what to expose as the hidden state.
When to choose which model
ModelUse
Simple RNNShort sequences, simple dependencies
Bidirectional RNNWhen both past and future context matter
LSTMLong-term dependencies and better memory control
GRUGood real-time choice with fewer gates and faster computation
Module 6

Recent Trends and Applications

GAN architecture, adversarial training, image generation, deep learning applications.

GAN basics

  • GAN stands for Generative Adversarial Network.
  • It contains two networks: Generator (G) and Discriminator (D).
  • The generator creates fake samples from random noise.
  • The discriminator tries to distinguish real data from generated data.
  • Training is adversarial: both models improve by competing with each other.
Noise z → Generator → Fake sample → Discriminator → Real / Fake decision
GAN working in short
  1. Feed real data to discriminator.
  2. Feed generated fake data to discriminator.
  3. Update discriminator to classify correctly.
  4. Update generator to fool discriminator.
  5. Repeat until generated samples look realistic.
Applications
  • Image generation
  • DeepFake and media synthesis
  • Image-to-image translation
  • Super-resolution
  • Art and design generation

Deep learning application mapping

TaskSuitable techniqueReason
Image classificationCNNLearns spatial features efficiently
Text predictionRNN / LSTM / GRUHandles sequential context
Fake image generationGANGenerative adversarial learning
Fake image detectionCNNLearns artifacts and visual inconsistencies