Deep Learning Interactive Notes

Module 1

Fundamentals of Neural Network

Biological neuron, McCulloch-Pitts neuron, perceptron, deep learning classes.

Key concepts

Biological neuron: dendrites receive signals, soma processes, axon sends output, synapse acts as the connection.
Artificial neuron: receives inputs, applies weights, computes weighted sum plus bias, then passes through an activation function.
McCulloch-Pitts neuron: first computational neuron model using binary inputs and threshold logic.
Perceptron: extends MCP by introducing learnable weights and bias for linearly separable classification.
Three classes of deep learning: generative, discriminative, and hybrid architectures.

Perceptron output: y = f( Σ(wᵢxᵢ) + b )

McCulloch-Pitts model: capabilities and limitations

Can simulate basic logic gates such as AND, OR, NOT, NOR, and NAND.
Shows that networks of simple neurons can perform logical computation.
Uses binary inputs and hard thresholding.
Cannot solve non-linearly separable problems such as XOR.
Has no learning capability; weights and threshold are fixed manually.
Limited to binary input-output behavior, so it cannot naturally model real-valued data.

Biological neuron vs artificial neuron

Biological	Artificial
Dendrites receive impulses	Input features enter the model
Synapse controls signal strength	Weight controls feature importance
Soma integrates signals	Summation computes weighted input
Axon sends spike	Activation produces output
Learning through synaptic plasticity	Learning through weight updates

Perceptron learning rule

Error: e = d - y
Weight update: wᵢ(new) = wᵢ(old) + C · e · xᵢ
Bias update: b(new) = b(old) + C · e

The perceptron repeatedly computes output, compares it with the desired label, then updates weights until classification error becomes zero or sufficiently small.

Mini flashcards

Why is perceptron better than MCP?

Because perceptron has learnable weights and bias, so it can learn decision boundaries from data.

Main limitation of MCP neuron?

It cannot learn and cannot solve non-linearly separable problems such as XOR.

Module 2

Training, Optimization and Regularization

Activation functions, loss functions, gradient descent, optimizers, overfitting control.

Activation functions at a glance

Function	Range	Best use	Main issue / benefit
Linear	(-∞, +∞)	Regression output	No non-linearity
Sigmoid	(0, 1)	Binary classification output	Vanishing gradients
Tanh	(-1, 1)	Zero-centered hidden activations	Still suffers vanishing gradients
ReLU	[0, ∞)	Hidden layers	Fast and reduces vanishing gradient
Leaky ReLU	(-∞, ∞)	Hidden layers	Fixes dying ReLU issue
Softmax	Probabilities summing to 1	Multi-class output	Ideal for class probabilities

Choosing output layer and loss function

Problem	Output activation	Loss
Regression	Linear	MSE / MAE
Binary classification	Sigmoid	Binary cross-entropy
Multi-class classification	Softmax	Categorical cross-entropy
Multi-label classification	Sigmoid per class	Binary cross-entropy

MSE vs Cross-Entropy

MSE: average squared difference between actual and predicted values. Used mainly in regression.
Cross-Entropy: measures mismatch between predicted probabilities and true labels. Used in classification.
Cross-entropy heavily penalizes confident wrong predictions, which is why it fits classification better.

MSE = (1/n) Σ(y - ŷ)²
Binary CE = -[ y log(p) + (1-y) log(1-p) ]

Optimizers: SGD, Momentum, Adam

SGD: updates parameters using one sample or mini-batch. Fast but noisy.
Momentum: adds past update direction to reduce oscillation and speed convergence.
Adam: combines momentum with adaptive learning rates per parameter, usually stable and fast.

Regularization methods

L1: encourages sparse weights.
L2 / weight decay: shrinks weights to small values.
Dropout: randomly drops neurons during training to reduce co-adaptation.
Early stopping: stops training when validation performance stops improving.
Batch normalization: stabilizes learning and also provides mild regularization.
Data augmentation: enlarges dataset using label-preserving transformations.

High-yield points

Learning rate is one of the most important hyperparameters.
Too few epochs cause underfitting; too many cause overfitting.
Batch size affects stability and speed of optimization.
Bias-variance trade-off explains underfitting vs overfitting.

Exam-ready contrast: Momentum vs Adam

Momentum	Adam
Uses velocity from past gradients	Uses first and second moments of gradients
Single global learning rate	Adaptive learning rate per parameter
Simpler, cheaper	Usually faster and more robust
Good upgrade over SGD	Common default for deep learning

Module 3

Autoencoders

Encoder-decoder model for compression, reconstruction, denoising, feature learning.

Core idea

An autoencoder is an unsupervised or self-supervised neural network that learns to reconstruct its own input.
It contains three parts: encoder, bottleneck / latent code, and decoder.
The network is trained to minimize reconstruction loss, usually MSE or L1 loss.

Input → Encoder → Latent Space (Code) → Decoder → Reconstructed Output

Types of autoencoders

Basic / Linear Autoencoder: simple compression and reconstruction.
Undercomplete: code dimension smaller than input, so it must learn salient features.
Overcomplete: code dimension larger than input; may simply copy input unless regularized.
Sparse: enforces that only a few hidden neurons are active.
Denoising: takes noisy input and reconstructs clean output.
Contractive: learns representations robust to small input changes.

Applications of autoencoders

Dimensionality reduction and visualization
Image compression
Anomaly detection using high reconstruction error
Denoising and missing-value imputation
Feature extraction and similarity search
Image inpainting, colorization, super-resolution
Recommendation systems

Autoencoder vs PCA

PCA	Autoencoder
Linear dimensionality reduction	Can learn non-linear representations
Closed-form mathematical solution	Trained using backpropagation
Limited to linear patterns	Captures more complex structures

Mini flashcards

Why is the bottleneck important?

It forces the network to compress information and learn the most useful features instead of memorizing the input.

How does anomaly detection work with autoencoders?

Train on normal data only; abnormal inputs reconstruct poorly, so high reconstruction error flags anomalies.

Module 4

Convolutional Neural Networks (CNN)

Convolution, stride, padding, pooling, feature maps, image classification pipeline.

CNN pipeline

Input → Convolution → ReLU → Pooling → (repeat) → Flatten → Fully Connected → Softmax

Convolution: filter slides over image to extract local patterns.
Feature map: output of a filter after convolution.
Stride: number of pixels the filter moves at each step.
Padding: extra pixels added around the input, often zeros, to preserve edge information or output size.
Pooling: downsampling operation, commonly max pooling.

Output size formula

Output size = ((N - F + 2P) / S) + 1

N = input size
F = filter size
P = padding
S = stride

Why pooling is used

Reduces spatial dimensions and computation
Helps retain important features
Adds some translation invariance
Can reduce overfitting

CNN vs fully connected network

CNN	Fully connected NN
Uses local receptive fields	Every neuron connects to all inputs
Parameter sharing through filters	No weight sharing
Efficient for images	Too many parameters for images
Learns spatial features	Ignores spatial locality

Exam tip: parameter sharing

The same filter weights are reused across the whole image. This drastically reduces the number of learnable parameters compared with a dense layer, while also making the model good at detecting the same feature anywhere in the image.

Typical use cases

Image classification
Object detection
OCR and handwriting recognition
Medical image analysis

Module 5

Recurrent Neural Networks (RNN)

Sequence learning, hidden state, BPTT, vanishing gradient, bidirectional RNN, LSTM, GRU.

Why RNN is needed

Traditional feedforward and CNN models assume fixed-size input and no memory of previous steps.
Sequence learning problems require output at time t to depend on previous inputs or outputs.
RNN introduces a hidden state that acts like memory.

xₜ + hₜ₋₁ → hₜ → yₜ

Unfolding computational graphs

RNN can be unrolled across time steps. Each time step behaves like a repeated layer that shares the same parameters. This makes sequential processing visible and trainable with backpropagation through time.

BPTT and gradient issues

BPTT: backpropagation is applied through all previous time steps.
Vanishing gradient: gradients become tiny, so the model forgets long-term dependencies.
Exploding gradient: gradients grow too large, making training unstable.
These problems motivate improved architectures such as LSTM and GRU.

Bidirectional RNN

Uses one RNN moving forward and another moving backward through the sequence. This helps the model use both past and future context, which is useful in tasks like speech recognition and language understanding.

LSTM in one view

Forget gate: decides what old information to discard.
Input gate: decides what new information to store.
Cell state: carries long-term memory.
Output gate: decides what to expose as the hidden state.

When to choose which model

Model	Use
Simple RNN	Short sequences, simple dependencies
Bidirectional RNN	When both past and future context matter
LSTM	Long-term dependencies and better memory control
GRU	Good real-time choice with fewer gates and faster computation

Module 6

Recent Trends and Applications

GAN architecture, adversarial training, image generation, deep learning applications.

GAN basics

GAN stands for Generative Adversarial Network.
It contains two networks: Generator (G) and Discriminator (D).
The generator creates fake samples from random noise.
The discriminator tries to distinguish real data from generated data.
Training is adversarial: both models improve by competing with each other.

Noise z → Generator → Fake sample → Discriminator → Real / Fake decision

GAN working in short

Feed real data to discriminator.
Feed generated fake data to discriminator.
Update discriminator to classify correctly.
Update generator to fool discriminator.
Repeat until generated samples look realistic.

Applications

Image generation
DeepFake and media synthesis
Image-to-image translation
Super-resolution
Art and design generation

Deep learning application mapping

Task	Suitable technique	Reason
Image classification	CNN	Learns spatial features efficiently
Text prediction	RNN / LSTM / GRU	Handles sequential context
Fake image generation	GAN	Generative adversarial learning
Fake image detection	CNN	Learns artifacts and visual inconsistencies

Deep Learning Notes, converted into an interactive HTML study sheet

Fundamentals of Neural Network

Key concepts

Mini flashcards

Training, Optimization and Regularization

Activation functions at a glance

High-yield points

Autoencoders

Core idea

Mini flashcards

Convolutional Neural Networks (CNN)

CNN pipeline

Recurrent Neural Networks (RNN)

Why RNN is needed

Recent Trends and Applications

GAN basics

Deep learning application mapping