From a single neuron to deep architectures
0.5.dev0+git.20260626.e137512 - June 26, 2026 18:41 UTC
Deep Learning#
Deep learning stacks simple, differentiable building blocks into networks trained end-to-end by gradient descent. This hub follows the same ground-up path as the source corpus: start by viewing logistic regression as one neuron, learn the computation graph and backpropagation that train it, then scale up to deep networks, the techniques that make them work, and the major architectures.
Three levels run through the page:
newcomers β a neuron, a loss, and how gradient descent learns;
practitioners β forward/backward propagation, activations, regularization and optimization;
researchers / engineers β CNNs and sequence models, and visualising them with scikit-plotsβ
visualkerasintegration.
Note
Open a dropdown for detail and follow See also links. Snippets use
real numpy / PyTorch / Keras calls. This page pairs with
the Terminology reference (classification
metrics for evaluating networks) and the
Data Preparation & Analysis hub.
Discovery at a Glance#
The smallest network, and how it learns.
A weighted sum, a sigmoid, a probability β the atom of every deep net.
How derivatives flow backward through a chain of operations.
The update rule that does the learning β written without slow Python loops.
Stacking neurons into layers.
Layer by layer from input to prediction.
The nonlinearities (ReLU, sigmoid, tanh) that give depth its power.
What the network minimises, and the optimisers that get it there.
Generalization techniques and the major architectures.
Dropout, weight decay, batch norm β and the hyperparameters that matter.
Weight sharing for images β convolution, pooling, deep backbones.
RNNs, LSTMs and attention for ordered data.
Part 1 β From Logistic Regression to a Neuron#
Logistic Regression as a Single Neuron#
What is it?
A single neuron computes a weighted sum of its inputs, adds a bias, and passes the result through a sigmoid to produce a probability:
This is logistic regression β and the exact unit that, stacked and repeated, becomes a deep network. Training minimises binary cross-entropy:
Computation Graph & Backpropagation#
What is it?
A computation graph represents a calculation as nodes (operations) and edges (values). Backpropagation applies the chain rule along this graph in reverse to compute the gradient of the loss with respect to every parameter efficiently β reusing intermediate results rather than recomputing them.
This reverse-mode automatic differentiation is what frameworks like PyTorch and TensorFlow implement under the hood.
See also
Gradient Descent & Vectorization#
The update rule
Gradient descent nudges each parameter against the gradient of the loss, scaled by a learning rate \(\alpha\):
Vectorization replaces per-example Python loops with matrix operations over the whole batch β the key to practical speed:
import numpy as np
Z = W @ X + b # all examples at once
A = 1 / (1 + np.exp(-Z)) # sigmoid, vectorized
dW = (A - Y) @ X.T / m # gradient over the batch
See also
dl-loss-optim
Part 2 β Neural Networks#
Forward Propagation#
What is it?
In a network, each layer applies a linear transform followed by a nonlinearity, feeding the next layer:
where \(a^{[0]} = x\) is the input and the final layerβs output is the prediction. Stacking layers lets the network compose simple features into complex ones.
Activation Functions#
Why they matter
Without a nonlinearity between layers, any stack collapses to a single linear map. Common choices:
ReLU \(g(z) = \max(0, z)\) β the default for hidden layers; cheap and avoids vanishing gradients.
Sigmoid β squashes to \((0, 1)\) for binary output.
Tanh β zero-centred \((-1, 1)\).
Softmax β multiclass output probabilities that sum to 1.
See also
dl-loss-optim
Part 3 β Improving & Scaling Networks#
Regularization, Normalization & Tuning#
Generalization techniques
L2 / weight decay β penalise large weights.
Dropout β randomly zero units during training to prevent co-adaptation.
Batch normalization β standardise layer inputs to stabilise and speed up training.
Early stopping β halt when validation loss stops improving.
Hyperparameters that matter most β learning rate (first), then batch size, network width/depth, and regularization strength. Tune on a validation split, never the test set.
model = nn.Sequential(
nn.Linear(20, 64), nn.BatchNorm1d(64), nn.ReLU(),
nn.Dropout(0.3), nn.Linear(64, 1),
)
Convolutional Neural Networks (CNNs)#
What is it?
CNNs exploit spatial structure by sharing weights: a small kernel slides across the image (convolution), detecting the same feature everywhere, while pooling downsamples for translation tolerance. Stacking these yields deep backbones (VGG, ResNet, EfficientNet).
scikit-plotsβ visualkeras integration renders these architectures
as layered diagrams β useful for documentation and review.
from tensorflow import keras
model = keras.Sequential([
keras.layers.Conv2D(32, 3, activation="relu",
input_shape=(28, 28, 1)),
keras.layers.MaxPooling2D(),
keras.layers.Flatten(),
keras.layers.Dense(10, activation="softmax"),
])
Sequence Models (RNNs, LSTMs, Attention)#
What is it?
For ordered data (text, audio, time series), recurrent networks carry a hidden state across steps. LSTMs / GRUs add gating to learn long-range dependencies, while attention / transformer layers let every position attend to every other β now the dominant approach for language.
Map to scikit-plots (visualkeras) & Frameworks#
Verified architecture-visualisation galleries and framework docs:
Layered diagram of a fully-connected classifier.
Spam classification network, visualised.
Encoder/decoder convolutional architecture.
Deep residual backbone diagram.
Tensors, autograd and torch.nn.
High-level model building on TensorFlow / JAX / PyTorch.
Sources#
Verified during preparation of this page; resolvable at build date.
Source context (framing only, re-expressed in our own words)
Deep Learning category (17 posts): https://insightful-data-lab.com/category/deep-learning/
Official documentation (API calls used above)
Keras: https://keras.io/
TensorFlow: https://www.tensorflow.org/
scikit-plots (this project)
visualkeras examples: https://scikit-plots.github.io/dev/auto_examples/index.html
Terminology reference: terminology-index
Standard reference
Goodfellow, Bengio & Courville, Deep Learning (free): https://www.deeplearningbook.org/