🧠 Deep Learning
From a single neuron to deep architectures
0.5.dev0+git.20260626.e137512 - June 26, 2026 18:41 UTC

Deep Learning#

Deep learning stacks simple, differentiable building blocks into networks trained end-to-end by gradient descent. This hub follows the same ground-up path as the source corpus: start by viewing logistic regression as one neuron, learn the computation graph and backpropagation that train it, then scale up to deep networks, the techniques that make them work, and the major architectures.

Three levels run through the page:

  • newcomers β€” a neuron, a loss, and how gradient descent learns;

  • practitioners β€” forward/backward propagation, activations, regularization and optimization;

  • researchers / engineers β€” CNNs and sequence models, and visualising them with scikit-plots’ visualkeras integration.

Note

Open a dropdown for detail and follow See also links. Snippets use real numpy / PyTorch / Keras calls. This page pairs with the Terminology reference (classification metrics for evaluating networks) and the Data Preparation & Analysis hub.


Discovery at a Glance#

The smallest network, and how it learns.

πŸ”˜ Logistic Regression as a Neuron

A weighted sum, a sigmoid, a probability β€” the atom of every deep net.

Logistic Regression as a Single Neuron
πŸ•ΈοΈ Computation Graph & Backprop

How derivatives flow backward through a chain of operations.

Computation Graph & Backpropagation
⬇️ Gradient Descent & Vectorization

The update rule that does the learning β€” written without slow Python loops.

Gradient Descent & Vectorization

Stacking neurons into layers.

➑️ Forward Propagation

Layer by layer from input to prediction.

Forward Propagation
⚑ Activation Functions

The nonlinearities (ReLU, sigmoid, tanh) that give depth its power.

Activation Functions
πŸ“‰ Loss & Optimization

What the network minimises, and the optimisers that get it there.

dl-loss-optim

Generalization techniques and the major architectures.

πŸ›‘οΈ Regularization & Tuning

Dropout, weight decay, batch norm β€” and the hyperparameters that matter.

Regularization, Normalization & Tuning
πŸ–ΌοΈ Convolutional Networks

Weight sharing for images β€” convolution, pooling, deep backbones.

Convolutional Neural Networks (CNNs)
πŸ” Sequence Models

RNNs, LSTMs and attention for ordered data.

Sequence Models (RNNs, LSTMs, Attention)

Part 1 β€” From Logistic Regression to a Neuron#

Logistic Regression as a Single Neuron#

What is it?

A single neuron computes a weighted sum of its inputs, adds a bias, and passes the result through a sigmoid to produce a probability:

\[\hat{y} = \sigma(w^\top x + b), \qquad \sigma(z) = \frac{1}{1 + e^{-z}}\]

This is logistic regression β€” and the exact unit that, stacked and repeated, becomes a deep network. Training minimises binary cross-entropy:

\[\mathcal{L} = -\frac{1}{m}\sum_{i=1}^{m} \big[y_i \log \hat{y}_i + (1 - y_i)\log(1 - \hat{y}_i)\big]\]
Computation Graph & Backpropagation#

What is it?

A computation graph represents a calculation as nodes (operations) and edges (values). Backpropagation applies the chain rule along this graph in reverse to compute the gradient of the loss with respect to every parameter efficiently β€” reusing intermediate results rather than recomputing them.

This reverse-mode automatic differentiation is what frameworks like PyTorch and TensorFlow implement under the hood.

Gradient Descent & Vectorization#

The update rule

Gradient descent nudges each parameter against the gradient of the loss, scaled by a learning rate \(\alpha\):

\[w := w - \alpha \frac{\partial \mathcal{L}}{\partial w}, \qquad b := b - \alpha \frac{\partial \mathcal{L}}{\partial b}\]

Vectorization replaces per-example Python loops with matrix operations over the whole batch β€” the key to practical speed:

import numpy as np

Z = W @ X + b                 # all examples at once
A = 1 / (1 + np.exp(-Z))      # sigmoid, vectorized
dW = (A - Y) @ X.T / m        # gradient over the batch

See also

dl-loss-optim


Part 2 β€” Neural Networks#

Forward Propagation#

What is it?

In a network, each layer applies a linear transform followed by a nonlinearity, feeding the next layer:

\[a^{[l]} = g\big(W^{[l]} a^{[l-1]} + b^{[l]}\big)\]

where \(a^{[0]} = x\) is the input and the final layer’s output is the prediction. Stacking layers lets the network compose simple features into complex ones.

Activation Functions#

Why they matter

Without a nonlinearity between layers, any stack collapses to a single linear map. Common choices:

  • ReLU \(g(z) = \max(0, z)\) β€” the default for hidden layers; cheap and avoids vanishing gradients.

  • Sigmoid β€” squashes to \((0, 1)\) for binary output.

  • Tanh β€” zero-centred \((-1, 1)\).

  • Softmax β€” multiclass output probabilities that sum to 1.

See also

dl-loss-optim


Part 3 β€” Improving & Scaling Networks#

Regularization, Normalization & Tuning#

Generalization techniques

  • L2 / weight decay β€” penalise large weights.

  • Dropout β€” randomly zero units during training to prevent co-adaptation.

  • Batch normalization β€” standardise layer inputs to stabilise and speed up training.

  • Early stopping β€” halt when validation loss stops improving.

Hyperparameters that matter most β€” learning rate (first), then batch size, network width/depth, and regularization strength. Tune on a validation split, never the test set.

model = nn.Sequential(
    nn.Linear(20, 64), nn.BatchNorm1d(64), nn.ReLU(),
    nn.Dropout(0.3), nn.Linear(64, 1),
)
Convolutional Neural Networks (CNNs)#

What is it?

CNNs exploit spatial structure by sharing weights: a small kernel slides across the image (convolution), detecting the same feature everywhere, while pooling downsamples for translation tolerance. Stacking these yields deep backbones (VGG, ResNet, EfficientNet).

scikit-plots’ visualkeras integration renders these architectures as layered diagrams β€” useful for documentation and review.

from tensorflow import keras
model = keras.Sequential([
    keras.layers.Conv2D(32, 3, activation="relu",
                        input_shape=(28, 28, 1)),
    keras.layers.MaxPooling2D(),
    keras.layers.Flatten(),
    keras.layers.Dense(10, activation="softmax"),
])
Sequence Models (RNNs, LSTMs, Attention)#

What is it?

For ordered data (text, audio, time series), recurrent networks carry a hidden state across steps. LSTMs / GRUs add gating to learn long-range dependencies, while attention / transformer layers let every position attend to every other β€” now the dominant approach for language.


Map to scikit-plots (visualkeras) & Frameworks#

Verified architecture-visualisation galleries and framework docs:

visualkeras β€” Dense network

Layered diagram of a fully-connected classifier.

https://scikit-plots.github.io/dev/auto_examples/visualkeras/plot_dl_ann_dense.html
visualkeras β€” Conv1D + Dense

Spam classification network, visualised.

https://scikit-plots.github.io/dev/auto_examples/visualkeras/plot_dl_ann_conv_dense.html
visualkeras β€” Autoencoder (CNN)

Encoder/decoder convolutional architecture.

https://scikit-plots.github.io/dev/auto_examples/visualkeras/plot_dl_cnn_autoencoder.html
visualkeras β€” ResNetV2

Deep residual backbone diagram.

https://scikit-plots.github.io/dev/auto_examples/visualkeras/plot_dl_cnn_resnetv2.html
PyTorch

Tensors, autograd and torch.nn.

https://pytorch.org/docs/stable/index.html
Keras

High-level model building on TensorFlow / JAX / PyTorch.

https://keras.io/

Sources#

Verified during preparation of this page; resolvable at build date.

Source context (framing only, re-expressed in our own words)

Official documentation (API calls used above)

scikit-plots (this project)

Standard reference

Tags: purpose: reference domain: neural network model-type: classification level: beginner level: intermediate level: advanced