Build a Network.
How neurons connect into layers, and how data flows through them — with code you can run yourself.
After this lesson you'll know
- How neurons connect to form layers
- The difference between input, hidden, and output layers
- Why architecture matters for what a network can learn
- What happens when data flows through a network
Layers are the architecture of intelligence.
A single neuron can make simple decisions. But stack neurons into layers — input, hidden, output — and suddenly the network can recognize faces, translate languages, and write code. The architecture (how many layers, how they connect) determines what the network can learn.
The input layer receives raw data and passes it forward. For an image, each input neuron holds one pixel value. For text, each input holds a token embedding. The input layer does no computation — it is purely a data entry point. Its size is fixed by the data: a 28x28 pixel image needs 784 input neurons. A sentence with 50 tokens needs 50 input positions.
Hidden layers are where the magic happens. Each layer builds on the previous one, detecting increasingly abstract patterns. In an image classifier: layer 1 finds edges, layer 2 combines edges into shapes, layer 3 combines shapes into object parts, layer 4 recognizes whole objects. Think of it as a detective building a case — first individual clues, then connections, then the full picture.
The output layer produces the final answer. For classification, you get one neuron per category — a cat/dog classifier has 2 output neurons. For regression (predicting a number), you get one output neuron. The output values are often converted to probabilities using softmax, which ensures all outputs sum to 1.0 — so you can read them as confidence percentages.
The number and size of layers defines what the network can learn:
NETWORK DEPTH vs CAPABILITY
Layers What It Can Learn Real Example
────── ───────────────── ────────────
1 Linear boundaries Is x > 5?
2-3 Curves and simple patterns Digit recognition
5-20 Complex visual patterns Image classification
50-100 Abstract reasoning Language understanding
100+ Deep abstraction GPT, Claude, DALL-E
More layers = more abstraction = more data needed
GPT-4 has ~120 layers. Your brain has ~6 cortical layers.
A neural network in 15 lines of Python.
import numpy as np
# Input: 3 features (e.g., pixel brightness values)
X = np.array([0.5, 0.8, 0.2])
# Layer 1: 3 inputs → 4 hidden neurons
W1 = np.random.randn(3, 4) * 0.5 # 3×4 weight matrix
b1 = np.zeros(4) # 4 biases
hidden = np.maximum(0, X @ W1 + b1) # ReLU activation
# Layer 2: 4 hidden → 2 outputs (cat vs dog)
W2 = np.random.randn(4, 2) * 0.5 # 4×2 weight matrix
b2 = np.zeros(2) # 2 biases
logits = hidden @ W2 + b2 # raw scores
# Softmax: convert raw scores to probabilities
probs = np.exp(logits) / np.sum(np.exp(logits))
print(f"Cat: {probs[0]:.1%}, Dog: {probs[1]:.1%}")
The @ operator is matrix multiplication — it computes every neuron's weighted sum in one shot. np.maximum(0, ...) is ReLU applied to the whole layer at once. That's the entire forward pass.
import torch
import torch.nn as nn
# Define the network architecture
model = nn.Sequential(
nn.Linear(3, 4), # 3 inputs → 4 hidden neurons
nn.ReLU(), # activation
nn.Linear(4, 2), # 4 hidden → 2 outputs
nn.Softmax(dim=0) # convert to probabilities
)
# Forward pass
X = torch.tensor([0.5, 0.8, 0.2])
probs = model(X)
print(f"Cat: {probs[0]:.1%}, Dog: {probs[1]:.1%}")
PyTorch's nn.Sequential builds the exact same architecture — but handles backpropagation and training automatically. The numpy version shows you what happens inside; PyTorch is what you use in production.
Not all layers are created equal.
The simple network above uses dense layers (also called fully connected) where every neuron connects to every neuron in the next layer. But real networks use specialized layer types designed for different kinds of data:
Every neuron connects to every neuron in the next layer. Good for tabular data (spreadsheets, databases). Simple and effective, but scales poorly for images because an image with 1000x1000 pixels would need 1 million connections per neuron. Used as the final layers in most networks.
Instead of connecting to every input, each neuron looks at a small patch (like a 3x3 window) and slides across the image. This makes CNNs excellent at finding visual patterns — edges, textures, shapes — regardless of where they appear. Used in image classification, object detection, and medical imaging.
Each token "pays attention" to every other token in the sequence, learning which words matter most for understanding each word. This is the architecture behind GPT, Claude, and every modern language model. The key innovation: unlike older approaches, transformers can process all words in parallel instead of one at a time.
What a neural network looks like.
Every neural network follows this pattern: data enters the input layer, flows through hidden layers that find patterns, and arrives at the output layer which makes the decision.
INPUT HIDDEN OUTPUT
(3 neurons) (4 neurons) (2 neurons)
┌───┐
x₁ ──▶│ h₁ │──┐
└───┘ │ ┌───┐
┌───┐ ├─────▶│ y₁ │ ← P(cat) = 0.82
x₂ ──▶│ h₂ │──┤ └───┘
└───┘ │
┌───┐ │ ┌───┐
x₃ ──▶│ h₃ │──┼─────▶│ y₂ │ ← P(dog) = 0.18
└───┘ │ └───┘
┌───┐ │
│ h₄ │──┘
└───┘
↑ Each input ↑ Each hidden ↑ Output neurons
connects to neuron finds give the final
EVERY hidden a different prediction as
neuron (fully pattern in probabilities
connected) the data that sum to 1
Every arrow represents a weight — a number that gets adjusted during training. In the code above, W1 contains 12 weights (3 inputs × 4 hidden neurons) and W2 contains 8 weights (4 hidden × 2 outputs). Training means finding the right values for all 20 weights.
Test your understanding.
Data flow from start to finish.
Let's trace a single example through the entire network — from raw data to final prediction:
EXAMPLE: Classifying a 3-pixel "image" as cat or dog
INPUT: pixel values [0.5, 0.8, 0.2]
HIDDEN LAYER (4 neurons, each sees ALL inputs):
h1 = ReLU(0.5×w1 + 0.8×w2 + 0.2×w3 + bias) = 0.62
h2 = ReLU(0.5×w4 + 0.8×w5 + 0.2×w6 + bias) = 0.00 ← killed by ReLU
h3 = ReLU(0.5×w7 + 0.8×w8 + 0.2×w9 + bias) = 0.91
h4 = ReLU(0.5×w10+ 0.8×w11+ 0.2×w12+ bias) = 0.15
OUTPUT LAYER (2 neurons):
cat = softmax(0.62×w13 + 0.00×w14 + 0.91×w15 + 0.15×w16 + bias)
dog = softmax(0.62×w17 + 0.00×w18 + 0.91×w19 + 0.15×w20 + bias)
RESULT: cat = 82%, dog = 18% → prediction: CAT
Total weights: 12 (input→hidden) + 8 (hidden→output) = 20
Total biases: 4 (hidden) + 2 (output) = 6
Total learnable parameters: 26
This tiny network has 26 parameters. GPT-4 has an estimated 1.8 trillion. The architecture is the same — layers of neurons with weights and biases — just scaled up by a factor of 70 billion.