After this lesson you'll know
- How neurons connect to form layers
- The difference between input, hidden, and output layers
- Why architecture matters for what a network can learn
- What happens when data flows through a network
Layers are the architecture of intelligence.
A single neuron can make simple decisions. But stack neurons into layers — input, hidden, output — and suddenly the network can recognize faces, translate languages, and write code. The architecture (how many layers, how they connect) determines what the network can learn.
The input layer receives raw data and passes it forward. For an image, each input neuron holds one pixel value. For text, each input holds a token embedding. The input layer does no computation — it is purely a data entry point. Its size is fixed by the data: a 28x28 pixel image needs 784 input neurons. A sentence with 50 tokens needs 50 input positions.
Hidden layers are where the magic happens. Each layer builds on the previous one, detecting increasingly abstract patterns. In an image classifier: layer 1 finds edges, layer 2 combines edges into shapes, layer 3 combines shapes into object parts, layer 4 recognizes whole objects. Think of it as a detective building a case — first individual clues, then connections, then the full picture.
The output layer produces the final answer. For classification, you get one neuron per category — a cat/dog classifier has 2 output neurons. For regression (predicting a number), you get one output neuron. The output values are often converted to probabilities using softmax, which ensures all outputs sum to 1.0 — so you can read them as confidence percentages.
The number and size of layers defines what the network can learn:
| Layers | What It Can Learn | Real Example |
|---|---|---|
| 1 | Linear boundaries | Is x > 5? |
| 2-3 | Curves and simple patterns | Digit recognition |
| 5-20 | Complex visual patterns | Image classification |
| 50-100 | Abstract reasoning | Language understanding |
| 100+ | Deep abstraction | GPT, Claude, DALL-E |
More layers = more abstraction = more data needed. GPT-4 has ~120 layers. Your brain has ~6 cortical layers.
A neural network in 15 lines of Python.
import numpy as np
# Input: 3 features (e.g., pixel brightness values)
X = np.array([0.5, 0.8, 0.2])
# Layer 1: 3 inputs → 4 hidden neurons
W1 = np.random.randn(3, 4) * 0.5 # 3×4 weight matrix
b1 = np.zeros(4) # 4 biases
hidden = np.maximum(0, X @ W1 + b1) # ReLU activation
# Layer 2: 4 hidden → 2 outputs (cat vs dog)
W2 = np.random.randn(4, 2) * 0.5 # 4×2 weight matrix
b2 = np.zeros(2) # 2 biases
logits = hidden @ W2 + b2 # raw scores
# Softmax: convert raw scores to probabilities
probs = np.exp(logits) / np.sum(np.exp(logits))
print(f"Cat: {probs[0]:.1%}, Dog: {probs[1]:.1%}")
The @ operator is matrix multiplication — it computes every neuron's weighted sum in one shot. np.maximum(0, ...) is ReLU applied to the whole layer at once. That's the entire forward pass.
import torch
import torch.nn as nn
# Define the network architecture
model = nn.Sequential(
nn.Linear(3, 4), # 3 inputs → 4 hidden neurons
nn.ReLU(), # activation
nn.Linear(4, 2), # 4 hidden → 2 outputs
nn.Softmax(dim=0) # convert to probabilities
)
# Forward pass
X = torch.tensor([0.5, 0.8, 0.2])
probs = model(X)
print(f"Cat: {probs[0]:.1%}, Dog: {probs[1]:.1%}")
PyTorch's nn.Sequential builds the exact same architecture — but handles backpropagation and training automatically. The numpy version shows you what happens inside; PyTorch is what you use in production.
Not all layers are created equal.
The simple network above uses dense layers (also called fully connected) where every neuron connects to every neuron in the next layer. But real networks use specialized layer types designed for different kinds of data:
Every neuron connects to every neuron in the next layer. Good for tabular data (spreadsheets, databases). Simple and effective, but scales poorly for images because an image with 1000x1000 pixels would need 1 million connections per neuron. Used as the final layers in most networks.
Instead of connecting to every input, each neuron looks at a small patch (like a 3x3 window) and slides across the image. This makes CNNs excellent at finding visual patterns — edges, textures, shapes — regardless of where they appear. Used in image classification, object detection, and medical imaging.
Each token "pays attention" to every other token in the sequence, learning which words matter most for understanding each word. This is the architecture behind GPT, Claude, and every modern language model. The key innovation: unlike older approaches, transformers can process all words in parallel instead of one at a time.
What a neural network looks like.
Every neural network follows this pattern: data enters the input layer, flows through hidden layers that find patterns, and arrives at the output layer which makes the decision.
Every arrow represents a weight — a number that gets adjusted during training. In the code above, W1 contains 12 weights (3 inputs × 4 hidden neurons) and W2 contains 8 weights (4 hidden × 2 outputs). Training means finding the right values for all 20 weights.
Test your understanding.
Network Architecture Concepts
Input Layer
Hidden Layer
Output Layer
Forward Pass
Network Architecture
Network Architecture
1Why do neural networks need hidden layers?
2For a cats vs dogs image classifier, why do you need 2 output neurons?
Data flow from start to finish.
Let's trace a single example through the entire network — from raw data to final prediction:
EXAMPLE: Classifying a 3-pixel "image" as cat or dog
INPUT: pixel values [0.5, 0.8, 0.2]
HIDDEN LAYER (4 neurons, each sees ALL inputs):
h1 = ReLU(0.5×w1 + 0.8×w2 + 0.2×w3 + bias) = 0.62
h2 = ReLU(0.5×w4 + 0.8×w5 + 0.2×w6 + bias) = 0.00 ← killed by ReLU
h3 = ReLU(0.5×w7 + 0.8×w8 + 0.2×w9 + bias) = 0.91
h4 = ReLU(0.5×w10+ 0.8×w11+ 0.2×w12+ bias) = 0.15
OUTPUT LAYER (2 neurons):
cat = softmax(0.62×w13 + 0.00×w14 + 0.91×w15 + 0.15×w16 + bias)
dog = softmax(0.62×w17 + 0.00×w18 + 0.91×w19 + 0.15×w20 + bias)
RESULT: cat = 82%, dog = 18% → prediction: CAT
Total weights: 12 (input→hidden) + 8 (hidden→output) = 20
Total biases: 4 (hidden) + 2 (output) = 6
Total learnable parameters: 26
This tiny network has 26 parameters. GPT-4 has an estimated 1.8 trillion. The architecture is the same — layers of neurons with weights and biases — just scaled up by a factor of 70 billion.