Neural Networks Without the Hype

Artificial intelligence and neural network visualization

Every neural network explanation starts by telling you it's "inspired by the human brain." I get why — it's an easy hook. But the analogy falls apart almost immediately and then you're stuck thinking about neurons and synapses when what's actually happening is matrix math.

A neural network is a function. It takes numbers in and spits numbers out. The interesting part is that the function has millions of adjustable parameters, and through a training process, those parameters get tuned so the output is useful for whatever task you're trying to accomplish. That's the entire concept. Everything else is implementation detail.

I'm going to walk through the code for a simple neural network because I find that reading code makes these things click faster than reading descriptions. If you know basic Python and aren't scared of numpy, you'll be fine.

What the Network Looks Like

Picture a chain of matrix multiplications. Your input data (numbers) gets multiplied by a matrix of weights, then a non-linear function gets applied, then the result gets multiplied by another matrix of weights, another non-linear function, and so on until you get an output.

Each multiplication step is called a "layer." The non-linear function applied between layers is called an "activation function." Without the activation function, stacking multiple layers of matrix multiplication would just collapse into a single matrix multiplication — it'd be equivalent to one layer. The activation function is what lets the network model non-linear relationships. Things that aren't just straight lines.

That's the forward pass. Data goes in one end, passes through layers, comes out the other end as a prediction.

The Activation Function

The simplest activation function to understand is sigmoid. It takes any number and squashes it into the range 0 to 1.

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

Large positive numbers become close to 1. Large negative numbers become close to 0. Zero maps to 0.5. If you plot it, it looks like an S-curve.

Why sigmoid specifically? Historically, it was popular because the output looks like a probability — a number between 0 and 1. Useful for binary classification: "is this email spam?" → 0.92 means 92% likely spam.

In modern deep learning, most hidden layers use ReLU (max(0, x)) instead of sigmoid because it trains faster and doesn't have the "vanishing gradient" problem that sigmoid has in deep networks. But sigmoid is easier to understand and works fine for a simple example, so I'll stick with it here.

Setting Up the Network

class SimpleNetwork:
    def __init__(self, input_size, hidden_size, output_size):
        # Weight matrix for input -> hidden layer
        self.W1 = np.random.randn(input_size, hidden_size) * 0.01
        self.b1 = np.zeros((1, hidden_size))

        # Weight matrix for hidden layer -> output
        self.W2 = np.random.randn(hidden_size, output_size) * 0.01
        self.b2 = np.zeros((1, output_size))

A few things to unpack here.

np.random.randn(input_size, hidden_size) * 0.01 creates a matrix of random numbers drawn from a standard normal distribution, then scales them down by multiplying by 0.01. The random initialization is important — if all weights started at zero, every neuron in a layer would compute the same thing and they'd all update identically during training. You'd effectively have a one-neuron layer no matter how wide you made it. Random initialization breaks that symmetry.

The * 0.01 keeps the initial values small. Large initial weights can cause the sigmoid function to saturate (output very close to 0 or 1) which makes gradients tiny and training painfully slow. Small weights keep the sigmoid in its "responsive" range where gradients flow better. This matters more than you'd think. I've seen training runs that completely failed to converge because of bad weight initialization.

self.b1 is the bias vector. It's an offset added after the matrix multiplication. Without it, the layer's output would always be zero when the input is zero, which limits what the network can learn. Biases give each neuron the ability to shift its activation threshold. Initializing them to zero is fine — they don't have the symmetry problem that weights have.

The Forward Pass

This is where the prediction happens.

    def forward(self, X):
        # Layer 1: multiply inputs by weights, add bias
        self.Z1 = np.dot(X, self.W1) + self.b1

        # Apply activation function
        self.A1 = sigmoid(self.Z1)

        # Layer 2: hidden activations * weights + bias
        self.Z2 = np.dot(self.A1, self.W2) + self.b2

        # Final activation gives us the prediction
        self.A2 = sigmoid(self.Z2)

        return self.A2

I stored the intermediate values (Z1, A1, Z2, A2) as instance variables because we'll need them for backpropagation during training. If you only cared about making predictions, you wouldn't need to save them.

Let me trace through what happens with some concrete numbers. Say we have 4 input features, 8 hidden neurons, and 1 output.

model = SimpleNetwork(input_size=4, hidden_size=8, output_size=1)

# A single example with 4 features
sample_input = np.array([[120, 1, 0.2, 5]])

prediction = model.forward(sample_input)
print(f"Output: {prediction[0][0]:.4f}")
# Output: something close to 0.5

The output will be around 0.5 because the weights are random and small. The network is guessing. It has no idea what the correct answer should be. To make it useful, we need to train it — which means adjusting the weights so the output gets closer to the right answer.

Training: The Part Everyone Glosses Over

The forward pass is the easy part. Training is where it gets interesting and where a lot of the intuition lives.

Training works like this: you show the network an example, it makes a prediction, you measure how wrong the prediction was (using a loss function), and then you adjust the weights in the direction that would make the prediction less wrong. Repeat this thousands or millions of times.

The "adjust the weights" step is called backpropagation, and it uses calculus (specifically the chain rule) to figure out how much each weight contributed to the error. I'm not going to derive the full math here — there are textbooks for that — but I'll show the implementation because the code is more readable than the equations.

    def train(self, X, y, learning_rate=0.1):
        # Forward pass
        prediction = self.forward(X)
        m = X.shape[0]  # number of examples

        # How wrong were we? (derivative of binary cross-entropy loss)
        dZ2 = prediction - y

        # Gradients for layer 2 weights and biases
        dW2 = (1/m) * np.dot(self.A1.T, dZ2)
        db2 = (1/m) * np.sum(dZ2, axis=0, keepdims=True)

        # Propagate the error back to layer 1
        dA1 = np.dot(dZ2, self.W2.T)
        dZ1 = dA1 * self.A1 * (1 - self.A1)  # sigmoid derivative

        # Gradients for layer 1 weights and biases
        dW1 = (1/m) * np.dot(X.T, dZ1)
        db1 = (1/m) * np.sum(dZ1, axis=0, keepdims=True)

        # Update weights — move them in the opposite direction of the gradient
        self.W2 -= learning_rate * dW2
        self.b2 -= learning_rate * db2
        self.W1 -= learning_rate * dW1
        self.b1 -= learning_rate * db1

The learning_rate controls how big the weight adjustments are. Too large and the network overshoots the optimal values, bouncing around without converging. Too small and training takes forever. Finding the right learning rate is more art than science in my experience. There are adaptive methods (Adam, RMSProp) that help, but for this example a fixed learning rate works.

prediction - y is the error signal. If the network predicted 0.8 and the true answer was 0, the error is 0.8. The rest of the function works backwards through the layers, computing how much each weight contributed to that error and adjusting accordingly.

Putting It All Together

# Generate some toy data
np.random.seed(42)
X = np.random.randn(200, 4)
# Label: 1 if sum of features > 0, else 0
y = (X.sum(axis=1, keepdims=True) > 0).astype(float)

model = SimpleNetwork(4, 8, 1)

# Train for 1000 iterations
for i in range(1000):
    model.train(X, y, learning_rate=0.5)
    if i % 200 == 0:
        predictions = model.forward(X)
        loss = -np.mean(y * np.log(predictions + 1e-8) + (1-y) * np.log(1-predictions + 1e-8))
        accuracy = np.mean((predictions > 0.5) == y)
        print(f"Step {i}: loss={loss:.4f}, accuracy={accuracy:.2%}")

After 1000 steps, this simple network should get to about 95%+ accuracy on this toy task. The task itself is trivial — just predict whether the sum of the input features is positive — but it shows the complete training loop: forward pass, compute loss, backward pass, update weights, repeat.

What I Skipped (and Why It Matters)

There's a lot I'm leaving out. Regularization, which prevents the network from memorizing the training data instead of learning general patterns. Batch normalization, which stabilizes training by normalizing the activations between layers. Dropout, which randomly turns off neurons during training to force the network to be more resilient. Different optimizers. Learning rate schedules. Weight initialization strategies beyond the simple approach I used.

Each of these addresses a specific problem that shows up when you scale from a toy example to a real-world model. Overfitting, vanishing gradients, training instability, slow convergence. The basic architecture I showed here — layers of matrix multiplications with non-linear activations — is the foundation of everything from image classifiers to GPT. The difference is scale (billions of parameters instead of dozens) and the accumulated tricks for making training work at that scale.

I think the mistake people make when learning this stuff is starting with the big models and working backward. If you understand the forward pass and backpropagation on a tiny network first, the architecture papers for transformers and convnets become way more approachable. They're not different in kind — they're different in the specific way the layers are structured and connected.

The hardest part for me personally wasn't the math. It was building intuition for when things go wrong during training. Loss not decreasing? Could be learning rate too high, or too low, or bad initialization, or a bug in your gradient computation, or data that hasn't been normalized. There's no single diagnostic tool. You just develop a feel for it over time, and you still get it wrong sometimes. I still get it wrong sometimes. You train a model for hours and the loss just sits there, refusing to budge, and you start questioning everything.

Anyway. The code above is a complete, working neural network in about 40 lines of Python. No libraries beyond numpy. If you copy it, run it, and step through the matrices with a debugger, you'll understand neural networks better than most people who've only watched explainer videos.

Artificial intelligence and neural network visualization

What the Network Looks Like

That's the forward pass. Data goes in one end, passes through layers, comes out the other end as a prediction.

The Activation Function

The simplest activation function to understand is sigmoid. It takes any number and squashes it into the range 0 to 1.

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

Large positive numbers become close to 1. Large negative numbers become close to 0. Zero maps to 0.5. If you plot it, it looks like an S-curve.

Setting Up the Network

class SimpleNetwork:
    def __init__(self, input_size, hidden_size, output_size):
        # Weight matrix for input -> hidden layer
        self.W1 = np.random.randn(input_size, hidden_size) * 0.01
        self.b1 = np.zeros((1, hidden_size))

        # Weight matrix for hidden layer -> output
        self.W2 = np.random.randn(hidden_size, output_size) * 0.01
        self.b2 = np.zeros((1, output_size))

A few things to unpack here.

The Forward Pass

This is where the prediction happens.

    def forward(self, X):
        # Layer 1: multiply inputs by weights, add bias
        self.Z1 = np.dot(X, self.W1) + self.b1

        # Apply activation function
        self.A1 = sigmoid(self.Z1)

        # Layer 2: hidden activations * weights + bias
        self.Z2 = np.dot(self.A1, self.W2) + self.b2

        # Final activation gives us the prediction
        self.A2 = sigmoid(self.Z2)

        return self.A2

Let me trace through what happens with some concrete numbers. Say we have 4 input features, 8 hidden neurons, and 1 output.

model = SimpleNetwork(input_size=4, hidden_size=8, output_size=1)

# A single example with 4 features
sample_input = np.array([[120, 1, 0.2, 5]])

prediction = model.forward(sample_input)
print(f"Output: {prediction[0][0]:.4f}")
# Output: something close to 0.5

Training: The Part Everyone Glosses Over

The forward pass is the easy part. Training is where it gets interesting and where a lot of the intuition lives.

    def train(self, X, y, learning_rate=0.1):
        # Forward pass
        prediction = self.forward(X)
        m = X.shape[0]  # number of examples

        # How wrong were we? (derivative of binary cross-entropy loss)
        dZ2 = prediction - y

        # Gradients for layer 2 weights and biases
        dW2 = (1/m) * np.dot(self.A1.T, dZ2)
        db2 = (1/m) * np.sum(dZ2, axis=0, keepdims=True)

        # Propagate the error back to layer 1
        dA1 = np.dot(dZ2, self.W2.T)
        dZ1 = dA1 * self.A1 * (1 - self.A1)  # sigmoid derivative

        # Gradients for layer 1 weights and biases
        dW1 = (1/m) * np.dot(X.T, dZ1)
        db1 = (1/m) * np.sum(dZ1, axis=0, keepdims=True)

        # Update weights — move them in the opposite direction of the gradient
        self.W2 -= learning_rate * dW2
        self.b2 -= learning_rate * db2
        self.W1 -= learning_rate * dW1
        self.b1 -= learning_rate * db1

Putting It All Together

# Generate some toy data
np.random.seed(42)
X = np.random.randn(200, 4)
# Label: 1 if sum of features > 0, else 0
y = (X.sum(axis=1, keepdims=True) > 0).astype(float)

model = SimpleNetwork(4, 8, 1)

# Train for 1000 iterations
for i in range(1000):
    model.train(X, y, learning_rate=0.5)
    if i % 200 == 0:
        predictions = model.forward(X)
        loss = -np.mean(y * np.log(predictions + 1e-8) + (1-y) * np.log(1-predictions + 1e-8))
        accuracy = np.mean((predictions > 0.5) == y)
        print(f"Step {i}: loss={loss:.4f}, accuracy={accuracy:.2%}")

Neural Networks Without the Hype

What the Network Looks Like

The Activation Function

Setting Up the Network

The Forward Pass

Training: The Part Everyone Glosses Over

Putting It All Together

What I Skipped (and Why It Matters)

Anurag Sinha

Found this useful?

Comments

Related Articles

How LLMs Actually Got Here

Prompt Engineering is Not a Real Job (But You Still Need to Learn It)

automation_scripts.py: A Blog Post in 150 Lines of Code

Neural Networks Without the Hype

What the Network Looks Like

The Activation Function

Setting Up the Network

The Forward Pass

Training: The Part Everyone Glosses Over

Putting It All Together

What I Skipped (and Why It Matters)

Anurag Sinha

Found this useful?

Comments

Related Articles

How LLMs Actually Got Here

Prompt Engineering is Not a Real Job (But You Still Need to Learn It)

automation_scripts.py: A Blog Post in 150 Lines of Code