Tinker with a neural network in your browser. Self-made model without any calculus.

Basic Tips/Info

  • Increase the epochs per frame (and lower the resolution) to massively speed up training (try to maximize FPS * epochs per frame)
  • If the cost remains high (or at 0.5) and the gradient magnitude is 0, it may be beneficial to reset the weights, as this indicates that the network is trapped in a local minimum
  • If the output texture is excessively fluctuating, the network is still learning, but the weight gradients are still gradually decelerating (like a skateboard gradually losing speed and coming to a stop on a half pipe) or the network keeps slightly overshooting back and forth
  • On more complicated patterns, adding more inputs will likely help dramatically
  • A larger network (more hidden layers and more inputs) means more time per epoch

What is a Neural Network?

A neural network is a method in artificial intelligence that teaches computers to process data in a way that is inspired by the human brain. It is a type of machine learning process, called deep learning, that uses interconnected nodes or neurons in a layered structure that resembles the human brain. It creates an adaptive system that computers use to learn from their mistakes and improve continuously. Thus, artificial neural networks attempt to solve complicated problems, like summarizing documents or recognizing faces, with greater accuracy.
— Amazon

How does a Neural Network learn?

A neural network learns by iteratively adjusting its weights and biases based on the observed input-output pairs during the training process. It computes the predicted outputs, compares them to the actual outputs using a loss function, and then updates the weights and biases through backpropagation, aiming to minimize the loss. This iterative optimization process gradually refines the network's parameters, allowing it to capture and generalize complex patterns in the data and improve its performance over time.

Hyperparameters/Settings Explained

Visualization Specific

These settings do not reload the neural network and can be freely changed while the neural network is training.

Resolution

  • How detailed the output texture is (number of squares the texture is made up of)
  • Higher: more smooth/clear visualization of the neural network
  • Lower: higher FPS
  • Good values are 5-100, but can increase it after training is complete or while the learning is paused for a more detailed output

Epochs per frame

  • How many training epochs happen per frame
  • Higher: faster training (but might also increase excessive fluctuations in the output texture)
  • Lower: higher FPS
  • Good values are 1-25, but the goal should be to increase this as much as possible without losing too much FPS (because epochs per second = epochs per frame * FPS)

Neural Network

Except for the learning rate and the momentum, these hyperparameters are not applied immediately and require the "Apply" button to be clicked before they take effect (which also resets the network's weights and biases).

Learning Rate

  • How much of the weight gradient to subtract from the weights
  • Higher: weights and biases are updated more aggressively leading to a possible faster convergence and more chances to skip over/escape local minimums/optima
  • Lower: less chance of overshooting the optimal solution, higher stability, and better learning of specific/highly detailed patterns
  • Generally good values are between 0-1 (>0) with smaller values usually being preferred (0.001, 0.003, 0.01, etc), but it vastly depends on on the data

Momentum

  • What proportion of the weight gradient to keep for the next training iteration/epoch
  • Higher: possible faster convergence and improved avoidance of local minimums/optima
  • Lower: less chance of overshooting and less aggressive fluctuations in the output texture
  • Good values vary, but must be equal to or between 0-1
    • The momentum value should be balanced with the learning rate. If the learning rate is large, a smaller momentum value may be appropriate to avoid overshooting and instability (and vice versa)

Hidden Layers

  • The sizes of the hidden layers
    • All the layers are completely connected to each other
    • For example a value of "6, 4" would mean that there is the input layer (configured by the "Inputs" setting), then a layer with 6 neurons, then a layer with 4 neurons, then an output layer with 3 neurons (for R, G, B)
  • Having more layers with more neurons increasing the amount of time per epoch
  • Good values depend a lot on your data. For the most part, the complexity from just 2d points with color should not require too many layers/neurons, but it depends.

Activation

  • An activation function is a mathematical function applied to the output of each neuron
  • They introduce non-linearity to the network, which is needed to enable the network to learn and represent nonlinear patterns
  • (Note: "exp(x)" = "e^x")
Leaky ReLU
  • Mathematical Definition: f(x) = max(0.1x, x)
  • Strengths:
    • Resolves the "dying ReLU" problem by introducing a small gradient for negative values
    • Allows for faster convergence compared to ReLU
  • Weaknesses:
    • Can suffer from vanishing gradients for extreme negative values
ReLU (Rectified Linear Unit)
  • Mathematical Definition: f(x) = max(0, x)
  • Strengths:
    • Simplicity and computational efficiency
    • Avoids vanishing gradient problem for positive values
  • Weaknesses:
    • Prone to "dying ReLU" problem, where neurons can get stuck in a state of inactivity
Sigmoid
  • Mathematical Definition: σ(x) = 1 / (1 + exp(-x))
  • Strengths:
    • Smooth and bounded between 0 and 1, which can be interpreted as probabilities
  • Weaknesses:
    • Suffers from vanishing gradients (difference between σ(10) and σ(11) is close to 0) , making it harder for deep networks to converge
    • Output is not centered around zero
Tanh (Hyperbolic Tangent)
  • Mathematical Definition: tanh(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))
  • Strengths:
    • Squashes the output between -1 and 1, making it suitable for outputs that need to be symmetrically centered
  • Weaknesses:
    • Still prone to the vanishing gradient problem, particularly for very large or small inputs
Linear
  • Mathematical Definition: f(x) = x
  • Strengths:
    • Preserves the linearity of the input, making it suitable for regression tasks
    • Avoids the vanishing gradient problem and provides a straightforward mapping
  • Weaknesses:
    • Lacks non-linearity, limiting its ability to model complex patterns

Weight Initialization

How the weights are initial set (biases always start at 0)

Random
  • A random value from a uniform distribution from -0.5 to 0.5
  • Strengths:
    • Simplicity and ease of implementation
    • Provides initial diversity in weights, allowing the network to explore different solutions
  • Weaknesses:
    • Lack of control over the magnitude and distribution of weights
    • Prone to vanishing or exploding gradients
Xavier (Glorot)
  • A random value from a normal distribution with a mean of 0 and a standard deviation of sqrt(2 / (number of inputs + number of outputs)
    • Where the number of inputs/outputs are different per layer
  • Strengths:
    • Takes into account the size of the weight matrix, helping to maintain a stable signal flow
    • Mitigates the vanishing or exploding gradient problem more effectively than random initialization
  • Weaknesses:
    • Assumes the activation function to be linear, which may not be accurate for all activation functions
    • May not be suitable for very deep networks or networks with non-linear activation functions
He (Kaiming)
  • A random value from a normal distribution with a mean of 0 and a standard deviation of sqrt(2 / number of inputs)
    • Where the number of inputs is different per layer
  • Strengths:
    • Specifically designed for networks using the ReLU activation function
    • Addresses the vanishing gradient problem by considering the rectifier non-linearity of ReLU
  • Weaknesses:
    • Assumes the activation function to be ReLU
    • Similar to Xavier initialization, may not be optimal for very deep networks or networks with non-linear activation functions other than ReLU

Regularization

Regularization techniques are employed to prevent overfitting and enhance model generalization by adding a penalty term to the loss function, encouraging simpler models and reducing reliance on noise or irrelevant patterns in the training data, with the choice of regularization depending on the specific problem, available data, and model complexity.

L1 (Lasso)
  • Adds L1 norm multiplied by the regularization rate to the cost. The L1 norm is the sum of the absolute values of the weights (divided by the number of weights).
  • Strengths:
    • Encourages sparsity in weight values, driving some weights to become exactly zero
    • Effective for feature selection, as it can push irrelevant or less important features to have zero weights
  • Weaknesses:
    • May lead to a non-unique solution when multiple features are highly correlated
    • Can be sensitive to the choice of the regularization rate and require careful tuning
L2 (Ridge)
  • Adds the L2 norm multiplied by regularization rate to the cost . The L2 norm is the sum of the squares of the weights (divided by the number of weights).
  • Strengths:
    • Encourages smaller weight values, effectively limiting the magnitudes of the weights
    • Provides smoother and more stable solutions compared to L1 regularization
  • Weaknesses:
    • May not drive any weights to exactly zero
    • L2 regularization does not explicitly perform feature selection and keeps all features in the model
None
  • No regularization 
  • Strengths:
    • Simplicity and reduced computational overhead, as there are no additional terms to consider in the loss function
    • Good if the "goal" is overfitting
  • Weaknesses:
    • Increased risk of overfitting
    • Lack of explicit control over the model's complexity, making it more prone to capturing noise and specific patterns in the training data

Regularization Rate

  • How much effect the regularization has on the cost
  • Higher: increased protection against overfitting and results in a simpler model
  • Lower: less change of underfitting and faster convergence
  • Generally good values are usually very low and for this visualization you probably want to close to no regularization for a faster convergence and a closer fit to the data points
    • (In theory you would want to change the learning rate as training continued, but this visualization does not support that)

Inputs

What inputs the model gets. More is usually better (but it might increase the time per epoch a little). While the network could learn the non-linear relationships, giving them directly to the network allows for faster convergence and potentially less neurons/hidden layers to achieve the same accuracy.

Custom Points

  • Left click to add points
  • The color of the added point is the selected color
  • To clear the data points scroll to the bottom of the right bar

More Information

Download

Download NowName your own price

Click download now to get access to the following files:

Source Code.zip 7 kB
Web Build.zip 7 kB

Comments

Log in with itch.io to leave a comment.

(1 edit) (-1)

I love the idea of playing with Neural Networks! I am excited to see a way to visualise training!

However, what is the goal here? The layers are training themselves, but what is it they are seeking to do? I can see lots of dials, but no single guiding principle.

(1 edit) (+1)

The primary objective of the training process is to discover a function that takes into account a pixel’s position (X, Y, and any other activated inputs) and produces a color (a set of 3 RGB numbers) based on the given data points. These data points refer to the “example” points, represented by dots that can be placed using the color selector buttons and left-clicking to add new points. The right panel provides an option to clear these points.

(I hope this helps!)

(-1)

The thing about Neural Networks is that you kill off ineffective permutations as the training is refined.

So it seems from your response that rather than having a directed goal, the function acts more or less like a random walk? 

(2 edits)

Sorry if my previous response wasn’t that clear.

The network learns through “gradient descent.”

Gradient descent is a fundamental optimization algorithm used in training neural networks. It works by iteratively adjusting the network’s parameters to minimize the difference between its predicted outputs and the actual targets in the training data.

The process starts with initializing the parameters randomly. Then, for each training example, the network calculates the error between its prediction and the target output. The gradients of the parameters with respect to this error are computed, indicating the direction and magnitude of the steepest descent in the error space.

The parameters are updated by subtracting a fraction of the gradients, known as the learning rate, from their current values. This step is repeated for all training examples in multiple iterations or epochs, gradually reducing the error.

By repeatedly adjusting the parameters based on the gradients, the neural network “descends” the error surface, converging towards a set of parameter values that yield better predictions. This iterative process enables the network to learn and improve its performance over time.

— ChatGPT

Think of the gradients like slope. You are always trying to determine which way is “downhill” and move towards the lowest possible point. A good visualization of gradient descent is this video: https://youtu.be/hfMk-kjRv4c?t=907 (from 15:07).