### Introduction: From Simple Linear Models to Neural Networks

Neural networks are essentially computational models designed to map input data to output data by learning weights through optimization. We’ll begin by explaining the simplest neural network: a linear model where the function is f(x)=w⋅xf(x) = w \cdot xf(x)=w⋅x.

In this case:

w is the weight we are adjusting.

x is the input.

f(x) represents the predicted output.

The goal of the neural network is to **adjust the weight www** so that the output f(x) closely matches the desired output provided by the programmer.

### Loss Function and Gradient Descent

To quantify how good or bad a prediction is, we use a **cost function**. A common cost function for regression tasks is the **Mean Squared Error (MSE)**, which is calculated as:

Here:

C(w) is the cost function.

f(xi) is the predicted output for the iii-th input.

yi is the desired output.

n is the number of data points.

When the network makes a correct prediction, the cost function will be close to 0. Otherwise, it will produce a larger number.

### 💡__This is where starts the Deep Analysis of Neural Networks__

### Gradient Descent

To find the weight www that minimizes the cost function, we use **gradient descent**. The gradient of the cost function helps us adjust the weight in the right direction. Mathematically, the gradient of the cost function with respect to www is:

To minimize the cost, we update www using the following equation:

Where:

α (alpha) is the learning rate, which controls how big the steps are in each iteration.

### Expanding to Neural Networks

A neural network consists of multiple layers of these linear models. In the simplest form of a neural network, we have:

Where:

W is the weight matrix.

b is the bias.

σ is the activation function (e.g., ReLU, sigmoid).

We use the **chain rule** during Backpropagation to compute the gradient of the loss function with respect to all weights and biases in the network.

With this explanation we have covered a Deep Analysis of Neural Networks

## Attention Mechanism in Transformers

### Self-Attention Mechanism

The attention mechanism computes a weighted sum of input values, dynamically adjusting based on the input sequence. In self-attention, we compute three vectors for each word: **Query** QQQ, **Key** KKK, and **Value** V, as follows:

Where X is the input matrix, and WQ,WK,WV are learned weight matrices.

### Scaled Dot-Product Attention

The attention score for a query Q and key K is calculated as the dot product, scaled by the square root of the dimension of the key vectors:

Where dk is the dimension of the key vectors.

### Multi-Head Attention

The Transformer uses multiple attention heads to capture different aspects of the input. Each attention head is computed independently, and the results are concatenated:

Where WO is a learned weight matrix that projects the concatenated output back to the original dimension.

### Feedforward Layer

After attention, the network applies a feedforward neural network to each position in the sequence:

Where W1,W2,b1,b2 are learnable parameters of the feedforward network.

### Let's now get deeper into the analysis of the attention mechanism in Transformers Architecture

**But before we start, oh boy! why we even call it Transformers? and yes we are Transformer fans, but this is more interesting than that, let's see why:**

The **"Transformer"** model, introduced in the paper "Attention is All You Need" by Vaswani et al. (2017), got its name because it **transforms the way information is processed in neural networks**—specifically, it shifts the paradigm from sequential models (like RNNs or LSTMs) to an attention-based mechanism.

Here’s why it's called "Transformer":

### 1. **It Transforms Input Sequences with Attention:**

The key innovation in the Transformer is the **self-attention mechanism**, which allows the model to dynamically weigh the importance of different parts of the input sequence, regardless of their distance from each other. This contrasts with previous models, which processed data sequentially and had difficulty capturing long-range dependencies efficiently.

### 2. **It Transforms the Way Models Process Data:**

Transformers **parallelize the computation** by removing recurrence found in models like RNNs and LSTMs. This makes it faster and more efficient on modern hardware like GPUs. The model processes the entire sequence at once, transforming the whole approach to sequence modeling.

### 3. **It Transforms Representations through Stacking Layers:**

The architecture uses a stack of **encoder and decoder layers**, each transforming the input or intermediate representations through attention mechanisms and feed-forward neural networks, refining the representation at every step.

### 4. **It Transforms the State of the Art:**

The model’s name also reflects the significant transformation it brought to the field of NLP and beyond. Its architecture has since been applied to a wide range of tasks, including text generation (e.g., GPT models), machine translation, and image processing.

The name “Transformer” captures its central mechanism—**transforming data through attention**—and its groundbreaking impact on AI architectures.

There you have it, this is the real reason why is called Transformers but now let's start to analyze the attention mechanism, the core of the transformer architecture.

The **attention mechanism** is one of the core innovations that propelled the development of the **Transformer model**, which revolutionized Natural Language Processing (NLP) and many other fields. The self-attention mechanism allows the model to weigh the importance of different words in a sentence relative to each other, no matter how far apart they are in the sequence.

In this article, we'll break down the attention mechanism in detail, covering each aspect, including **self-attention**, **scaled dot-product attention**, and **multi-head attention**, with an emphasis on mathematical equations, detailed explanations, and Python code examples.

### The Building Blocks of Attention

For every input word (or token) in a sequence, the self-attention mechanism computes three vectors: the **Query (Q)**, **Key (K)**, and **Value (V)** vectors. These vectors are derived through learned weight matrices that apply a linear transformation to the input embedding.

#### Query, Key, and Value Matrices

Given an input sequence represented by an embedding matrix XXX, the Query, Key, and Value vectors are computed as follows:

Where:

Q is the Query matrix.

K is the Key matrix.

V is the Value matrix.

X is the input sequence (each token has an embedding vector).

WQ,WK,WV are weight matrices learned during training.

**Note:** Each input token in the sequence is transformed into a Query vector, a Key vector, and a Value vector using these weight matrices. The Query and Key vectors determine the "attention score" for each token pair, while the Value vector contains the information to be aggregated.

### Self-Attention Mechanism

Self-attention allows each word in the sequence to "attend" to all other words. This is accomplished by calculating a score that reflects how much focus should be placed on each other word in the sequence. The score between a pair of words is computed using the dot product of their Query and Key vectors, followed by scaling and applying softmax normalization.

#### Scaled Dot-Product Attention

The **attention score** between a Query vector q and a Key vector k is computed as follows:

Breaking this down:

**Dot Product of Query and Key**: The dot product of the Query Q and Key K gives a raw "similarity" score between two tokens in the sequence.

Here, dk is the dimension of the Key vectors. Each element in the Query vector is multiplied by the corresponding element in the Key vector, and the results are summed.

**Scaling**: The result of the dot product is divided by the square root of the dimensionality

This scaling helps prevent the values from growing too large as the dimensionality increases, which could otherwise lead to very small gradients during optimization.

**Softmax**: The scaled dot products are passed through a softmax function to convert them into probabilities. This ensures that the attention scores sum to 1, allowing the model to focus on certain tokens more than others.

**Weighting the Values**: Finally, the output is obtained by multiplying these attention scores with the corresponding Value vectors V. This produces a weighted sum of the Value vectors, where the weights are determined by the attention scores.

### Multi-Head Attention

The self-attention mechanism described so far operates as a **single attention head**. In practice, the Transformer uses **multi-head attention**, which allows the model to focus on different parts of the sequence in parallel, capturing various relationships between tokens.

#### Multi-Head Attention Mechanism

Each attention head independently computes its own Query, Key, and Value matrices, and the outputs of all attention heads are concatenated and linearly transformed to produce the final output. This is represented as:

Where:

h is the number of attention heads.

W sub cero is a learned projection matrix that linearly transforms the concatenated output.

Each attention head is calculated as:

The multi-head mechanism allows the model to learn multiple attention distributions simultaneously, enabling the model to attend to different aspects of the input sequence.

### Why Multi-Head Attention?

The use of multi-head attention enables the model to jointly focus on different representations of the same input, allowing it to capture a wide variety of linguistic patterns. For example, one head may focus on short-range dependencies while another may capture long-range dependencies.

### Example Python Code for Scaled Dot-Product Attention

Below is an example Python implementation of the scaled dot-product attention mechanism using NumPy:

```
#Python
import numpy as np
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Calculate the attention scores.
:param Q: Queries matrix
:param K: Keys matrix
:param V: Values matrix
:param mask: Optional mask to apply to attention scores
:return: Output and attention scores
"""
d_k = Q.shape[-1]
# Step 1: Compute dot products between Q and K^T
scores = np.matmul(Q, K.T) / np.sqrt(d_k)
# Step 2: Apply mask (optional)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
# Step 3: Apply softmax to get attention weights
attention_weights = softmax(scores, axis=-1)
# Step 4: Multiply by values V to get the output
output = np.matmul(attention_weights, V)
return output, attention_weights
def softmax(x, axis):
"""
Compute softmax values for each set of scores in x.
"""
e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
return e_x / e_x.sum(axis=axis, keepdims=True)
```

This code computes the attention scores, applies softmax to turn them into probabilities, and then computes the weighted sum of the Value vectors.

### Feedforward Layer

After the multi-head attention mechanism, a **feedforward neural network (FFNN)** is applied independently to each position in the sequence. The feedforward network consists of two linear transformations with a ReLU activation in between:

Where W1,W2, b1, b2 are learned parameters.

This layer processes each token's representation individually and adds non-linearity to the model, allowing it to capture complex patterns in the data.

### Conclusion

The **attention mechanism** is a powerful tool that allows neural networks, particularly Transformers, to focus on the most important parts of the input sequence dynamically. By computing attention scores between every token in a sequence, it enables the model to capture relationships that are essential for tasks like translation, summarization, and more.

The introduction of multi-head attention further enhances this by allowing the model to attend to different parts of the sequence in parallel. This combination of parallel attention and feedforward layers gives Transformers their immense power and flexibility, making them the foundation of many state-of-the-art models like BERT and GPT.

## Comentários