Category: Artificial Intelligence

We didn’t see yet what it can do

Right now, we’re all trapped in a bizarre digital loop: you’re using ChatGPT to write an email that your boss is just going to ask ChatGPT to summarize. It’s a massive productivity gain… for the servers.

But here’s the kicker: we’re currently living through a subsidized “inference party.” When the venture capital chips run out and the real bill hits the table, are you really going to drop $200 a month just to generate mediocre cat memes for your Instagram? I doubt it.

It is a common misconception that electricity “happened” overnight. In reality, there was a massive 40-to-50-year gap between the invention of the lightbulb and the point where it actually changed how the world worked. For a long time, its practical effect was just serving as an expensive way to replace candles in wealthy households. It’s not very different from AI today: just an expensive toy that most of humankind doesn’t yet give a damn about.

Inference costs will sink, right? For a while, maybe. But mind you, there are physical laws in place limiting how much computing power you can squeeze out of a square meter of silicon or a gigawatt-hour. Transistor size won’t shrink forever; in fact, they are very close to the theoretical limit, beyond which quantum tunneling causes electron leakage, making them impossible to turn off. Breakthroughs in algorithms are unpredictable, and we’re going to need a lot of them for LLM inference to become a sustainable business.

Besides all that, to reap real productivity gains, the people staring at Excel spreadsheets all day doing real work for real businesses need to figure out how to truly automate large chunks of their daily tasks. We aren’t going to achieve that just by giving them a model subscription and telling them to copy-paste frantically between different applications; the AI needs to be baked into their workflows. Instead of throwing data into a textbox and then manually moving it back into their systems, those systems need to have a button that does it all—or better yet, no button at all, where the work just gets done.

I’d relate AI adoption to the dot-com boom of the late 90s. It took a lot of time and effort from highly competent people to change how the world works and gets things done. For a long time, the internet for “real” business was just mail on steroids—a faster way to run the same old processes. Some places still run like that just fine, by the way; but most had to adapt to the new ways, and many were driven out of business. Amazon’s operations are nowhere near what a bookstore used to be 25 years ago, and 25 years from now, the daily routines of most professions will be unrecognizable in ways no one can currently predict.

Eventually, some folks will find novel ways to do novel things, and the first ones will reap great rewards and a huge competitive edge. Not all risk-tolerant people will succeed, but almost all risk-averse ones will fail.

Keep your mind open and stay safe.

March 3, 2026
Transformer architecture applied to supply chain optimization
In this post, I’ll implement a simple Deep Learning model to predict optimal task distribution across an hypothetical (and intentionally simplified) supply chain scenario. We’ll implement the model, load the data, train and evaluate it.

The scenario: Imagine we have a warehouse, where there are n stations capable of processing products. Each station has a distinct distance d from each product storage, and a speed s for processing each product. We need to process orders, each one consisting of a list of products. The orders are previously prioritized by the time they need to be delivered.

Motivation: We could encode the rules for this task using classical computing techniques (conditions and loops), and not have to prepare and load a lot of data, but that approach wouldn’t give us flexibility to deal with different warehouses, or changing conditions that a very dynamic environment would present us with. Using Machine Learning, we will be free from devising detailed rules that can change at any moment, and we’d be easily able to feed the model with real-time data, allowing it to learn new patterns during operation.

Available options: Since orders can have an arbitrary number of products, we need to create a model capable of dealing with inputs of any size. To achieve this, we could use Recurrent Neural Networks, or Transformers.
- Recurrent Neural Networks (RNNs): RNNs process data sequentially, one element at a time. They maintain a “hidden state” that acts as a memory of previous inputs. However, they struggle with “vanishing gradients,” making it difficult for them to remember information from the beginning of a long sequence by the time they reach the end.
- Transformers: Unlike RNNs, Transformers use a Self-Attention mechanism to process the entire input sequence simultaneously (parallelization). This allows the model to weigh the importance of every product in an order relative to every other product and station, regardless of their position in the list.
Why Transformers are the Ideal Solution

In our warehouse scenario, the “sequence” isn’t necessarily chronological—it’s a complex relational set. Transformers are superior here for three specific reasons:
1. Global Context: A Transformer can simultaneously compare a product’s storage distance (d) and a station’s speed (s) across the entire order. It doesn’t “forget” the first product by the time it analyzes the tenth.
2. Order Prioritization: Since our orders are pre-prioritized by delivery time, Transformers can use Positional Encodings to maintain this hierarchy while still calculating the most efficient physical path for the picker.
3. Scalability: Because they don’t process data step-by-step, Transformers are significantly faster to train and more adept at handling massive “rush hour” spikes where the number of products (n) per order varies wildly.
The model:

Model’s input will be a list of orders, each consisting of a list products, that will be just a list of integers, each one representing the indexes of product in a list of products:
```
my_input = torch.tensor([[0, 1, 2, 3], [4, 3, 2, 0]])
```
The output is a list of the indexes of the stations each product should be assigned to:
```
model(my_input)
> [[3, 4, 4, 3], [3, 0, 4, 3]]
```
To achieve that, we need first to define the layers of the model:

First we define the embeddings, it is like a giant sticker book. Every item in your supply chain gets its own special sticker (a long list of numbers) that describes what it is.
```
self.embedding = nn.Embedding(vocab_size, self.d_model, padding_idx=pad_idx)
```
Next, we define a positional encoder, it is like a calendar. It tells the brain when things happened. If a “truck” arrives on Monday, it’s different than a “truck” arriving on Friday. (The 5k is the number of items we’re able to handle in a single order).
```
self.pos_encoder = nn.Parameter(torch.randn(1, 5000, self.d_model))
```
Then we define a transformer encoder. Think of it as a super-powered Reading Club for robots.
```
encoder_layer = nn.TransformerEncoderLayer(d_model=self.d_model, nhead=8, batch_first=True)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=4)
```
Imagine you give a group of friends a long sentence to read. In the old days, robots had to read one word at a time, like walking down a dark hallway with a tiny flashlight. They would often forget the beginning of the sentence by the time they got to the end! The Transformer Encoder is different. It looks at the whole page.

The Encoder has many “heads” (usually 8, like an octopus).
- One head might focus on who is doing the action.
- Another head might focus on where they are.
- Another head looks for colors or sizes. By looking at the sentence 8 different ways at once, it understands the story much better than a human could!
We have num_layers=4. This means the sentence goes through 4 different clubs in a row.
- Club 1 figures out the easy stuff (like which words are nouns).
- Club 2, 3, and 4 figure out the complicated stuff (like the “vibe” of the sentence or the secret meaning behind the words).
Finally, we have a linear layer that makes our output assume the expected form, that is: a list containing the probabilities of each station being the ideal one, something like that:
```
> [0.25, 0.5, 0.125, 0.124, 0.001]
```
To find out what’s the best station for a product to be processed by, we just take the index of the highest probability (1, in this case).

The complete model:
```
class SupplyChainModel(nn.Module):
    def __init__(self, vocab_size, pad_idx=0):
        super().__init__()
        self.pad_idx = pad_idx
        self.d_model = 256
        
        self.embedding = nn.Embedding(vocab_size, self.d_model, padding_idx=pad_idx)
        self.pos_encoder = nn.Parameter(torch.randn(1, 5000, self.d_model))
        
        encoder_layer = nn.TransformerEncoderLayer(d_model=self.d_model, nhead=8, batch_first=True)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=4)
        self.out = nn.Linear(self.d_model, vocab_size)

    def forward(self, x):
        # Create mask: True where x is the padding index
        mask = (x == self.pad_idx) 
        
        x_emb = self.embedding(x) + self.pos_encoder[:, :x.size(1), :]
        
        # Pass the mask to the transformer
        x_enc = self.transformer_encoder(x_emb, src_key_padding_mask=mask)
        
        return self.out(x_enc)
```
With all that, we can make a simple function to run the inference:
```
def predict(model, input_list):
    # 1. Put model in evaluation mode (turns off Dropout, etc.)
    model.eval()
    
    with torch.no_grad():
        # 2. Forward Pass (Get raw scores)
        logits = model(input_list) # Shape: [1, Seq_Len, Vocab_Size]
        
        # 3. Get the most likely token ID for each position
        predictions = torch.argmax(logits, dim=-1) # Shape: [1, Seq_Len]
        
        # 4. Return as a simple Python list
        return predictions
```
And a training loop:
```
for epoch in range(epochs):
    total_loss = 0
    model.train()

    for sequence, result in loader:
        optimizer.zero_grad()
        
        # Forward pass
        logits = model(sequence)
        
        # Flatten for CrossEntropy: (Batch * Seq, Vocab) vs (Batch * Seq)
        loss = criterion(logits.view(-1, vocab_size), result.view(-1))
        
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1} | Loss: {total_loss/len(loader):.4f}")
```
Running everything together, we have this:
```
Input:  [[0, 1, 2, 3], [4, 3, 2, 0]]
Result: [[3, 4, 4, 3], [3, 0, 4, 3]]
Epoch 10 | Loss: 0.0075
Epoch 20 | Loss: 0.0020
Epoch 30 | Loss: 0.0007
Epoch 40 | Loss: 0.0004
Epoch 50 | Loss: 0.0004

Training Complete!
Result after training: [[4, 3, 2, 1], [1, 2, 3, 1]]
```
Conclusion: why use such specific model?

While Large Language Models (LLMs) are amazing at writing poems or coding, they are often overkill for specific industrial tasks. This custom Transformer architecture is the better tool for the job because an LLM has billions of parameters, which costs a fortune to run. Our model is compact (with a d_model of 256 and only 4 layers). It also is streamlined, it can process thousands of shipping data points in milliseconds, it’s orders of magnitude faster. It’s the difference between waiting for a long email reply and getting an instant text message. Finally, we’re using a model that wasn’t trained to translate French or make cat images, making the accuracy skyrocket.
March 2, 2026

Category: Artificial Intelligence

We didn’t see yet what it can do

Transformer architecture applied to supply chain optimization

Why Transformers are the Ideal Solution