Floats and Precision in LLM Training

A while ago, I was SFT-ing a Qwen 0.6B model on the UltraIntellect dataset to understand how the training loop behaves. While running training locally, I ran into a CUDA (OOM) issue, yup it happened lol, which turned out to be related to using the wrong floating-point format and consuming more memory than expected. That experience pushed me to understand what floats really are, why they matter for LLM training, and how different formats change both memory usage and stability.
This blog gives a simple conceptual overview of floats, how computers represent them, and why precision choices matter during training.
Why floats matter in LLM training
Deep learning involves many operations that require representing numbers across a wide dynamic range. In the context of LLMs, tasks such as:
- calculating gradients
- updating weights
- normalization
- computing probabilities
all rely on continuous values rather than discrete ones. Using integers alone would make it difficult to represent small values like 0.0004, which appear frequently during optimization. Floats allow models to represent both very small and very large numbers efficiently, enabling stable training and smooth updates during gradient descent.
What are floats
Floats are a data type used to represent real numbers with both precision and range. Computers follow a standardized representation that breaks a number into three main components:
- Sign
- Exponent
- Mantissa (fraction)
A floating-point value is typically expressed as:
This structure lets computers store numbers in a form similar to scientific notation, allowing flexible scaling while maintaining useful precision.
Example of a float being represented
Let us take a simple number like 6.5 and see how it becomes a floating-point value.
First, convert it to binary:
Now express it in normalized scientific notation:
From here we fill the float components:
- Sign = 0 (positive number)
- Exponent = 2 + bias
- Mantissa = 101000… (fractional part after the leading 1)
For FP32 (I’ll cover this later in the blog), the exponent bias is 127:
Stored exponent = 2 + 127 = 129
Final bit layout:
0 | 10000001 | 10100000000000000000000
Computers do not store the number directly; they store its sign, scaling factor, and precision separately.
Different types of floats
There are several formats used in deep learning, with FP32, FP16, and BF16 being the most common. All of them follow the same general floating-point representation, but they differ in how many bits are allocated to range and precision.
| Format | Total bits | Sign | Exponent | Mantissa (fraction) | Bytes |
|---|---|---|---|---|---|
| FP32 | 32 | 1 | 8 | 23 | 4 |
| FP16 | 16 | 1 | 5 | 10 | 2 |
| BF16 | 16 | 1 | 8 | 7 | 2 |
- FP32 (1/8/23) gives strong precision and wide range, but uses the most memory.
- FP16 (1/5/10) cuts memory, but its smaller exponent range can underflow/overflow more easily.
- BF16 (1/8/7) keeps FP32-like exponent range with lower precision, which is why it is widely used for stable LLM mixed-precision compute.
Simple memory example comparing FP32 vs BF16 (and what AMP does)
Consider a simple model with 1B parameters.
If we store parameters in FP32:
But during training we also store gradients and optimizer states:
- Parameters → 4 GB
- Gradients → 4 GB
- Adam m state → 4 GB (Optimizer State 1)
- Adam v state → 4 GB (Optimizer State 2)
Total persistent memory is approximately 16 GB, even before activations are counted.
Now the important correction for real training stacks:
In default PyTorch AMP / Hugging Face Trainer mixed precision (bf16=True with autocast), parameters are typically still stored in FP32, and optimizer states are also FP32. So persistent memory is still roughly:
- Parameters (FP32) → 4 GB
- Gradients (FP32) → 4 GB
- Adam states (FP32) → 8 GB
So it stays near 16 GB persistent. The main memory win comes from activations and many compute tensors being BF16 during forward/backward.
If we were to explicitly store trainable weights in BF16 (not the default AMP path), then the memory can look more like:
- Parameters (BF16) → 2 GB
- Gradients (BF16) → 2 GB
- FP32 optimizer/master copies for stability → often still significant (~8 GB or more, depending on optimizer/setup)
This is why practical memory savings depend on implementation details, not just the dtype name.
Core intuition still holds: BF16 is used for compute efficiency, while FP32 is kept where stability matters (optimizer states and, in some setups, master weights).
Final thoughts
I have been enjoying writing and reading more often nowadays, this blog was one I planned to do for a while, and it’s finally here, one platform that I’ve been using a lot is X(Twitter), I honestly love the platform, and most of my learnings come from there, I try to post a lot every day
I can’t avoid thinking about the fact that how everything is going to look like in the next coming years, the stuff that I see and do every day, it’s just something I never expected a year back, to be fair, I am not even writing code by hand anymore, I’m happy and scared at the same time, but well only time will tell how everything prevails, but well it’s going to be a wild ride for sure, sorry about the small hiatus from posting blogs, I love to do this a lot, and will continue to do so in the future.