Color Spaces
For the longest time we didn’t have to pay a lot of attention to the way we talk about color. The modern display technologies capable of showing more vivid shades have, for better or for worse, changed the rules of the game. Once esoteric ideas like a gamut or a color space are becoming increasingly important.
A color can be described in many different ways. We could use words, list amounts of CMYK printer inks, enumerate quite flawed HSL and HSV values, or even quantify the responses of cells in a human retina. Those notions are useful in some contexts, but I’m not going to focus on any of them.
This article is dedicated entirely to RGB values from RGB color spaces. It may seem fairly restrictive, but considering the domination of displays as the medium for color presentation, it is a pragmatic approach and it ultimately won’t prevent us from describing everything we can see.
A dry definition of a color space is not a good way to kick things off. Instead, we’ll start by playing with one of the most common tools used to specify colors.
Color Pickers
You’ve probably seen an RGB color picker before, it usually looks like this:
By dragging all the sliders to the right you can create a white color. Lots of red and green with little blue produces shades of yellow. That color picker is not the only way to specify colors. Try using the one below:
The gist of the behavior of that new picker is the same – the red slider controls the red component, the green slider affects the green component, and the blue slider acts on the blue component. Both sets of sliders have the minimum value of 0 and the maximum value of 255. However, the shades and intensities they create are quite different. We can compare some of the colors for the same slider positions side by side. In each plate the top half shows the color from the first picker and the bottom half contains the color from the second picker:
The only color that looks the same is a pure black, none of the other colors match despite the same values of red, green, and blue. You may be wondering which halves of the plates are wrong, but the truth is that they’re both correct, they just come from different RGB color spaces.
I’ll soon explain what causes those differences and we’ll eventually see what defines a color space, but for now the important lesson is that on their own the numeric values of the red, green, and blue components have no meaning. A color space is what assigns the meaning to those numeric values.
What I’ve shown you above is an extreme example in which almost everything about the two color spaces is different. Over the course of the next few paragraphs we will be looking at simpler examples which will help us get a feel for what different aspects of color spaces mean. Before we discuss those aspects we should make a quick detour to revisit how the components of an RGB color are described.
Normalized Range
When dealing with RGB colors you’ll often seem the specified in a 0 to 255 range:
Or in an equivalent hexadecimal form:
While using the range of 0 to 255 is convenient, especially for specifying colors on the web, it ties the description of color to a specific 8bit depth. What we actually want to express is the percentage of the maximum intensity a red, green, or blue component can represent.
In our discourse I’ll use a so called normalized range where the minimum value is still 0.0, but the maximum value is 1.0. Calculating normalized values from 0 to 255 range is easy – just divide the source numbers by 255.0:
This form may feel less familiar, but it lets us talk about the values without having to care what kind of discrete encoding scheme they’ll eventually use.
Intensity Mismatch
Let’s continue playing with the color pickers by upgrading them to show two colors at the same time. In the top half we’ll see the color obtained from the RGB values interpreted by one color space, while the bottom half shows the color obtained from the same RGB values, but interpreted by another color space:
While playing with the sliders you may have noticed something peculiar – if the sliders are at their minimum or maximum the colors look the same, otherwise they don’t. Here’s a comparison of some of the colors for different slider values:
A color space can specify how the numeric values of the red, green, and blue components map to intensity of the corresponding light source. In other words, the position of a slider may not be equal to intensity of the light the slider controls. The color space from the bottom half uses a simple linear encoding:
The light shining at 64% of its maximum intensity will be encoded as the number 0.64. The top color space uses an encoding with a fixed 2.0 exponent:
The light shining at 64% of its maximum intensity will be encoded as 0.8.
This may seem all like a pointless transformation, but there is a good reason for doing all this nonlinear mapping. The human eye is not a simple detector of the power of the incoming light – its response is nonlinear. A twofold increase in emitted number of photons per second will not be perceived as twice as bright light.
If we were to encode the colors using floating point numbers the need for a nonlinear encoding function would be diminished. However, the numeric values of color are often encoded using the familiar 8 bits per component, e.g. in the most common configurations of JPEG and PNG files. Using a nonlinear tone response curve, or TRC for short, lets us maintain more or less perceptual uniformity and use the chunky, quantized range to keep the detail in the darker parts.
Here are the first 14 representable values in 8 bit encoding of linearly increasing amount of light. You can probably tell that the brightness difference we perceive between the first two panes is much larger than between the last two panes:
First 14 representable linear shades of gray in 8 bit encoding
The extremely common sRGB color space employs a nonlinear TRC to make a better use of the human visual perception. Here are the first 14 representable values in an 8 bit encoding of output sRGB values. You may need to ramp up your display brightness to see them, or maybe even make the background blackmake the background bright:
First 14 representable sRGB shades of gray in an 8 bit encoding
Notice the hollow circles under all but the first and the last colors. None of those 12 shades of gray would be representable in an 8 bit format if we were to use straightforward linear encoding. You’d actually need a range of 0 to 4095 (12 bits per component) to represent the same very dark shades of sRGB gray without using any tone response curves.
The situation at the other end of the spectrum is more difficult to visualize since the images on this website use sRGB color space, but we can at least show that the white shades in linear color space shown in the upper part of the picture get darker more slowly than corresponding white shades in sRGB at the bottom:
Last 14 representable linear (top) and sRGB (bottom) shades of gray in an 8 bit encoding
Linear encoding has the same precision of light intensity everywhere, which accounting for our nonlinear perception ends up having very quickly increasing brightness at the dark end and very slowly decreasing darkness at the bright end.
Different color spaces use different TRCs. Some, like DCI P3, use a simple power function with a fixed exponent. Others, like ProPhoto employ a piecewise combination of a linear segment with a power segment. The magnitude of the exponent differs between color spaces – there is no unanimous opinion on what constitutes the best encoding. Finally, some color spaces don’t use any special encoding and just represent the component values in linear fashion.
While TRCs are very useful for encoding the intensities for storage, the crucial thing to remember is that mathematical operations on light only make sense when done on linear values, because they represent the actual intensities of light. It’s actually fairly easy to visualize in a gradual mix between a pure red and a pure green:
Notice how in the top halves, which use nonlinear encoding, the colors get darker in the middle, while in the bottom, linear halves the progression looks more balanced.
What we’ve discussed in this section is the first component of a typical RGB display color space. We can now say that one of the defining properties of a color space is:
The encoding is an essential, but nonetheless, technical part of a color space. The specification of the actual colors is what we usually care about the most.
Primary Differences
Let’s look at a different example of two color spaces in a color picker. To make things simpler, both halves operate with linear intensities:
Notice that grayscale colors now match perfectly, however, the colors in the bottom half are clearly more subdued for the same numeric value:
I’ll symbolize the red, green, and blue values from the top half with a small bar on top of the letters: RGB. For the values from the bottom half I’ll use a bar at the bottom: RGB.
You may be wondering if, despite a rather different behavior, it is possible to make the colors match. In other words, knowing the RGB values we’d like to express the same color using RGB values.
The notion may seem contrived, but the underlying ideas are important for understanding how the same color can be described by many different triplets of values. To see the scenario in action we can fix the bottom half of the picker and let the sliders control only the top color:
After some trial and error you may get pretty close, or perhaps even achieve the exact match. Unless you’re quite skilled, the task probably wasn’t trivial and trying to do the same manual work for every possible color would be a daunting endeavor.
A consistent reproduction of color is crucial for realizing any design – we wouldn’t want a carefully selected shade of pastel beige to look like a ripe orange when seen on some other device. We’re in a dire need of a better approach for trying to make the colors look the same despite a different meaning of the values of the RGB components.
Primaries Matched
Reproducing arbitrary colors between different color spaces is difficult, but we can try to simplify things a little by just trying to match the pure red, green, and blue colors and seeing where it gets us. Let’s start by matching the red from the bottom plate:
Making the border between the two halves disappear may take some tweaking, but eventually we can agree on values that fit perfectly. It’s a step in the right direction – since we’re dealing with linear values we can now express how much a single unit of R will contribute to RGB components. For instance, to recreate the RGB color of 0.60.00.0 we’d need to use:
0.6 × 0.100 = 0.060 units of G
0.6 × 0.024 = 0.014 units of B
We can put these results into equations. When a color from the RGB space has no green and no blue, but some amount of red, we could use the following formulas to see the same color in the top plate:
G = 0.100×R
B = 0.024×R
We can repeat the experiment for pure G:
Having decided on the the perfect match, we can write down the impact of G on RGB components:
G = 0.690×G
B = 0.000×G
All that’s left is to repeat the experiments for pure B. If you’re still not tired of dragging the sliders you can do it yourself, or just let me do it:
Once again, the results can be written down as:
G = 0.210×B
B = 0.976×B
The provided equations apply only when the RGB half shows just some amount of red, or just some amount of green, or just some amount of blue. Being limited to having nonzero values in only one component won’t get us far. To create a more diverse set of colors we need to have a way of finding out how mixes of basic components in RGB affect values in RGB.
Mixing Things Up
Thankfully, Grassmann’s laws come to our rescue. Those empirical rules tell us that if two colored lights match, then after addition of another set of matching light sources their overlap, or their sum, will also match:
The sum of two matching sets of colored lights also matches
As a result we can do the three matchings for R, G, and B separately, then just combine the results into one set. For instance, to obtain the final R value from RGB all we need to do is to sum the contribution of R to R, the contribution of G to R, and the contribution of B to R. Repeating the trick for the other two components yields the following set of equations:
G = 0.100×R + 0.690×G + 0.210×B
B = 0.024×R + 0.000×G + 0.976×B
What we just did was a matching of primary red, green, and blue colors from one color space to another. With the equations in hand we can finally recreate the math behind the matching beige from the first example. By plugging in the original RGB values of 0.90.50.2 we can calculate the value of the exact match:
G = 0.100×0.9 + 0.690×0.5 + 0.210×0.2 = 0.477
B = 0.024×0.9 + 0.000×0.5 + 0.976×0.2 = 0.217
Seeing the Matrix
If you look closely at the three equations governing the conversion from RGB to RGB you may notice the coefficients form a 3×3 grid:
0.1000.6900.210
0.0240.0000.976
It’s actually a 3×3 matrix that defines the conversion from RGB to RGB. If you’re familiar with matrices and vectors, you might have already realized that the transformation we did was a plain old matrix times vector multiplication.
On its own the matrix isn’t particularly interesting, but the transformation it does can be visualized in 3D. In the following interactive diagram the outer cube defines the RGB space, while the inner, skewed cube (parallelepiped) defines the RGB space. You can drag the cubes around to see them from different angles and control the RGB color indicator using the sliders:
As an experiment I encourage you to set red and green components to 0 and just play with the blue slider. You should be able to see that movement along the pure blue color in RGB space requires some shift in red and green in RGB.
Notice that RGB parallelepiped fits completely inside RGB cube which means that one can recreate every single RGB color using RGB space. The inverse, however, is not true.
There and Back Again
In the example below the background of the top half is set to a pure R at maximum intensity. You may try really hard, but there is no combination of values in RGB that can cause the seam between the two halves to disappear:
If we ignore the issue for a minute and treat the problem algebraically we can solve the system of 3 equations looking for pure R at the output:
0.0 = 0.100×R + 0.690×G + 0.210×B
0.0 = 0.024×R + 0.000×G + 0.976×B
The details of the evaluation are boring, so let’s just present the solution as is:
G = −0.201
B = −0.036
All three values are outside of 0.0 to 1.0 range – a clear indication of the unreachability of that color in RGB space!
Values larger than 1.0 have a very straightforward explanation: 1.0 of R is beyond the range of intensity of R, you could say that R is not strong enough. Some less intense values of R can contribute to R within limit. For example, 0.5 of R requires 0.5 × 1.470 = 0.735 of R.
Negative values are slightly trickier because they are imaginary, but we can still reason about them quite easily.
Negative Values
Let’s try to think about an experiment in which we’re tasked with matching a light on the right side with a combination of a red, green, and blue light on the left side:
Mixing lights to obtain the color on the right
After some tuning we may realize that even with no green and no blue light the colors still don’t match and the left patch of light continues to look a bit orangey indicating that it has too much green in it:
The left patch is still too green
We can’t decrease the green light on the left side anymore, so we have to become more creative. We could say that the left side is too green, or we could say that the right side is not green enough! Since all we care about is making the colors match we could try adding some of the green light to the other side:
Adding green light to the target side
Now the right side is too green, but if we adjust the added amount we can eventually make the colors look the same:
A match with just the right amount of green
Let’s try to write down what we’ve achieved in an equation. We have a maximum amount of red and no green or blue on the left side, while on the right side we have the target color with some green light added to it:
This feels a little mischievous – we wanted to match the pure target color without any additions. Let’s try to rearrange the equation by moving the added green light from the right to the left side:
That equivalent equation tells us that getting the exact match with the starting target color requires using a negative amount of the green light on the left side. Naturally, this is something we can’t do in a physical world. Mathematically, however, a match of the very saturated red from the right side really requires removing some green light from the left side.
If you’re still not convinced consider what would happen if we somehow were able to create −0.2×G on the left side. I can’t show you a picture of that imaginary situation, but I can show you what would happen if we then added 0.2×G to both sides:
This is exactly the same image as before since both scenarios are equivalent. Recall from Grassmann’s laws that adding the same amount of light to both sides maintains the color match, so indeed the imaginary −0.2×G on the left side must have made it look like the original unmodified saturated red from the right side.
This may all be somehow unsatisfying, however, when you step away from the physical restrictions and think about lights as just numbers, it all, quite literally, adds up.
The Matrix Seeing
If we go back to our quest for obtaining RGB values from RGB colors we can repeat the “system of equations” trick for the other two components and combine the results using Grassmann’s laws to end up with the full set of equations:
G = −0.201×R + 1.513×G − 0.311×B
B = −0.036×R + 0.011×G + 1.025×B
Once again, the set of coefficients can be presented in a 3×3 matrix:
−0.201+1.513−0.311
−0.036+0.011+1.025
If you’ve solved systems of linear equations before, you may have realized that this is just an inverse matrix of the original RGB to RGB transformation. This is a very useful property. When establishing the conversion from one linear color space to another we just need to figure out the coefficients of the matrix in one direction – the matrix in the other direction is just its inverse.
One more time we can visualize the transformation the matrix performs. This time RGB is the perfect cube, while RGB is the outer skewed cube:
You can easily see how much bigger the RGB space is and indeed it lets us express some colors that RGB can’t.
Breaking the Boundaries
So far whenever a numeric value in a color space was outside of 0.0 to 1.0 range we’d just clip it to the representable limits. Let’s try to see what happens when we remove that restriction. We can modify the sliders to allow −1.0 to 2.0 range and try to match the pure R again:
With that approach we can actually represent the pure R using RGB color space. Since the numbers don’t care, the entire thing works and is often called unbounded or extended range.
To actually see RGB colors represented using extended range RGB colors the display has to be capable of showing RGB colors in the first place, otherwise they would get physically clamped by the display itself. To put it differently, the transformation of values from a color space with extended range to the native color space of the display should end up with the values in a standard range.
While unbounded values are very flexible, there are two important caveats one should consider when dealing with them. First of all, since the range of values is unlimited, it is possible to create colors that truly have no physical meaning, e.g. −1.0−1.0−1.0.
Additionally, storing the values in a type with a limited range may not be possible. If we decide to use 8 bits per component and encode the 0.0 to 1.0 range into 0 to 255 range then we won’t be able to represent the values outside of normalized range since they simply won’t fit.
One Space to Rule Them All
All the color transformations we performed are somewhat contrived – the two color spaces we analyzed are defined in relation to each other and have no immutable attachment to the physical world. Their reds, greens, and blues are whatever we ended up seeing on the screen. We have no way of knowing how the colors would look on another device.
If we had a master color space that was derived from physical quantities we could solve the problem by finding a transformation from any color space to that common connection space. Luckily, in 1931 that very color space has been specified by CIE and resulted in what is known as CIE 1931 XYZ color space. Before we discuss how the space was defined we need to take a quick look at physics and the human visual system.
And There Was Light
To ground the discussion of color perception in the real world we have to establish it in terms of light – the part of electromagnetic radiation that is visible to the human eye. The visible spectrum can only be approximated on a typical display, but its colors at specific wavelengths look roughly like this:
Visible spectrum with wavelengths shown in nanometers
Colors corresponding to a single wavelength are called spectral colors. For example, a light with wavelength of 570 nm would produce a pure spectral yellow. Notice that boundaries between the colors are soft, so there is no single “spectral red”, but a wavelength is all we need to describe a spectral color accurately.
Human eyes are not simple wavelength detectors and the perception of the same color can be created in many different ways. A yellow color can be created using light with wavelength around 570 nm, or as a mixture of a green and a red light.
Since a human retina has three different types of photoreceptive cones, any color sensation can be matched using three fixed colors with varying intensities. It doesn’t mean that all colors can be recreated using just additive mixes of three colors. As we’ve discussed, sometimes one or two of those colors have negative weights and have to be added to the target color instead.
Algebraically, however, we don’t need more than three primary colors to represent everything humans can perceive (with some very rare exceptions). This property of the human visual system is called trichromacy and is exploited by various display technologies to present a very broad spectrum of colors using just three RGB primaries.
The Color Matching Experiments
At the beginning of the 20^{th} century David Wright and John Guild independently conducted experiments with human observers in which the test subjects tried to match the spectral colors in 10 nm increments with a combination of red, green, and blue light sources. The experiments were fairly similar to what’ve been doing with the sliders trying to match the target color as a combination of some other three colors.
The results of the experiments were standardized as rgb color matching functions that can be presented on a graph:
CIE rgb color matching functions
Notice the presence of negative values. Not all spectral colors could be achieved by a combination of selected RGB values and yet again a negative value means that the color was added to the target value.
The CIE standardized the reference R of the experiments as monochromatic light with wavelength of 700.0 nm, the reference G as 546.1 nm, and the reference B as 435.8 nm, with specific ratio of relative intensities between them. This is a critical step in our discussion of color – we’re finally grounded to some physically measurable properties.
Let’s look at the yellow wavelength of 570 nm. Reading from the color matching plot we can tell that the color match required 0.16768 units of R, 0.17087 units of G and −0.00135 units of B. A set of three RGB coordinates just begs for a three dimensional presentation. If we read the color matching value for every wavelength we obtain the following plot:
You may wonder why the spectrum curve doesn’t go through the pure red, green, or blue endpoints of the CIE RGB cube, but it’s simply the result of normalization and scaling applied to the color matching functions by the committee. Within reasonable limits a light source can be constructed to be as powerful as needed, so what constitutes a “1.0” is always defined somehow arbitrarily.
XYZ Space
CIE RGB color space is rooted to concrete physical properties of monochromatic lights and we could use it to define any color sensation. However, the committee strived to create a derived space that would have a few useful properties, two of which are worth noting:
 No negative coordinates of spectral colors
 Separation of chromaticity (hue and colorfulness) from luminance (visual perception of brightness)
After some deliberation the following set of equations was created:
Y = 1.000×R + 4.591×G + 0.061×B
Z = 0.000×R + 0.057×G + 5.594×B
The factors may seem arbitrary, and in some sense they are, but this is simply yet another mapping of values from one space to another, similar to what we did with RGB to RGB transformation. The Y component was chosen as the luminance component, while X and Z define the chromaticity.
We can visualize the CIE XYZ space inside the CIE RGB space:
You can see how the XYZ space is designed around the spectrum curve to make sure it fits in the positive octant. Its smaller size is not really a concern, the scaling makes things more convenient to work with by making sure the perceptually brightest wavelength of 555 nm has a Y value of 1. As a final step we can just show the XYZ space and the spectral colors on their own:
Since the CIE XYZ space is derived from CIE RGB it is also grounded in measurable physical quantities. The XYZ color space is the base color space used for all conversions of matrixbased RGB display color spaces. The mapping between any two color spaces is done through a common CIE XYZ intermediate. This simplifies the color transformations since we don’t need to know how to convert colors between any two arbitrary color spaces, we just need to be able to convert both of them to the XYZ space.
xy Chromaticity Diagram
Three dimensional diagrams are fun to play with, but they’re often not particularly practical – a 2D plot is often easier to work with and reason about. Since the Y component of the XYZ space is devoid of any colorfulness and hue, we’re left with only two components that affect the chromaticity of a color. We can perform three operations intended to reduce the dimensionality of the space:
y = Y / (X + Y + Z)
z = Z / (X + Y + Z)
The distinction between a lower and an upper case is important here – X is not the same as x. These seemingly arbitrary equations have a simple visual explanation – it’s a projection onto a triangle spanned between XYZ coordinates of 1.00.00.0, 0.01.00.0, and 0.00.01.0:
Notice that x, y, and z add up to 1, so we can drop the last component since it’s redundant – we can always recreate it by subtracting x and y from 1. Rejection of z is equivalent to a flat projection onto xy plane:
If we repeat that step for every combination of spectral colors we can finally present the 2D plot known as CIE xy chromaticity diagram in its full glory. You may have seen this horseshoe shape before:
CIE xy chromaticity diagram
No RGB display technology is capable of presenting all the colors that humans can see, so many of the ones shown in the picture are actually the result of clamping to the representable range of the sRGB color space.
The colors on the inside of the plot are just some combinations of the spectral colors and the colors outside of the plot don’t exist. Notice the straight diagonal line connecting the end points of the red and blue area. While every point on the outline of the horseshoe shape has a corresponding spectral wavelength, the colors on that line of purples do not – there is no wavelength of light that looks like magenta. The purples are simply how the human brain interprets the mixes of red and blue light and ultimately they are no different than any other shade. Perception of every color happens in our heads.
It’s important to mention that the xy chromaticity diagram is not perceptually uniform. In some areas of the plot one has to move relatively far away from a chosen color to notice the difference in chromaticity, while in some other areas the distance to change is much smaller. Over time CIE developed more uniform chromaticity diagrams, but since the xy diagram is easily obtained from the XYZ space it continues to be used in discussion of RGB color spaces.
Gamut
The chromaticity diagram is useful in visualizing the gamut of a color space – the extent of colors that a color space can represent. It’s necessary to note that a gamut is a three dimensional construct, so a 2D projection onto an image, somehow contrarily, does not present a full picture. It’s nonetheless a useful tool employed in comparison of color spaces.
We can finally present the chromaticities of primaries of both RGB and RGB color spaces in a single graph:
Comparison of primaries
I’ll discuss the meaning of the little cross in the middle soon, but the important fact is that a triangle depicts all representable chromaticities of a color space. Notice how RGB triangle is smaller than RGB triangle showing us yet again that the latter can represent more colors.
This diagram summarizes a long journey we took to define the second component that defines every RGB color space:
In the simulation below you can drag the slider to see how the extent of the chromaticity triangle corresponds to the representable colors. The base values of an image from sRGB color space are converted to a space with a reduced gamut, clamped to 0.0 to 1.0 range, then finally converted back to sRGB for display. You can clicktap the image to change it:
In some cases the color clamping is pretty severe, but for the image with snowy mountains the change is minimal. An almost black and white picture has little chromaticity and therefore a reduced gamut has almost no effect on it.
White Point
The last aspect of color spaces we will discuss is related to the little cross in the middle of RGB and RGB gamut visualization. Similarly to how different color spaces can assign different colors to their pure red defined as 1.00.00.0, the white point of a color space defines its color of white – the represented color when all three components are ones 1.01.01.0.
In the color picker below both halves have the same RGB primaries, but different white points. See what happens when you drag all the sliders to the right:
The color space form the bottom half defines its white as a slightly different color than the top color space.
A 3D visualization shows two interesting properties. Firstly, the axes of the inner and the outer cubes are collinear, they’re just scaled differently. Secondly, the “far” endpoints of the cubes no longer overlap since their whitepoints are different:
The white point is the last piece of the color space puzzle. We can now say that a color space is also defined by:
With xy coordinates of the red, green, and blue primaries, and the xy coordinates of the white point one can evaluate the RGB to XYZ transformation for a given color space. The details of this computation are not critical to our discussion and you can read about them in many places online, e.g. on Bruce Lindbloom’s website.
While necessary for correct calculation of RGB to XYZ transformation, in practice it may be difficult to notice that two color spaces have different whitepoints. Most color conversion operations will undergo chromatic adaptation and the color of 1.01.01.0 in the source color space will be mapped to 1.01.01.0 in the destination color space.
sRGB Color Space
We’ll finish our discussion with a showcase of the sRGB color space, described by its authors as “A Standard Default Color Space for the Internet”. If you ever were specifying just “RGB” colors it’s extremely likely that the components were assumed to come from the sRGB color space. In fact, it’s the color space used in the very first color picker in this article.
While the sRGB specification isn’t free, Wikipedia and International Color Consortium provide all the information we need to describe it.
Tone Response Curve
Let’s look at the plot of the tone response curve of the sRGB color space. The light intensity value is on the vertical axis and it’s symbolized with a sun icon. The encoded value is on the horizontal axis and it’s symbolized with a binary code:
Tone response curve of sRGB
While the curve may look like a single power function, in the range of 0 to 0.04045 it actually is a linear segment:
When encoded value is larger than 0.04045 the intensity value is indeed defined by a slightly nudged power function with exponent of 2.4:
The transition between the two parts is continuous and I marked its location on the plot with a little bump.
The curve was designed to roughly correspond to the response curve of CRT displays thus to some extent minimizing the need for any color management. That convenience had the unfortunate consequence of creating a widespread confusion by binding the two separate ideas into one.
In general case the purpose of tone response curves is to make the best use of available space when storing the color information in formats with limited bit depth. The nonlinear response of an electron gun in a CRT display is a separate and only loosely related concept.
Chromaticity and White Point Coordinates
Chromaticity coordinates of red, green, blue, and white are just 4 pairs of x and y values and we can put them in a table:
Red  Green  Blue  White  

x  0.6400  0.3000  0.1500  0.3127 
y  0.3300  0.6000  0.0600  0.3290 
You’ll probably agree that this form is not particularly exciting, things look much nicer on the xy chromaticity diagram:
Chromaticity and white point coordinates of sRGB
On a technical note it’s worth mentioning that the primaries share the same location as Rec. 709 – the standard for HDTV. Additionally, the white point location corresponds to Standard Illuminant D65 – a representation of the average daylight.
While sRGB isn’t capable of representing some of the more vivid shades and clearly doesn’t contain all the colors humans can see, it has served its purpose as a standard color space of 8bit graphics remarkably well.
Further Reading
The web is filled with interesting gems related to color. For a very good dissection of many of the major components on a pathway between hex colors and what we end up perceiving with our eyes I suggest reading Jamie Wong’s “Color: From Hexcodes to Eyeballs”. I especially like his very approachable explanation of spectral distributions.
Bruce MacEvoy’s set of pages on color vision is an amazing resource about various aspects of color. I highly recommend the first post in the series in which the author describes the fascinating details of biophysics of the human visual system.
Elle Stone authored an extensive collection of great articles on color spaces, calibration, and image editing. For a different take on what we’ve touched upon in this article you should see “Completely Painless Programmer’s Guide to XYZ, RGB, ICC, xyY, and TRCs”.
If you think you’d enjoy a deep dive into the optimization process of creating a minimal sRGB ICC profile, Clinton Ingram has got you covered in his four part series. His pragmatic approach to the problem provides a very entertaining read.
Finally, as a resource in a more traditional form, I recommend the book “Color Imaging: Fundamentals and Applications”. It covers a vast selection of topics related to color and its reproduction, all in a very readable form.
Final Words
Many of the concepts we’ve discussed were developed decades ago, but despite the age even the old ideas continue to be incredibly useful and the science behind them is expanded to this day.
Color is one of those areas with seemingly infinite depth of complexity. I hopefully showed you that some of that complexity isn’t actually as scary as it looks, you just need to shine a light on it.
Exposing Floating Point
Despite everyday use, floating point numbers are often understood in a handwavy manner and their behavior raises many eyebrows. Over the course of this article I’d like to show that things aren’t actually that complicated.
This blog post is a companion to my recently launched website – float.exposed. Other than exploiting the absurdity of present day list of top level domains, it’s intended to be a handy tool for inspecting floating point numbers. While I encourage you to play with it, the purpose of many of its elements may be exotic at first. By the time we’ve finished, however, all of them will hopefully become familiar.
On a technical note, by floating point I’m referring to the ubiquitous IEEE 754 binary floating point format. Types half
, float
, and double
are understood to be binary16, binary32, and binary64 respectively. There were other formats back in the day, but whatever device you’re reading this on is pretty much guaranteed to use IEEE 754.
With the formalities out of the way, let’s start at the shallow end of the pool.
Writing Numbers
We’ll begin with the very basics of writing numeric values. The initial steps may seem trivial, but starting from the first principles will help us build a working model of floating point numbers.
Decimal Numbers
Consider the number 327.849. Digits to the left of the decimal point represent increasing powers of ten, while digits to the right of the decimal point represent decreasing powers of ten:
Even though this notation is very natural, it has a few disadvantages:
 small numbers like 0.000000000653 require skimming over many zeros before they start “showing” actually useful digits
 it’s hard to estimate the magnitude of large numbers like 7298345251 at a glance
 at some point the distant digits of a number become increasingly less significant and could often be dropped, yet for big numbers we don’t save any space by replacing them with zeros, e.g. 7298000000
By “small” and “big” numbers I’m referring to their magnitude so −4205 is understood to be bigger than 0.03 even though it’s to the left of it on the real number line.
Scientific notation solves all these problems. It shifts the decimal point to right after the first nonzero digit and sets the exponent accordingly:
Scientific notation has three major components: the sign (+), the significand (3.27849), and the exponent (2). For positive values the “+” sign is often omitted, but we’ll keep it around for the sake of verbosity. Note that the “10” simply shows that we’re dealing with base10 system. The aforementioned disadvantages disappear:
 the 0heavy small number is presented as 6.53×10^{−10} with all the pesky zeros removed
 just by looking at the first digit and the exponent of 7.298345251×10^{9} we know that number is roughly 7 billion
 we can drop the unwanted distant digits from the tail to get 7.298×10^{9}
Continuing with the protagonist of this section, if we’re only interested in 4 most significant digits we can round the number using one of the many rounding rules:
The number of digits shown describes the precision we’re dealing with. A number with 8 digits of precision could be printed as:
Binary Numbers
With the familiar base10 out of the way, let’s look at the binary numbers. The rules of the game are exactly the same, it’s just that the base is 2 and not 10. Digits to the left of the binary point represent increasing powers of two, while digits to the right of the binary point represent decreasing powers of two:
When ambiguous I’ll use _{2} to mean the number is in base2. As such, 1000_{2} is not a thousand, but 2^{3} i.e. eight. To get the decimal value of the discussed 1001.0101_{2} we simply sum up the powers of two that have the bit set: 8 + 1 + 0.25 + 0.0625, ending up with the value of 9.3125.
Binary numbers can use scientific notation as well. Since we’re shifting the binary point by three places, the exponent ends up having the value of 3:
Similarly to scientific notation in base10, we also moved the binary point to right after the first nonzero digit of the original representation. However, since the only nonzero digit in base2 system is 1, every nonzero binary number in scientific notation starts with a 1.
We can round the number to a shorter form:
Or show that we’re more accurate by storing 11 binary digits:
If you’ve grasped everything that we’ve discussed so far then congratulations – you understand how floating point numbers work.
Floating Point Numbers
Floating points numbers are just numbers in base2 scientific notation with the following two restrictions:
 limited number of digits in the significand
 limited range of the exponent – it can’t be greater than some maximum limit and also can’t be less than some minimum limit
That’s (almost) all there is to them.
Different floating point types have different number of significand digits and allowed exponent range. For example, a float
has 24 binary digits (i.e. bits) of significand and the exponent range of [−126, +127], where “[” and “]” denote inclusivity of the range (e.g. +127 is valid, but +128 is not). Here’s a number with a decimal value of −616134.5625 that can fit in a float
:
Unfortunately, the number of bits of significand in a float
is limited, so some real values may not be perfectly representable in the floating point form. A decimal number 0.2 has the following base2 representation:
The overline (technically known as vinculum) indicates forever repeating value. The 25^{th} and later significant digits of the perfect base2 scientific representation of that number won’t fit in a float
and have to be accounted for by rounding the remaining bits. The full significand:
Will be rounded to:
After multiplication by the exponent the resulting number has a different decimal value than the perfect 0.2:
If we tried rounding the full significand down:
The resulting number would be equal to:
No matter what we do, the limited number of bits in the significand prevents us from getting the correct result. This explains why some decimal numbers don’t have their exact floating point representation.
Similarly, since the value of the exponent is limited, many huge and many tiny numbers won’t fit in a float
: neither 2^{200} nor 2^{−300} can be represented since they don’t fall into the allowed exponent range of [−126, +127].
Encoding
Knowing the number of bits in the significand and the allowed range of the exponent we can start encoding floating point numbers into their binary representation. We’ll use the number −2343.53125 which has the following representation in base2 scientific notation:
The Sign
The sign is easy – we just need 1 bit to express whether the number is positive or negative. IEEE 754 uses the value of 0
for the former and 1
for the latter. Since the discussed number is negative we’ll use one:
The Significand
For the significand of a float
we need 24 bits. However, per what we’ve already discussed, the first digit of the significand in base2 is always 1, so the format cleverly skips it to save a bit. We just have to remember it’s there when doing calculations. We copy the remaining 23 digits verbatim while filling in the missing bits at the end with 0s:
The leading “1” we skipped is often referred to as an “implicit bit”.
The Exponent
Since the exponent range of [−126, +127] allows 254 possible values, we’ll need 8 bits to store it. To avoid special handling of negative exponent values we’ll add a fixed bias to make sure no encoded exponent is negative.
To obtain a biased exponent we’ll use the bias value of 127. While 126 would work for regular range of exponents, using 127 will let us reserve a biased value of 0 for special purposes. Biasing is just a matter of shifting all values to the right:
The bias in a float
For the discussed number we have to shift its exponent of 11 by 127 to get 138, or 10001010_{2} and that’s what we will encode as the exponent:
Putting it All Together
To conform with the standard we’ll put the sign bit first, then the exponent bits, and finally, the significand bits. While seemingly arbitrary, the order is part of the standard’s ingenuity. By sticking all the pieces together a float
is born:
The entire encoding occupies 32 bits. To verify we did things correctly we can fire up LLDB and let the hacky type punning do its work:
(lldb) p 2343.53125f
(float) $0 = 2343.53125
(lldb) p/t *(uint32_t *)&$0
(uint32_t) $1 = 0b11000101000100100111100010000000
While neither C nor C++ standards technically require a float
or a double
to be represented using IEEE 754 format, the rest of this article will sensibly assume so.
The same procedure of encoding a number in base2 scientific notation can be repeated for almost any number, however, some of them require special handling.
Special Values
The float
exponent range allows 254 different values and with a bias of 127 we’re left with two yet unused biased exponent values: 0 and 255. Both are employed for very useful purposes.
A Map of Floats
A dry description doesn’t really paint a picture, so let’s present all the special values visually. In the following plot every dot represents a unique positive float
:
All the special values
If you have trouble seeing color you can switch to the alternative version. If you don’t have trouble seeing color you can switch to the color version. Notice the necessary truncation of a large part of exponents and of a gigantic part of significand values. At your current viewing size you’d have to scroll through roughly window widths to see all the values of the significand.
We’ve already discussed all the unmarked dots — the normal floats. It’s time to dive into the remaining values.
Zero
A float
number with biased exponent value of 0 and all zeros in significand is interpreted as positive or negative 0. The arbitrary value of sign (shown as _
) decides which 0 we’re dealing with:
Yes, the floating point standard specifies both +0.0 and −0.0. This concept is actually useful because it tells us from which “direction” the 0 was approached as a result of storing value too small to be represented in a float
. For instance 10e30f / 10e30f
won’t fit in a float
, however, it will produce the value of 0.0
.
When working with zeros note that 0.0 == 0.0
is true even though the two zeros have different encoding. Additionally, 0.0 + 0.0
is equal to 0.0
, so by default the compiler can’t optimize a + 0.0
into just a
, however, you can set flags to relax the strict conformance.
Infinity
A float
number with maximum biased exponent value and all zeros in significand is interpreted as positive or negative infinity depending on the value of the sign bit:
Infinity arises as a result of rounding a value that’s too large to fit in the type (assuming default rounding mode). In case of a float
, any number in base2 scientific notation with exponent greater than 127 will become infinity. You can also use macro INFINITY
directly.
The positive and negative zeros become useful again since dividing a positive value by +0.0 will produce a positive infinity, while dividing it by −0.0 will produce a negative infinity.
Operations involving finite numbers and infinities are actually well defined and follow common sense property of keeping infinities infinite:
 any finite value added to or subtracted from ±infinity ends up as ±infinity
 any finite positive value multiplied by ±infinity ends up as ±infinity, while any finite negative value multiplied by ±infinity flips its sign to ∓infinity
 division by a finite nonzero value works similarly to multiplication (think of division as multiplication by an inverse)
 square root of a +infinity is +infinity
 any finite value divided by ±infinity will become ±0.0 depending on the signs of the operands
In other words, infinities are so big that any shifting or scaling won’t affect their infinite magnitude, only their sign may flip. However, some operations throw a wrench into that simple rule.
NaNs
A float
number with maximum biased exponent value and nonzero significand is interpreted as NaN – Not a Number:
The easiest way to obtain NaN directly is by using NAN
macro. In practice though, NaN arises in the following set of operations:
 ±0.0 multiplied by ±infinity
 −infinity added to +infinity
 ±0.0 divided by ±0.0
 ±infinity divided by ±infinity
 square root of a negative number (−0.0 is fine though!)
If the floating point variable is uninitialized, it’s also somewhat likely to contain NaNs. By default the result of any operation involving NaNs will result in a NaN as well. That’s one of the reasons why compiler can’t optimize seemingly simple cases like a + (b  b)
into just a
. If b
is NaN the result of the entire operation has to be NaN too.
NaNs are not equal to anything, even to themselves. If you were to look at your compiler’s implementation of isnan
function you’d see something like return x != x;
.
It’s worth pointing out how many different NaN values there are – a float
can store 2^{23}−1 (over 8 million) different NaNs, while a double
fits 2^{52}−1 (over 4.5 quadrillion) different NaNs. It may seem wasteful, but the standard specifically made the pool large for, quote, “uninitialized variables and arithmeticlike enhancements”. You can read about one of those uses in Annie Cherkaev’s very interesting “the secret life of NaN”. Her article also discusses the concepts of quiet and signaling NaNs.
Maximum & Minimum
The exponent range limit puts some constraints on the minimum and the maximum value that can be represented with a float
. The maximum value of that type is 2^{128} − 2^{104} (3.40282347×10^{38}). The biased exponent is one short of maximum value and the significand is all lit up:
The smallest normal float
is 2^{−126} (roughly 1.17549435×10^{−38}). Its biased exponent is set to 1 and the significand is cleared out:
In C the minimum and maximum values can be accessed with FLT_MIN
and FLT_MAX
macros respectively. While FLT_MIN
is the smallest normal value, it’s not the smallest value a float
can store. We can squeeze things down even more.
Subnormals
When discussing base2 scientific notation we assumed the numbers were normalized, i.e. the first digit of the significand was 1:
The range of subnormals (also known as denormals) relaxes that requirement. When the biased exponent is set to 0, the exponent is interpreted as −126 (not −127 despite the bias), and the leading digit is assumed to be 0:
The encoding doesn’t change, when performing calculations we just have to remember that this time the implicit bit is 0 and not 1:
While subnormals let us store smaller values than the minimum normal value, it comes at the cost of precision. As the significand decreases we effectively have fewer bits to work with which is more apparent after normalization:
The classic example for the need for subnormals is based on simple arithmetic. If two floating point values are equal to each other:
Then by simply rearranging the terms it follows that their difference should be equal to 0:
Without subnormal values that simple assumption would not be true! Consider x
set to a valid normal float
number:
And y
as:
The numbers are distinct (observe the last few bits of significand). Their difference is:
Which is outside of the normal range of a float
(notice the exponent value smaller than −126). If it wasn’t for subnormals the difference after rounding would be equal to 0, thus implying the equality of not equal numbers.
On a historical note, subnormals were very controversial part of the IEEE 754 standardization process, you can read about it more in “An Interview with the Old Man of FloatingPoint”.
Discrete Space
Due to the fixed number of bits in the significand floating point numbers can’t store arbitrarily precise values. Moreover, the exponential part causes the distribution of values in a float
to be uneven. In the picture below each tick on the horizontal axis represents a unique float value:
Chunky float
values
Notice how the powers of 2 are special – they define the transition points for the change of “chunkiness”. The distance between representable float
values in between neighboring powers of two (i.e. between 2^{n} and 2^{n + 1}) are constant and we can jump between them by changing the significand by 1 bit.
The larger the exponent the “larger” the 1 bit of significand is. For example, the number 0.5 has the exponent value of −1 (since 2^{−1} is 0.5) and 1 bit of its significand jumps by 2^{−24}. For the number 1.0 the step is equal to 2^{−23}. The width of the jump at 1.0 has a dedicated name – machine epsilon. For a float
it can be accessed via FLT_EPSILON
macro.
Starting at 2^{23} (decimal value of 8388608) increasing significand by 1 increases the decimal value of float by 1.0. As such, 2^{24} (16777216 in base10) is the limit of the range of integers that can be stored in a float
without omitting any of them. The next float has the value of 16777218, the value of 16777217 can’t be represented in a float
:
The end of the gapless region
Note that the type can handle some larger integers as well, however, 2^{24} defines the end of the gapless region.
Raw Integer Value
With a fixed exponent increasing the significand by 1 bit jumps between equidistant float values, however, the format has more tricks up its sleeve. Consider 2097151.875 stored in a float
:
Ignoring the division into three parts for a second, we can think of the number as a string of 32 bits. Let’s try interpreting them as a 32bit unsigned integer:
As a quick experiment, let’s add one to the value…
…and put the bits verbatim back into the float
format:
We’ve just obtained the value of 2097152.0 which is the next representable float
– the type can’t store any other values between this and the previous one.
Notice how adding one overflowed the significand and added one to the exponent value. This is the beauty of putting the exponent part before the significand. It lets us easily obtain the next/previous representable float (away/towards zero) by simply increasing/decreasing its raw integer value.
Incrementing the integer representation of the maximum float
value by one? You get infinity. Decrementing the integer form of the minimum float
? You enter the world of subnormals. Decrease it for the smallest subnormal? You get zero. Things fall into place just perfectly. The two caveats with this trick is that it won’t jump from +0.0 to −0.0 and vice versa, moreover, infinities will “increment” to NaNs, and the last NaN will increment to zero.
Other Types
So far we’ve focused our discussion on a float
, but its popular bigger cousin double
and the less common half
are also worth looking at.
Double
In base2 scientific notation a double
has 53 digits of significand and exponent range of [−1022, +1023] resulting in an encoding with 11 bits dedicated to exponent and 52 bits to significand to form a 64bit encoding:
Half
Halffloat is used relatively often in computer graphics. In base2 scientific notation a half
has 11 digits of significand and exponent range of [−14, +15] resulting in an encoding with 5 bits dedicated to exponent and 10 bits to significand creating a 16bit type:
half
is really compact, but also has very small range of representable values. Additionally, given only 5 bits of the exponent, almost ^{1}⁄_{32} of the possible half
values are dedicated to NaNs.
Larger Types
IEEE 754 specifies 128bit floating point format, however, native hardware support is very limited. Some compilers will let you use it when __float128
type is used, but the operations are usually done in software.
The standard also suggests equations for obtaining the number of exponent and significand bits in higher precision formats (e.g. 256bit), but I think it’s fair to say those are rather impractical.
Same Behavior
While all IEEE 754 types have different lengths, they all behave the same way:
 ±0.0 always has all the bits of the exponent and the significand set to zero
 ±infinity has all ones in the exponent and all zeros in the significand
 NaNs have all ones in the exponent and a nonzero significand
 the encoded exponent of subnormals is 0
The only difference between the types is in how many bits they dedicate to the exponent and to the significand.
Conversions
While in practice many floating point calculations are performed using the same type throughout, a type change is often unavoidable. For example, JavaScript’s Number
is just a double
, however, WebGL deals with float
values. Conversions to a larger and a smaller type behave differently.
Conversion to a Larger Type
Since a double
has more bits of the significand and of the exponent than a float
and so does a float
compared to a half
we can be sure that converting a floatingpoint value to a higher precision type will maintain the exact stored value.
Let’s see how this pans out for a half
value of 234.125. Its binary representation is:
The same number stored in a float
has the following representation:
And in a double
:
Note that the new significand bits in a larger format are filled with zeros which simply follows from scientific notation. The new exponent bits are filled with 0s when the highest bit is 1, and with 1s when the highest bit is 0 (you can see it by changing type e.g. for 0.11328125) – a result of unbiasing the value with original bias then biasing again with the new bias.
Conversion to a Smaller Type
The following should be fairly unsurprising, but it’s worth going through an example. Consider a double
value of −282960.039306640625:
When converting to a float
we have to account for the significand bits that don’t fit which is by default done using roundingtonearesteven method. As such, the same number stored in a float
has the following representation:
The decimal value of this float is −282960.03125, i.e. a different number than the one stored in a double
. Converting to a half
produces:
What happened here? The exponent value of 18 that fits perfectly fine in a float
is too large for the maximum exponent of 15 that a half
can handle and the resulting value is −infinity.
Converting from a higher to a lower precision floating point type will maintain the exact value if the significand bits that don’t fit in the smaller type are 0s and the exponent value can be represented in the smaller type. If we were to convert the previously examined 234.125
from a double
to a float
or to a half
it would keep its exact value in all three types.
A Sidenote on Rounding
While roundhalfup (“If the fraction is .5 – round up”) is the common rounding rule used in everyday life, it’s actually quite flawed. Consider the results of the following made up survey:
 725 responders said their favorite color is red
 275 responders said their favorite color is green
The distribution of votes is 72.5% and 27.5% respectively. If we wanted to round the percentages to integer values and were to use roundhalfup we’d end up with the following outcome: 73% and 28%. To everyone’s dissatisfaction we just made the survey results add up to 101%.
Roundtonearesteven solves this problem by, unsurprisingly, rounding to nearest even value. 72.5% becomes 72%, 27.5% becomes 28%. The expected sum of 100% is restored.
Conversion of Special Values
Neither NaNs nor infinities follow the usual conventions. Their special rule is very straightforward: NaNs remain NaNs and infinities remain infinities in all the type conversions.
Printing
Working with floating point numbers often requires printing their value so that it can be restored accurately — every bit should maintain its exact value. When it comes to printf
style formatting characters, %f
and %e
are commonly used. Sadly, they often fail to maintain enough precision:


Produces:
3.008011
3.008011
3.008011e+00
3.008011e+00
However, those two floating point numbers are not the same and store different values. f0
is:
And f1
differs from f0
by 3:
The usual solution to this problem is to specify the precision manually to the maximum number of digits. We can use FLT_DECIMAL_DIG
macro (value of 9) for this purpose:


Yields:
3.008011102e+00
3.008011817e+00
Unfortunately, it will print the long form even for simple values, e.g. 3.0f
will be printed as 3.000000000e+00
. It seems that there is no way to configure the printing of floating point values to automatically maintain exact number of decimal digits needed to accurately represent the value.
Hexadecimal Form
Luckily, hexadecimal form comes to the rescue. It uses %a
specifier and prints the shortest, exact representation of floating point number in a hexadecimal form:


Produces:
0x1.810682p+1
0x1.810688p+1
The hexadecimal constant can be used verbatim in code or as an input to scanf
\strtof
on any reasonable compiler and platform. To verify the results we can fire up LLDB one more time:
(lldb) p 0x1.810682p+1f
(float) $0 = 3.0080111
(lldb) p 0x1.810688p+1f
(float) $1 = 3.00801182
(lldb) p/t *(uint32_t *)&$0
(uint32_t) $2 = 0b01000000010000001000001101000001
(lldb) p/t *(uint32_t *)&$1
(uint32_t) $3 = 0b01000000010000001000001101000100
The hexadecimal form is exact and concise – each set of four bits of the significand is converted to the corresponding hex digit. Using our example values: 1000
becomes 8
, 0001
becomes 1
and so on. An unbiased exponent just follows the letter p
. You can find more details about the %a
specifier in “
Hexadecimal FloatingPoint Constants”.
Nine digits may be enough to maintain the exact value, but it’s nowhere near the number of digits required to show the floating point number in its full decimal glory.
Exact Decimal Representation
While not every decimal number can be represented using floating point numbers (the infamous 0.1), every floating point number has its own exact decimal representation. The following example is done on a half
since it’s much more compact, but the method is equivalent for a float
and a double
.
Let’s consider the value of 3.142578125 stored in a half
:
The equivalent value in scientific base2 notation is:
Firstly, we can convert the significand part to an integer by multiplying it by 1:
Which we an cleverly expand:
To obtain an integer times a power of two:
Then we can combine the fractional part with the exponent part:
And in decimal form:
We can get rid of the power of two by multiplying it by a cleverly written value of 1 yet another time:
We can pair every 2 with every 5 to obtain:
Putting back all the pieces together we end up with a product of two integers and a shift of a decimal place encoded in the power of 10:
Coincidentally, the trick of multiplying by 5^{−n}×5^{n} also explains why negative powers of 2 are just powers of 5 with a shifted decimal place (e.g. ^{1}⁄_{4} is ^{25}⁄_{100}, and ^{1}⁄_{16} is ^{625}⁄_{10000}).
Even though the exact decimal representation always exists, it’s often cumbersome to use – some small numbers that can be stored in a double
have over 760 significant digits of decimal representation!
Further Reading
My article is just a drop in the sea of resources about floating point numbers. Perhaps the most thorough technical writeups on floating point numbers is “What Every Computer Scientist Should Know About FloatingPoint Arithmetic”. While very comprehensive, I find it difficult to get through. Almost five years have passed since I first mentioned it on this blog and, frankly, I’ve still limited my engagement to mostly skimming through it.
One of the most fascinating resources out there is Bruce Dawson’s amazing series of posts. Bruce dives into a ton of details about the format and its behavior. I consider many of his articles a mustread for any programmer who deals with floating point numbers on a regular basis, but if you only have time for one I’d go with “Comparing Floating Point Numbers, 2012 Edition”.
Exploring Binary contains many detailed articles on floating point format. As a delightful example, it demonstrates that the maximum number of significant digits in the decimal representation of a float
is 112, while a double
requires up to 767 digits.
For a different look on floating point numbers I recommend Fabien Sanglard’s “Floating Point Visually Explained” – it shows an interesting concept of the exponent interpreted as a sliding window and the significand as an offset into that window.
Final Words
Even though we’re done, I encourage you to go on. Any of the mentioned resources should let you discover something more in the vast space of floating point numbers.
The more I learn about IEEE 754 the more enchanted I feel. William Kahan with the aid of Jerome Coonen and Harold Stone created something truly beautiful and everlasting.
I genuinely hope this trip through the details of floating point numbers made them a bit less mysterious and showed you some of that beauty.