Exposing Floating Point

Despite everyday use, floating point numbers are often understood in a hand-wavy manner and their behavior raises many eyebrows. Over the course of this article I’d like to show that things aren’t actually that complicated.

This blog post is a companion to my recently launched website – float.exposed. Other than exploiting the absurdity of present day list of top level domains, it’s intended to be a handy tool for inspecting floating point numbers. While I encourage you to play with it, the purpose of many of its elements may be exotic at first. By the time we’ve finished, however, all of them will hopefully become familiar.

On a technical note, by floating point I’m referring to the ubiquitous IEEE 754 binary floating point format. Types half, float, and double are understood to be binary16, binary32, and binary64 respectively. There were other formats back in the day, but whatever device you’re reading this on is pretty much guaranteed to use IEEE 754.

With the formalities out of the way, let’s start at the shallow end of the pool.

Writing Numbers

We’ll begin with the very basics of writing numeric values. The initial steps may seem trivial, but starting from the first principles will help us build a working model of floating point numbers.

Decimal Numbers

Consider the number 327.849. Digits to the left of the decimal point represent increasing powers of ten, while digits to the right of the decimal point represent decreasing powers of ten:

3
102
2
101
7
100
.
 
8
10−1
4
10−2
9
10−3

Even though this notation is very natural, it has a few disadvantages:

  • small numbers like 0.000000000653 require skimming over many zeros before they start “showing” actually useful digits
  • it’s hard to estimate the magnitude of large numbers like 7298345251 at a glance
  • at some point the distant digits of a number become increasingly less significant and could often be dropped, yet for big numbers we don’t save any space by replacing them with zeros, e.g. 7298000000

By “small” and “big” numbers I’m referring to their magnitude so −4205 is understood to be bigger than 0.03 even though it’s to the left of it on the real number line.

Scientific notation solves all these problems. It shifts the decimal point to right after the first non-zero digit and sets the exponent accordingly:

+3.27849×102

Scientific notation has three major components: the sign (+), the significand (3.27849), and the exponent (2). For positive values the “+” sign is often omitted, but we’ll keep it around for the sake of verbosity. Note that the “10” simply shows that we’re dealing with base-10 system. The aforementioned disadvantages disappear:

  • the 0-heavy small number is presented as 6.53×10−10 with all the pesky zeros removed
  • just by looking at the first digit and the exponent of 7.298345251×109 we know that number is roughly 7 billion
  • we can drop the unwanted distant digits from the tail to get 7.298×109

Continuing with the protagonist of this section, if we’re only interested in 4 most significant digits we can round the number using one of the many rounding rules:

+3.278×102

The number of digits shown describes the precision we’re dealing with. A number with 8 digits of precision could be printed as:

+3.2784900×102

Binary Numbers

With the familiar base-10 out of the way, let’s look at the binary numbers. The rules of the game are exactly the same, it’s just that the base is 2 and not 10. Digits to the left of the binary point represent increasing powers of two, while digits to the right of the binary point represent decreasing powers of two:

1
23
0
22
0
21
1
20
.
 
0
2−1
1
2−2
0
2−3
1
2−4

When ambiguous I’ll use 2 to mean the number is in base-2. As such, 10002 is not a thousand, but 23 i.e. eight. To get the decimal value of the discussed 1001.01012 we simply sum up the powers of two that have the bit set: 8 + 1 + 0.25 + 0.0625, ending up with the value of 9.3125.

Binary numbers can use scientific notation as well. Since we’re shifting the binary point by three places, the exponent ends up having the value of 3:

+1.0010101×23

Similarly to scientific notation in base-10, we also moved the binary point to right after the first non-zero digit of the original representation. However, since the only non-zero digit in base-2 system is 1, every non-zero binary number in scientific notation starts with a 1.

We can round the number to a shorter form:

+1.0011×23

Or show that we’re more accurate by storing 11 binary digits:

+1.0010101000×23

If you’ve grasped everything that we’ve discussed so far then congratulations – you understand how floating point numbers work.

Floating Point Numbers

Floating points numbers are just numbers in base-2 scientific notation with the following two restrictions:

  • limited number of digits in the significand
  • limited range of the exponent – it can’t be greater than some maximum limit and also can’t be less than some minimum limit

That’s (almost) all there is to them.

Different floating point types have different number of significand digits and allowed exponent range. For example, a float has 24 binary digits (i.e. bits) of significand and the exponent range of [−126, +127], where “[” and “]” denote inclusivity of the range (e.g. +127 is valid, but +128 is not). Here’s a number with a decimal value of −616134.5625 that can fit in a float:

1.00101100110110001101001×219

Unfortunately, the number of bits of significand in a float is limited, so some real values may not be perfectly representable in the floating point form. A decimal number 0.2 has the following base-2 representation:

1.1001×2−3

The overline (technically known as vinculum) indicates forever repeating value. The 25th and later significant digits of the perfect base-2 scientific representation of that number won’t fit in a float and have to be accounted for by rounding the remaining bits. The full significand:

1.100110011001100110011001100

Will be rounded to:

1.10011001100110011001101

After multiplication by the exponent the resulting number has a different decimal value than the perfect 0.2:

0.20000000298023223876953125

If we tried rounding the full significand down:

1.10011001100110011001100

The resulting number would be equal to:

0.199999988079071044921875​

No matter what we do, the limited number of bits in the significand prevents us from getting the correct result. This explains why some decimal numbers don’t have their exact floating point representation.

Similarly, since the value of the exponent is limited, many huge and many tiny numbers won’t fit in a float: neither 2200 nor 2−300 can be represented since they don’t fall into the allowed exponent range of [−126, +127].

Encoding

Knowing the number of bits in the significand and the allowed range of the exponent we can start encoding floating point numbers into their binary representation. We’ll use the number −2343.53125 which has the following representation in base-2 scientific notation:

1.0010010011110001×211

The Sign

The sign is easy – we just need 1 bit to express whether the number is positive or negative. IEEE 754 uses the value of 0 for the former and 1 for the latter. Since the discussed number is negative we’ll use one:

1

The Significand

For the significand of a float we need 24 bits. However, per what we’ve already discussed, the first digit of the significand in base-2 is always 1, so the format cleverly skips it to save a bit. We just have to remember it’s there when doing calculations. We copy the remaining 23 digits verbatim while filling in the missing bits at the end with 0s:

00100100111100010000000

The leading “1” we skipped is often referred to as an “implicit bit”.

The Exponent

Since the exponent range of [−126, +127] allows 254 possible values, we’ll need 8 bits to store it. To avoid special handling of negative exponent values we’ll add a fixed bias to make sure no encoded exponent is negative.

To obtain a biased exponent we’ll use the bias value of 127. While 126 would work for regular range of exponents, using 127 will let us reserve a biased value of 0 for special purposes. Biasing is just a matter of shifting all values to the right:

The bias in a float

The bias in a float

For the discussed number we have to shift its exponent of 11 by 127 to get 138, or 100010102 and that’s what we will encode as the exponent:

10001010

Putting it All Together

To conform with the standard we’ll put the sign bit first, then the exponent bits, and finally, the significand bits. While seemingly arbitrary, the order is part of the standard’s ingenuity. By sticking all the pieces together a float is born:

11000101000100100111100010000000

The entire encoding occupies 32 bits. To verify we did things correctly we can fire up LLDB and let the hacky type punning do its work:

(lldb) p -2343.53125f
(float) $0 = -2343.53125

(lldb) p/t *(uint32_t *)&$0
(uint32_t) $1 = 0b11000101000100100111100010000000

While neither C nor C++ standards technically require a float or a double to be represented using IEEE 754 format, the rest of this article will sensibly assume so.

The same procedure of encoding a number in base-2 scientific notation can be repeated for almost any number, however, some of them require special handling.

Special Values

The float exponent range allows 254 different values and with a bias of 127 we’re left with two yet unused biased exponent values: 0 and 255. Both are employed for very useful purposes.

A Map of Floats

A dry description doesn’t really paint a picture, so let’s present all the special values visually. In the following plot every dot represents a unique positive float:

All the special values

All the special values

If you have trouble seeing color you can switch to the alternative version. Notice the necessary truncation of a large part of exponents and of a gigantic part of significand values. At your current viewing size you’d have to scroll through roughly window widths to see all the values of the significand.

We’ve already discussed all the unmarked dots — the normal floats. It’s time to dive into the remaining values.

Zero

A float number with biased exponent value of 0 and all zeros in significand is interpreted as positive or negative 0. The arbitrary value of sign (shown as _) decides which 0 we’re dealing with:

_0000000000000000000000000000000

Yes, the floating point standard specifies both +0.0 and −0.0. This concept is actually useful because it tells us from which “direction” the 0 was approached as a result of storing value too small to be represented in a float. For instance -10e-30f / 10e30f won’t fit in a float, however, it will produce the value of -0.0.

When working with zeros note that 0.0 == -0.0 is true even though the two zeros have different encoding. Additionally, -0.0 + 0.0 is equal to 0.0, so by default the compiler can’t optimize a + 0.0 into just a, however, you can set flags to relax the strict conformance.

Infinity

A float number with maximum biased exponent value and all zeros in significand is interpreted as positive or negative infinity depending on the value of the sign bit:

_1111111100000000000000000000000

Infinity arises as a result of rounding a value that’s too large to fit in the type (assuming default rounding mode). In case of a float, any number in base-2 scientific notation with exponent greater than 127 will become infinity. You can also use macro INFINITY directly.

The positive and negative zeros become useful again since dividing a positive value by +0.0 will produce a positive infinity, while dividing it by −0.0 will produce a negative infinity.

Operations involving finite numbers and infinities are actually well defined and follow common sense property of keeping infinities infinite:

  • any finite value added to or subtracted from ±infinity ends up as ±infinity
  • any finite positive value multiplied by ±infinity ends up as ±infinity, while any finite negative value multiplied by ±infinity flips its sign to ∓infinity
  • division by a finite non-zero value works similarly to multiplication (think of division as multiplication by an inverse)
  • square root of a +infinity is +infinity
  • any finite value divided by ±infinity will become ±0.0 depending on the signs of the operands

In other words, infinities are so big that any shifting or scaling won’t affect their infinite magnitude, only their sign may flip. However, some operations throw a wrench into that simple rule.

NaNs

A float number with maximum biased exponent value and non-zero significand is interpreted as NaN – Not a Number:

_11111111    at least one 1     

The easiest way to obtain NaN directly is by using NAN macro. In practice though, NaN arises in the following set of operations:

  • ±0.0 multiplied by ±infinity
  • −infinity added to +infinity
  • ±0.0 divided by ±0.0
  • ±infinity divided by ±infinity
  • square root of a negative number (−0.0 is fine though!)

If the floating point variable is uninitialized, it’s also somewhat likely to contain NaNs. By default the result of any operation involving NaNs will result in a NaN as well. That’s one of the reasons why compiler can’t optimize seemingly simple cases like a + (b - b) into just a. If b is NaN the result of the entire operation has to be NaN too.

NaNs are not equal to anything, even to themselves. If you were to look at your compiler’s implementation of isnan function you’d see something like return x != x;.

It’s worth pointing out how many different NaN values there are – a float can store 223−1 (over 8 million) different NaNs, while a double fits 252−1 (over 4.5 quadrillion) different NaNs. It may seem wasteful, but the standard specifically made the pool large for, quote, “uninitialized variables and arithmetic-like enhancements”. You can read about one of those uses in Annie Cherkaev’s very interesting “the secret life of NaN”. Her article also discusses the concepts of quiet and signaling NaNs.

Maximum & Minimum

The exponent range limit puts some constraints on the minimum and the maximum value that can be represented with a float. The maximum value of that type is 2128 − 2104 (3.40282347×1038). The biased exponent is one short of maximum value and the significand is all lit up:

01111111011111111111111111111111

The smallest normal float is 2−126 (roughly 1.17549435×10−38). Its biased exponent is set to 1 and the significand is cleared out:

00000000100000000000000000000000

In C the minimum and maximum values can be accessed with FLT_MIN and FLT_MAX macros respectively. While FLT_MIN is the smallest normal value, it’s not the smallest value a float can store. We can squeeze things down even more.

Subnormals

When discussing base-2 scientific notation we assumed the numbers were normalized, i.e. the first digit of the significand was 1:

+1.00101100110110001101001×219

The range of subnormals (also known as denormals) relaxes that requirement. When the biased exponent is set to 0, the exponent is interpreted as −126 (not −127 despite the bias), and the leading digit is assumed to be 0:

+0.00000000000110001101001×2−126

The encoding doesn’t change, when performing calculations we just have to remember that this time the implicit bit is 0 and not 1:

00000000000000000000110001101001

While subnormals let us store smaller values than the minimum normal value, it comes at the cost of precision. As the significand decreases we effectively have fewer bits to work with which is more apparent after normalization:

+1.10001101001×2−138

The classic example for the need for subnormals is based on simple arithmetic. If two floating point values are equal to each other:

x == y

Then by simply rearranging the terms it follows that their difference should be equal to 0:

x − y == 0

Without subnormal values that simple assumption would not be true! Consider x set to a valid normal float number:

+1.01100001111101010000101×2−124

And y as:

+1.01100000011001011100001×2−124

The numbers are distinct (observe the last few bits of significand). Their difference is:

+1.1000111101001×2−132

Which is outside of the normal range of a float (notice the exponent value smaller than −126). If it wasn’t for subnormals the difference after rounding would be equal to 0, thus implying the equality of not equal numbers.

On a historical note, subnormals were very controversial part of the IEEE 754 standardization process, you can read about it more in “An Interview with the Old Man of Floating-Point”.

Discrete Space

Due to the fixed number of bits in the significand floating point numbers can’t store arbitrarily precise values. Moreover, the exponential part causes the distribution of values in a float to be uneven. In the picture below each tick on the horizontal axis represents a unique float value:

Chunky float values

Chunky float values

Notice how the powers of 2 are special – they define the transition points for the change of “chunkiness”. The distance between representable float values in between neighboring powers of two (i.e. between 2n and 2n + 1) are constant and we can jump between them by changing the significand by 1 bit.

The larger the exponent the “larger” the 1 bit of significand is. For example, the number 0.5 has the exponent value of −1 (since 2−1 is 0.5) and 1 bit of its significand jumps by 2−24. For the number 1.0 the step is equal to 2−23. The width of the jump at 1.0 has a dedicated name – machine epsilon. For a float it can be accessed via FLT_EPSILON macro.

Starting at 223 (decimal value of 8388608) increasing significand by 1 increases the decimal value of float by 1.0. As such, 224 (16777216 in base-10) is the limit of the range of integers that can be stored in a float without omitting any of them. The next float has the value of 16777218, the value of 16777217 can’t be represented in a float:

The end of the gapless region

The end of the gapless region

Note that the type can handle some larger integers as well, however, 224 defines the end of the gapless region.

Raw Integer Value

With a fixed exponent increasing the significand by 1 bit jumps between equidistant float values, however, the format has more tricks up its sleeve. Consider 2097151.875 stored in a float:

01001001111111111111111111111111

Ignoring the division into three parts for a second, we can think of the number as a string of 32 bits. Let’s try interpreting them as a 32-bit unsigned integer:

01001001111111111111111111111111

As a quick experiment, let’s add one to the value…

01001010000000000000000000000000

…and put the bits verbatim back into the float format:

01001010000000000000000000000000

We’ve just obtained the value of 2097152.0 which is the next representable float – the type can’t store any other values between this and the previous one.

Notice how adding one overflowed the significand and added one to the exponent value. This is the beauty of putting the exponent part before the significand. It lets us easily obtain the next/previous representable float (away/towards zero) by simply increasing/decreasing its raw integer value.

Incrementing the integer representation of the maximum float value by one? You get infinity. Decrementing the integer form of the minimum float? You enter the world of subnormals. Decrease it for the smallest subnormal? You get zero. Things fall into place just perfectly. The two caveats with this trick is that it won’t jump from +0.0 to −0.0 and vice versa, moreover, infinities will “increment” to NaNs, and the last NaN will increment to zero.

Other Types

So far we’ve focused our discussion on a float, but its popular bigger cousin double and the less common half are also worth looking at.

Double

In base-2 scientific notation a double has 53 digits of significand and exponent range of [−1022, +1023] resulting in an encoding with 11 bits dedicated to exponent and 52 bits to significand to form a 64-bit encoding:

1011111101001011000101101101100100111101101110100010001101101000

Half

Half-float is used relatively often in computer graphics. In base-2 scientific notation a half has 11 digits of significand and exponent range of [−14, +15] resulting in an encoding with 5 bits dedicated to exponent and 10 bits to significand creating a 16-bit type:

0101101101010001

half is really compact, but also has very small range of representable values. Additionally, given only 5 bits of the exponent, almost 132 of the possible half values are dedicated to NaNs.

Larger Types

IEEE 754 specifies 128-bit floating point format, however, native hardware support is very limited. Some compilers will let you use it when __float128 type is used, but the operations are usually done in software.

The standard also suggests equations for obtaining the number of exponent and significand bits in higher precision formats (e.g. 256-bit), but I think it’s fair to say those are rather impractical.

Same Behavior

While all IEEE 754 types have different lengths, they all behave the same way:

  • ±0.0 always has all the bits of the exponent and the significand set to zero
  • ±infinity has all ones in the exponent and all zeros in the significand
  • NaNs have all ones in the exponent and a non-zero significand
  • the encoded exponent of subnormals is 0

The only difference between the types is in how many bits they dedicate to the exponent and to the significand.

Conversions

While in practice many floating point calculations are performed using the same type throughout, a type change is often unavoidable. For example, JavaScript’s Number is just a double, however, WebGL deals with float values. Conversions to a larger and a smaller type behave differently.

Conversion to a Larger Type

Since a double has more bits of the significand and of the exponent than a float and so does a float compared to a half we can be sure that converting a floating-point value to a higher precision type will maintain the exact stored value.

Let’s see how this pans out for a half value of 234.125. Its binary representation is:

0 101101101010001  

The same number stored in a float has the following representation:

0 1000011011010100010000000000000  

And in a double:

0100000001101101010001000000000000000000000000000000000000000000

Note that the new significand bits in a larger format are filled with zeros which simply follows from scientific notation. The new exponent bits are filled with 0s when the highest bit is 1, and with 1s when the highest bit is 0 (you can see it by changing type e.g. for 0.11328125) – a result of unbiasing the value with original bias then biasing again with the new bias.

Conversion to a Smaller Type

The following should be fairly unsurprising, but it’s worth going through an example. Consider a double value of −282960.039306640625:

1100000100010001010001010100000000101000010000000000000000000000

When converting to a float we have to account for the significand bits that don’t fit which is by default done using rounding-to-nearest-even method. As such, the same number stored in a float has the following representation:

1 1001000100010100010101000000001  

The decimal value of this float is −282960.03125, i.e. a different number than the one stored in a double. Converting to a half produces:

1 111110000000000  

What happened here? The exponent value of 18 that fits perfectly fine in a float is too large for the maximum exponent of 15 that a half can handle and the resulting value is −infinity.

Converting from a higher to a lower precision floating point type will maintain the exact value if the significand bits that don’t fit in the smaller type are 0s and the exponent value can be represented in the smaller type. If we were to convert the previously examined 234.125 from a double to a float or to a half it would keep its exact value in all three types.

A Sidenote on Rounding

While round-half-up (“If the fraction is .5 – round up”) is the common rounding rule used in everyday life, it’s actually quite flawed. Consider the results of the following made up survey:

  • 725 responders said their favorite color is red
  • 275 responders said their favorite color is green

The distribution of votes is 72.5% and 27.5% respectively. If we wanted to round the percentages to integer values and were to use round-half-up we’d end up with the following outcome: 73% and 28%. To everyone’s dissatisfaction we just made the survey results add up to 101%.

Round-to-nearest-even solves this problem by, unsurprisingly, rounding to nearest even value. 72.5% becomes 72%, 27.5% becomes 28%. The expected sum of 100% is restored.

Conversion of Special Values

Neither NaNs nor infinities follow the usual conventions. Their special rule is very straightforward: NaNs remain NaNs and infinities remain infinities in all the type conversions.

Printing

Working with floating point numbers often requires printing their value so that it can be restored accurately — every bit should maintain its exact value. When it comes to printf-style formatting characters, %f and %e are commonly used. Sadly, they often fail to maintain enough precision:

1
2
3
4
5
6
7
float f0 = 3.0080111026763916015f;
float f1 = 3.0080118179321289062f;

printf("%f\n", f0);
printf("%f\n", f1);
printf("%e\n", f0);
printf("%e\n", f1);

Produces:

3.008011
3.008011
3.008011e+00
3.008011e+00

However, those two floating point numbers are not the same and store different values. f0 is:

01000000010000001000001101000001

And f1 differs from f0 by 3:

01000000010000001000001101000100

The usual solution to this problem is to specify the precision manually to the maximum number of digits. We can use FLT_DECIMAL_DIG macro (value of 9) for this purpose:

1
2
3
4
5
float f0 = 3.0080111026763916015f;
float f1 = 3.0080118179321289062f;

printf("%.*e\n", FLT_DECIMAL_DIG, f0);
printf("%.*e\n", FLT_DECIMAL_DIG, f1);

Yields:

3.008011102e+00
3.008011817e+00

Unfortunately, it will print the long form even for simple values, e.g. 3.0f will be printed as 3.000000000e+00. It seems that there is no way to configure the printing of floating point values to automatically maintain exact number of decimal digits needed to accurately represent the value.

Hexadecimal Form

Luckily, hexadecimal form comes to the rescue. It uses %a specifier and prints the shortest, exact representation of floating point number in a hexadecimal form:

1
2
3
4
5
float f0 = 3.0080111026763916015f;
float f1 = 3.0080118179321289062f;

printf("%a\n", f0);
printf("%a\n", f1);

Produces:

0x1.810682p+1
0x1.810688p+1

The hexadecimal constant can be used verbatim in code or as an input to scanf\strtof on any reasonable compiler and platform. To verify the results we can fire up LLDB one more time:

(lldb) p 0x1.810682p+1f
(float) $0 = 3.0080111

(lldb) p 0x1.810688p+1f
(float) $1 = 3.00801182

(lldb) p/t *(uint32_t *)&$0
(uint32_t) $2 = 0b01000000010000001000001101000001

(lldb) p/t *(uint32_t *)&$1
(uint32_t) $3 = 0b01000000010000001000001101000100

The hexadecimal form is exact and concise – each set of four bits of the significand is converted to the corresponding hex digit. Using our example values: 1000 becomes 8, 0001 becomes 1 and so on. An unbiased exponent just follows the letter p. You can find more details about the %a specifier in “ Hexadecimal Floating-Point Constants”.

Nine digits may be enough to maintain the exact value, but it’s nowhere near the number of digits required to show the floating point number in its full decimal glory.

Exact Decimal Representation

While not every decimal number can be represented using floating point numbers (the infamous 0.1), every floating point number has its own exact decimal representation. The following example is done on a half since it’s much more compact, but the method is equivalent for a float and a double.

Let’s consider the value of 3.142578125 stored in a half:

0100001001001001

The equivalent value in scientific base-2 notation is:

+1.1001001001×21

Firstly, we can convert the significand part to an integer by multiplying it by 1:

1.1001001001×1

Which we an cleverly expand:

1.1001001001×210×2−10

To obtain an integer times a power of two:

110010010012×2−10

Then we can combine the fractional part with the exponent part:

110010010012×2−10×21

And in decimal form:

1609×2−9

We can get rid of the power of two by multiplying it by a cleverly written value of 1 yet another time:

2−9×5−9 × 59

We can pair every 2 with every 5 to obtain:

10−9×59

Putting back all the pieces together we end up with a product of two integers and a shift of a decimal place encoded in the power of 10:

10−9×59×1609 = 3.142578125

Coincidentally, the trick of multiplying by 5−n×5n also explains why negative powers of 2 are just powers of 5 with a shifted decimal place (e.g. 14 is 25100, and 116 is 62510000).

Even though the exact decimal representation always exists, it’s often cumbersome to use – some small numbers that can be stored in a double have over 760 significant digits of decimal representation!

Further Reading

My article is just a drop in the sea of resources about floating point numbers. Perhaps the most thorough technical write-ups on floating point numbers is “What Every Computer Scientist Should Know About Floating-Point Arithmetic”. While very comprehensive, I find it difficult to get through. Almost five years have passed since I first mentioned it on this blog and, frankly, I’ve still limited my engagement to mostly skimming through it.

One of the most fascinating resources out there is Bruce Dawson’s amazing series of posts. Bruce dives into a ton of details about the format and its behavior. I consider many of his articles a must-read for any programmer who deals with floating point numbers on a regular basis, but if you only have time for one I’d go with “Comparing Floating Point Numbers, 2012 Edition”.

Exploring Binary contains many detailed articles on floating point format. As a delightful example, it demonstrates that the maximum number of significant digits in the decimal representation of a float is 112, while a double requires up to 767 digits.

For a different look on floating point numbers I recommend Fabien Sanglard’s “Floating Point Visually Explained” – it shows an interesting concept of the exponent interpreted as a sliding window and the significand as an offset into that window.

Final Words

Even though we’re done, I encourage you to go on. Any of the mentioned resources should let you discover something more in the vast space of floating point numbers.

The more I learn about IEEE 754 the more enchanted I feel. William Kahan with the aid of Jerome Coonen and Harold Stone created something truly beautiful and ever-lasting.

I genuinely hope this trip through the details of floating point numbers made them a bit less mysterious and showed you some of that beauty.

Mesh Transforms

I’m a huge fan of the transform property. Combining rotations, translations, and scalings is one of the easiest way to modify a shape of a UIView or a CALayer. While easy to use, regular transforms are quite limited in what they can achieve – a rectangular shape of a layer can be transformed into an arbitrary quadrilateral. It’s nothing to sneeze at, but there are much more powerful toys out there.

This article is focused on mesh transforms. The core idea of a mesh transform is very straightforward: you introduce a set of vertices in the layer then you move them around deforming the entire contents:

A mesh transform

A mesh transform

The major part of this post is dedicated to a private Core Animation API that’s been part of the framework since iOS 5.0. If at this point you’re leaving this page to prevent spoiling your mind with deliciousness of private API, fear not. In the second section of the article I present an equivalent, open-sourced alternative.

CAMeshTransform

The first time I saw iOS-runtime headers I was mesmerized. The descriptions of so many private classes and hidden properties were extremely eye-opening. One of my most intriguing findings was CAMeshTransform class and a corresponding meshTransform property on CALayer. I badly wanted to figure it all out and recently I finally did. While seemingly complex, the concepts behind a mesh transform are easy to grasp. Here’s how a convenience construction method of CAMeshTransform looks like:

+ (instancetype)meshTransformWithVertexCount:(NSUInteger)vertexCount
                                    vertices:(CAMeshVertex *)vertices
                                   faceCount:(NSUInteger)faceCount
                                       faces:(CAMeshFace *)faces
                          depthNormalization:(NSString *)depthNormalization;

This method clearly describes the basic components of a CAMeshTransform – vertices, faces, and a string describing depth normalization. We will tackle these components one by one.

Disclaimer: unfortunately, names of fields inside a struct are lost during compilation so I had to come up with reasonable descriptions on my own. While original names are most likely different, their intention remains the same.

A Vertex

CAMeshVertex is a simple struct with two fields:

typedef struct CAMeshVertex {
    CGPoint from;
    CAPoint3D to;
} CAMeshVertex;

CAPoint3D is very similar to a regular CGPoint – it’s trivially extended by a missing z coordinate:

typedef struct CAPoint3D {
    CGFloat x;
    CGFloat y;
    CGFloat z;
} CAPoint3D;

With that in mind the purpose of a CAMeshVertex is easily inferred: it describes the mapping between the flat point on the surface of a layer and the transformed point located in a 3D space. CAMeshVertex defines the following action: “take this point from the layer and move it to that position”. Since CAPoint3D field has x, y, and z components, a mesh transform doesn’t have to be flat:

A vertex moves from a 2D point on a layer to a 3D point in a space

A vertex moves from a 2D point on a layer to a 3D point in a space

A Face

CAMeshFace is simple as well:

typedef struct CAMeshFace {
    unsigned int indices[4];
    float w[4];
} CAMeshFace;

The indices array describes which four vertices a face is spanned on. Since CAMeshTransform is defined by an array of vertices, a CAMeshFace can reference vertices it’s built with by their indexes in vertices array. This is a standard paradigm in computer graphics and it’s very convenient – many faces can point at the same vertex. This not only removes the problem of data duplication, but also makes it easy to continuously modify the shape of all the attached faces:

Faces are defined by their vertices

Faces are defined by their vertices

As for the w field of CAMeshFace, we’ll temporarily postpone its discussion.

Coordinates

With an overview of the vertices and faces at hand it’s still not obvious what values should we put inside a CAMeshVertex. While the vast majority of CALayer’s properties are defined in points, there are a few that make use of unit coordinates, the anchorPoint being probably the most popular one. CAMeshVertex makes use of unit coordinates as well. The from point of {0.0, 0.0} corresponds to a top left corner of the layer and {1.0, 1.0} point corresponds to bottom right corner of the layer. The to point uses exactly the same coordinate system:

Vertices are defined in unit coordinates

Vertices are defined in unit coordinates

The reason for using unit coordinates is introduced in Core Animation Programming Guide:

Unit coordinates are used when the value should not be tied to screen coordinates because it is relative to some other value.

The best thing about unit coordinates is that they’re size invariant. You can reuse the exact same mesh to transform both small and large views and it will all work just fine. I believe this was the main reason for choosing them as units of CAMeshTransform.

Modifying Mesh Transforms

One of the drawbacks of creating a regular CAMeshTransform is that it’s immutable and all the vertices and faces have to be defined before the transform is created. Thankfully, a mutable subclass named CAMutableMeshTransform also exists and this one does allow adding, removing and replacing vertices and faces at any time.

Both mutable and immutable mesh transform have a subdivisionSteps property that describes how many mesh subdivisions should be performed by the framework when layer gets rendered on screen. Number of splits grows exponentially, so setting the value of property to 3 will divide each edge to 8 pieces. The default value of subdivisionSteps is −1 and it usually makes the meshes look smooth. I assume it tries to automatically adjust the number of subdivisions to make the final result look good.

What’s not obvious, for non 0 values of subdivisionSteps generated mesh doesn’t go through all of its vertices:

Shape of subdivided mesh vs its vertices

Shape of subdivided mesh vs its vertices

In fact, the vertices are control points of a surface and by observing how they influence the shape I suspect CAMeshTransform actually defines a cubic NURBS surface. Here’s where w field of CAMeshFace comes back. Setting a value at one of the four indices of the w array seems to influence the weight of the corresponding vertex. The factor doesn’t seem to be the weight as defined in NURBS equation. Unfortunately, I couldn’t force myself to get through literally hundreds lines of floating point assembly to figure out what’s going on.

Even though NURBS surfaces are extremely powerful, the fact that they don’t go through the defining vertices is quite a kicker. When I was designing my meshes I wanted to have total control over what’s going on and how the generated mesh looked like so I usually set subdivisionSteps property to 0.

Applying Mesh Transform

On its own CAMeshTransform is of little use, but it can be easily assigned to a private property of a CALayer:

@property (copy) CAMeshTransform *meshTransform;

The following piece of code creates a wavy mesh transform. It’s excessively verbose for the purpose of demonstrating how all the pieces fit together. With a bunch of convenience methods, the same effect can be created within literally a few lines of code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
- (CAMeshTransform *)wavyTransform
{
    const float Waves = 3.0;
    const float Amplitude = 0.15;
    const float DistanceShrink = 0.3;
    
    const int Columns = 40;
    
    CAMutableMeshTransform *transform = [CAMutableMeshTransform meshTransform];
    for (int i = 0; i <= Columns; i++) {
        
        float t = (float)i / Columns;
        float sine = sin(t * M_PI * Waves);
        
        CAMeshVertex topVertex = {
            .from = {t, 0.0},
            .to   = {t, Amplitude * sine * sine + DistanceShrink * t, 0.0}
        };
        CAMeshVertex bottomVertex = {
            .from = {t, 1.0},
            .to   = {t, 1.0 - Amplitude + Amplitude * sine * sine - DistanceShrink * t, 0.0}
        };
        
        [transform addVertex:topVertex];
        [transform addVertex:bottomVertex];
    }
    
    for (int i = 0; i < Columns; i++) {
        unsigned int topLeft     = 2 * i + 0;
        unsigned int topRight    = 2 * i + 2;
        unsigned int bottomRight = 2 * i + 3;
        unsigned int bottomLeft  = 2 * i + 1;
        
        [transform addFace:(CAMeshFace){.indices = {topLeft, topRight, bottomRight, bottomLeft}}];
    }
    
    transform.subdivisionSteps = 0;
    
    return transform;
}

Here’s a UILabel with a mesh transform applied:

Mesh-transformed UILabel

Mesh-transformed UILabel

It’s worth pointing out that you might often see different results depending on whether the app is run on a simulator or on a device. Apparently, the iOS simulator version of Core Animation uses a software renderer for its 3D stuff and it’s a different software renderer than the one used for OpenGL ES. This is especially visible with patterned textures.

Leaky Abstractions

When you look closer at the mesh-transformed UILabel on a retina device, you’ll notice that its text quality is subpar. It turns out it can be easily improved with a single property assignment:

label.layer.rasterizationScale = [UIScreen mainScreen].scale;

This is a giveaway of how it all might work under the hood. The contents of layer and all its sublayers get rasterized into a single texture that later gets applied to the vertex mesh. In theory, the rasterization process could be avoided by generating the correct meshes for all the sublayers of the transformed layer by making them perfectly overlap their respective superlayers. In general case, however, vertices of the sublayers would be placed in-between vertices of the parent layer which would surely cause a nasty z-fighting. Rasterization looks like a good solution.

The other problem I noticed has its roots in hardware. CAMeshTransform provides a nice abstraction of a face which is nothing more than a quadrilateral, also known as a quad. However, modern GPUs are only interested in rendering triangles. Before it is sent to the GPU a quad has to be split into two triangles. This process can be done in two distinct ways by choosing either diagonal as a separating edge:

Two different divisions of the same quad into triangles

Two different divisions of the same quad into triangles

It might not seem like a big deal, but performing a seemingly similar transform can produce vastly different results:

Symmetrical meshes, asymmetrical results

Symmetrical meshes, asymmetrical results

Notice that the shapes of mesh transforms are perfectly symmetrical, yet the result of their action is not. In the left mesh only one of the triangles actually gets transformed. In the right mesh both triangles do. It shouldn’t be hard to guess which diagonal does Core Animation use for its quad subdivision. Note that the effect will also happen for the exact meshes if you change the order of indices inside their respective faces.

Even though the small issues caused by rasterization and triangulation are leaky abstractions and can’t be completely ignored, they seem to be the only viable solutions to the complexity they try to mask.

Adding Depth

The unit coordinates are a neat idea and they work great for both width and height. However, we don’t have any way to define the third dimension – a size field of layer’s bounds has merely two dimensions. One unit of width is equal to bounds.size.width points and height works correspondingly. How can one specify how many points does one unit of depth have? Authors of Core Animation have solved this problem in a very simple but surprisingly effective way.

A depthNormalization property of CAMeshTransform is an NSString that can legally by set to one of the six following constants:

extern NSString * const kCADepthNormalizationNone;
extern NSString * const kCADepthNormalizationWidth;
extern NSString * const kCADepthNormalizationHeight;
extern NSString * const kCADepthNormalizationMin;
extern NSString * const kCADepthNormalizationMax;
extern NSString * const kCADepthNormalizationAverage;

Here’s the trick: CAMeshTransform evaluates the depth normalization as a function of the other two dimensions. The constant names are self-explanatory, but let’s get through a quick example. Let’s assume the depthNormalization is set to kCADepthNormalizationAverage and the layer bounds are equal to CGRectMake(0.0, 0.0, 100.0, 200.0). Since we picked the average normalization, one unit of depth will map to 150.0 points. A CAMeshVertex with to coordinates of {1.0, 0.5, 1.5} will map to a 3D point with coordinates equal to {100.0, 100.0, 225.0}:

Converting from units to points

Converting from units to points

Why go through the trouble of converting unit coordinates to points? It’s because of a transform property of a CALayer and its type – CATransform3D. Components of CATransform3D are defined in terms of points. You can actually apply any transform to the layer itself and it will influence its vertices as well. The z coordinate translation and a perspective transform come to mind as a major beneficiaries of this feature.

At this point we could create another example, this time with depthNormalization not equal to the default kCADepthNormalizationNone. The results would be quite disappointing – everything would look flat. The depth added by non-zero z coordinates of vertices is very unconvincing. We can skip this step altogether and add a missing component that would emphasize the slopes and curvatures of the mesh – the shading.

Meeting Prometheus

Since we’ve already opened Pandora’s box of private Core Animation classes, we might as well use another one. At this point it should come as no surprise that a class named CALight exists and it’s actually very useful since CALayer has a private, NSArray-typed lights property.

A CALight is created with + (id)lightWithType:(NSString *)lightType convenience method and the lightType argument can be one of the following four values:

extern NSString * const kCALightTypeAmbient;
extern NSString * const kCALightTypeDirectional;
extern NSString * const kCALightTypePoint;
extern NSString * const kCALightTypeSpot;

I’m not going to discuss CALight in details, so let’s jump straight to an example. This time we’re going to use two hand-made CAMutableMeshTransform convenience methods. The first one, identityMeshTransformWithNumberOfRows:numberOfColumns:, creates a mesh with uniformly spread vertices that don’t introduce any disturbances. Then we’re going to modify those vertices by mapVerticesUsingBlock: method that maps all vertices to some other vertices.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
CALight *light = [CALight new]; // directional light by default
[label.superview.layer setLights:@[light]]; // has to be applied to superlayer

CAMutableMeshTransform *meshTransform = [CAMutableMeshTransform identityMeshTransformWithNumberOfRows:50 numberOfColumns:50];
[meshTransform mapVerticesUsingBlock:^CAMeshVertex(CAMeshVertex vertex, NSUInteger vertexIndex) {
    float x = vertex.from.x - 0.5f;
    float y = vertex.from.y - 0.5f;
    
    float r = sqrtf(x * x + y * y);
    
    vertex.to.z = 0.05 * sinf(r * 2.0 * M_PI * 4.0);
    
    return vertex;
}];
label.layer.meshTransform = meshTransform;

CATransform3D transform = CATransform3DMakeRotation(-M_PI_4, 1, 0, 0);
transform.m34 = -1.0/800.0; // some perspective
label.layer.transform = transform;

And here’s the result of applying the code to a square UILabel:

CALight, CAMeshTransform, and CATransform3D all working together

CALight, CAMeshTransform, and CATransform3D all working together

While the lighting looks a little bit cheesy, it certainly is impressive how easy it is to do quite complex effects.

CALight seems to have tunable ambient, diffuse and specular intensities – a standard set of coefficients of Phong reflection model. Moreover, CALayer has corresponding surface reflectance properties. I played with these for a few minutes and I didn’t really get anywhere, but I cleaned-up the private headers so it should be much easier to test the lighting capabilities of Core Animation.

Private for a Reason

One of the most important reasons for keeping an API private is that it doesn’t have to be bullet proof and CAMeshTransform certainly is not. There are a few ways to get hurt.

To begin with, assigning 20 to subdivisionSteps property is probably the easiest way to programmatically reboot your device. A set of memory warnings spilled into console is a clear indication of what’s going on. This is certainly annoying, but can be easily avoided – don’t touch the property or set it to 0.

If one of the faces you provide is degenerated, e.g. all of its indices point to the same vertex, you will hang your device. Everything will stop working, including the hardware buttons (!) and only a hard restart will help (long press home + power buttons). The framework doesn’t seem to be prepared for malformed input.

Why do these problems happen? It’s because of the backboardd – a process that is, among other activities, acting as a render server for Core Animation. Technically, it’s not the app itself that makes the system crash, it’s the indirect misuse of one of the core components of iOS that causes all the troubles.

Missing Features

The idea of a general purpose mesh-transformable layer is complex enough that Core Animation team had to cut some corners and skip some of the potential features.

Core Animation allows mesh-transformed layers to have an alpha channel. Rendering semitransparent objects correctly is not a trivial problem. It’s usually done with a Painter’s algorithm. The z-sorting step is not hard to implement and indeed the code does seem to execute a radix sort call which is quite clever, since floats can be sorted with radix sort as well. However, it’s not enough to sort the triangles as some of them may overlap or intersect.

The usual solution to this problem is to divide the triangles so that all the edge cases are removed. This part of the algorithm seems to be not implemented. Granted, correct & good-looking meshes should rarely overlap in a tricky way, but sometimes it does happen and the mesh-transformed layer may look glitchy.

Another feature that’s been completely ignored is hit testing – the layer behaves as if it hasn’t been mesh-transformed at all. Since neither CALayer’s nor UIView’s hitTest: method are aware of mesh, the hit test area of all the controls will rarely match their visual representation:

Hit test area of an embedded UISwitch is not affected by a mesh transform

Hit test area of an embedded UISwitch is not affected by a mesh transform

The solution to this problem would be to shoot a ray through the space, figure out which triangle has been hit, project the hit point from the 3D space back into the 2D space of the layer and then do the regular hit testing. Doable, but not easy.

Replacing Private API

Taking into account all the drawbacks of CAMeshTransform one could argue it’s a faulty product. It’s not. It’s just amazing. It opens up the entire new spectrum of interaction and animation on iOS. It’s a breath of fresh air in a world of plain old transforms, fade-ins and blurs. I badly wanted to mesh-transform everything, but I can’t consciously rely on that many lines of private API calls. So I wrote an open-sourced and very closely matching replacement.

In the spirit of CAMeshTransform I created a BCMeshTransform which copies almost every feature of the original class. My intention was clear: if CAMeshTransform ever ships, you should be able to use the exact same mesh transforms on any CALayer and achieve extremely similar, if not exact, results. The only required step would be to find and replace BC class prefix with CA.

With a transform class in hand the only thing that’s missing is a target of a mesh transform. For this purpose I created a BCMeshTransformView, a UIView subclass that has a meshTransform property.

Without direct, public access to Core Animation render server I was forced to use OpenGL for my implementation. This is not a perfect solution as it introduces some drawbacks the original class didn’t have, but it seems to be the only currently available option.

A Few Tricks

When I was creating the classes I encountered a few challenges and it probably won’t hurt to discuss my solutions to these problems.

Animating with UIView Animation Block

It turns out it’s not that hard to write a custom animatable property of any class. David Rönnqvist has pointed out in his presentation on UIView animations that a CALayer asks its delegate (a UIView owning the layer) for an action when any of its animatable properties is set.

If we’re inside an animation block then UIView will return an animation as a result of an actionForKey:method call. With a CAAnimation in hand we can check its properties to figure out what animation parameters does the block based animation have.

My initial implementation looked like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
- (void)setMeshTransform:(BCMeshTransform *)meshTransform
{
    CABasicAnimation *animation = [self actionForLayer:self.layer forKey:@"opacity"];
    if ([animation isKindOfClass:[CABasicAnimation class]]) {
    	// we're inside an animation block
    	NSTimeInterval duration = animation.duration;
    	...
    }
    ...
}

I quickly realized it was an invalid approach – the completion callback did not fire. When a block based animation is made, UIKit creates an instance of UIViewAnimationState and sets it as a delegate of any CAAnimation created within the block. What I suspect also happens, UIViewAnimationState waits for all the animations it owns to finish or get cancelled before firing the completion block. Since I was obtaining the animation just for the purpose of reading its properties, it hasn’t been added to any layer and thus it never finished.

The solution for this problem was much less complicated than I expected. I added a dummy view as a subview of BCMeshTransformView itself. Here’s the code I’m currently using:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
- (void)setMeshTransform:(BCMeshTransform *)meshTransform
{
    [self.dummyAnimationView.layer removeAllAnimations];
    self.dummyAnimationView.layer.opacity = 1.0;
    self.dummyAnimationView.layer.opacity = 0.0;
    CAAnimation *animation = [self.dummyAnimationView.layer animationForKey:@"opacity"];
    
    if ([animation isKindOfClass:[CABasicAnimation class]]) {
    	// we're inside UIView animation block
    }
    ...
}

The double opacity assignment is needed to ensure the property changes it value. The animation will not be added to a layer if it’s already in the destination state. Moreover, a layer has to be in a view hierarchy of any UIWindow, otherwise its properties won’t get animated.

As for animating the meshes themselves, it’s possible to force Core Animation to interpolate any float values by packing them in NSNumber, shoving them into NSArray, implementing needsDisplayForKey: class method and responding to presentation layer changes inside setValue:forKey: method. While very convenient, this approach has some serious performance issues. Meshes with 25x25 faces were not animated with 60 FPS, even on the iPad Air. The cost of packing and unpacking is very high.

Instead of pursuing the Core Animation way, I used a very simple animation engine powered by CADisplayLink. This approach is much more performant, handling 100x100 faces with butter smooth 60 FPS. It’s not a perfect solution, we’re loosing many conveniences of CAAnimation, but I believe the 16x speed boost is worth the trouble.

Rendering Content

The core purpose of BCMeshTransformView is to display its mesh-transformed subviews. The view hierarchy has to be rendered into a texture before its submitted to OpenGL. The textured vertex mesh then gets displayed by GLKView which is the the main workhorse of BCMeshTransformView. This high level overview is straightforward, but it doesn’t mention the problem of snapshotting the subview hierarchy.

We don’t want to snapshot the GLKView itself as this would quickly create a mirror-tunnel like effect. On top of that, we don’t want to display the other subviews directly – they’re supposed to be visible inside the OpenGL world, not within the UIKit view hierarchy. They can’t be put beneath the GLKView as it has to be non opaque. To solve these issues I came up with a concept of a contentView, similarly to how UITableViewCell handles its user defined subviews. Here’s how a view hierarchy looks:

The view hierarchy of BCMeshTransformView

The view hierarchy of BCMeshTransformView

The contentView is embedded inside a containerView. The containerView has a frame of CGRectZero and clipsToBounds property set to YES, making it invisible to the user but still within the reach of BCMeshTransformView. Every subview that should get mesh-transformed must be added to contentView.

A content of contentView is rendered into a texture using drawViewHierarchyInRect:afterScreenUpdates:. The entire process of snapshotting and uploading texture is very fast, but unfortunately for larger views it takes more than 16 milliseconds. This is too much to render the hierarchy on every frame. Even though BCMeshTransformView automatically observes changes of its contentView subviews and re-renders the texture on its own, it doesn’t support animations inside the meshed subviews.

Final Words

Without a doubt, a mesh transform is a fantastic concept, yet it seems so unexplored in the world of interfaces. It certainly adds more playfulness to otherwise dull screens. In fact, you can experience mesh transforms today, on your iOS device, by launching Game Center and watching the subtle deformations of bubbles. This is CAMeshTransform working its magic.

I encourage you to check out the demo app I made for BCMeshTransformView. It contains a few ideas of how a mesh transform can be used to enrich interaction, like my very simple, but functional take on that famous Dribbble. For more inspiration on some sweet meshes, Experiments by Marcus Eckert is a great place to start.

I wholeheartedly hope BCMeshTransformView becomes obsolete on the first day of WWDC 2014. The Core Animation implementation of mesh transforms has more features and is more tightly integrated with the system. Although it currently doesn’t handle all the edge cases correctly, with a bit of polishing it surely could. Fingers crossed for June 2.