Saturday, July 31, 2010

Small Comparison Optimization = Big Gains

This is a small and pretty well known trick, but its worth mentioning since a lot of high-level coders don't seem to realize it, and it could result in some big speedups.

Lets take a look at some random code snippet:

int main() {
const double len = 10000;
const double fract = 1.0/3.0;
for(double i = 1; i < len; i++) {
for(double j = 1; ; j++) {
if (j/i < fract) {
cout << j << " / " << i << endl;
}
else break;
}}
}


What this code is doing is printing out all fractions that are less than 1/3, for all denominators less than 10,000.

Notice anything that we can optimize?

Well if you take a look at the if-statement, notice we are dividing j by i, and then doing a comparison. But as we should know by now, division is a slow operation.

On Core 2 Duos, double floating-point division has a 6~32 clock cycle latency (exact latency depends on the operand values), whereas multiplication is only 5 cycles, and add/sub are 3 cycles.

So it would be nice if we can omit that division from the comparison... oh wait we can!

Using simple mathematics we can do:

j/i < fract =>
i * (j/i) < fract * i =>
j < fract * i


Now we got the same equation, but we were able to remove the division completely!
So in the worst-case that our division was taking ~32 clock cycles, this new optimization is ~6 times faster.

Now lets see this code in action.
Here is the same problem as above, except we've now increased 'len' to 100,000, and instead of printing out the results, we just count the fractions (this way we're not limited by console-write speed):

int main() {
const double len = 100000;
const double fract = 1.0/3.0;
int ret = 0;
for(double i = 1; i < len; i++) {
for(double j = 1; ; j++) {
if (j < fract * i) {
ret++;
}
else break;
}}
cout << ret << endl;
}



When I ran the above code using "if (j/i < fract)", it took ~19 seconds to finish on my PC.
When I ran the optimized version using "if (j < fract*i), it took ~2 seconds to finish executing! That's an even bigger speedup than predicted.

I hope this shows why optimization is important.

One last thing to note:
When using integers, you might not always want to do this trick.
j/i > x, is not necessarily the same as j > x*i when using integers, since the first one truncates the value (obviously, since integers don't hold the fractional part).
For example:

int main() {
int i = 21;
int j = 2;

if (i/j > 10) cout << "First one" << endl;
if (i > 10*j) cout << "Second one" << endl;
}


The above example only prints out "Second one" since we're using integers (if i and j were doubles, it would print out "First one" and "Second one").

This isn't really a negative, since when using integers you probably want to be using the second-method anyways, so you don't have to deal with converting to floats; in that case, you not only would save time by not doing a division, but you would also save time by not doing an int-to-float conversion.

Wednesday, July 21, 2010

Recursion is not always good!

In my university's computer science curriculum, they seem to put a big emphasis in using recursion to solve things, as opposed to alternative methods (such as loops).

It is indeed true that recursion can simplify the high-level code, and many times it can look beautiful. There is of-course a price to pay with recursion, and that is the extra overhead recursive calls have (therefor they're usually slower than alternative iterative/loop versions).

I will stress that learning to think recursively is indeed a good skill to have, and any good programmer should be able to do so. So in that sense, it is good that computer science curriculums seem to be stressing recursion; however, once you have learned to think recursively, it is again beneficial to start thinking non-recursively when algorithms are simple-enough to do so.

Let's consider one of the most common examples of recursion used, the factorial function! (I have used this function many times in my previous articles to demonstrate things, so shouldn't be a surprise :D)


template<typename T>
T factorial(T n) {
return n <= 1 ? 1 : n * factorial(n-1);
}

int main() {
int n = factorial(5);
cout << n << endl; // Prints "120"
return 0;
}


This does end up looking pretty clean, as we can do the whole factorial function in one line.

Now lets take a look at the iterative/loop version:

template<typename T>
T factorial(T n) {
T ret = n ? n :(n=1);
while(--n) ret*=n;
return ret;
}

int main() {
int n = factorial(5);
cout << n << endl; // Prints "120"
return 0;
}


Now this iterative version is not as 'beautiful' as the recursive version, because as you can see its 3 lines instead of 1.

However the generated code for this version ends up being nicer than the recursive version.

When you make recursive calls, the arguments that are passed to the recursive function need to be pushed on the stack, and then these arguments eventually need to be popped back out as well. Furthermore, not only do the arguments need to be pushed on the stack, but also the return address needs to be pushed. Sometimes the recursion can be nested so-deep, that you end up running out of stack, and get a stack-overflow crash. This extra overhead is the downside of recursion.

The iterative/loop version doesn't have such problems. No function calls are made so you don't have the function call overhead. Instead the loop versions usually end up compiling with a simple short-jump.

Another thing is that the loop version is inline-able, whereas the recursive version likely will not be inlined, and even if it does get inlined, it probably will only inline the first call to the function.

Potentially the compiler might be able to turn a recursive call into a iterative/loop-version, but don't count on this. At least, I haven't seen it happen...


Now, there are some cases where an algorithm will end up being naturally-recursive, and it ends up being difficult or unpractical to code an iterative version. In these cases recursion is the better way to go (in these cases, the algorithm can end up being faster using recursion instead of iteration as well). So in the end, it really depends on the situation whether to go with recursion or iteration/loops; the more experience you get, the more you figure out when to use recursion vs iteration.


Notes:
If you noticed, I was using template-functions for the factorial function. The reason I made it a template, is because the factorial function is something you may want to re-use with different data-types (int, unsigned int, long long int, float, double, some custom BigInteger class, etc...).
Using templates, we only have to code one function that can be reused by these different data-types.

Monday, July 19, 2010

Max/Min Values for Signed and Unsigned Ints

There are a variety of ways you can assign the max/min values to a signed/unsigned integer.

A very professional looking way is using the numeric_limits class like so:

#include <limits>

int main() {
int x = numeric_limits<int>::max();
int y = numeric_limits<int>::min();
return 0;
}


But there are many other ways you can do this; and they all take a lot less typing than using numeric_limits, and you don't have to include any extra headers...

Lets first take a look at max/min integer values for different variables (in hexadecimal representation because its easier to memorize):

// u8, u16, u32... mean unsigned int of 8, 16, and 32 bits respectively
// s8, s16, s32... mean signed int of 8, 16, and 32 bits respectively
----------------------------------
| type | max | min |
----------------------------------
| u8 | 0xff | 0x0 |
| u16 | 0xffff | 0x0 |
| u32 | 0xffffffff | 0x0 |
----------------------------------
| s8 | 0x7f | 0x80 |
| s16 | 0x7fff | 0x8000 |
| s32 | 0x7fffffff | 0x80000000 |
----------------------------------


The table is pretty easy to memorize. Essentially, for unsigned ints, the max value is always a bunch of f's in hexadecimal, and the min-value is always 0.
For signed ints, the max value is always 0x7f, followed by a bunch of f's.
And lastly the min value for signed ints is always 0x80, followed by a bunch of 0's.


One thing to notice is the bitwise representation for "-1" always has all-bits set.
That means that for a 32bit integer, "-1" is "0xffffffff", which is the max value for an unsigned integer.

This means that for unsigned integers, we can declare int max/min like this:

int main() {
unsigned int x = -1u; // Unsigned int max
unsigned int y = 0; // Unsigned int min
return 0;
}


In c++, you can put the suffix "u" at the end of an int literal to mean its unsigned. So what that code is doing, is treating the value "-1" as an unsigned int, which means its assigning 0xffffffff to the variable 'x'.

Here's the thing, I've noticed that some compilers end up generating warnings when you treat a negative number as unsigned in operations, so instead of writing '-1u', we can notice that negative one, is really the same as NOT 0, so we can instead write '~0u'. This will do the same thing, and is generally better to do since it won't generate the compiler errors.

So its nicer to do:

int main() {
unsigned int x = ~0u; // Unsigned int max
unsigned int y = 0; // Unsigned int min
return 0;
}


Now that we know this trick for unsigned ints, lets think of signed integers.
The max value for 32bit signed integers is 0x7fffffff.
If we notice though, 0x7fffffff is really (0xffffffff >> 1).
That is, signed int max, is equal to unsigned int max shifted right by one.
But since we already learned the trick on how to represent unsigned int max easily, we can use that to our advantage:

int main() {
int x = ~0u>>1; // Signed int max
return 0;
}


The last thing to notice for signed ints, is that the minimum value for 32bit signed ints in hex is 0x80000000, but it just so happens that that number is signed int max + 1, that is 0x7fffffff + 1.

So we can end up doing this:

int main() {
int x = ~0u>>1; // Signed int max
int y =(~0u>>1)+1; // Signed int min
return 0;
}


That about does it for integers, for floats its not as simple so I didn't talk about them here. Maybe I'll end up making another blog post about the different floating point representations...

Notes:
When doing the '-1' or '~0' tricks where you shift the value; be sure to do the shift as an unsigned operation! That is either do: ~0u>>1 or (unsigned int)~0>>1, DO NOT do ~0>>1. This will do an arithmetic shift, instead of a logical shift, and make the result be -1, instead of 0x7fffffff.

And in-case you're wondering, these tricks are not 'slower' than typing in the numbers manually. C++ compilers do something called constant-propagation and constant-folding, which converts stuff like (-1u>>1)+1 to a constant at compile-time, and therefore there's no extra overhead compared to typing 0x80000000....

Also, this will not apply to most people, but another thing to note is that some very-old/weird hardware can use one's complement integers (instead of two's complement) or some other funky stuff; when dealing with weird hardware its best to use the numeric_limits class instead of bitwise tricks, because numeric_limits assigns its values based on the specific platform. That-said, I don't even know any hardware that uses one's complement for signed integers, so these tricks are almost always safe :p

Thursday, July 15, 2010

C++0x reusing same keywords and symbols for entirely different meanings

One thing especially bad about c++0x is that it seems to give different meanings to already known keywords and symbols.

For example, the array subscript operator "[]", already can be used to define arrays and index them, but now in c++0x they also denote lambda expressions.

But an even more ludicrous example is the keyword "auto".

"auto" in c++0x can be used to infer the type of an identifier, but it just so happens that "auto" already has a meaning in c++.

auto is used to denote "automatic storage", which essentially means stack-storage for variables that weren't declared globally..
So you can do:
auto int i = 0;


The reason you don't ever see code explicitly using auto, is because 'auto' is on by default, so its rarely used.

This means however, that in a c++0x function you should be able to do this:
auto auto i = 0;


I tested this with GCC 4.5.0, and got a compiler error :/
In fact, GCC 4.5.0 with c++0x extensions enabled doesn't even compile "auto int i = 0;"...

I think this is a GCC bug, but still this shows the problems that reusing keywords and symbols can cause.

Also, needless to say, it makes things a lot more confusing for beginners to understand the different meanings of the words...

Wednesday, July 14, 2010

String Literal with Array Subscript Trick

This is a cool trick I just found out recently.

If you have a string literal such as "abcdef", you can interpret that as an array and use the '[]' operator to get the correct character from the string.

For example:


int main() {
for (int i = 9; i >= 0; i--) {
cout << "abcdefghij"[i];
}
cout << endl;
}


That will print out "jihgfedcba".


Now knowing this trick (as well as some other minor changes), we can omit a few lines from our hexToString() function in my previous article.


string toHexString(u32 value) {
char str9[9] = {'0'}; // '0' char followed by null bytes
for (int i = 7, pos = 0; i >= 0; i--) {
char dig = (value>>(i*4))&0xf;
if (!dig && !pos) continue;
str9[pos++] = "0123456789abcdef"[dig];
}
return string(str9);
}

Multi-Dimensional Array Optimization

When using multidimensional arrays in c++ (or most languages for that matter), it is better to use powers of 2 for all dimensions (except for the first one which doesn't matter).

For example, take the following multi-dimensional array:
int a[3][3][3];

It will be faster to reference the elements in the following array instead:
int a[3][4][4];

The reason is that the compiler can use shifts when calculating the address to read from, as opposed to using multiplication operations. On x86 processors, this shifting by powers of 2 could even be implicit as part of the memory addressing modes which allow a shift parameter in their encoding.

So if you need speed, it could be beneficial to use powers of 2 dimensions even when you don't actually reference the extra memory. Although this optimization may require benchmarking if you increase the size of the array with too much unused data; because it may slow down the code by having less efficient cache-usage due to the extra unused memory. (in the case where all the memory will be used, then its a win-win situation)

In some cases you don't even need to increase the array dimensions to benefit from this knowledge. If you have an array "int arr[4][7];" you know it will be more efficient to swap the dimensions and just access it the other way: "int arr[7][4]".

Reading Hexadecimal Numbers and Understanding Binary Numbers

This post is aimed towards beginners trying to understand hexadecimal numbers, and why programmers often use them instead of base-10 decimal.

Well computers internally deal with logic circuits and we have a concept of a 'bit', which represents either '1' or '0', 'on' or 'off', etc...

You can put 8 bits together and form a byte. These 8 bits can then be interpreted as a binary string to represent a number.
The first bit, called the least-significant bit, represents 2^0 power (0 or 1).
The second bit, represents 2^1 power (0 or 2)
The third bit, represents 2^2 power (0 or 4), etc...

So in one byte, you have 8-bits that represent this:

b7, b6, b5, b4, b3, b2, b1, b0
2^7, 2^6, 2^5, 2^4, 2^3, 2^2, 2^1, 2^0


All these bit-values are then added together to give you the value of the number.

For example, '4' is represented as "00000100", notice that b2 is set (2^2)
Likewise, '5' is "00000101", notice that b2 and b0 are set (2^2 + 2^0 == 4 + 1 == 5)
And so on...


Now because internally numbers are represented in this format, you should understand that binary numbers are pretty important to programmers, especially low-level programmers.

But notice that it takes 8 digits to represent 1-byte, a byte only represents 1-char in c++.
But usually we deal with ints, these don't just represent 1-byte, but instead 4-bytes (usually).

That means it will take 8*4 = 32 digits to represent an integer in binary!
You can't expect a programmer to be writing 32-digit numbers full of 1's and 0's. There has to be a better way!

That's where hexadecimal comes in.
It just so happens that 1-digit in hexadecimal, represents exactly 4-digits in binary. This is very useful to us, because that means if we memorize the hexadecimal digits, then we can easily refer to 4-binary digits with just 1 hexadecimal digit.

The 16 digits in hexadecimal are:

Hex | Binary
--------------
0 | 0000
1 | 0001
2 | 0010
3 | 0011
4 | 0100
5 | 0101
6 | 0110
7 | 0111
8 | 1000
9 | 1001
a | 1010
b | 1011
c | 1100
d | 1101
e | 1110
f | 1111


The key to using hexadecimal fluently, is memorizing this hex-to-binary table.
I'll give you some hints to remember some of the numbers:
0 is 0 in hex and in binary.
1 is 1 in hex and in binary.
1, 2, 4, and 8 only have 1-bit set (because they're powers of 2)
f is the last hex number, and has all 4-bits set (all 1's)
'a' comes after 9, which in decimal should be '10', if you notice though, 'a' represents '1010' in binary, so think of two 10's.


Anyways, it takes some time to memorize the table, but once you do, you are then able to convert hex numbers to binary very easily.
For example:

7 f
0x7f = | 0111 | 1111 |

5 5
0x55 = | 0101 | 0101 |

a b c d e f 9 8
0xabcdef98 = | 1010 | 1011 | 1100 | 1101 | 1110 | 1111 | 1001 | 1000 |



As you can see by the last example, we were able to represent a 32-bit number with just 8 digits. Instead of typing "10101011110011011110111110011000" we just had to type "0xabcdef98".

Oh and just so you know, c++ uses the prefix "0x" to refer to hexadecimal numbers.
Sometimes people use "$" to refer to hex, or sometimes they just use "x" as the prefix.

Also I should note that the reason you can't use normal base-10 decimal numbers to refer to binary easily, is because a single decimal digit does not equal an exact amount of binary digits. 3-binary digits is octal, and 4-binary digits is hex. Decimal is in-between those, so you can't use decimal to refer to binary numbers nicely.

Anyways I hope this article helps those who were having a bit of trouble understanding hex.

Monday, July 12, 2010

Random String Function Implementations

There are a bunch of string functions out-there that do almost everything you could want to do with strings.

There's so much however that sometimes its just simpler to implement your own method instead of finding the appropriate function to use.

What if you wanted to find the size of a char-array string including its NULL-byte.
Well you can easily and elegantly do this:

int strSize(const char* str) {
int len = 0;
while(str[len++]);
return len;
}


Another easy thing to implement is a toLower() function which converts uppercase letters to lowercase:

void strToLower(char* str) {
for( ; *str; str++) {
char& c = *str;
if (c >= 'A' && c <= 'Z') c = c - 'A' + 'a';
}
}


A matching toUpper() function is obviously just as easy ;p


Last here's a fun function I made that converts a 'u32' (unsigned 32-bit integer), to a hex string:


string toHexString(u32 value) {
char str9[9];
u32 pos = 0;
for (int i = 0; i < 8; i++) {
char dig = (value>>((7-i)*4))&0xf;
if (!dig && !pos) continue;
if ((dig) < 0xa) str9[pos] = '0' + dig;
else str9[pos] = 'a' +(dig-0xa);
pos++;
}
if (!pos) { str9[0] = '0'; pos=1; }
str9[pos] = '\0';
return string(str9);
}


That one is definitely more complex than the others, but it was fun to code.
All hex digits are 4-bits, and a u32 value only needs a char array of at-most size 9 (8 digits + 1 null byte).

Anyways, in real coding projects its probably best to use already defined functions that perform the string operation you want; but implementing your own can sometimes be fun, and possibly impress your professors on homework assignments.

Nested Functions in C++

Although c++ does not natively support nested functions, you can do a trick to support them using structs/classes.

In my other blog post I talked about how lambdas can be used as nested functions, and I also showed a way to do recursion with lambdas. If you have access to c++0x, then the nicer way to do nested functions is using lambdas, but if you don't, then you can use the nested-class trick.

Although it may sound odd, you can declare a struct/class in a function. And then this struct/class can have a static method which you can call.

Here's one version of it:

int main() {
struct factClass {
static int factorial(int x) {
return (x <= 1) ? 1 : x * factorial(x-1);
}
};

cout << factClass::factorial(5) << endl; // Prints '120'
return 0;
}


Okay real quickly I want to point out that in c++, structs and classes are the same thing; the only difference is the default access between structs and classes.

With struct all the method/variables are 'public' by default, and with classes they're 'private' by default.
You can change the behavior by specifying "public:" or "protected:" or "private:" keywords for both structs and classes.
When I first started with c++ I didn't know this, so that's why I'm mentioning it :D


Okay now that I got that out of the way, as you can see the above function is pretty bloated.
We can use macros to hide some of the bloat of the class declaration like so:


#define nested(className) struct className { static

int main() {
nested(factClass) int factorial(int x) {
return (x <= 1) ? 1 : x * factorial(x-1);
}};

cout << factClass::factorial(5) << endl; // Prints '120'
return 0;
}


There's also another way to do this using functors.
Basically what we do is have a struct/class and then overload its "()" operator to make it look like a function call like so:


int main() {
struct {
int operator()(int x) {
return (x <= 1) ? 1 : x * this->operator()(x-1);
}
} factorial;

cout << factorial(5) << endl; // Prints '120'
return 0;
}



Now that's pretty cool.
We create an anonymous struct, and 'factorial' is an instance of that struct. Then it has the overloaded '()' operator which lets us call factorial(5) as if it was a function.

We can again use macros to hide some of the bloat, although the end result looks kind-of weird:


#define nested(xReturnType) struct { xReturnType operator()

int main() {
nested(int)(int x) {
return (x <= 1) ? 1 : x * this->operator()(x-1);
}} factorial;

cout << factorial(5) << endl; // Prints '120'
return 0;
}


The nice thing about using functors like in the last two examples is that you don't have to use a separate class name and function name (className::functionName()), instead you just use one name and call it with that name.

Anyways there you go, we have shown a variety of ways you can do nested functions, and I would stick to lambdas unless you're not using c++0x, in which you'll have to use one of the nested-class tricks mentioned above ;p

c++0x autos, lambdas, and lambda recursion!

C++0x introduces some new features that will probably end up making C++0x code many-times more obfuscated and harder to read; however the features can be very useful and fun to code with as well.

The first interesting feature I'll talk about is the "auto" keyword. Essentially you can use the auto keyword to hold a type that will be automatically inferred.
For example:

int a = 1; // Type is explicitly int
auto b = 1; // Type of 'int', which is inferred by the int literal
auto c = a; // Type of 'int', which is inferred by the type of a
auto d = 1.4f; // Type of 'float', which is inferred by the float literal


The auto keyword will probably be abused by c++0x coders, making code harder to follow; I imagine people will start using this for a-lot of their variable declarations, making it difficult to know the true types of variables when browsing code.

The auto keyword has some great uses, but I will not explicitly talk about them here because it'll make this post too long...


The second feature I'll talk about are lambdas.
When I first saw the c++0x lambdas I thought it looked very confusing, but after playing around with them I found out they're pretty fun.

I'll use Microsoft's definition of lambda expression:
"A lambda expression is a function or subroutine without a name that can be used wherever a delegate is valid. Lambda expressions can be functions or subroutines and can be single-line or multi-line. You can pass values from the current scope to a lambda expression."

Basically a lambda allows you to create anonymous functions, and they can be defined within other c++ functions, so it also allows you to do nested functions.

They look like this:

// Foo is a function pointer to a lambda that has nothing in its body,
// so it does nothing...
auto foo = [](){};

// [] tells which variables in the current scope will be available to the lambda
// () is the parameter list to be passed to the lambda function
// {} is the lambda's body

// You can now call foo like so:
foo();


Okay I know this is confusing, so I'll give a concrete example:


int main() {
auto foo = [](int x) { return x * x; };
cout << foo(2) << endl;
return 0;
}


That prints out '4' to console.
As you can see we have a parameter list of "int x", and then in the body of the lambda we are returning x*x.
Notice how you don't have to explicitly mention a return-type! That is also an interesting feature of c++0x.

Now the '[]' part of the lambda is a bit tricky to understand, basically it allows you to specify variables from the current scope that will be available to the lambda's body.

For example you can do this:


int main() {
int x = 2;
auto foo = [&x]() { return x * x; };
cout << foo(2) << endl;
return 0;
}


This will return '4'.
What its saying is that the lambda's body has access to the 'x' from the main() function.
If you modify 'x' in the lambda's body, the 'x' in the main() function will also be modified because we're passing the value by reference.
If you want to pass x by value, you can just omit the ampersand, and it will pass the variable by value.

If you want the lambda to have access to all local variables then you can do:

int main() {
int x = 2;
auto foo = [&]() { return x * x; };
cout << foo(2) << endl;
return 0;
}


Using the '[&]' means that you're passing all variables of the same scope by reference.
You can also use '[=]' which means you're passing them all by value.
You can even do something like, '[&, x]', which means you're passing all by reference, except for 'x' which will be passed by value.

Technically I've been assigning an identifier to the lambda 'foo', so I've not been using lambdas as anonymous functions here.

The way you would use lambdas as anonymous functions (a function without a name or identifier), is when another method takes a function object as an argument.
The typical example seen all over the web is with 'for_each'.
The arguments for for_each() are found here, but basically are:
param1 = A starting iterator
param2 = An ending iterator
param3 = A function Object

So you can either pass an already defined function as param3, or you can use lambdas to define the function in the current scope.

This example is on wikipedia, but I'll reuse it here.
Suppose you have a vector full of ints, and you want to add up all the ints together.
With for_each, you can specify a lambda as param3 which adds each int to a 'total' variable:

std::vector<int> someList;
int total = 0;
std::for_each(someList.begin(), someList.end(), [&](int x) {
total += x;
});


I recommend reading the wikipedia link as it has more examples.


Finally the last thing I want to mention is lambda recursion.
When I originally tried using recursion with c++ lambdas, I got compiler errors. So I ended up concluding that recursive lambdas in c++ were not possible without using hacks.

I ended up coding a hack to support lambda recursion:


typedef int(*pFoo)(int, int);

int main() {

pFoo foo = [](int x, int _f) -> int {
int* _p = (int*)&_f;
void*& p = (void*&)*_p;
pFoo* a = (pFoo*)_p;
pFoo f = *a;
return (x<=1) ? 1 : x * f(x-1, _f);
};

int* _p = (int*)&foo;
int p = *_p;

cout << foo(5, p) << endl; // Prints '120'
return 0;
}


Yes I know its super ugly, I had to do a lot of ugly casts to get GCC to compile without warnings or errors.

The basic idea here is that you define the lambda taking two arguments, one being the integer we will perform the factorial function on, and the other being a function pointer to the lambda itself. The lambda can then use that function pointer to call itself. The function pointer in this case is disguised as an 'int', and then casted to the function pointer type.

I have recently figured out however that you don't need to resort to hacks for lambda recursion!

Here is the proper way to do lambda recursion:


int main() {
function<int(int)> factorial = [&factorial](int n) -> int {
return n <= 1 ? 1 : n * factorial(n-1);
};
cout << factorial(5) << endl; // Prints '120'
return 0;
}


Notice how we give the lambda access to the 'factorial' variable.

I had tried this before using 'auto' instead of 'function<int(int)>' and it didn't work with GCC, so I had assumed recursive lambdas were not supported (I had also read posts by people saying they weren't supported... guess they were wrong :p).

So the trick is to not use 'auto' when declaring the lambda type, but instead use function<int(int)> to explicitly show its a function.

And when you think about it, it makes sense that you can't use 'auto' when using lambda recursion, since you're using the identifier before its type has fully been defined.

Anyways, with the introduction of lambdas, c++0x has nice support for nested functions, recursive nested functions, and anonymous functions.

So get yourself a c++0x compiler and try it out!

Sunday, July 11, 2010

Using c++ references for nicer code

It is common to see castings and conversions between different types of primitives in c++ code.
Many times however you don't want to do any conversion of the value, but instead just want to reference the binary representation of the variable as another type.

The common ugly way to do this is something like this:

// Program Entry Point
int main() {

// Make x hold the binary representation of a
// positive infinity float
int x = 0x7f800000;

// Treat x as if it was holding a float instead of an int,
// and then give f the same value
float f = *(float*)&x;

// Print float f (positive infinity) to the console
cout << f << endl;
return 0;
}


This does work, however as you can see the second line of code is pretty ugly.
The reason you can't just do "float f = (float)x;", is because that would mean you're casting the int value to a float; so the value is actually converted to a floating point value.

But that's not what we wanted, we instead want the 'int x' to behave as it was a float the entire time; that is why we first take the address of 'x' (&x), then treat it as a float pointer ((float*)&x), then finally dereference it back (*(float*)&x).

With c++ however, you can use references to simplify this whole procedure.

// Program Entry Point
int main() {

// Make x hold the binary representation of a
// positive infinity float
int x = 0x7f800000;

// Treat x as if it was holding a float instead of an int,
// and then give f the same value
float f = (float&)x;

// Print float f (positive infinity) to the console
cout << f << endl;
return 0;
}


This trick was taught to me by Jake Stine from pcsx2, and its something not a lot of coders know apparently, because I commonly see people using the ugly-way to do this instead.
Most-likely because you can't do this in normal C (just another reason why C sucks compared to C++ xD).

Fast Multiplication and Division

Another well known bitwise trick is that you can do multiplication and division by powers of two "2^n", with just simple shifts by "n".

For example, you can do:


// Normal Multiplication
void mul(int& x) {
x = x * 4;
}

// Optimized Multiplication
void mul(int& x) {
x = x << 2;
}


And for division:


// Normal Division
void div(unsigned int& x) {
x = x / 4;
}

// Optimized Division
void div(unsigned int& x) {
x = x >> 2;
}


Unlike the modulus optimization trick, the multiplication trick does work for both signed and unsigned variables, so the compiler is safe to optimize these cases for you.

In the case of the division trick there is a problem that may happen if you have a negative signed number. If you divide a negative integer and the result should have a remainder, then using shifts may give you the wrong result (for example, -1/2 should be 0, but -1>>1 is -1).
Compilers are able to optimize signed division by powers of two using shifts and interesting logic to account for the possible errors.

For the reason that compilers will make the necessary optimizations for you, it is generally better to leave the code as normal multiplications and divisions for readability and to prevent mistakes.
If you're dealing with low level assembly, you'd obviously want to be doing the shifts where possible instead of using mul and div.

Fast Modulus Operation and Using Unsigned Ints as an Optimization

This is another pretty well-known trick; but there is often a few misconceptions with it.

If you are using the modulus operator '%' with a power of 2 value '2^n', then you can do the same operation with a bitwise AND with 2^n-1.


// Normal Mod Operation
void mod(unsigned int& x) {
x = x % 4;
}

// Optimized Mod Operation
void mod(unsigned int& x) {
x = x & 3;
}


It is very important to note that this only works with positive/unsigned values.
If x was '-6' for example, the correct value returned should be '-2', but with the optimized mod it would return '2'.

The optimized mod trick however is very useful as perhaps one of the most common uses of Mod is to check if a number is odd or even:


// Normal isEven Operation
bool isEven(int x) {
return !(x % 2);
}

// Optimized isEven Operation
void isEven(int x) {
return !(x & 1);
}


This works regardless if the input is positive/unsigned, and is much faster than using the modulus operator.

The way the Modulus operator works in x86-32 is it has to do a Div operation, and get the remainder; Div is almost always a slow operation on any architecture.


I also wanted to point out something else.
The misconception some people have is that the compiler already will optimize something like "x = x % 2" into "x = x & 1", well the real answer is not-always.

As I already said, the optimization only works for positive/unsigned values, so if you declare x as a normal signed-int, then the compiler can't safely optimize the operation into an AND.
This is why I wanted to point out that using "Unsigned Int" can be an optimization because the compiler will be able to change such mod operations into bitwise ANDs.

If you leave the values as signed, the compiler either does one of 2 things:
1) Will just compile it using a div and get the remainder.
2) Will compile code that checks the sign bit of value, does the optimized AND, and negates the value if the original value was negative.

That is, the compiler will do something like this:

// What the compiler might generate for
// "x = x % 2;" with signed values
if (x < 0) {
x = x & 1;
x = -x;
}
else x = x & 1;



Many times I have seen code where performance has increased dramatically from removing modulus operations in favor of bitwise ANDs or other logic.

For example, if you were to have a loop like this:

for(int i = 0; i < 100000; i++) {
someVar++;
someVar%=7;
// ... Do more stuff
}


You can make it much faster by doing this:

for(int i = 0; i < 100000; i++) {
someVar++;
if (someVar >= 7) someVar = 0;
// ... Do more stuff
}

Saturday, July 10, 2010

Bitwise Set-Bit

Say you want to set a variable to '0' if an integer is 0, and '1' if an integer is non-zero.

You can do it this way:

int setBit(int x) {
return !!x;
}


If you want to instead emulate this behavior, you can do it some other ways.

If you know for sure the sign-bit is not set, then you can implement this behavior like this:

int setBit(int x) {
return (uint)(x + 0x7fffffff) >> 31;
}


This works because if any bits were set, then it ends up carrying a '1' over to bit #31 (the sign bit).

If you also want to include the sign bit into this, you can implement the setBit() function like this:

int setBit(int x) {
return (((uint)x >> 1) | (x&1)) + 0x7fffffff) >> 31;
}


Or you can use 64bit operations like this:

int setBit(int x) {
uint64 y = (uint64)(uint)x + 0xfffffffful;
return ((int*)&y)[1];
}



Notes:
This trick isn't really useful when sticking to high level c++, but it can be useful when dealing with low level assembly.

With x86 if you don't wish to use the SETcc instruction, you can use a variation of the trick for full 32bit detection using the carry bit like so:

add eax, -1 // Add 0xffffffff to eax
rcl eax, 1 // Rotate with the carry bit so its in the lsb
and eax, 1 // AND with the carry bit

The above is nice because SETcc only sets the lower byte to 1/0, while this will set the full 32bit register to 1/0. Also because SETcc only modifies the lower byte, it can be a speed penalty with the CPU's pipeline as opposed to modifying the full 32bit register.

Also note that I use 'int' and 'uint' in this article to refer to 32 bit signed and unsigned ints.
And i use 'uint64' to refer to unsigned 64bit int.

Friday, July 9, 2010

Testing for NaN and Infinity values

From my work on pcsx2 I have had to deal extensively with the problem of NaN and Infinity values with floats.

The ps2's FPU and VU processors do not support NaN or Infinity values, so it is a pain to emulate them on a system that does support such values (x86-32/SSE processors).

There are a variety of ways to test for NaN and Infinity values, and I will list a few here.

The first is pretty well known.
If you compare a float for equality against itself, it should return True unless the float is a NaN.
So the typical approach is to do something like:

// Test for NaN
bool isNaN(float x) {
return x != x;
}


That will only test for NaN's, if you want to check for infinities you can do something like this:


#include <limits>
#include <math.h>

// Test for positive or negative infinity values
bool isInf(float x) {
return fabs(x) == numeric_limits<float>::infinity();
}



You can instead use bitwise logic to test for NaN's and Infinities.


// Test for NaN with bitwise logic
bool isNaN(float x) {
return ((int&)x & 0x7fffffff) >= 0x7f800001;
}

// Test for Inf with bitwise logic
bool isInf(float x) {
return ((int&)x & 0x7fffffff) == 0x7f800000;
}


You can even check for both NaN and Infinities with just one comparison:


// Test for NaN or Inf with bitwise logic
bool isNaNorInf(float x) {
return ((int&)x & 0x7fffffff) >= 0x7f800000;
}



Generally you probably don't want to use the bitwise version of these functions for the reason that the compiled code will end up having to switch from FPU to integer arithmetic, which will most likely end up being slower than sticking to the floating point comparisons.

Compare Optimization Tricks

Okay I invented these bitwise tricks myself, so they shouldn't be as well known as the last one I posted; well the first one is more obvious (so probably other people have used it), but the other one not so much.

If you're ever comparing an integer being in the range 0...n, you can normally do this:

// Normal Compare
bool compare(int x) {
return (x >= 0) && (x <= 10)
}


However, you can do that same operation with just one compare instruction:

// Optimized Compare
bool compare(int x) {
return (unsigned int)x <= 10;
}



Now this second trick is if you want to compare 2 different integers, and want to check if they're both in the range 0...2^n

Normally you could do this:

// Normal Compare
bool compare(int x, int y) {
return (x >= 0) && (x <= 16)
&& (y >= 0) && (y <= 16);
}


Using the trick mentioned before, we figured out you can optimize it to this:

// Optimized Compare
bool compare(int x, int y) {
return ((unsigned int)x <= 16)
&& ((unsigned int)y <= 16);
}


But now we can further optimize this to just one single compare!

// Very Optimized Compare!
bool compare(int x, int y) {
return ((unsigned int)(x|y) <= 16);
}


That ends up doing the work of 4 comparisons, with just 1!
(Although you now have the added OR instruction; but OR is very fast to compute on almost every processor).

I want to stress again that the last optimization only works when the value is a power of two (2^n).

Thursday, July 8, 2010

xor swap trick and add/sub swap trick

This is a pretty common bitwise trick most veteran coders know about, but its probably a good bitwise trick to start this section out with.

If you have 2 integers, x and y, you can swap their values normally by doing:

// Normal Swap Function
_f void swap(int& x, int& y) {
    int z = x;
    x = y;
    y = z;
}


But this requires a temporary variable/register to be used (the "int c").

You can do this same functionality without using a temporary variable at all, using the xor swap trick:

// XOR Swap Function
_f void swap(int& x, int& y) {
    assert(&x != &y);
    x ^= y;
    y ^= x;
    x ^= y;
}


The result ends up being the same as the first function, but notice now it doesn't use any temporary variables.
This trick can be useful if you're doing some low-level asm work, or reg-alloc in a recompiler, or for any reason you don't have an extra register to spare and your instruction set doesn't have an xchg opcode. Admittingly, if you're just sticking to c++ code, might just be better to use the normal swap routine instead.

Also beware! This trick has a big problem if &x == &y, that is, if x and y are the same exact variable. In that case instead of keeping the values the same, you end up zeroing out the value!
So the xor swap trick should only be used when you know you are dealing with 2 distinct 'x' and 'y' variables. That is why we put the assert(&x != &y) line in this function, so on debug builds we will get an error if the references of x and y point to the same exact variable.

Lastly I'll point out that you can do the same trick using add and sub operations, but its a bit messier and not as useful:

// Add/Sub Swap Function
_f void swap(int& x, int& y) {
    assert(&x != &y);
    x += y;
    y  = x - y;
    x -= y;
}


The problem with this code is the second instruction.
For many instruction sets (for example SSE, MMX, or x86-32), you're stuck with "dest, src" syntax for instructions.
This means that the compiled code for this will still end up having to use some type of temporary register or the stack to compute this.

If you're dealing with another architecture however that has 3-operand syntax "dest, src1, src2", then this is more useful since you can do the second line with just 1 instruction.

Notes:

I use "_f" as a macro for "__forceinline" so:
// Forceinline Macro for in-lining functions
#define _f __forceinline


Also note that this trick can be extended to be used with floats, doubles, and etc...
But you have to be sure to do the operations as 32bit integers or 64bit integers respectively.

Here's an example of swapping 2 floating point singles with this trick:

// Add/Sub Swap Function
_f void swap(int& x, int& y) {
    assert(&x != &y);
    x ^= y;
    y ^= x;
    x ^= y;
}

// Float Swap Function
_f void swap(float& x, float& y) {
    swap((int&)x, (int&)y);
}


Technically this isn't very useful when sticking with high-level c++, especially since floating point operations are generally compiled to use the x87 fpu (so the floats will be moved back into gprs, and end up being slower than a regular swap); also this will confuse the compiler more so that's another reason it won't generate as good code; but if you're dealing with SSE directly, the xor trick can be useful.

So...

I had an idea.

I was thinking of stuff I should post in this blog and then I came up with the idea of posting random code snippets/tricks that you can do (mostly will be c++/asm stuff).

Hopefully you guys like that idea; its at least its something kind-of fun :D