Tuesday, June 23, 2009

LINQ Overkill. Why? Ha! Why not??

Source Code: http://www.box.net/shared/tnt1569ql9

Before going to SetFocus, we had to take a few test before being accepted. One of these tests was a simple coding exercise, which went something like this:

Write a program that accepts a string input from the user. Your program will then count the occurrence of each character in the string, and output the results in this format:

There are 0 A's.
There are 2 B's.
There are 1 C's.
...
...


Simple program really, with a few easy approaches. I personally used a HashTable to hold the chars and the integers. After going through SetFocus, and learning about generics, I rewrote the app to use a Dictionary<char,int>, and the app basically looked something like this:



private static void CountCharacters(string text)
{
Dictionary<char,int> chars = new Dictionary<char, int>();
for (int i = 65; i <= 90; i++) //character integer values for 'A' - 'Z'
{
chars.Add((char)i, 0);
}

foreach (var character in text.ToUpperInvariant())
{
if (chars.ContainsKey(character))
{
chars[character]++;
}
}

foreach (var kvp in chars)
{
Console.WriteLine("There are {0} {1}'s.", kvp.Value, kvp.Key);
}
}


Basically, populate the dictionary with the alphabet in upper case, initialize all int's to zero's. Then, loop through the text, and increment the corresponding integer. Finally, loop through the dictionary, and output the results.

Very nice. Recently however, I was bored one bight, and it hit me. Having used LINQ extensivley for quite a while now, I figured, I'm sure I can figure out a way to do this in a "LINQ one liner", ie. one gigantic impossible-to-understand LINQ statement. Why? Why not!!

So, here's what I came up with. I'm warning you, it's isn't pretty but it works. I'll try my best to explain after:



public static StringBuilder CountCharsAlexLinqMethod(string text)
{
var result = new StringBuilder();
foreach (var item in (from p in "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
join pa in
(from p in text.ToUpperInvariant()
group p by p into g
select new { Count = g.Count(), Character = g.Key })
on p equals pa.Character into w
from a in w.DefaultIfEmpty()
select new
{
Count = a == null ? 0 : a.Count,
Character = p
}))
{
result.AppendFormat(Format, item.Count, item.Character);
}

return result;
}


Told you it wasn't pretty :) Ok, it looks worse than it is. The basic premise is as follows. First, let's analyze the inner query:


(from p in text.ToUpperInvariant()
group p by p into g
select new { Count = g.Count(), Character = g.Key })


Think of this as a "GROUP BY" in SQL. It basically groups all the letters together in the given text, and counts them up. So at this point, if our text for example was the word "Hello", then this collection of anonymous objects would look something like this:

{Character = 'H', Count = 1}
{Character = 'E', Count = 1}
{Character = 'L', Count = 2}
{Character = 'O', Count = 1}

Once we have that, we basically do the equivalent of a LEFT OUTER JOIN on the entire alphabet. Since string implements IEnumerable<char> we can query it just like any other IEnumerable and even join on it. So what's happening is, we're joining our results from the inner query earlier, to a collection that contains the entire alphabet. Wherever they join, meaning wherever the letter is found in the inner query, use that as the count, if not, use 0 as the count:


on p equals pa.Character into w
from a in w.DefaultIfEmpty()
select new
{
Count = a == null ? 0 : a.Count,
Character = p
}))


Crazy, but not too bad. Well, the story isn't over. Since I thought this was kinda neat, and I HAD to show someone, I decided to bug James Arendt who was one of my instructors at Set Focus and I sent him my code. Here was his response:

Fun solution, Alex! I don't use the group join too often so it was neat to see it in action on this solution. I decided to take a stab at the code as well, but I took a different direction.




private static void CountCharacters(string text)
{
// I could have used a literal, but I wanted to demonstrate some
// other LINQ routines.
var letters = Enumerable.Range('A', 26).Select(i => (char)i);
var sourceChars = letters.Concat(text.ToUpperInvariant());

var results = from c in sourceChars
group c by c into g
where char.IsLetter(g.Key)
orderby g.Key
select new { Char = g.Key, Count = g.Count() - 1 };

foreach (var result in results)
{
Console.WriteLine("There are {0} {1}'s.", result.Count, result.Char);
}
}


His approach is a little different. First, he builds up an enumerable with the alphabet without using literals (no real difference there). Then however, he tacks on the text to the end of that enumerable. So at this point, if our test text was "Hello", the enumerable would be:

'A','B','C','D'.....'Z','H','E','L','L','O'.

Then, he does a group by on this enumerable, counting up the duplicates.

Nice. So, whenever you have more than one way of doing something, what does every computer nerd like me love to do? Why benchmark of course!!

In my initial comparisons between my method and the way James did it, his way was quicker every time. I wrote back to him, and congratulated him that his way was in fact quicker, but he pointed out to me, that his way was only quicker with shorter strings. When the strings got really long, my way turned out to be fast. So, I decided to flesh this all out, and really benchmark all ways of doing it.

Before I do that, just for the sake of really doing it in just one line, I wrote it one more way:



public static StringBuilder CountCharsAlexLinqCompleteOverkill(string text)
{
var result = new StringBuilder();

(from p in "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
join pa in
(from p in text.ToUpperInvariant()
group p by p into g
select new { Count = g.Count(), Character = g.Key })
on p equals pa.Character into w
from a in w.DefaultIfEmpty()
select new
{
Count = a == null ? 0 : a.Count,
Character = p
}).ForEach(item => result.AppendFormat(Format, item.Count, item.Character));

return result;
}

private static void ForEach<T>(this IEnumerable<T> list, Action<T> action)
{
foreach (var item in list)
{
action(item);
}
}


Not really THAT much different, but now it really is just one statement. There's a ForEach extension method now, so the actual CountCharacters method is just one big statement. So, here's the entire class I used for benchmarking:



using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace LinqOverkill
{
public static class CharacterCounter
{
private const string Format = "There are {0} {1}'s.\n";

public static StringBuilder CountCharsDictionaryMethod(string text)
{
var chars = new Dictionary<char, int>();
for (int i = 65; i <= 90; i++) //character integer values for 'A' - 'Z'
{
chars.Add((char)i, 0);
}

foreach (var character in text.ToUpperInvariant())
{
if (chars.ContainsKey(character))
{
chars[character]++;
}
}

var result = new StringBuilder();


foreach (var kvp in chars)
{
result.AppendFormat(Format, kvp.Value, kvp.Key);
}

return result;
}

public static StringBuilder CountCharsAlexLinqMethod(string text)
{
var result = new StringBuilder();
foreach (var item in (from p in "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
join pa in
(from p in text.ToUpperInvariant()
group p by p into g
select new { Count = g.Count(), Character = g.Key })
on p equals pa.Character into w
from a in w.DefaultIfEmpty()
select new
{
Count = a == null ? 0 : a.Count,
Character = p
}))
{
result.AppendFormat(Format, item.Count, item.Character);
}

return result;
}


public static StringBuilder CountCharsJamesLinqMethod(string text)
{
var builder = new StringBuilder();

var letters = Enumerable.Range('A', 26).Select(i => (char)i);
var sourceChars = letters.Concat(text.ToUpperInvariant());

var results = from c in sourceChars
group c by c into g
where char.IsLetter(g.Key)
orderby g.Key
select new { Char = g.Key, Count = g.Count() - 1 };

foreach (var result in results)
{
builder.AppendFormat(Format, result.Count, result.Char);
}

return builder;
}

public static StringBuilder CountCharsAlexLinqCompleteOverkill(string text)
{
var result = new StringBuilder();

(from p in "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
join pa in
(from p in text.ToUpperInvariant()
group p by p into g
select new { Count = g.Count(), Character = g.Key })
on p equals pa.Character into w
from a in w.DefaultIfEmpty()
select new
{
Count = a == null ? 0 : a.Count,
Character = p
}).ForEach(item => result.AppendFormat(Format, item.Count, item.Character));

return result;
}

private static void ForEach<T>(this IEnumerable<T> list, Action<T> action)
{
foreach (var item in list)
{
action(item);
}
}
}
}


I then wrote a test method which tested it first with a short string of 24 characters, and then 4135 characters. Here's the main method:



using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;

namespace LinqOverkill
{
class Program
{
static void Main(string[] args)
{
string text = "This is some dummy text.";
int iterations = 10000;

BenchmarhMethods(text, iterations);

string bigText = LinqOverkill.Resource.BigText;
BenchmarhMethods(bigText, iterations);

Console.ReadKey(true);
}

private static void BenchmarhMethods(string text, int iterations)
{
Console.WriteLine("Testing with {0} characters.", text.Length);
Stopwatch watch = new Stopwatch();

watch.Start();
for (int i = 1; i <= iterations; i++)
{
CharacterCounter.CountCharsDictionaryMethod(text);
}
watch.Stop();
Console.WriteLine("Dictionary method took: {0} milliseconds.", watch.ElapsedMilliseconds);

watch.Reset();

watch.Start();
for (int i = 1; i <= iterations; i++)
{
CharacterCounter.CountCharsAlexLinqMethod(text);
}
watch.Stop();
Console.WriteLine("Alex LINQ method took: {0} milliseconds.", watch.ElapsedMilliseconds);

watch.Reset();

watch.Start();
for (int i = 1; i <= iterations; i++)
{
CharacterCounter.CountCharsAlexLinqCompleteOverkill(text);
}
watch.Stop();
Console.WriteLine("Alex Overkill method took: {0} milliseconds.", watch.ElapsedMilliseconds);

watch.Reset();

watch.Start();
for (int i = 1; i <= iterations; i++)
{
CharacterCounter.CountCharsJamesLinqMethod(text);
}
watch.Stop();
Console.WriteLine("James LINQ method took: {0} milliseconds.", watch.ElapsedMilliseconds);

watch.Reset();
Console.WriteLine();
}
}
}


Here were the results on my machine:

Testing with 24 characters.
Dictionary method took: 427 milliseconds.
Alex LINQ method took: 1176 milliseconds.
Alex Overkill method took: 1157 milliseconds.
James LINQ method took: 636 milliseconds.

Testing with 4135 characters.
Dictionary method took: 7484 milliseconds.
Alex LINQ method took: 6417 milliseconds.
Alex Overkill method took: 6429 milliseconds.
James LINQ method took: 6738 milliseconds.

As you can see, with 24 characters, using a Dictionary and looping yourself is the quickest way. Second quickest is the way James did it. Then, my way (the Alex LINQ and the overkill are basically the same exact thing) came in last.

When I pumped it up to 4135 characters though, my was fast fastest :) Even faster than using a Dictionary.

One final note. In production code, this really isn't a good idea. Yes, it's loads of fun to do, (I love doin stuff like this) but if another developer, or even you really, ever needs to go back to look at this code, it'll take them 3 times as long to figure out what's going on and how it's doing it.

Finally, I'll leave you with this example of the worst abuse of LINQ ever. It's awesome, but I have no clue what the heck it's doing!! http://blogs.msdn.com/lukeh/archive/2007/10/01/taking-linq-to-objects-to-extremes-a-fully-linqified-raytracer.aspx

No comments: