.NET Framework - what is the best datatype for..

Asked By calvert4ren on 20-Nov-07 07:43 AM
I need to some sort of data type that will hold a listing of ID's and
their counts/frequency.  As I do some processing I get an ID back
which I need to store and keep an accurate count for how many times
that ID is listed.  For example I might get a listing of these
numbers:
1,4,6,23,6,2,54,1,6,23,2,4,1,6

I will get those numbers one at a time and need to build something
like
1,3
4,2
6,4
23,2
2,2
43,1

In above number X,Y:
X- is the number
Y- is the count of the number

I then need to sort the data on the Y (the count) column:
6,4
1,3
4,2
23,2
2,2
43,1

Can anyone suggest how best to do this.  Any suggestions or help is
much appreciated!

Thanks,

Calvert




Nicholas Paldino [.NET/C# MVP] replied on 19-Nov-07 11:11 AM
I would store this in a Dictionary<int, int> where the key is the id and
the value is the count.

Then, when you need to sort the keys, I would enumerate through the
KeyValuePair instances and sort those (Dictionary<TKey, TValue> implements
IEnumerable<KeyValuePair<TKey, TValue>>).  You can simply put them into an
array and then call the static Sort method on the Array class, passing in an
anonymous method for the Comparison<T> delegate (which in this case, would
be Comparison<KeyValuePair<int, int>>) which would compare the values based
on the count.

You could also get away with using a DataTable with an id column and a
count column, and then just placing a data view on the data table which is
sorted by the count.


--
- Nicholas Paldino [.NET/C# MVP]
- mvp@spam.guard.caspershouse.com
Chris Dunaway replied on 20-Nov-07 07:43 AM
As for data types, a general rule of thumb would be that if you plan
to do some calculations with it (like addition, subtraction, etc.)
then it should be some sort of numeric type (int, double, decimal).
If you are only going to store it, then use a string.  For example,
even though a phone number is all digits, you should use a string
since you don't do calculations with a phone number.

For you problem, I think your best bet would be to use a generic
dictionary.  Use the ID as the key and the value as the count.  Read
each value in the array, check to see if it already exists in the
dictionary.  If it does, then just increment the count, otherwise, add
it to the dictionary.

Dim values() As Integer = {1,4,6,23,6,2,54,1,6,23,2,4,1,6}
Dim idCount As New Dictionary(Of String, Integer)()

For Each i As Integer In values                               'Check
each value in the array
If idCount.ContainsKey(i.ToString()) Then             'If the value
already exists in the dictionary
idCount(i.ToString) += 1
'increment the count for that value

Else
'otherwise
idCount.Add(i.ToString(), 1)                             'add
the value to the dictionary
End If
Next


Hope this helps

Chris
calvert4ren replied on 20-Nov-07 07:43 AM
Thank you both very much.  That was exactly the information I was
looking for.  I am going to give it a shot now.  Thanks again.
Hilton replied on 19-Nov-07 03:38 PM
Hi,

Being a CF engineer, I'm always looking to optimize, so even though hashes,
dictionaries etc will work, the whole process is inefficient.  Do you have a
range of values?  Let's say that you know that the values will be 0-1000,
define a struct with two values; one being the value, the other being the
count.  Create an array of the structs and off you go.  Alternatively you
could use two separate arrays.

Again, this is for performance reasons and I'll bet it'll be way faster than
dictionaries etc.  Of course, only do this extra work if you need the
performance and you're not just running a one-off process.

I get this feeling of pending flames - it always seems to happen when I
suggest more efficient techniques.  :)

Hilton
Peter Duniho replied on 19-Nov-07 03:56 PM
On 2007-11-19 12:38:47 -0800, "Hilton" <nospam@nospam.com> said:


Define "inefficient".

A dictionary implementation will be faster than your suggested implementation.


If I understand your suggestion correctly, you are proposing creating
an array that will be scanned searching for the index value, at which
point the count can be updated.

For small data sets, I think that would work fine.  Nothing wrong at
all.  But claiming that it's more "efficient" begs the question: in
what way?

It's more efficient in memory usage, but it's certainly not more
efficient with respect to performance.  Furthermore, for large data
sets where the difference in memory footprint is likely to be of
concern, the performance difference will be phenomenal.  The dictionary
implementation will always provide a basically constant order retrieval
of the element in question, while your linear search will be linear
order, taking orders of magnitude longer as the data set is also orders
of magnitude larger.

In other words, the performance for just 100 elements is ten times
slower than the performance for just 10 elements, and for 1000 elements
is another ten times slower than for 100 elements.


You'd bet wrong.  Your suggestion is more memory efficient, but it's
not going to be faster.


No, only do this extra work if you absolutely must minimize the memory
footprint of the implementation.  It's not going to perform better at
all.


Well, a) you seem to misunderstand how the dictionary solution would
work, and b) you do seem to insist on mischaracterizing techniques as
way that's is interesting in the more general case (this is not a
newsgroup for writing using Compact Framework).

It is true that there are often trade-offs between being memory
efficient and computation efficient.  It's false for you to assert that
you know best which is more important.  Even on a CF-based
implementation, sometimes you'll still prefer a faster implementation
over one that uses less memory.  It just depends on the priorities of
the situation.

Pete
Hilton replied on 19-Nov-07 07:56 PM
Peter,


You misunderstood probably because I never explained it well enough.


Nope, no scanning whatsoever.

If you know you'll have say 1000 possible indices, then define an array of
1000 structs, then when you see a value of 605, you simply do
than any dictionary implementation.  In fact, you don't even need the struct
if most of the work is 'reading' the value and the sorting doesn't need to
be 'high-performance' - then you can define an array of ints and not structs
and spend a little more time doing the sorting.  To decide, we'd need to
understand more about the nature of the data.

Hilton
Wingot replied on 19-Nov-07 09:33 PM
Err, isn't that VB? Correct me if I'm wrong, but the "Dim" was the
biggest clue, and the fact that the comments start with ', among other
things.

Just thought it was weird to see it in the csharp newsgroup :).

Regards,
Wingot

-----Original Message-----
From: Chris Dunaway [mailto:dunawayc@gmail.com]
Posted At: Tuesday, 20 November 2007 1:14 AM
Posted To: microsoft.public.dotnet.languages.csharp
Conversation: what is the best datatype for..
Subject: Re: what is the best datatype for..

*snip*

For you problem, I think your best bet would be to use a generic
dictionary.  Use the ID as the key and the value as the count.  Read
each value in the array, check to see if it already exists in the
dictionary.  If it does, then just increment the count, otherwise, add
it to the dictionary.

Dim values() As Integer = {1,4,6,23,6,2,54,1,6,23,2,4,1,6}
Dim idCount As New Dictionary(Of String, Integer)()

For Each i As Integer In values                               'Check
each value in the array
If idCount.ContainsKey(i.ToString()) Then             'If the value
already exists in the dictionary
idCount(i.ToString) += 1
'increment the count for that value

Else
'otherwise
idCount.Add(i.ToString(), 1)                             'add
the value to the dictionary
End If
Next


Hope this helps

Chris
Peter Duniho replied on 19-Nov-07 10:14 PM
On 2007-11-19 16:56:41 -0800, "Hilton" <nospam@nospam.com> said:


Yes, it's hard to know what a person is talking about when they don't
explain themselves.


Yes, that's probably true.  Based on your updated description, I'd
agree the dictionary implementation will be slower.  However, it will
only be slower by a very tiny amount (assuming the rest of the code is
doing anything interesting, it would be difficult to even measure the
difference...the Dictionary class is still constant order just like
your lookup-array), and your implementation will use a lot more memory
than a dictionary would unless the range of values is completely used
or at least nearly so.

It also won't be flexible if the input data changes for any reason,
since the implementation itself is so dependent on the input data.

Finally, one could use the SortedDictionary class for simplicity.  It
would be much slower for the counting part of the algorithm (O(log n)
instead of constant order), but when you're done counting the data
would already be sorted.  In the end, I'd be surprised if the total
time spent was significantly different (for sure, the order of the
algorithm would be the same: O(n log n) for all three variations).

reasonable interpretation of "way faster" would properly describe the
minimal difference in performance between using a dictionary and your
lookup-array proposal.


Well, personally I would almost never implement a design like the one
you proposed.  I certainly would never do it based solely on
performance reasons.  If my input data was a reasonably narrow range
(say, 16-bit values or less), _and_ I expected the input values to
cover the range with very few exceptions, then I might choose a
solution like the one you proposed.  But mainly because it's simpler,
not because it might be a little faster.

Frankly, when you write stuff like "Being a CF engineer, I'm always
looking to optimize", you are REALLY insulting to those of us who don't
write CF code.  As if we aren't concerned with optimization as well.
Furthermore, it's pretty clear from this thread and some others we've
seen that you have a very specific and narrow idea of what "optimize"
means, and if someone doesn't find your choice of trade-offs useful,
you dismiss them as someone who isn't optimizing their code.

The fact is, here you've suggested a solution that isn't really going
to gain any significant performance, and which has its own costs that
may not be appropriate for the situation (and in fact is likely to be
inappropriate for most situations, _especially_ in CF where memory is
at a much greater premium than on a desktop computer).

Calling the dictionary solution "inefficient" is just plain dumb; it
might not be quite as fast as a straight array lookup, but it's not
like it's a _slow_ design.  It's not "inefficient" at all.  It's just a
different way of approaching the problem and your implication that
anyone who might use a dictionary implementation is writing inefficient
code is very annoying.

So, if you're wondering why your posts seem to generate what you think
are "flames", you might take a good look at how you write your
anyway.

Pete
Hilton replied on 20-Nov-07 03:00 AM
Dude, chill.  The guy asked how to write a histogram, I said use an array of
integers.

Hilton
Jon Skeet [C# MVP] replied on 20-Nov-07 07:44 AM
That depends on how many samples there are. Suppose there's only one
sample, but a range of size 100,000 - with the dictionary approach,
you'd only have a single entry to sort, whereas with your approach
you'd have to sort 100,000 entries. Which of those do you think would
be quicker?

(Note that using a struct may well be slower than using a class, by
the way - when sorting, each swap would be swapping more data. I
wouldn't like to say without testing though.)

Personally I'd go for the simplest code first and only optimise when
it's shown to be necessary - but at that point I'd definitely be
benchmarking instead of guessing. I would *try* using a Dictionary to
collect the values, then converting it to a list and sorting the list.
That way you wouldn't pay a penalty for unused values, although
there's a higher constant factor due to more copying being involved.

Jon
Peter Duniho replied on 20-Nov-07 05:38 AM
If that is _all_ you would  said, we'd not have a problem.

it is the other crap you insist on writing that causes the issues.

Stick to the facts and you will not get flamed.
Marc Gravell replied on 20-Nov-07 07:17 AM
Just to turn things around... in .NET 3.5 and C# 3 (VS2008) you can
use LINQ to do the same thing, but focusing more on what you want to
do, rather than how it should happen. See below, but you can read at a
glance that we are grouping by the value itself, ordering by the count
(descending) and then value (ascending), and projecting the value and
the count.

If you do the same with a dictionary, it'll work, but it won't be as
clear what the code is actually trying to achieve. IMO, at least ;-p
The advantage of this is that it also ports very well both to database
usage, and to (for example) Parallel LINQ - without you having to redo
it from scratch.

Marc

// data-stream could be any source
IEnumerable<int> dataStream = new[] {
1, 4, 6, 23, 6, 2, 54, 1, 6, 23, 2, 4, 1, 6 };

// prepare a query
var query = from value in dataStream
group value by value into grouped
orderby grouped.Count() descending,
grouped.Key
select new {Value = grouped.Key, Count =
grouped.Count() };

// execute and iterate
foreach (var row in query) {
Console.WriteLine("{0}: {1}", row.Value, row.Count);
}
Marc Gravell replied on 23-Nov-07 06:13 AM
As a semi-retraction... for reference, looking deeper into the code it
appears that Enumerable.GroupBy() uses a Lookup<TKey,TElement>
internally, which adds the elements to each grouping; as such, it
probably isn't suited to cases where you want to forget about the
contents and just keep the aggregates...

But never the less, worth knowing. Now to see if I can write a more
efficient LINQ aggregator ;-p

Marc
Chris Dunaway replied on 23-Nov-07 06:13 AM
Oops!  I had been looking at the VB group and obviously forgot where I
was!  I am too young to have a "senior moment"!

Chris
Hilton replied on 20-Nov-07 01:19 PM
OK....  I'm not looking to get into a pissing match here, so this will be my
last post to this thread, but for the record, I'll correct just one of the
many incorrect things in your post.

You said: "Frankly, when you write stuff like "Being a CF engineer, I'm
always looking to optimize", you are REALLY insulting to those of us who
don't
write CF code.  As if we aren't concerned with optimization as well."

Peter, this logic is so flawed that is it laughable.  If a red car has
wheels, then a blue car doesn't?  Secondly, it is pretty obvious that
running code on a small device with a computationally-limited 200MHz CPU
with a tiny bus to slow memory, etc requires some additional optimizations
that a dual-core desktop may not.

Well, there you go.

Hilton
Peter Duniho replied on 20-Nov-07 02:46 PM
On 2007-11-20 10:19:07 -0800, "Hilton" <nospam@nospam.com> said:


Sorry?  You wrote you were going to correct something.  But you just
wrote another untrue thing.


I'm not laughing.


I have never seen anyone write "being a red car, it has wheels" and if
I did, I would call out that statement as being just as wrong as what
you wrote.  Clearly the phrasing you used makes the exact implication I
describe.

Specifically, the word "being" along with an adjective ("CF" in your
original statement, "red" in this latest example) implies some
significance to the adjective, not just the noun.

If you don't want to accept or admit that implication, fine.  You'll
probably keep making the same mistake.  But stop being so surprised
when the things you write set someone off.



Just because you write code on the CF platform, that in no way
necessarily gives you a leg up on people who don't with respect to
optimal techniques.  If anything, the performance of your platform is
more likely to affect the quantity of data you are able to deal with,
rather than how efficient an algorithm you need.  Optimal techniques
are needed across ALL platforms.

Frankly, your contribution to this thread is a very good example of the
fallaciousness in believing that working on CF somehow gives you an
advantage, since your claims about the relative performance of the
techniques offered are just plain wrong.

Again, there was nothing wrong with the basic technique you suggested.
But you couldn't resist comparing yourself and your technique to the
other people, and _that_ is the problem.  It would be annoying if there
was even some basis to the comparison, but the fact that the comparison
itself is completely flawed just aggravates the situation.


Yes, there I go.

Pete
Hilton replied on 20-Nov-07 03:59 PM
Jon,


I agree 100% - the nature of the data is significant.  BTW: I wouldn't have
to sort 100,000, the algorithm could, for example remove the unused counts
by creating another array and copying over only the used counts and convert
from an int to a struct which would include the index.



The "int count, index" struct would have double the data to swap so you are
correct (instead of just a reference) - assuming that we use ints (again,
depends on the nature of the data - we might require bytes or longs).
However, since arrays of structs are contiguous, cache misses would/might be
reduced and that might end up being significant (or not).  On the flip side,
creating (and disposing) the objectes incurs a penalty.  There are several
'exercises for the reader' here; e.g. how much more expensive are objects
than structs during creation and swapping, how much penalty is incurred
during boxing and unboxing (if required), faster to swap structs or
references etc.  Another thing, seems like accessing a x[y].z should be
faster when using structs than objects (I think).



I think it really all depends on the nature of the data and the nature of
the app.  With large amounts of data (e.g. large bitmap or network streams)
I would definitely favor the array of ints, but the techniques you suggest
definitely should be part of the same toolkit to solve the problem.

Jon, do have any URLs with information on the internals of "Dictionary<int,
int>"?  I'm particularly interested in boxing, unboxing, casting, hashing,
etc.  Performance info would be great too.

Thanks,

Hilton
Jon Skeet [C# MVP] replied on 20-Nov-07 04:39 PM
Removing unused counts is still O(R) where R is the theoretical range -
that may be *much* larger than the set of actual numbers used.


Possibly. My bigger point was that it's not a good idea to claim
performance benefits without testing, which is what you were doing.
Optimization is a very subtle art - and a time consuming exercise,
which should only be attempted where really needed.


Why would you "definitely favour the array of ints"? You've already
suggested that with a range of size 100,000 you wouldn't use an array -
that could be the case with large streams.

In particular, would you "definitely favour the array of ints" without
benchmarking? If so, that's foolish.


There'd be no boxing involved, due to the way that generics are handled
by the JIT. General performance info is given in MSDN - lookup on a
Dictionary<K,V> approaches an O(1) operation; adding is O(1) unless
extra capacity is required, in which case it becomes O(n).

--
Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet   Blog: http://www.msmvps.com/jon.skeet
World class .NET training in the UK: http://iterativetraining.co.uk
Scott Gifford replied on 20-Nov-07 04:57 PM
Just for the OP's reference (and some Google keywords), this is a
standard tradeoff, and it depends on whether the data is "dense" or
efficient; for sparse data the dictionary is likely to be best.  But
of course, performance specifics will depend on a lot of factors.

Since both ways are fairly simple, if performance matters it would be
straightforward to write it both ways and benchmark, as Jon suggests.
It would also give some useful data for the next time you have to make
this tradeoff.

Good luck!

----Scott.
Hilton replied on 20-Nov-07 06:14 PM
You're combining two of my comments into one.  When I refer to 'range size'
I mean are all the values 0-255 or 0-65535 etc.  So a huge network stream
will only have a range size of 0-255 if you look at the individual bytes.
Same goes with calculating a histogram of an 8-bit indexed image; e.g. GIF
or PNG8 which is a very common operation - look at (almost) any image
application.  You think they use an array or a dictionary?  If I have a huge
input stream and was only looking at bytes (i.e. range of 0-255), would I
dictionary be faster than "x[y]++" where x is an array of ints?  And which
would be clearer to read?  Jon, Peter, take that as my throwing down a
challenge.  :)

Hilton
Jon Skeet [C# MVP] replied on 20-Nov-07 06:52 PM
Um, you said in a single post:

would definitely favour the array of ints".

How is that combining two comments?


You never suggested that "huge network stream" implies "we're only
examining a single byte at a time". Why can't a network stream be a
stream of longs, for instance?


In the case of only 256 entries, I'd probably go with the array as
well. That's not always the case with large network streams, nor with
images (which may well *not* be 8-bit indexed).

--
Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet   Blog: http://www.msmvps.com/jon.skeet
World class .NET training in the UK: http://iterativetraining.co.uk
Peter Duniho replied on 20-Nov-07 06:52 PM
On 2007-11-20 15:14:02 -0800, "Hilton" <nospam@nospam.com> said:


Challenge?  Looks more like a straw man to me.

Jon's point appears to me to be that you wrote "with large amounts of
data I would definitely favor the array of ints", but there's really
nothing about the amount of data that predisposes it to being better
solved by an array of ints when that amount is large.

He's not combining two comments.  The single statement is by itself
flawed.  Even if you have a large amount of data, if the range is large
and the population within that range is sparse, an array of ints is not
likely to be a good solution.

I have already agreed that in many situations, I would prefer an array
of ints for its simplicity.  Raising that as an arguing point doesn't
make any sense; we're already in agreement there.

The point of contention, which you seem to keep overlooking, is that
your original claim was that an array of ints is efficient while a
dictionary is NOT efficient.  The claim is wrong twice: first, because
the dictionary isn't inefficient at all even in cases where it's slower
than the array of ints, and second because there are situations in
which a dictionary solution is actually more efficient than the array
of ints.

As far as readability goes, which is more readable depends on context.
While I think that the array of ints is likely to be the simpler, more
clear implementation in most cases, that doesn't mean it would be in
all cases.  For example, using a SortedDictionary produces very simple
code IMHO because you never even have to have a separate section of
code to sort the data.  Even a Dictionary may be found to be simpler
because of the ability to do compile-time type checking, allowing for
it to be used in a simpler way elsewhere.

Context is everything, and there's not enough context to claim a priori
that the array of ints is in fact the clearer implementation.

But besides, your contention has been that the array solution is
superior due to _performance_, not readability.  And on that basis,
your claim has no firm foundation.

Pete
Hilton replied on 20-Nov-07 07:20 PM
Jon,

Yet another thread where being in the same room would resolve it in 2
minutes but NG posts result in ambiguity etc.  I agree I wasn't specific
enough with large network stream, that's why I clarified it.  You and I are
agreeing here that small range, lots of data would imply array of int usage;
sparse and/or big-range data would imply Dictionary or other non-array
solution.  And as we both keep saying/implying, the nature of the data
ultimately determines the course of action.

Cheers,

Hilton
Chris Mullins [MVP - C#] replied on 20-Nov-07 07:35 PM
Well, as both Jon and Peter have pointed out your array of ints has some
potential problems. It's not a clear-cut "This is better across the board"
type of decision. It's likley better in many cases, but not all. Sadly, that
always seems to be the case...

Here's one more piece of fuel for the fire:
If you allocate a large array of ints, they're going to be allocated off the
Large Object Heap. This may have some long-term implications for your
process, as the LOH is never compacted. In the limited memory environment of
a mobile platform, this impact may be much larger than it would be on a
desktop system. For long term stability, something more link-list based may
be better, as it won't have the appearence of leaking memory...


Ya know, the most efficient technique would be scan the values, and see if
you can further compact them. You should run sometime like the PunyCode
algorithm over them, so as to guarantee that all of your values are in a
default range. Then you can fit, on average, 2 values into the space that
was taken by 1. You could then (in memory) ZIP the punycoded array, and
compact it by a factor or 5:1. This would give you an overall memory saving
of nearly 10:1.

For further savings, you could dump the values to the screen, and ask the
end-user to remember them for a brief amount of time. Then run a sliding
visual window through, and have the user enter the values. This would let
you completly eliminate the array from memory, thereby signifigantly
decreasing your hardware specs and opening up your target market.

For long term storage, you should use the built-in IR channel to dump the
array to a printer in a 3D barcode format. When you need to recover the
data, use the built-in camera on the phone as a scanner, and volia, you have
reduced footprint even more.

If you're really brave, see if you can setup a harmonic based on the 802.11
receivers built into many of the Windows Mobile devices. With proper
encoding, you could offload everything to the wireless....

(It's a slow day. Honestly though, there is no cut-and-dried answer.... :) )

--
Chris Mullins
Wingot replied on 20-Nov-07 08:12 PM
*snip*
answer....

Nice one Chris, I love that idea. :)

You could even pretend it is a game for the user to test his memory :P.
Peter Duniho replied on 20-Nov-07 08:57 PM
On 2007-11-20 16:20:42 -0800, "Hilton" <nospam@nospam.com> said:


I disagree that that's the root of the disagreement here.  Hopefully
you can handle the recursion.  :)


No, you accused Jon of "combining two of [your] comments into one".
That's not clarification, that's finger-pointing.


I'm not sure that Jon would agree that "lots of data" has anything to
do with it.  I know I don't.  The amount of data is essentially
irrelevant; the distribution of the data is what matters, along with
code maintainability (something that is almost always in direct
conflict with squeezing the last few percentage points of performance
out of some code).


No disagreement there.  But you continue to refuse to acknowledge and
retract your false statement at the outset, claiming that using a
dictionary implementation would be inefficient.  You've simply shifted
your stance to a point where you can claim a lack of disagreement,
silently abandoning your original premise.

There are some scenarios in which an array implementation would be
slightly more efficient (not "way faster" as you wrote), but that's not
a given and in no case would a dictionary implementation actually be
considered _inefficient_.

Pete
Marc Gravell replied on 21-Nov-07 03:27 AM
Aside, but I did manage to cobble something together that aggregates
efficiently while streaming, allowing usage as below... uses
Dictinary<TKey, ...> approach, and will have a little overhead
compared to doing it manually, but it is a little more expressive (but
works for LINQ-to-objects only; it won't parse down to the database!)

If anybody wants the source let me know (extensions are to
IEnumerable<T>).

Marc

var query = data.GroupAggregate(
// grouping key (with optional comparer)
x => x.Type,
// "params" of desired aggregates
data.GroupMin(x => x.SomeValue),
data.GroupSum(x => x.SomeValue),
data.GroupAvg(x => x.SomeValue),
data.GroupCount(),
data.GroupMax(x => (decimal) x.SomeOtherValue)
)
// could further compose query via LINQ, but this will
do...
.OrderByDescending(x => x.Key).ThenBy(x => x.Value[0]);

// look at results; key is in x.Key; x.Value is a decimal[] of
the accumulators
foreach (var x in query) {
Console.WriteLine("{0}: {1}\t{2}\t{3}\t{4}",
x.Key,
x.Value[0], x.Value[1], x.Value[2], x.Value[3]);
}
Hilton replied on 23-Nov-07 01:43 AM
Chris, I totally agree (as mentioned in previous posts).  If you look at the
OP, the numbers mentioned were: 1,4,6,23,6,2,54,1,6,23,2,4,1,6, hence my
suggestion for using the array.

I think that if there was always one best solution, our skill/art/work
wouldn't be as challenging and rewarding.  :)

Hilton
Jon Skeet [C# MVP] replied on 23-Nov-07 06:58 PM
Just to say I've just implemented this myself, or rather a variation,
due to something else I'm writing at the moment when I want to group
and count. Here's what the unit tests look like, so you can see the
kind of thing it does. Like Marc, let me know if you'd like it...
(It'll be part of MiscUtil when I get round to it.)

Note that my code only does group counting - no summing etc, at the
moment. I'm not sure whether either CountGroup or Marc's GroupCount
names really capture the functionality nicely. I did call mine

[Test]
public void CountGroupIdentity()
{
var query = new[] { "One", "Two", "Two", "Three" }.CountGroup();

query = from element in query
orderby element.Count, element.Key
select element;

AssertAreEqual(new[] { 1, 1, 2 },
query.Select(elt => elt.Count));
AssertAreEqual(new[] { "One", "Three", "Two" },
query.Select(elt => elt.Key));
}

[Test]
public void CountGroupWithMapping()
{
var query = new[] { "a", "b", "c", "aa", "bb",
.CountGroup(x => x.Length);
query = from element in query
orderby element.Count
select element;

AssertAreEqual(new[] { 2, 3, 4 },
query.Select(elt => elt.Count));
AssertAreEqual(new[] { 2, 1, 3 },
query.Select(elt => elt.Key));
}

[Test]
public void CountGroupWithMappingAndComparison()
{
var query = new[] { "a", "A", "a", "B", "b" }.CountGroup
(x => x, StringComparer.OrdinalIgnoreCase);
query = from element in query
orderby element.Count
select element;

AssertAreEqual(new[] { 2, 3 },
query.Select(elt => elt.Count));
AssertAreEqual(new[] { "B", "a" },
query.Select(elt => elt.Key));
}


--
Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet   Blog: http://www.msmvps.com/jon.skeet
World class .NET training in the UK: http://iterativetraining.co.uk
Marc Gravell replied on 25-Nov-07 04:04 AM
(I'm assuming that this is a single-pass counter that discards the
elements and keeps the aggregates)

The group scenario is such a common case that an *efficient*
aggregator probably is worthy of a place in MiscUtil. The more general
case (multiple concurrent aggregates) gets very scrappy very quickly,
but a single-purpose "does one thing well" is a good thing. And IMO
perhaps a very good educational aid for people wanting to understand
LINQ-to-objects. As for the names - it stumped me too ;-p I didn't
stress over it...

It almost seems a shame that this common case didn't make it into the
base-libs so that the same code could port to the database too... but
I guess it helps to keep things simple.

Now... to get some sleep...

Marc
Jon Skeet [C# MVP] replied on 24-Nov-07 02:48 AM
Yup.


Right. I may get round to doing sum/min/max etc at some point. Not just
yet though :)




Indeed. Another thing that I'm rather amazed isn't in the framework is
a way of reading a file, line by line, as an IEnumerable<string>. It's
simple to implement, but *so* handy. Again, it'll be in MiscUtil next
time I do a release (the code's there already, and is more flexible
than just files, of course).

--
Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet   Blog: http://www.msmvps.com/jon.skeet
World class .NET training in the UK: http://iterativetraining.co.uk
Marc Gravell replied on 25-Nov-07 04:04 AM
Brain dump...

For info, I think I've reached the conclusion that the optimal
approach here is to get all the common aggregates (and offer no
bespoke), rather than worry about individual aggregate functions
(which make for a lot of overhead) - but the main difficulty arises
from generics not supporting operators like plus (for the accumulator)
and divide (for an average). Min/max are slightly easier thanks to
Comparer<T>.Default. The simplest (but less pleasing?) approach is to
cheat and use "decimal" or similar. As a middle level, support for the
common types could be offered by a private factory, i.e. the generic
Accululator<TSource, TValue> might need a "Func<TValue, TValue,
TValue> add" and a "Func<TValue, int, TValue> quotient" (just for
Average at the end), which a factory that switches on TValue, using
specific Int32, Single, etc functions. At the far end, the other thing
I had in mind to look at was building an Expression of the same
(probably in a generic static for cache) and compiling it... certainly
easier than using Reflection.Emit, and ?should? cover the wider range
of types?

And of course the other problem is how to offer the different
aggregates to the caller if they want aggregates on multiple
concurrent properties, with different types... again "decimal" makes
this simple but feels unsatisfying - but the alternatives... boxing?
or an IAccumulator with generic wrapper methods like Min<T>() (and
hope the user asks for the same T as TValue)...

But I have other things that are more pressing, and no real /need/ to
look at the above, other than academic interest. I'll probably take
another look at some point...

Marc
Marc Gravell replied on 25-Nov-07 04:04 AM
First off - apologies to the group; this is turning into a bit more
free-thought/discussion than question/answer. As such, it occurs that
I should really start blogging this stuff instead (as a more
appropriate forum). If anybody can recommend an appropriate blog site,
I'd be interested...

I looked into the Expression stuff above, and upon consideration I
believe it is a pretty good solution for that old chestnut: how to do
maths with generics. The following works, and is easily extended to
other operators... the generated delegate is fully cached etc...

This has some application to the LINQ / grouping scenario above, but I
suspect it has a lot *more* uses, and it looks like it should
automatically support "JoesRandomStruct" etc...

Marc

using System;
using System.Linq.Expressions;

namespace ConsoleApplication1 {
class Program {
static void Main() {
var data = new[] {1.2, 3.1, 2.9};
Test(data);
/*double i = 0;
int j = 1;
double x = i / j;*/
}
static void Test<T>(T[] args) {
T sum = default(T);
int count = 0;
foreach (T t in args) {
sum = MathOperator<T>.Add(sum, t);
count++;
}
T avg = MathOperator<T>.DivideInt32(sum, count);
}
}
}

static class MathOperator<T> {
private static Func<T, T, T> add, divide;
private static Func<T, int, T> divideInt32;

static Func<TLhs, TRhs, TReturn> BuildBinary<TLhs, TRhs,
TReturn>(Func<Expression, Expression, BinaryExpression> method) {
ParameterExpression lhs = Expression.Parameter(typeof(TLhs),
rhs = Expression.Parameter(typeof(TRhs), "rhs");
try {
return Expression.Lambda<Func<TLhs, TRhs, TReturn>>(
method(lhs, rhs), lhs, rhs).Compile();
} catch (InvalidOperationException) {
// try castring rhs to lhs
Expression castRhs = Expression.Convert(rhs,
typeof(TLhs));
return Expression.Lambda<Func<TLhs, TRhs, TReturn>>(
method(lhs, castRhs), lhs, rhs).Compile();
}
}
public static T Add(T arg1, T arg2) {
if (add == null) {
add = BuildBinary<T, T, T>(Expression.Add);
}
return add(arg1, arg2);
}
public static T Divide(T arg1, T arg2) {
if (divide == null) {
divide = BuildBinary<T, T, T>(Expression.Divide);
}
return divide(arg1 ,arg2);
}
public static T DivideInt32(T arg1, int arg2) {
if (divideInt32== null) {
divideInt32 = BuildBinary<T, int, T>(Expression.Divide);
}
return divideInt32(arg1, arg2);
}

}
Jon Skeet [C# MVP] replied on 24-Nov-07 02:46 PM
Fine by me - although I've got nothing to add to the particular
suggestion you've presented here.

What I've been thinking about is a sort of "Split" method which could
take a further query expression as its parameter, and apply that query
expression to each group within a set of results. That way we wouldn't
need to write "group and count", "group and sum" etc - just split,
passing count, sum or whatever as the parameter.

Unfortunately, I can't get this to work in my head with IEnumerable<T>
without creating one thread per group - we get into stack difficulties,
effectively. (Or you have to read the whole lot and cache it first,
which is what we're trying to avoid.) It may not be possible, but that
would certainly be a pity... it's *logically* feasible, but LINQ
doesn't help us as much as we might want.

--
Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet   Blog: http://www.msmvps.com/jon.skeet
World class .NET training in the UK: http://iterativetraining.co.uk
Marc Gravell replied on 25-Nov-07 04:04 AM
That is exactly what I tried to do in my Nov 21 implementation... the
parameter (in fact a set of "params" parameters). Each of these "Sum"
etc parameters were a seed plus a Func<Func<...>> - i.e. a factory
method that returned an accumulator method. The first time a new
branch is found, it invoke the *outer* method to get hold of the
accumulator method to use for that branch... which might be a common
one, or might be stateful. For existing branches it simply cranks the
handle on the accumulator it got previously. This means it only stores
one state per branch (group) [per accumulator].

I'm not entirely happy with my implementation (in fact I intend to
redo from scratch - it passes the selector too far into the stack and
uses "decimal", etc), but unless I've misunderstood your post we are
thinking along the same lines. Again, while I think the concept is OK,
I stress that I'm not very happy with the current /code/, hence I'm
not wildly posting it here in case it just adds confusion, but I can e-
mail it if you are interested.

At the moment the crank-handle drives aggregates (counters, sum, etc),
but the same approach (i.e. a factory returning new crank-handles for
each group) could theoretically drive any operation that just needs
the group-key and some prior "so far" state - so it could start
progressing the first group even though the last entry (for that
group) is at the end of the stream.

Marc
Jon Skeet [C# MVP] replied on 25-Nov-07 03:11 AM
I don't *think* so. As far as I can see, your aggregation works on a
single value at a time, so you can't use the existing extension
methods. I was thinking of something like:

var query = data.Split (x => x.Count());

foreach (var group in query)
{
Console.WriteLine (group.Key + ": "+group.Result);
}

kind as data, or possibly IGrouping<T> to allow the lower levels to
access the key if they wanted - but Count() is the standard
Enumerable.Count().

--
Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet   Blog: http://www.msmvps.com/jon.skeet
World class .NET training in the UK: http://iterativetraining.co.uk
Marc Gravell replied on 25-Nov-07 04:05 AM
Ah right, so each branch yields an IEnumerable over the source
entities for that key...

I see what you mean... the thread approach is fine until you get lots
of groups...

Marc
Jon Skeet [C# MVP] replied on 25-Nov-07 03:32 AM
Indeed. It feels like it *should* be feasible, but basically we need to
push the data rather than pulling it. Very annoying.

--
Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet   Blog: http://www.msmvps.com/jon.skeet
World class .NET training in the UK: http://iterativetraining.co.uk
Marc Gravell replied on 27-Nov-07 02:23 AM
Well, for data where we know the grouping key is ascending (to use
your log-file example, grouping per day or per week) there is an
obvious single-threaded optimisation, since we can simply "yield
break" when we find the key changes, and start a new IEnumerable<T>.

In the more general case... for small volume the inbuilt (Lookup<,>
based) code suffices, so I'm thinking of the worst case high-volume.
In this case, *generally* the number of records is order(s) of
magnitude greater than the number of groups. I'm excluding scenarios
like "group orders by customer" where each customer only has a small
number of orders (perhaps a database is the answer there!); but things
like "group site hits by [major] browser-flavor" would be perfectly
reasonable.

In such volume cases, pre-loading everything is a non-starter, and the
simplest non-cached drip of data (over m=groups threads) would lead to
a lot of idle threads, but there is perhaps a middle-ground for those
cases where you want a bit more parallelism (rather than 20 threads
only 1 of which is doing anything). If you allocated each group a
*small* queue (either a Queue<T> or T[] or whatever), you would then
hopefully find (assuming reasonable distribution) that *most* of the
time there is space in the queue for the item you are adding. If full
it could Monitor.Wait on the groups sync-lock until there is space to
add. If the queue was empty prior to adding it could Monitor.Pulse the
group. And the reverse for the enumerator: while the queue is empty
then Monitor.Wait; if the queue was full prior to dequeing then
Monitor.Pulse. And some close-down test for the end-of-stream.
Essentially parallel single-producer/single-consumer queues, but where
the producer is writing to multiple such queue.

Obviously pool-threads would be risky due to saturation, but regular
Threads could be used (perhaps at lower priority than the thread that
is reading the stream).

You've probably already considered this, though...

Marc
Marc Gravell replied on 27-Nov-07 02:23 AM
As an aside - with some luck it could also elect *not* to start a new
thread until the group's stack is full an needs pumping... then at the
end the primary thread could mop up any groups without threads (or
even share them among the threads that *do* exist, but this is more
complex).

That way, you don't get idling threads for things that you only see
once or twice...

Marc
Jon Skeet [C# MVP] replied on 25-Nov-07 07:39 AM
That's an interesting idea - I suspect it would make the implementation
pretty nasty though.

I think I'll have a go at implementing it the naive "thread per group"
way, for the sake of interest, and write a blog post about it... when
I've time, of course.

--
Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet   Blog: http://www.msmvps.com/jon.skeet
World class .NET training in the UK: http://iterativetraining.co.uk