How not to write a linq query

I ran into a performance problem just recently in my vote scoring strategy. The strategy was developed using TDD and passed the tests with flying colors when tested against 6 contestants and 20 or so votes. All edge cases were covered, and the calculations were correct. It wasn't until the vote count approached 4200 in production and 24 contestants that performance problems were observed.

So I wrote a load test against the vote scoring strategy, charted the run times and analyzed that the algorithm was O(n^2) where the number of milliseconds to complete was equal to 0.04v + 0.000285v^2 where v is the number of votes. Based on this regression equation, I could predict that 1 million votes, a number that I think we should be able to handle, would take 79 hours to calculate the scores. I think it's safe at this point to invest some time in improving the algorithm.

Charted runtimes

Votes Milliseconds
1000 323
2000 1252
3000 2832
4000 4751
5000 7308
6000 10411
7000 14029
8000 18622
9000 23407
10000 28825

Inefficient algorithm:

public async Task<IEnumerable<ContestantWithScore>> GetRankings(int warId)
{
    var contestants = await _contestantRepository.GetAll(warId);
    var matches = await _matchRepository.GetAll(warId);
    var votes = await _voteRepository.GetAll(warId);



    var scores = (from v in votes.Where(v => v.Choice != VoteChoice.Pass)
            from m in matches.Where(m => m.Id == v.MatchId)
            select(v.Choice == VoteChoice.Contestant1Won ? m.Contestant1 : m.Contestant2))
            .GroupBy(x => x)
            .Select(group => new { Id = group.Key, Score = group.Count() });



    var results = from c in contestants
                    from s in scores.Where(s => s.Id == c.Id).DefaultIfEmpty()
            select new ContestantWithScore
            {
                Contestant = c,
                Score = s == null ? 0 : s.Score
            };
    return results;
}

The first problem that stood out at me is how the code alternates between Linq keywords and Linq extension methods. The first "join" is applied after the vote filter is applied. The second "join" has a filter applied on the second data set. It took me a second to finally spot this biggest culprit here though: those are not joins. Those are a "select many" operations! This is not an optimized join operation, this is a brute force! This is iterating over all elements of the vote collection, and then one by one iterating of over all elements of the matches collection for all elements with a property Id equal to the MatchId property of the current vote element. It's no wonder the performance was taking a dive every time we added new votes. If there was one vote for every match (as there is likely to be, let's ignore the possibility that no one voted on a particular match), and there were 4000 randomly generated match ups, then we would iterate over 4000 votes 4000 times for a total of 16 million iterations. If we add just one more match up and vote, then we iterate an additional 8,001 times. This is clearly an O(n^2) algorithm.

Though poorly written, we can probably overlook the contestant "select many" operation since the number of contestants should be relatively low compared to the number of votes.

I rewrote the whole algorithm to properly take advantage of Linq's join operations. The improved algorithm times appears much more linear, an O(n) algorithm where the number of milliseconds to run is near 0.00155v where v is the number of votes. This is a vast improvement on the last algorithm. I was able to calculate scores for one million votes in just over 1 second. That is a number I can work with and seek other solutions like caching if I need to continue to improve on production speeds.

Charted Runtimes

Votes Milliseconds
0 0
200000 292
400000 500
600000 941
800000 953
1000000 1201
1200000 1512
1400000 2946
1600000 3317
1800000 2526
2000000 2832

Improved algorithm:

public async Task<IEnumerable<ContestantWithScore>> GetRankings(int warId)
{
    var contestants = await _contestantRepository.GetAll(warId);
    var matches = await _matchRepository.GetAll(warId);
    var votes = await _voteRepository.GetAll(warId);



    var winners = from v in votes
                    join m in matches on v.MatchId equals m.Id
                    where v.Choice != VoteChoice.Pass
                    select (v.Choice == VoteChoice.Contestant1Won ? m.Contestant1 : m.Contestant2);



    var scores = from w in winners
                    group w by w into g
                    select new { Id = g.Key, Score = g.Count() };



    var results = from c in contestants
                    join s in scores on c.Id equals s.Id into scoredGroup
                    from score in scoredGroup.DefaultIfEmpty()
                    select new ContestantWithScore
                    {
                        Contestant = c,
                        Score = score == null ? 0 : score.Score
                    };
    return results;
}

The good news was that there were tests already in place to guarantee that no functionality was lost in my refactoring. The lesson learned is how to spot a "select many" which is doing the job of a join.