PowerShell Challenge – Beating Hadoop with Posh

Update

Read the follow-up with in-depth analysis for many of the techniques used by our combatants!


 

Premise

Saw this super interesting read online over the weekend:

Command line tools can be 235x faster than Hadoop

In this post, the author posits that he can crunch numbers from the Linux command line MUCH faster than Hadoop can!

If he can do that, surely we can also beat the Hadoop Cluster…then I started wondering how I would replicate this in PowerShell, and thus this challenge was born…

Challenge

  • Download the repo here (2gb!), unzip it and keep the first 10 folders
  • This equates to ~3.5 GB, which is roughly the same data size from the original post
  • Be sure to only parse the first 10 folders 🙂

    hadop
    You can delete RebelSite, Twic and WorldChampionships
  • Iterate through all of those Chess Record files it contains(*.pgn) and parse each record out.  We need to return a total count of black wins, white wins and draws.  To read a PGN:

Continue reading