PowerShell Challenge – Beating Hadoop with Posh

Update

Read the follow-up with in-depth analysis for many of the techniques used by our combatants!


 

Premise

Saw this super interesting read online over the weekend:

Command line tools can be 235x faster than Hadoop

In this post, the author posits that he can crunch numbers from the Linux command line MUCH faster than Hadoop can!

If he can do that, surely we can also beat the Hadoop Cluster…then I started wondering how I would replicate this in PowerShell, and thus this challenge was born…

Challenge

  • Download the repo here (2gb!), unzip it and keep the first 10 folders
  • This equates to ~3.5 GB, which is roughly the same data size from the original post
  • Be sure to only parse the first 10 folders 🙂

    hadop
    You can delete RebelSite, Twic and WorldChampionships
  • Iterate through all of those Chess Record files it contains(*.pgn) and parse each record out.  We need to return a total count of black wins, white wins and draws.  To read a PGN:

We are only interested in the results of the game, which only have 3 real outcomes. The 1-0 case means that white won, the 0-1 case means that black won, and the 1/2-1/2 case means the game was a draw. There is also a case meaning the game is ongoing or cannot be scored, but we ignore that for our purposes.

  • Use solid PowerShell best practices, pipelines or whatever you want to beat the Hadoop cluster’s time of 26 minutes!

To enter

Post your comment with a link to a Github Gist or the code you used to solve this problem.  Have it in by March 19th.

Winners will be determined by my decision, one from each of these categories:

  • Creative Coder Award- could be tersest, most ‘Dave Wyatt’, or the most dot net
  • Most ‘Best Practice’ award- if you’re emitting objects and embracing the teachings of Snover, you’re in the running
  • The So Fast Award- fastest wins, bar none

Remember, OP from the other thread did this all in a pipeline of Linux. PowerShell is all about the pipeline. So embrace it to win!

Here’s how this will work. Once the time is up, I’ll take everyone’s final submission and then script them to run one after the next with a console reset in between. I’ll run them on my pc with a newer i7 and put the input files on a ramdisk.

This will decide the speed winner. The other two will be hand selected, by me. I’ll write up a short post with the findings and announce the winners on the 20th.

The luxurious prizes

Winners will get their pick from this pile of github and PowerShell stickers!

image

I’ll mail it to you unless you live in the middle of nowhere Europe and shipping will kill me.

Have your entry in by March 19th!

I’ll post my best attempt once the entries are in!

 

Advertisements

25 thoughts on “PowerShell Challenge – Beating Hadoop with Posh

  1. Kevin Marquette (@KevinMarquette) March 1, 2016 / 1:33 pm

    This is about as simple and quick as I could make it.

    Feels like it eats way too much ram. I’ll see if I can fix that in a future submission. I’ll get more creative with the next one.

    • Simon March 6, 2016 / 3:24 pm

      I made a small tweak and it is running about 7 seconds quicker on my machine /Simon

      function Get-ChessResult2
      {
      [CmdletBinding()]
      Param(
      [Parameter(
      Position = 0,
      Mandatory = $true,
      ValueFromPipeline = $true,
      ValueFromPipelineByPropertyName = $true
      )]
      [String]$Fullname
      )

      begin
      {
      $results = @{}
      }

      process
      {
      $results = Select-String -Path $Fullname -Pattern “Result”

      }
      end
      {
      Write-Output ($results | Select-Object @{N=”White”;E={$_.'[Result “1-0″]’}},@{N=”Black”;E={$_.'[Result “0-1″]’}},@{N=”Tie”;E={$_.'[Result “1/2-1/2”]’}})
      }
      }

      Measure-Command {
      get-ChildItem .\ChessData-master -Filter *.pgn -recurse | Get-ChessResult2

  2. DavidKuehn March 2, 2016 / 2:41 am

    Not a one-liner and only uses a single core, but I’m really just glad it beats 26 minutes.

    • DavidKuehn March 2, 2016 / 9:15 pm

      Not sure why the link didn’t show up in my initial submission:

      It’s a nine+ minute runtime, which should put me squarely in last place!

      • FoxDeploy March 7, 2016 / 1:46 pm

        Don’t worry, I found the link, and it’s been added to the leaderboards.

    • Martin9700 March 2, 2016 / 2:52 pm

      Did I mention the sub 3 minute run time?

  3. Martin9700 March 2, 2016 / 2:50 pm

    I tried to add some multi-threading using Boe’s PoshRSJob (Runspaces) but those actually ran slower!

  4. Craig Duff March 4, 2016 / 8:57 pm

    Here’s my go.

    • Craig Duff March 4, 2016 / 10:18 pm

      Edited regex to capture case where there are multiple spaces between Result and the first quote.

  5. Mathias Jessen (@IISResetMe) March 5, 2016 / 11:04 pm

    Going for speed? Compiled code is your friend!

    We can utilize `System.Threading.Tasks.Parallel.Foreach()` and a `ConcurrentDictionary` to read the pgn files in parallel:

    • FoxDeploy March 6, 2016 / 10:17 am

      Can you write an example of how to do that? Not sure how to implement this (if possible) in powershell.

    • Boe Prox March 6, 2016 / 7:07 pm

      I looked at that (System.Threading.Tasks.Parallel.Foreach()) briefly in making it work purely in PowerShell on Thursday, but moved on when it threw errors about lacking a runspace to execute the scriptblock on anything greater than 1 item being passed into it as well as I wanted to avoid any sort of code compiling and focus on purely making stuff work in PowerShell. I may come back to that later though and see if I can find a way to make it work.

  6. Tore Groneng (@ToreGroneng) March 7, 2016 / 6:40 am

    hi,

    Created mine yesterday using c#. I am going for fast and ugly 🙂 @KEVINMARQUETTE function uses 2 minutes and 46 seconds on my computer (it is a 4,5 year old Lenovo). My function uses 20-25 seconds.

  7. Øyvind Kallstad March 9, 2016 / 3:27 pm

    This is my entry. I found Mathias code to be near perfect, so I had no choice but to reuse his work, and replace some parts of it with a little tighter code 😉 So, it’s kind of a collaboration between Mathias and me 🙂

  8. Kevin Marquette (@KevinMarquette) July 13, 2016 / 9:07 pm

    Can you update this post with a link to the resulting analysis? I keep referring to the results and sharing them with others. It is great to share anytime speed of a script comes up.

    • FoxDeploy July 13, 2016 / 9:17 pm

      Will do. Check back tomorrow am (can’t edit well from phone!)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s