PowerShell Challenge – Beating Hadoop with Posh

Update

Read the follow-up with in-depth analysis for many of the techniques used by our combatants!


 

Premise

Saw this super interesting read online over the weekend:

Command line tools can be 235x faster than Hadoop

In this post, the author posits that he can crunch numbers from the Linux command line MUCH faster than Hadoop can!

If he can do that, surely we can also beat the Hadoop Cluster…then I started wondering how I would replicate this in PowerShell, and thus this challenge was born…

Challenge

  • Download the repo here (2gb!), unzip it and keep the first 10 folders
  • This equates to ~3.5 GB, which is roughly the same data size from the original post
  • Be sure to only parse the first 10 folders 🙂

    hadop
    You can delete RebelSite, Twic and WorldChampionships
  • Iterate through all of those Chess Record files it contains(*.pgn) and parse each record out.  We need to return a total count of black wins, white wins and draws.  To read a PGN:

We are only interested in the results of the game, which only have 3 real outcomes. The 1-0 case means that white won, the 0-1 case means that black won, and the 1/2-1/2 case means the game was a draw. There is also a case meaning the game is ongoing or cannot be scored, but we ignore that for our purposes.

  • Use solid PowerShell best practices, pipelines or whatever you want to beat the Hadoop cluster’s time of 26 minutes!

To enter

Post your comment with a link to a Github Gist or the code you used to solve this problem.  Have it in by March 19th.

Winners will be determined by my decision, one from each of these categories:

  • Creative Coder Award- could be tersest, most ‘Dave Wyatt’, or the most dot net
  • Most ‘Best Practice’ award- if you’re emitting objects and embracing the teachings of Snover, you’re in the running
  • The So Fast Award- fastest wins, bar none

Remember, OP from the other thread did this all in a pipeline of Linux. PowerShell is all about the pipeline. So embrace it to win!

Here’s how this will work. Once the time is up, I’ll take everyone’s final submission and then script them to run one after the next with a console reset in between. I’ll run them on my pc with a newer i7 and put the input files on a ramdisk.

This will decide the speed winner. The other two will be hand selected, by me. I’ll write up a short post with the findings and announce the winners on the 20th.

The luxurious prizes

Winners will get their pick from this pile of github and PowerShell stickers!

image

I’ll mail it to you unless you live in the middle of nowhere Europe and shipping will kill me.

Have your entry in by March 19th!

I’ll post my best attempt once the entries are in!

 

25 thoughts on “PowerShell Challenge – Beating Hadoop with Posh

  1. Kevin Marquette (@KevinMarquette) March 1, 2016 / 1:33 pm

    This is about as simple and quick as I could make it.

    <#
    .Example
    .\HadoopVSPowershell.ps1
    .Author
    @KevinMarquette
    .Notes
    You may need to [GC]::Collect() after this because it uses a lot of RAM
    #>
    [cmdletbinding()]
    param([string]$Path = (Get-Location))
    process
    {
    $count = @{}
    ls $Path include *.pgn recurse | Get-Content | ?{$_ -match 'Result '} | %{$count[$_]+=1}
    Write-Output $count
    }

    Feels like it eats way too much ram. I’ll see if I can fix that in a future submission. I’ll get more creative with the next one.

    Like

    • Simon March 6, 2016 / 3:24 pm

      I made a small tweak and it is running about 7 seconds quicker on my machine /Simon

      function Get-ChessResult2
      {
      [CmdletBinding()]
      Param(
      [Parameter(
      Position = 0,
      Mandatory = $true,
      ValueFromPipeline = $true,
      ValueFromPipelineByPropertyName = $true
      )]
      [String]$Fullname
      )

      begin
      {
      $results = @{}
      }

      process
      {
      $results = Select-String -Path $Fullname -Pattern “Result”

      }
      end
      {
      Write-Output ($results | Select-Object @{N=”White”;E={$_.'[Result “1-0″]’}},@{N=”Black”;E={$_.'[Result “0-1″]’}},@{N=”Tie”;E={$_.'[Result “1/2-1/2”]’}})
      }
      }

      Measure-Command {
      get-ChildItem .\ChessData-master -Filter *.pgn -recurse | Get-ChessResult2

      Like

  2. DavidKuehn March 2, 2016 / 2:41 am
    <#
    https://foxdeploy.com/2016/03/01/powershell-challenge-beating-hadoop-with-posh/
    9m 25s 643ms on my machine, Core i5-2500K @ 3.30Ghz
    Only uses a single core so there is much room to improve.
    Update $Path to point to the folder containing the folders of .pgn files.
    Change "select -First 10" to "select -First 1" for quick test runs.
    #>
    $Path = 'V:\ChessData Project\ChessData-master\ChessData-master'
    Set-Location -Path $Path
    Measure-Command -Expression {
    [int]$W = 0
    [int]$B = 0
    [int]$D = 0
    Get-ChildItem -Directory | sort Name | select -First 10 | % {
    Get-ChildItem -Path $_.FullName -File -Filter *.pgn | % {
    Write-Verbose "$($_.FullName) $($($_.length)/1MB)" -Verbose
    $CurrentFile = New-Object System.IO.StreamReader($_.FullName)
    while (($CurrentLine = $CurrentFile.ReadLine()) -ne $null) {
    if ($CurrentLine -match 'result'){
    $M = (($CurrentLine -split '-')[0]).ToString()[-1]
    if($M -eq '2'){$D++}elseif($M -eq '1'){$B++}elseif($M -eq '0'){$W++}
    }
    }
    $CurrentFile.Close()
    }
    }
    $Result = [pscustomobject]@{
    White = $W
    Black = $B
    Draw = $D
    }
    }
    $Result

    view raw
    gistfile1.txt
    hosted with ❤ by GitHub

    Not a one-liner and only uses a single core, but I’m really just glad it beats 26 minutes.

    Like

    • DavidKuehn March 2, 2016 / 9:15 pm

      Not sure why the link didn’t show up in my initial submission:

      It’s a nine+ minute runtime, which should put me squarely in last place!

      Like

      • FoxDeploy March 7, 2016 / 1:46 pm

        Don’t worry, I found the link, and it’s been added to the leaderboards.

        Like

  3. Martin9700 March 2, 2016 / 2:47 pm

    Here’s my take:

    [CmdletBinding()]
    Param (
    [ValidateScript({ Test-Path $_ } )]
    [string]$Path = "c:\Test"
    )
    $ResultHash = @{
    "1-0" = 0
    "0-1" = 0
    "1/2-1/2" = 0
    }
    $DisplayHash = @{
    "1-0" = "White"
    "0-1" = "Black"
    "1/2-1/2" = "Draw"
    }
    $Start = Get-Date
    $Results = ForEach ($Dir in (Get-ChildItem $Path Directory | Sort Name | Select First 10))
    {
    ForEach ($File in (Get-ChildItem "$($Dir.FullName)\*.pgn" File))
    {
    $RawFile = (New-Object System.IO.StreamReader Argument $File.FullName).ReadToEnd()
    ForEach ($Group in ($RawFile | Select-String AllMatches Pattern '\[Result \"(.*)\"').Matches)
    {
    $ResultHash[$Group.Groups[1].Value] ++
    }
    }
    }
    $ResultHash.Keys | Where { $_ -ne "*" } | Sort | Select @{Name="Winner";Expression={ $DisplayHash[$_] }},@{Name="Count";Expression={ $ResultHash[$_] }}
    Write-Verbose $(New-TimeSpan Start $Start End (Get-Date)) Verbose

    Like

  4. Martin9700 March 2, 2016 / 2:50 pm

    I tried to add some multi-threading using Boe’s PoshRSJob (Runspaces) but those actually ran slower!

    Like

  5. Irwin Strachan March 4, 2016 / 4:03 am

    Here’s my take…

    [CmdletBinding()]
    Param (
    [ValidateScript({ Test-Path $_ } )]
    [string]$ChessMatchesPath = 'C:\Users\Irwin\Downloads\ChessData-master\ChessData-master',
    [switch]$DisplaySumTotal
    )
    #region Define regex results. Make search case-insensitive just incase.
    [regex]$whiteWin = '(?i)result\s+"1-0"'
    [regex]$blackWin = '(?i)Result\s+"0-1"'
    [regex]$Tie = '(?i)result\s+"1/2.*'
    #endregion
    #region Main Process All pgn files
    Get-ChildItem Path $ChessMatchesPath File *.pgn Recurse |
    ForEach-Object {
    $pgnFile = Get-Content $_.FullName Raw
    [PSCustomObject]@{
    FileReference = $_.Name
    WhiteWin = @($whiteWin.Matches($pgnFile)).Count
    BlackWin = @($blackWin.Matches($pgnFile)).Count
    Tie = @($Tie.Matches($pgnFile)).Count
    }
    } OutVariable ChessResults
    If($DisplaySumTotal){
    "`n`r"
    'Total wins White: {0}' -f ($ChessResults.WhiteWin | Measure-Object Sum).Sum
    'Total wins Black: {0}' -f ($ChessResults.BlackWin | Measure-Object Sum).Sum
    'Total ties: {0}' -f ($ChessResults.Tie | Measure-Object Sum).Sum
    }
    #endregion

    Like

    • Craig Duff March 4, 2016 / 10:18 pm

      Edited regex to capture case where there are multiple spaces between Result and the first quote.

      Like

  6. Mathias Jessen (@IISResetMe) March 5, 2016 / 11:04 pm

    Going for speed? Compiled code is your friend!

    We can utilize `System.Threading.Tasks.Parallel.Foreach()` and a `ConcurrentDictionary` to read the pgn files in parallel:

    param(
    [parameter(Position=0)]
    $ChessFolder = 'D:\iisresetme\ChessData\'
    )
    # Prepare C# method to read and process the files
    $Challenge = @{
    Name = 'ResultCounter'
    Namespace = 'ChessData'
    PassThru = $true
    UsingNamespace = @(
    'System.Collections',
    'System.Collections.Concurrent',
    'System.Collections.Generic',
    'System.IO',
    'System.Threading.Tasks'
    )
    MemberDefinition = @'
    public static Hashtable CountChessResults(string folder)
    {
    // Use a ConcurrentDictionary to avoid concurrent updates from Parallel.ForEach() overwriting each other
    ConcurrentDictionary<string, int> winners = new ConcurrentDictionary<string, int>(new Dictionary<string, int>() { { "Draw", 0 }, { "Black", 0 }, { "White", 0 } });
    Parallel.ForEach<string>(Directory.EnumerateFiles(folder, "*.pgn", SearchOption.AllDirectories), fileName =>
    {
    // StreamReader.ReadLine() seems to be the fastest text reader
    using (StreamReader reader = new StreamReader(fileName))
    {
    string line;
    while ((line = reader.ReadLine()) != null)
    {
    if (line.StartsWith("[Result "))
    {
    if(line.Contains("1/2"))
    winners.AddOrUpdate("Draw", 1, (k, v) => v + 1);
    else if(line.Contains("1-0"))
    winners.AddOrUpdate("Black", 1, (k, v) => v + 1);
    else if(line.Contains("0-1"))
    winners.AddOrUpdate("White", 1, (k, v) => v + 1);
    }
    }
    }
    });
    // return a hashtable, easily converted to an object in PowerShell
    return new Hashtable(winners);
    }
    '@
    }
    $CompilerParams = [System.CodeDom.Compiler.CompilerParameters]::new()
    $CompilerParams.CompilerOptions = "/optimize+ /warn:0"
    # Don't try to re-add assembly if it already exists (ie. someone already ran the script once before)
    try {
    $ChallengeModule = [ChessData.ResultCounter] -as [type]
    }
    catch{
    $ChallengeModule = Add-Type @Challenge CompilerParameters $CompilerParams |Select-Object First 1
    }
    # wrapper function that takes a path to the root folder containing *.pgn files
    function Measure-PGNResult
    {
    param($path)
    return New-Object psobject Property $ChallengeModule::CountChessResults($(Resolve-Path $path).Path)
    }
    Measure-Command {
    $Results = Measure-PGNResult $ChessFolder
    }
    $Results

    Like

    • FoxDeploy March 6, 2016 / 10:17 am

      Can you write an example of how to do that? Not sure how to implement this (if possible) in powershell.

      Like

    • Mathias Jessen (@IISResetMe) March 6, 2016 / 10:50 am

      The comment form ate my first submission 😦

      Here we go again:

      param(
      [parameter(Position=0)]
      $ChessFolder = 'D:\iisresetme\ChessData\'
      )
      # Prepare C# method to read and process the files
      $Challenge = @{
      Name = 'ResultCounter'
      Namespace = 'ChessData'
      PassThru = $true
      UsingNamespace = @(
      'System.Collections',
      'System.Collections.Concurrent',
      'System.Collections.Generic',
      'System.IO',
      'System.Threading.Tasks'
      )
      MemberDefinition = @'
      public static Hashtable CountChessResults(string folder)
      {
      // Use a ConcurrentDictionary to avoid concurrent updates from Parallel.ForEach() overwriting each other
      ConcurrentDictionary<string, int> winners = new ConcurrentDictionary<string, int>(new Dictionary<string, int>() { { "Draw", 0 }, { "Black", 0 }, { "White", 0 } });
      Parallel.ForEach<string>(Directory.EnumerateFiles(folder, "*.pgn", SearchOption.AllDirectories), fileName =>
      {
      // StreamReader.ReadLine() seems to be the fastest text reader
      using (StreamReader reader = new StreamReader(fileName))
      {
      string line;
      while ((line = reader.ReadLine()) != null)
      {
      if (line.StartsWith("[Result "))
      {
      if(line.Contains("1/2"))
      winners.AddOrUpdate("Draw", 1, (k, v) => v + 1);
      else if(line.Contains("1-0"))
      winners.AddOrUpdate("Black", 1, (k, v) => v + 1);
      else if(line.Contains("0-1"))
      winners.AddOrUpdate("White", 1, (k, v) => v + 1);
      }
      }
      }
      });
      // return a hashtable, easily converted to an object in PowerShell
      return new Hashtable(winners);
      }
      '@
      }
      $CompilerParams = [System.CodeDom.Compiler.CompilerParameters]::new()
      $CompilerParams.CompilerOptions = "/optimize+ /warn:0"
      # Don't try to re-add assembly if it already exists (ie. someone already ran the script once before)
      try {
      $ChallengeModule = [ChessData.ResultCounter] -as [type]
      }
      catch{
      $ChallengeModule = Add-Type @Challenge CompilerParameters $CompilerParams |Select-Object First 1
      }
      # wrapper function that takes a path to the root folder containing *.pgn files
      function Measure-PGNResult
      {
      param($path)
      return New-Object psobject Property $ChallengeModule::CountChessResults($(Resolve-Path $path).Path)
      }
      Measure-Command {
      $Results = Measure-PGNResult $ChessFolder
      }
      $Results

      Like

    • Boe Prox March 6, 2016 / 7:07 pm

      I looked at that (System.Threading.Tasks.Parallel.Foreach()) briefly in making it work purely in PowerShell on Thursday, but moved on when it threw errors about lacking a runspace to execute the scriptblock on anything greater than 1 item being passed into it as well as I wanted to avoid any sort of code compiling and focus on purely making stuff work in PowerShell. I may come back to that later though and see if I can find a way to make it work.

      Like

    • Mathias Jessen (@IISResetMe) March 7, 2016 / 1:31 pm
      param(
      [parameter(Position=0)]
      $ChessFolder = 'D:\iisresetme\ChessData\'
      )
      # Prepare C# method to read and process the files
      $Challenge = @{
      Name = 'ResultCounter'
      Namespace = 'ChessData'
      PassThru = $true
      UsingNamespace = @(
      'System.Collections',
      'System.Collections.Concurrent',
      'System.Collections.Generic',
      'System.IO',
      'System.Threading.Tasks'
      )
      MemberDefinition = @'
      public static Hashtable CountChessResults(string folder)
      {
      // Use a ConcurrentDictionary to avoid concurrent updates from Parallel.ForEach() overwriting each other
      ConcurrentDictionary<string, int> winners = new ConcurrentDictionary<string, int>(new Dictionary<string, int>() { { "Draw", 0 }, { "Black", 0 }, { "White", 0 } });
      Parallel.ForEach<string>(Directory.EnumerateFiles(folder, "*.pgn", SearchOption.AllDirectories), fileName =>
      {
      // StreamReader.ReadLine() seems to be the fastest text reader
      using (StreamReader reader = new StreamReader(fileName))
      {
      string line;
      while ((line = reader.ReadLine()) != null)
      {
      if (line.StartsWith("[Result "))
      {
      if(line.Contains("1/2"))
      winners.AddOrUpdate("Draw", 1, (k, v) => v + 1);
      else if(line.Contains("1-0"))
      winners.AddOrUpdate("Black", 1, (k, v) => v + 1);
      else if(line.Contains("0-1"))
      winners.AddOrUpdate("White", 1, (k, v) => v + 1);
      }
      }
      }
      });
      // return a hashtable, easily converted to an object in PowerShell
      return new Hashtable(winners);
      }
      '@
      }
      $CompilerParams = [System.CodeDom.Compiler.CompilerParameters]::new()
      $CompilerParams.CompilerOptions = "/optimize+ /warn:0"
      # Don't try to re-add assembly if it already exists (ie. someone already ran the script once before)
      try {
      $ChallengeModule = [ChessData.ResultCounter] -as [type]
      }
      catch{
      $ChallengeModule = Add-Type @Challenge CompilerParameters $CompilerParams |Select-Object First 1
      }
      # wrapper function that takes a path to the root folder containing *.pgn files
      function Measure-PGNResult
      {
      param($path)
      return New-Object psobject Property $ChallengeModule::CountChessResults($(Resolve-Path $path).Path)
      }
      Measure-Command {
      $Results = Measure-PGNResult $ChessFolder
      }
      $Results

      Like

  7. Tore Groneng (@ToreGroneng) March 7, 2016 / 6:40 am

    hi,

    Created mine yesterday using c#. I am going for fast and ugly 🙂 @KEVINMARQUETTE function uses 2 minutes and 46 seconds on my computer (it is a 4,5 year old Lenovo). My function uses 20-25 seconds.

    function Get-ChessScores
    {
    [cmdletbinding()]
    Param(
    [Parameter(ValueFromPipeline)]
    [string]$RootPath
    )
    Begin {
    $SharpCode = @'
    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Threading.Tasks;
    public static class ChessResult
    {
    public static System.Collections.Hashtable GetResults(string fullFileName)
    {
    System.Collections.Hashtable hash = new System.Collections.Hashtable();
    hash.Add("White", 0);
    hash.Add("Black", 0);
    hash.Add("Draw", 0);
    List<byte> ByteList = new List<byte>();
    var allBytes = System.IO.File.ReadAllBytes(fullFileName);
    int resultsBracketIndex = 0;
    int white = 0;
    int black = 0;
    int draw = 0;
    string resultString = string.Empty;
    for (int i = 0; i < allBytes.Length; i++)
    {
    if (allBytes[i] == 91 && allBytes[(i + 1)] == 82 && allBytes[(i + 2)] == 101)
    {
    resultsBracketIndex = i;
    }
    if (allBytes[i] == 93 && resultsBracketIndex != 0)
    {
    resultsBracketIndex = 0;
    ByteList.Add(allBytes[i]);
    int indexx = ByteList.IndexOf(34);
    if (ByteList[indexx + 1] == 49 && ByteList[indexx + 2] == 45) white++;
    if (ByteList[indexx + 1] == 48 && ByteList[indexx + 2] == 45) black++;
    if (ByteList[indexx + 1] == 49 && ByteList[indexx + 2] == 47) draw++;
    ByteList.Clear();
    }
    if (resultsBracketIndex != 0)
    {
    ByteList.Add(allBytes[i]);
    }
    }
    hash["White"] = white;
    hash["Black"] = black;
    hash["Draw"] = draw;
    return hash;
    }
    }
    '@
    Add-Type $SharpCode ErrorAction SilentlyContinue
    $hash = @{
    White = 0
    Black = 0
    Draw = 0
    }
    [gc]::Collect()
    }
    Process {
    $Files = Get-ChildItem Path $RootPath Filter *.pgn Recurse
    foreach($file in $Files)
    {
    $FileResult = [ChessResult]::GetResults($file.fullname)
    $hash["White"] += $FileResult["White"]
    $hash["Black"] += $FileResult["Black"]
    $hash["Draw"] += $FileResult["Draw"]
    }
    }
    End {
    return $hash
    }
    }

    view raw
    Get-ChessScore.ps1
    hosted with ❤ by GitHub

    Like

  8. Øyvind Kallstad March 9, 2016 / 3:27 pm

    This is my entry. I found Mathias code to be near perfect, so I had no choice but to reuse his work, and replace some parts of it with a little tighter code 😉 So, it’s kind of a collaboration between Mathias and me 🙂

    param([Parameter(Position = 0)][string] $Path = 'C:\Users\grave\Downloads\ChessData-master\')
    $code = @{
    Name = 'ResultCounter'
    Namespace = 'ChessData'
    PassThru = $true
    UsingNamespace = @(
    'System.Collections.Concurrent',
    'System.IO',
    'System.Threading.Tasks'
    )
    MemberDefinition = @'
    public static ConcurrentDictionary<string, int> ProcessChessFiles(string path)
    {
    ConcurrentDictionary<string, int> result = new ConcurrentDictionary<string, int>();
    string tie = "1/2-1/2";
    string black = "0-1";
    string white = "1-0";
    string lineResult = "[Result";
    Parallel.ForEach<string>(Directory.EnumerateFiles(path, "*.pgn", SearchOption.AllDirectories), filename =>
    {
    using (StreamReader sr = new StreamReader(filename))
    {
    string line;
    while ((line = sr.ReadLine()) != null)
    {
    if ((line.Length – line.Replace(lineResult, String.Empty).Length) / lineResult.Length == 1)
    {
    if ((line.Length – line.Replace(white, String.Empty).Length) / white.Length == 1)
    {
    result.AddOrUpdate("White", 1, (k, v) => v + 1);
    }
    else if ((line.Length – line.Replace(tie, String.Empty).Length) / tie.Length == 1)
    {
    result.AddOrUpdate("Tie", 1, (k, v) => v + 1);
    }
    else if ((line.Length – line.Replace(black, String.Empty).Length) / black.Length == 1)
    {
    result.AddOrUpdate("Black", 1, (k, v) => v + 1);
    }
    }
    }
    }
    });
    return result;
    }
    '@
    }
    $CompilerParams = [System.CodeDom.Compiler.CompilerParameters]::new()
    $CompilerParams.CompilerOptions = "/optimize+ /warn:0"
    $class = Add-Type @code CompilerParameters $CompilerParams |Select-Object First 1
    Measure-Command {
    $res = $class::ProcessChessFiles($Path)
    }
    $res | Format-Table

    view raw
    communary.ps1
    hosted with ❤ by GitHub

    Like

  9. Kevin Marquette (@KevinMarquette) July 13, 2016 / 9:07 pm

    Can you update this post with a link to the resulting analysis? I keep referring to the results and sharing them with others. It is great to share anytime speed of a script comes up.

    Like

Have a code issue? Share your code by going to Gist.github.com and pasting your code there, then post the link here!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.