Coding for speed

CODING FOR SPEED

In this post, we review some of the things we learned about coding for speed in the Hadoop PowerShell challenge. The winners are at the end of this post, so zip down there to see if you won!

We’ll use the post to cover some of what we learned from the entries here.  Here’s our top three tips for making your PowerShell scripts run just that much faster!

When searching through files, don’t use Get Content

As it turns out Select-String (PowerShell’s text searching cmdlet) is capable of mounting a file in memory, no need to gc it first. It’s also MUCH slimmer too, and has speed for days.  Look at the performance difference in this common scenario, searching 10 directories of files using Select-String , and then stark contract compared to Get-Content.

#Get-Content | Select-String example
 dir $pgnfiles | select -first 10 | get-content | Select-String "Result"

#Select-String Only example
 dir $pgnfiles | select -first 10 | Select-String "Result"

Testing GC | Select-String...3108.5527 MS
Testing Select-String Only...99.1534   MS

Using Select-String alone is a 31x Speed Increase!  This is pretty much a no-brainer.  If you need to look inside of files, definitely dump your Get-Content  steps.  Credit goes to Chris Warwick for this find.

$collection += $object is SLOW!

We see this structure a LOT in PowerShell:

#init my collection
$collection = @()

ForEach ($file in $pgnfiles) {

  $collection += $file | Select-String "Result"

 }

 $collection

This structure sets up a ‘master list’, then does some processing for each object, eventually adding it to our master list, then at the end, display the list.

Why shouldn’t I do this?

PowerShell is based off of dotnet and some dotnet variable types including our beloved string and array are immutable.  This means that PowerShell can’t simply tack your entry to the end of $collection, like you’d think.


No, instead PowerShell has to make a new variable equal to the whole of the old one, add our new entry to the end, and then throws away the old variable. This has almost no impact on small datasets, but look at the difference when we go through 100k GUID here!

Write-Output "testing ArrayList.."

(measure-command -ex {$guid = new-object System.Collections.ArrayList
1..100000 | % {
$guid.Add([guid]::NewGuid().guid) | out-null

}

}).TotalMilliseconds

Write-Output "testing `$collection+=..."

(measure-command -ex {

$guid = @()
1..100000 | % {
    $guid += [guid]::NewGuid().guid
    }

}).TotalMilliseconds

testing ArrayList...    7784.5875  MS
testing $collection+=...465156.249 MS

Sixty times faster!!!  The really crazy part, you can watch PowerShell’s RAM usage jump all over the place, as it doubles up the variable in memory, commits it, and then runs GarbageCollection.  Watch how the RAM keeps doubling, then halfing!

GIF
I didn’t actually think it would be this dramatic! (this is a gif, if viewing on mobile, make sure to tap it)

 

How do I not use $collection += structure in my code?

ArrayList will be your new best friend.

Array list is a bit different from a regular string; here’s how you do it. First you have to make a new array list (which developers call instantiating an instance of a class, sounds so cool to say it!), like so:

$collection = New-Object System.Collections.ArrayList

Next, we iterate through each object, and here’s the real difference.

We call the ArrayLists .Add() method, instead of using the += syntax. Finally at the end, we get the whole list back out by using return, or just putting the variable name in again.

ForEach ($file in $pgnfiles) {
  $result = $file | Select-String "Result"
  $collection.Add($result)

 }
 return $collection

You might notice when you run this that you see something like this:

numbers
Ohhh, so many numbers

ArrayList is a bit weird.  when you add an entry to it, ArrayList responds back with the index position of the new item you added.  In some use case in the world, this might be helpful, but not really to us.  So, we just pipe our .Add() statement into null, like so:

$collection.Add($result) | Out-Null

Some people put [void] on the front of the line instead, same result.

In one project we were migrating customers from two different remote desktop systems into one with some complex PowerShell code. There was a section of the code which built a list of all of there files and omitting certain ones. When we swapped out $string += for array list, we dropped out execution time from six minutes to only 20 seconds! A huge performance boost with this one tip!

The fastest way to read a file, stream reader

I was simply astounded to see the tremendous speed difference between using PowerShell’s Get-Content cmdlet versus the incredibly fast StreamReader.

Here’s why Get-Content can be a bit slow.  When you’re running Get-Content, or Select-String, PowerShell is reading the whole file into memory at once.  It parses it and dumps out a object for each line in the file, sending it on down the pipeline for processing.

This is VERY SLOW on big files.  If you’d like to know a bit more, read Don’s great post on Get-Content here, or Keith’s write-up here.

When we’re working with large files, or lots of small files, we have a better, option, and that is the StreamReader from .Net. It IS fundamentally different in how it presents the content from the file, so here’s a comparison.

#Working with Get-Content

#Read our file into File
$file = Get-Content $fullname

#Step through each line
foreach ($line in $file){
    #Do something with our line here
    #ex:
    if($line -like "[Re*")
       {
       $results[$line]+=1
       }
}

And now, with StreamReader

#Same concept but with StreamReader

#Setup a streamreader to process the file
$file = New-Object System.IO.StreamReader -ArgumentList $Fullname

:loop while ($true )
{
    #Read this line
    $line = $file.ReadLine()
    if ($line -eq $null)
    {
        #If the line was $null, we're at the end of the file, let's break
        $file.close()
        break loop
    }
    #Do something with our line here
    if($line.StartsWith('[Re'))
        {
        $results[$line]+=1
        }

}

So, now that you’ve seen how it works, how much faster and better is it?

Speed results

The numbers speak for themselves

Method Time
Get-Content 3562 MS
 StreamReader  133 MS

StreamReader is 26 times faster!

Wouldn’t it be great to have a PowerShell snippet for StreamReader?

I thought so too! So here you go.  Load this into the ISE and run it once.  After that, you can hit Ctrl+J and have a nice sample StreamReader code structure.

$snippet = @{
    Title = 'StreamReader Snippet'
    Description = 'Use this to quickly have a working StreamReader'
    Text = @"
    $fullname = #FilePathHere
begin
    {
        $results = @{}
    }

    process
    {
        $file = New-Object System.IO.StreamReader -ArgumentList $Fullname

        :loop while ($true )
        {
            $line = $file.ReadLine()
            if ($line -eq $null)
            {
                $file.close()
                break loop
            }
            if($line.StartsWith('[Re'))
            {
                #do something with the line here
                $results[$line]+=1
            }
        }
    }
    end
    {
        return $results
    }
}
"@
}
New-IseSnippet @snippet

This syntax comes to us by way of u/evetsleep, /u/Vortex100 and Kevin Marquette, from Reddit/r/powershell!

Other ways to speed up your code

I know I said my top three tips, but I also want to give a little extra.  Here are some extra BONUS TIPS for you.

Runspaces are crazy fast – Boe Prox turned in an awesome example of working with RunSpaces, here.  If you’d like to read a bit more, check out his full write-up guide here. This guide should be considered REQUIRED reading, if speed is your game. Amazing stuff, and incredibly fast, much better than using PowerShell Jobs.

Taking out your own Trash – This cool tip comes to us from Kevin Marquette.  If PowerShell has some monster objects in memory, or you just want to clean things up, you can call a System Garbage Collection method to take out your trash, like so:

[GC]::Collect()

True Speed comes from going native – The fastest of the fast approaches used native c# code which powershell has supported since v 3. Using this, you gain a whole slew (that’s a technical term) of new dotnet goodness to play with. For examples of this technique, check out what Tore, Oysind and Mathias did.

Can PowerShell beat Hadoop?

From the original post that started this whole thing, Adam Drake’s Can command line tools be faster than your Hadoop cluster?

[using Amazon Web Services hosting…] with 7 x c1.medium machine[s] in the cluster took 26 minutes…processing data at ~ 1.14MB/sec

All of these entrants can proudly say that their code DID beat the Hadoop cluster.  Boe Prox , Craig Duff, Martin Pugh, /u/evetsleep /u/Vortex100 and kevin Marquette, Irwin Strachan, Flynn Bundy, David Kuehn, and /u/LogicalDiagram from Reddit, and @IisResetme!  All eleven averaged a minimum of 10.76 MB/sec.  Their code all completed in less than six minutes, much faster than the 26 minutes of the mighty seven node Hadoop cluster!

But can PowerShell beat Linux?

When I saw that Adam Drake, a master of the Linux command line and Bash tools, was able to process all of the results in only 11 seconds, I knew this was a tall order.  We gave it our all guys, there’s no shame in…BEATING that time!

gAmazingly, our two Speed Demons,  Tore Groneng, and Øvind Kallstad, working in conjunction with Mathias Jensen, turned in a blazing fast time of eight seconds, each!  To be specific, Øvind’s time was 8,778 MS, while Tore beat that by an additional 200 MS.   This represents a data throughput of 411.75 MB/s!  This is close to the maximum speed of my all SSD Raid-0, so they REALLY turned in quite a result!

360 times faster than the Hadoop cluster. Astounding!

Winners!

I’m now pleased to announce the winners of the Hadoop contest.  I was so impressed with the entries that I decided to pick a bonus fourth winner.

Speed King Winner – This one goes to Tore Groneng.  He worked closely with Mathias Jensen, and turned out an incredible 8 second total execution.  For comparison, this is a 200x speed increase over the results of the Hadoop Cluster from our original challenge.  He should be proud.

A close runner-up was Øvind Kallstad, with a very honorable time of 8778 MS.

Most Best Practice Award – This one goes to Boe Prox, with a textbook perfect entry, including object creation, runspaces, and just plain pretty code.

Regex God – This award goes to Craig Duff, who blew my socks off with his impressive Regex skills!

One-liner Champion – This award was well earned by Flynn Bundy, who managed to turn out a very respectable time of two minutes, and did it all in a one-liner!  His code ALMOST fits in a single, tweet, in fact!  Only 216 characters!

If your name is mentioned here, send me a DM and we’ll work out getting you your hard-earned stickers 🙂

Name Link Time(ms) Hours:Min:Sec Winner
Tore Groneng https://gist.github.com/torgro/4b8aa80ad5b9b2da351b#file-get-chessscore-ps1 8525 00:00:08.7673.32 Speed King!
Boe Prox https://gist.github.com/proxb/eba9b262e1dcb593ec94 28274 00:00:28.25447.28 Most Best Practice Award
Craig Duff https://gist.github.com/duffwv/eaf16d733fdb00e4d6e8#file-beatinghadoop-ps1 39813 00:00:39.35832.08 Regex God Award
Flynn Bundy https://gist.github.com/bundyfx/1ef0455eb9bcbcc2d627 119774 00:01:59.107797.31 One-liner Champion

Thank you to everyone who entered.  The leaderboards have been updates with your times, and I’ll add your throughput when I get the chance this week!

Advertisements

19 thoughts on “Coding for speed

  1. Derek March 23, 2016 / 2:42 pm

    Arraylist is awesome but I’ve found you have to be careful when using it with functions. It tends to ignore scoping (I’m assuming since it’s not a native powershell type or something?). If you modify the variable inside the function, the changes are saved where normally the child scope doesn’t directly affect the parent scope.

    function Add-CollectionItem{
    $arrayList.add(“placeholder”)
    Write-Output “Arraylist inside function” $arrayList
    }

    $arrayList = New-Object System.Collections.ArrayList
    Write-Output “Arraylist before function” $arrayList
    Add-CollectionItem
    Write-Output “Arraylist after function” $arrayList

    # PowerShell array
    function Add-CollectionItem2{
    $newArray += “placeholder”
    Write-Output “New array in function” $newArray
    }

    $newArray = @()
    Add-CollectionItem2
    Write-Output “New array outside of function” $newArray

    • FoxDeploy March 23, 2016 / 3:26 pm

      Wow, I never knew that! That actually explains why the entries with ArrayLists seemed to return more entries than expected. Still, their speed alone means it’s worth it.

      • Derek March 23, 2016 / 4:21 pm

        Arraylist is totally worth it! I just figured I’d save some people the troubleshooting time I had to go through.

    • Tim Green March 23, 2016 / 11:53 pm

      This is a good tip because it is not obvious and could easily cause some real confusion. But to clarify, it is not that scope is ignored, it is that the variable is passed “by reference” instead of “by value”. The same is true with hash tables in PowerShell.

      • Derek March 24, 2016 / 9:39 am

        Thanks Tim! That makes more sense than what I assumed it was.

  2. Cody March 24, 2016 / 12:55 am

    Your CMS has messed up the HTML formatting on your scripts; they’re showing amp;quot; etc.

    Also the reason people cast to void is because it doubles performance again.

    (measure-command -ex {$guid = new-object System.Collections.ArrayList
    1..100000 | % {
    $guid.Add([guid]::NewGuid().guid) | out-null

    }

    }).TotalMilliseconds
    17251.4011

    (measure-command -ex {$guid = new-object System.Collections.ArrayList
    1..100000 | % {
    [void] $guid.Add([guid]::NewGuid().guid)

    }

    }).TotalMilliseconds
    8844.5349

    Hard to sneeze at.

    Also I tested some other combinations:
    a) Using ArrayList with += instead of Add (very slow).
    b) Using System.Collections.Generic.List“1[string] instead of ArrayList (doesn’t return an index and runs at basically the same speed as the ArrayList with a void return).
    c) Using the normal @() with .Add instead of += (invalid, doesn’t work).

  3. Irwin Strachan March 24, 2016 / 9:02 am

    Reblogged this on pshirwin and commented:
    The results are in! Great summary about the Hadoop PowerShell Challenge by Stephen Owen! It was fun to see the different approaches. When it comes to speed you can’t beat native C# code! Great tips! Worth the read!

  4. Leon Bambrick March 24, 2016 / 8:51 pm

    Awesome tips.

    Not sure I agree with this claim: “dotnet variables are immutable by design.”

    Sure “strings” in .net are immutable by design, but variables and objects are not generally immutable.

    I think there’s something else going on to cause the performance increase you’re seeing.

    Sorry to find fault with an otherwise absolutely stellar article 😉

  5. rkeithhill March 24, 2016 / 9:12 pm

    You can simplify your StreamReader snippet and make it more reliable like so:

    $file = New-Object System.IO.StreamReader -ArgumentList $pwd\build.ps1
    try {
        while (($line = $file.ReadLine()) -ne $null) {
            if ($line.StartsWith('[Re')) {
                #do something with the line here
                $results[$line]+=1
            }
        }
    }
    finally {
        $file.Close()
    }
    
  6. Kirk Munro March 25, 2016 / 8:49 am

    Great article Stephen. A few comments:

    1. Dot net variables are not immutable by design. Certain dot net types are immutable, arrays being one of those types. Strings are another. Not dot net variables in general.

    2. Since this is about performance, you should never pipe to Out-Null. Piping to Out-Null, especially when you’re doing so in a pipeline with only two pipeline elements, is a performance hit because you’re invoking a pipeline where you don’t need one. You should do one of the following instead:

    a) Cast as [void], like this:

    [void]$collection.Add($result)

    b) Assign to $null, like this:

    $null = $collection.Add($result)

    c) Redirect to null, this this:

    $collection.Add($result) > $null

    Believe it or not, avoiding the pipeline here makes a difference, especially over very large loops. My preference is the last of those three, but you’ll see all of them used in place of Out-Null.

    3. When you use a stream, you should use try/finally and close it in the finally block. Otherwise you risk leaving it open in the event of an exception. Always close your streams. Keith Hill’s example in these comments has a cleaner version that does this.

    4. You forgot the “y” in “Øyvind” in numerous locations in your article.

    • FoxDeploy March 25, 2016 / 8:50 am

      Thanks for the tips. I knew I was missing a distinction about the variables vs strings. I’ll update this

  7. adfaf March 25, 2016 / 11:11 am

    What’s really funny is you call this article “Coding for Speed” yet you say this:

    “Some people put [void] on the front of the line instead, I try to avoid it, seems confusing and very ‘developery’ too me.”

    Except Out-Null is several orders of magnitude slower than [void]:

    http://stackoverflow.com/a/5263780/3131004

    So please use [void]. Otherwise you’re slowing your code down, ironically enough.

  8. Flavius March 29, 2016 / 3:59 am

    When doing a Get-Content vs StreamReader speed comparison, you use -like in one and StartsWith in another.

    Because string comparison can be expensive (I am not sure -like does not uses regex behind), wouldn’t it be more relevant if you used the same string function? Otherwise you might test the difference in string comparison operations more than file readers.

  9. Per Møller (@pmmviper) May 6, 2016 / 11:10 am

    If possible you should not be adding to collections or arrays inside the loop, just output what you need to pipeline and assign the output instead, much faster.

    Like this:
    Write-Output “testing pipeline”

    (measure-command -ex {

    $guid = 1..100000 | % {
    [guid]::NewGuid().guid
    }

    }).TotalMilliseconds

    Results on my system:
    testing ArrayList..
    4603,7343
    testing pipeline
    1672,9649

    • FoxDeploy May 6, 2016 / 4:20 pm

      Good idea, but how would you suggest we capture the emitted objects? Does capturing them introduce a delay?

      • Kitt Holland July 28, 2016 / 2:32 pm

        He is capturing them, he assigns $guid = to the loop.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s