Update
Read the follow-up with in-depth analysis for many of the techniques used by our combatants!
Premise
Saw this super interesting read online over the weekend:
Command line tools can be 235x faster than Hadoop
In this post, the author posits that he can crunch numbers from the Linux command line MUCH faster than Hadoop can!
If he can do that, surely we can also beat the Hadoop Cluster…then I started wondering how I would replicate this in PowerShell, and thus this challenge was born…
Challenge
- Download the repo here (2gb!), unzip it and keep the first 10 folders
- This equates to ~3.5 GB, which is roughly the same data size from the original post
- Be sure to only parse the first 10 folders 🙂
You can delete RebelSite, Twic and WorldChampionships - Iterate through all of those Chess Record files it contains(*.pgn) and parse each record out. We need to return a total count of black wins, white wins and draws. To read a PGN:
We are only interested in the results of the game, which only have 3 real outcomes. The 1-0 case means that white won, the 0-1 case means that black won, and the 1/2-1/2 case means the game was a draw. There is also a – case meaning the game is ongoing or cannot be scored, but we ignore that for our purposes.
- Use solid PowerShell best practices, pipelines or whatever you want to beat the Hadoop cluster’s time of 26 minutes!
To enter
Post your comment with a link to a Github Gist or the code you used to solve this problem. Have it in by March 19th.
Winners will be determined by my decision, one from each of these categories:
- Creative Coder Award- could be tersest, most ‘Dave Wyatt’, or the most dot net
- Most ‘Best Practice’ award- if you’re emitting objects and embracing the teachings of Snover, you’re in the running
- The So Fast Award- fastest wins, bar none
Remember, OP from the other thread did this all in a pipeline of Linux. PowerShell is all about the pipeline. So embrace it to win!
Here’s how this will work. Once the time is up, I’ll take everyone’s final submission and then script them to run one after the next with a console reset in between. I’ll run them on my pc with a newer i7 and put the input files on a ramdisk.
This will decide the speed winner. The other two will be hand selected, by me. I’ll write up a short post with the findings and announce the winners on the 20th.
The luxurious prizes
Winners will get their pick from this pile of github and PowerShell stickers!
I’ll mail it to you unless you live in the middle of nowhere Europe and shipping will kill me.
Have your entry in by March 19th!
I’ll post my best attempt once the entries are in!
This is about as simple and quick as I could make it.
HadoopVSPowershell.ps1
hosted with ❤ by GitHub
Feels like it eats way too much ram. I’ll see if I can fix that in a future submission. I’ll get more creative with the next one.
LikeLike
I found a small bug and fixed it: https://gist.github.com/KevinMarquette/f81b6f0a54c9df650c22
I had to remove the -readcount because I was using it wrong.
LikeLike
Reblogged this on Skatterbrainz Blog.
LikeLike
So over at /r/Powershell https://www.reddit.com/r/PowerShell/comments/48gmy3/powershell_challenge_can_you_beat_hadoop_with_posh/ we pulled together this script: https://gist.github.com/KevinMarquette/999e3eaeb75dd59a04d5
Credit goes to /u/evetsleep and /u/Vortex100 for most of this. I mostly made small tweaks.
We have it running in 67-100 seconds
LikeLike
I made a small tweak and it is running about 7 seconds quicker on my machine /Simon
function Get-ChessResult2
{
[CmdletBinding()]
Param(
[Parameter(
Position = 0,
Mandatory = $true,
ValueFromPipeline = $true,
ValueFromPipelineByPropertyName = $true
)]
[String]$Fullname
)
begin
{
$results = @{}
}
process
{
$results = Select-String -Path $Fullname -Pattern “Result”
}
end
{
Write-Output ($results | Select-Object @{N=”White”;E={$_.'[Result “1-0″]’}},@{N=”Black”;E={$_.'[Result “0-1″]’}},@{N=”Tie”;E={$_.'[Result “1/2-1/2”]’}})
}
}
Measure-Command {
get-ChildItem .\ChessData-master -Filter *.pgn -recurse | Get-ChessResult2
LikeLike
gistfile1.txt
hosted with ❤ by GitHub
Not a one-liner and only uses a single core, but I’m really just glad it beats 26 minutes.
LikeLike
Not sure why the link didn’t show up in my initial submission:
It’s a nine+ minute runtime, which should put me squarely in last place!
LikeLike
Don’t worry, I found the link, and it’s been added to the leaderboards.
LikeLike
Here’s my take:
Get-HadoopChallenge.ps1
hosted with ❤ by GitHub
LikeLike
Did I mention the sub 3 minute run time?
LikeLike
I tried to add some multi-threading using Boe’s PoshRSJob (Runspaces) but those actually ran slower!
LikeLike
Here’s my take…
Get-ChessMatchesResults.ps1
hosted with ❤ by GitHub
LikeLike
https://gist.github.com/duffwv/eaf16d733fdb00e4d6e8
Here’s my go.
LikeLike
Edited regex to capture case where there are multiple spaces between Result and the first quote.
LikeLike
Going for speed? Compiled code is your friend!
We can utilize `System.Threading.Tasks.Parallel.Foreach()` and a `ConcurrentDictionary` to read the pgn files in parallel:
Measure-PGNResult.ps1
hosted with ❤ by GitHub
LikeLike
Can you write an example of how to do that? Not sure how to implement this (if possible) in powershell.
LikeLike
The comment form ate my first submission 😦
Here we go again:
Measure-PGNResult.ps1
hosted with ❤ by GitHub
LikeLike
I looked at that (System.Threading.Tasks.Parallel.Foreach()) briefly in making it work purely in PowerShell on Thursday, but moved on when it threw errors about lacking a runspace to execute the scriptblock on anything greater than 1 item being passed into it as well as I wanted to avoid any sort of code compiling and focus on purely making stuff work in PowerShell. I may come back to that later though and see if I can find a way to make it work.
LikeLike
Measure-PGNResult.ps1
hosted with ❤ by GitHub
LikeLike
hi,
Created mine yesterday using c#. I am going for fast and ugly 🙂 @KEVINMARQUETTE function uses 2 minutes and 46 seconds on my computer (it is a 4,5 year old Lenovo). My function uses 20-25 seconds.
Get-ChessScore.ps1
hosted with ❤ by GitHub
LikeLike
This is my entry. I found Mathias code to be near perfect, so I had no choice but to reuse his work, and replace some parts of it with a little tighter code 😉 So, it’s kind of a collaboration between Mathias and me 🙂
communary.ps1
hosted with ❤ by GitHub
LikeLike
Can you update this post with a link to the resulting analysis? I keep referring to the results and sharing them with others. It is great to share anytime speed of a script comes up.
LikeLike
Will do. Check back tomorrow am (can’t edit well from phone!)
LikeLike
Done
LikeLike