Extracting and monitoring web content with PowerShell

 

Extract PowerShell

This kind of request comes up all the time on StackOverflow and /r/PowerShell.

“How can I extract content from a webpage using PowerShell”.

And it’s an interesting problem to solve.  However, nothing motivates like greed, and I recently revisited this topic in order to help me track down the newest must-have item, the Switch.

In fact, this post COULD have been called ‘Finding a Nintendo Switch with PowerShell’!

I have been REALLY wanting a Nintendo Switch, and since I’ll be flying up to NYC next month for Tome’s NYC TechStravaganza (come see me if you’ll be in Manhattan that day!), it’s the perfect justification for She-Who-Holds-The-Wallet for me to get one!

But EVERYWHERE is sold out.  Still!  😦

However, the stores have been receiving inventory every now and then, and I know that when GameStop has it in stock, I want to buy it from them!  With that in mind, I knew I just needed a way to monitor the page and alert me when some text on it changes.

Web scraping, here we go!

Caveat: Scraping a site isn’t illegal, but it might void the terms of some sites out there.  Furthermore, if you scrape too often, you might be blocked from the site temporarily or forever.  Don’t get greedy in scraping, or try to use it commercially.

If a site provides an API, go that route instead, as API are sanctioned and provided by the company to use, and require 1% of the resources of loading a full page.

Finally, some Content Management Systems will never update an existing page, but create a new one with a new URL and update all links accordingly.  If you’re not careful, you could end up querying a page that will never change. 

GameStop Nintendo Switch with Neon Joycons

First thing’s first, let’s load this page in PowerShell and store it in a variable, we’ll be using Invoke-WebRequest to handle this task.

$url ='http://www.gamestop.com/nintendo-switch/consoles/nintendo-switch-console-with-neon-blue-and-neon-red-joy-con/141887'
$response = Invoke-WebRequest -Uri $url

Next, I want to find a particular element on the page, which I’ll parse to see if it looks like they have some in stock. For that, I need to locate the ID or ClassName of the particular element, which we’ll do using Chrome Developer Tools.

On the page, right-click ‘Inspect Element‘ on an element of your choosing.  In my case, I will right-click on the ‘Unavailable’ text area.

This will launch the Chrome Developer Console, and should have the element selected for you in the console, so you can just copy the class name.  You can see me moving the mouse around, I do this to see which element is the most likely one to contain the value.

 

You want the class name, in this case ats-prodBuy-inventory.  We can use PowerShell’s wonderful HTML parsing to do some heavy lifting here, by leveraging the HTMLWebResponseObject‘s useful ParsedHTML.getElementsByClassName method.

So, to select only the element in the body with the class name of ats-prodBuy-inventory, I’ll run:

$rep.ParsedHtml.body.getElementsByClassName('ats-prodBuy-inventory')

This will list ALL the properties of this element, including lots of HTML info and properties that we don’t need.

To truncate things a bit, I’ll select only properties which have text or content somewhere in the property name.

$rep.ParsedHtml.body.getElementsByClassName($classname) | select *text*,*content*

The output:

innerText         : Currently unavailable online
outerText         : Currently unavailable online
parentTextEdit    : System.__ComObject
isTextEdit        : False
oncontextmenu     : 
contentEditable   : inherit
isContentEditable : False

Much easier to read.  So, now I know that the innerText or outerText properties will let me know if the product is in stock or not.  To validate, I took a look at another product which was in stock, and saw that it was the same properties.

All that remained was to take this few-liner and and convert it into a script which will loop once every 30 mins, with the exit condition of when the message text on the site changes.  When it does, I’m using a tool I wrote a few years ago Send-PushMessage, to send a PushBullet message to my phone to give me a head’s up!


$url ='http://www.gamestop.com/nintendo-switch/consoles/nintendo-switch-console-with-neon-blue-and-neon-red-joy-con/141887'

While ($($InStock -eq $notInStock)){
$response = Invoke-WebRequest -Uri $url
$classname ='ats-prodBuy-inventory'
$notInStock = 'Currently unavailable online'

$InStock = $response.ParsedHtml.body.getElementsByClassName($classname) | select -expand innertext
"$(get-date) is device in stock? $($InStock -ne $notInStock)`n-----$InStock"
Start-Sleep -Seconds (60*30)
}
Send-PushMessage -Type Message -title "NintendoSwitch" -msg "In stock, order now!!!!"

This is what I’ve been seeing…but eventually I’ll get a Push Message when the site text changes, and then, I’ll have my Switch!

Willing to help!

Are you struggling to extract certain text from a site?  Don’t worry, I’m here to help!  Leave me a comment below and I’ll do my best to help you.  But before you ask, checkout this post on Reddit to see how I helped someone else with a similar problem.

reddit/r/powerhsell: Downloading News Articles from the Web

 

14 thoughts on “Extracting and monitoring web content with PowerShell

  1. fxslayer March 30, 2017 / 2:08 pm

    I have used this on IE & Chrome but currently have data embedded inside ‘EO.Web’ controls (Essential Objects) java wrapper – and my Webpage that is inside the wrapper has hidden elements which are not recognized by inspect.exe OR UIspy or UIAutomation spy.

    Like

    • FoxDeploy March 30, 2017 / 4:02 pm

      If the data is being loaded by a Java connection, you should use fiddler to examine the connection and see if you can replicate it. If this is publicly accessible, I can help.

      Like

    • Nas November 19, 2018 / 8:12 am

      This is just what I’m looking for though I have an issue ( Forgive my ignorance I’m quite new to Powershell). I’m trying to write a script that will query a website to check if the latest version of a particular software is available by querying the ‘date’ class (which is a class name repeated over the page) within a tr id. (‘download-2209’) How would I modify the line below to enable this

      $rep.ParsedHtml.body.getElementsByClassName(‘class-name’)

      If it has changed I then want it to send a mail out and download it

      it’s a secure site so I’ve added the line below. to the start of the script.

      Any help is much appreciated

      Like

      • FoxDeploy November 19, 2018 / 10:57 am

        Hi! Please post your code as a github gist or pastebin link and share it with me. You can email me as well if it’s very secure and you’re concerned ☺

        Like

  2. Paul September 27, 2017 / 2:32 pm

    I’m going through this now. First off, awesome real world example of how to practically scrape websites with PS! A quick nitpicky correction. You should change your code from $rep.ParsedHtml.body.getElementsByClassName(‘ats-prodBuy-inventory’) to $response.ParsedHtml.body.getElementsByClassName(‘ats-prodBuy-inventory’). It’s correct in the screenshot, but not the text preceeding the screenshot.

    Like

  3. Amol Dhaygude November 9, 2018 / 12:23 pm

    Hi, this is working scripts with Internet but I have requirement of internal printer site open and extract the data but it’s not happening, could you please help me

    Like

    • FoxDeploy January 10, 2019 / 5:24 pm

      Sure, I’ll take a look at it tomorrow, should be possible!

      Like

    • FoxDeploy January 11, 2019 / 10:58 pm

      Alright, I took a stab and wrote it up here.

      Extracting Content from a site’s unpublished API

      Imagine that youve found a site that has a perfect list of some info you need, but the site owner’s don’t have it in a format you can easily use! This happens a lot, but fortunatley for us, if the data can be retrieved and displayed in a web browser, we can normally request that same data directly through a web call instead!

      The problem is that the API endpoints we need to hit may not always be published publically.

      For instance, this webpage has a lot of good info on beer, but no great way to export it.

      https://www.systembolaget.se/sok-dryck/?subcategory=%C3%96l&type=Ale%20brittisk-amerikansk%20stil&style=Imperial%2FDubbel%20IPA&fullassortment=1

      I’ll walk you through a technique to find the API used and grab the data ourselves!

      Start by opening Chrome and then open up the DevTools and navigate to the Network pane. Now, navigate to the URL.

      Next, click ‘I am over 20’ and watch for requests of the type ‘xhr’. An XHR request is an AJAX request (Asynchronous JavaScript and XML (but it could be called AJAJ, because everything is in JSON now, pretty much!))

      Example of an XHR Request in Chrome Tools

      extract01

      AJAX requests are commonly used when you want to load a form quickly and then retrieve the data from an API to fill out a table or form. This is a lot faster for user experience than holding up the whole page load until you’ve sent the full data payload over.

      So, we watch for XHR requests because they’re basically always interesting!

      In this case, it loads their catalog of beer!

      Easily Viewing the body of the response

      You can just click the request to see info about it, and on the Response tab, you can see the payload. This is what we were looking for!
      extract02

      If the JSON is really complex, it might be hard to read in chrome, so I recommend copying it and pasting into Jsonlint.com to format it. You can even take the URL for the XHR request (in the red box above) and paste it into JSONLint to get a pretty print version of the JSON object.

      extract03

      Now that we know this URL has the info that we’re looking for, we can just paste this directly into Invoke-RestMethod and then look at the output until we find the values we want!

      $t = Invoke-RestMethod -Uri "https://www.systembolaget.se/api/productsearch/search/sok-dryck/?style=Imperial%2FDubbel%20IPA&subcategory=%C3%96l&type=Ale%20brittisk-amerikansk%20stil&sortdirection=Ascending&site=all&fullassortment=1"
      $t.ProductSearchResults | select ProducerName, ProductNameBold, ProductNameThin

      view raw
      Example.md
      hosted with ❤ by GitHub

      Enjoy, it was a fun little challenge.

      Like

  4. DanD February 27, 2019 / 1:59 pm

    I know its been awhile but im looking to write a fun little script that pulls the current food truck and perhaps the next one or two as well from seattlefoodtruck.com. For instance “https://www.seattlefoodtruck.com/schedule/plaza-east” The javascript that runs outputs the details but no matter what i do i cannot access the contents of the output. Any help on what im missing?

    Like

    • FoxDeploy March 3, 2019 / 11:32 am

      Here you go, here’s an explanation and the working code to do what you’re trying to do 🙂

      The key is to monitor the request in Chrome Dev tools and filter down to just XHR requests, which is what will be used normally to populate sublists or reactive content in webpages.

      Getting foodtruck time and locations in Bellvue

      We got a request on FoxDeploy a few days ago asking the following:

      I know its been awhile but im looking to write a fun little script that pulls the current food truck and perhaps the next one or two as well from seattlefoodtruck.com. For instance “https://www.seattlefoodtruck.com/schedule/plaza-east” The javascript that runs outputs the details but no matter what i do i cannot access the contents of the output. Any help on what im missing?

      First off, we loaded up the URL in Chrome and then opened up devtools and went to the network tab, then refreshed. We’re looking for XHR requests

      Filtering down to XHR requests (which is what an AJAX request will basically always be) we see just one request. We have a good idea its going to be AJAX, as a nice website like this will most of the time break their components up into reusable modules or Partial Views, and will composite the whole thing together with a few requests.


      Clicking into the request, we see a number of the restaurants listed here….looks like we’re in the right neighborhood!

      We can then copy the request like this…

      And paste it into an Invoke-RestMethod cmdlet and then assign the results to a variable and play with them a bit…

      view raw
      Readme.MD
      hosted with ❤ by GitHub

      $FoodTruckEvents = Invoke-RestMethod https://http://www.seattlefoodtruck.com/api/events?page=1&for_locations=51&with_active_trucks=true&include_bookings=true&with_booking_status=approved'
      #this endpoint gives us a JSON response with multiple events, so we parse out each event
      ForEach ($event in $FoodTruckEvents.events){
      #each event seems to be one day, so we resolve the .start_time and make it human readable
      $date = Get-Date $event.start_time | Select-Object -ExpandProperty DateTime
      #each event could have more than one truck, so we step through them
      ForEach ($booking in $event.bookings){
      #create a new PowerShell object to output the info we need
      [pscustomobject]@{Date=$date;TruckName=$booking.truck.name;FoodType=$booking.truck.food_categories -join ','}
      }
      }

      view raw
      Foodtrucks.ps1
      hosted with ❤ by GitHub

      Like

  5. AltHexOrtega December 11, 2019 / 5:25 pm

    Hi Stephen, I’m looking for some ideas on how could I monitor a login to a webpage… you know… invoke-login-logout, but I have no idea on how to tell through the result that login was successfull or wasn’t. I’m actually using “com object” so i can handle navigation and pass information of user and password to the login form… now the only thing I need to know is how can I get success or non-success message after login.

    Hope you see this.
    Thanks in advance!!

    Like

    • FoxDeploy December 28, 2019 / 11:59 am

      If you’re not authorized, how this is returned to the end user depends on the site specifically.

      For instance ,a lot of asp.net sites and nginx instances will return a status code 401. This is a property you’ll get back in your web response. But doing a 401 is sort of unfriendly to the end user so a lot of sites will have a ‘nice error page’ to see. This means you’ll have to do a login in Chrome or EdgeDev, monitoring it with the Network tab open and check the box to ‘Preserve Logs’ to ensure that you can see what sort of info is returned to the user.

      If this is too vague, post on /r/FoxDeploy on reddit and I’ll reply there with more info.

      Like

Have a code issue? Share your code by going to Gist.github.com and pasting your code there, then post the link here!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.