Pete Hinchley: Exporting a Web Site as a Collection of Word Documents using PowerShell

For reasons I can't quite understand, I was recently asked to develop a solution for exporting a web site to a collection of Microsoft Word documents. The content within each page, excluding headers, footers, navigation bars, etc, was to be saved to a separate Word document in a dynamically generated folder structure that mirrored the site hierarchy.

Fortunately, I didn't need to crawl the site, as the URL of every page to be processed was to be provided in a text file exported from the web site content management system (with each URL stored as a new line within the file).

A few notes on the solution:

Here is the code:

# script root.
$root = (pwd).path
if ($psscriptroot -ne '') { $root = $psscriptroot }
$root = $root.trim('\')

# path to file containing urls to crawl.
$urls = "$root\urls.txt"

# path to temporary html file.
$scratch = "$root\scratch.html"

# inserted into the footer of every document.
$footerReference = 'http://example.com'

# extract content by class name.
function extract-class($body, $class) {
  return $body.getElementsByClassName($class) | select -first 1 | select -expand innerhtml
}

# extract content by id.
function extract-id($html, $id) {
  return $html.IHTMLDocument3_getElementByID($id) | select -expand innerhtml
}

# get meta tag.
function get-metatag($html, $name) {
  $head = ($html.childnodes | ? nodename -eq 'HTML').childnodes | ? nodename -eq 'HEAD'
  return $head.getElementsByTagName('meta') | ? name -eq $name | select -expand content
}

# save section of web page to a temporary html file.
function save-webpage($url) {
  $page = invoke-webrequest $url
  while($ie.Busy) { Start-Sleep -Milliseconds 100 }
  $html = $page.parsedhtml

  # skip page with specific meta tag.
  if ((get-metatag $html 'robots') -eq 'noindex') { return }

  # create folder structure based on uri.
  $structure = ($url -replace 'https?://[^/]+/?(.*)', '$1') -replace '/', '\'

  $folder = split-path -parent $structure
  $folder = '{0}\{1}' -f $root, $folder
  $file   = split-path -leaf $structure

  mkdir $folder -force -ea silent | out-null

  $path = '{0}\{1}.docx' -f $folder, $file

  # i.e. extract <div class="content">
  # $text = extract-class $html.body 'content'

  # i.e. extract <div id="innercontent">
  $text = extract-id $html 'maincontent'

  # microsoft word needs the content to be wrapped in html tags.
  $text = '<html>{0}</html>' -f $text
  $text | out-file $global:scratch

  return $path
}

# convert temporary html file to word document.
function convert-webpage($path) {
  [ref]$format = 'microsoft.office.interop.word.WdSaveFormat' -as [type]
  $word = new-object -comobject word.application
  $word.visible = $false

  "Processing $path"

  $doc = $word.documents.open($global:scratch)
  $footer = $doc.sections.first.footers | ? index -eq 1
  $footer.range.text = $footerReference
  $doc.saveas([ref]$path, [ref]$format::wdFormatDocumentDefault)

  $doc.close()
  $word.quit()
  $word = $null

  [gc]::collect()
  [gc]::WaitForPendingFinalizers()
}

get-content $urls | %{
  if ($path = save-webpage $_) { convert-webpage $path }
}

remove-item $scratch -force -ea silent