7

I have this PowerShell script that's main purpose is to search through HTML files within a folder, find specific HTML markup, and replace with what I tell it to.

I have been able to do 3/4 of my find and replaces perfectly. The one I am having trouble with involves a Regular Expression.

This is the markup that I am trying to make my regex find and replace:

<a href="programsactivities_skating.html"><br />
                                           </a>

Here is the regex I have so far, along with the function I am using it in:

automate -school "C:\Users\$env:username\Desktop\schools\$question" -query '(?mis)(?!exclude1|exclude2|exclude3)(<a[^>]*?>(\s|&nbsp;|<br\s?/?>)*</a>)' -replace ''

And here is the automate function:

function automate($school, $query, $replace) {
    $processFiles = Get-ChildItem -Exclude *.bak -Include "*.html", "*.HTML", "*.htm", "*.HTM" -Recurse -Path $school
    foreach ($file in  $processFiles) {
        $text = Get-Content $file
        $text = $text -replace $query, $replace
        $text | Out-File $file -Force -Encoding utf8
    }
}

I have been trying to figure out the solution to this for about 2 days now, and just can't seem to get it to work. I have determined that problem is that I need to tell my regex to account for Multiline, and that's what I'm having trouble with.

Any help anyone can provide is greatly appreciate.

Thanks in Advance.

3 Answers 3

20

Get-Content produces an array of strings, where each string contains a single line from your input file, so you won't be able to match text passages spanning more than one line. You need to merge the array into a single string if you want to be able to match more than one line:

$text = Get-Content $file | Out-String

or

[String]$text = Get-Content $file

or

$text = [IO.File]::ReadAllText($file)

Note that the 1st and 2nd method don't preserve line breaks from the input file. Method 2 simply mangles all line breaks, as Keith pointed out in the comments, and method 1 puts <CR><LF> at the end of each line when joining the array. The latter may be an issue when dealing with Linux/Unix or Mac files.

Sign up to request clarification or add additional context in comments.

1 Comment

Or if you're on V3 or greater $text = Get-Content $file -raw. BTW be careful with that last example as it does NOT preserve line breaks.
1

I don't get what it is you're trying to do with those Exclude elements, but I find multi-line regex is usually easier to construct in a here-string:

$text = @'
<a href="programsactivities_skating.html"><br />
                                       </a>
'@

$regex = @'
(?mis)<a href="programsactivities_skating.html"><br />
\s+?</a>
'@

$text -match $regex

True

Comments

-1

Get-Content will return an array of strings, you want to concatenate the strings in question to create one:

function automate($school, $query, $replace) {
    $processFiles = Get-ChildItem -Exclude *.bak -Include "*.html", "*.HTML", "*.htm", "*.HTM" -Recurse -Path $school
    foreach ($file in  $processFiles) {
        $text = ""
        $text = Get-Content $file | % { $text += $_ +"`r`n" }
        $text = $text -replace $query, $replace
        $text | Out-File $file -Force -Encoding utf8
    }
}

1 Comment

Why not $text = (Get-Content $file) -join "`r`n" or as mentioned above: $Text = Get-Content $file | Out-String

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.