RegEx in Powershell, combine replace calls

Question

I've written my own CSS minifier for fun and profit (not so much profit), and it works great. I am now trying to streamline it, since I'm essentially filtering the file 10+ times. Not a huge deal with a small file, but the larger they get, the worse that performance hit will be.

Is there a more elegant way to filter my input file? I'm assuming regex will have a way, but I am no regex wizard...

$a = (gc($path + $file) -Raw)
$a = $a -replace "\s{2,100}(?<!\S)", ""
$a = $a -replace " {",    "{"
$a = $a -replace "} ",    "}"
$a = $a -replace " \(",   "\("
$a = $a -replace "\) ",   "\)"
$a = $a -replace " \[",   "\["
$a = $a -replace "\] ",   "\]"
$a = $a -replace ": ",    ":"
$a = $a -replace "; ",    ";"
$a = $a -replace ", ",    ","
$a = $a -replace "\n",    ""
$a = $a -replace "\t",    ""

To save you a little headache, i'm basically using the first -replace to strip any successive witespace from 2-100 characters in length. The remaining replace statements cover cleaning up single spaces in specific circumstances.

How can I combine this, so I'm not filtering the file 12 times?

I suggest trying string method to replace literal strings $a = $a.replace(') ', '') and measure the time. Don't be surprised if it turns out much faster than any of regex-based answers even on a large text. Anyway you should use a proper CSS parsing instead. — woxxom
– woxxom, Commented Nov 5, 2016 at 10:22

TessellatingHeckler · Accepted Answer · 2016-11-04 23:01:48Z

negative lookbehind (?<!\S) is used in this scenario: (?<!prefix)thing to match a thing which does not have the prefix on the left. When you put it at the end of the regex, with nothing after it, I think it does nothing at all. You might have intended it to go on the left, or might have intended to to be a negative lookahead, I won't try to guess, I'll just remove it for this answer.
You're missing the use of character classes. abc looks for the text abc, but put them in square brackets and [abc] looks for any of the characters a, b, c.
1. Using that, you can combine the last two lines into one: [\n\t] which replace either a newline or a tab.
You can combine the two separate (replace with nothing) rules using regex logical OR | to make one match: \s{2,100}|[\n\t] - match the spaces or the newline or tab. (You could probably use OR twice instead of characters, fwiw).
Use regex capture groups which allow you to reference whatever the regex matched, without knowing in advance what that was.
1. e.g. "space bracket -> bracket" and "space colon -> colon" and "space comma -> comma" all follow the general pattern "space (thing) -> (thing)". And the same with the trailing spaces "(thing) space -> (thing)".
2. Combine capture groups with character classes to merge the rest of the lines all into one.

e.g.

$a -replace " (:)", '$1'    # capture the colon, replacement is not ':' 
                            # it is "whatever was in the capture group"

$a -replace " ([:,])", '$1' # capture the colon, or comma. Replacement  
                            # is "whatever was in the capture group"
                            # space colon -> colon, space comma -> comma

# make the space optional with \s{0,1} and put it at the start and end
\s{0,1}([:,])\s{0,1}  #now it will match "space (thing)" or "(thing) space"

# Add in the rest of the characters, with appropriate \ escapes
# gained from [regex]::Escape('those chars here')

# Your original:
$a = (gc D:\css\1.css -Raw)
$a = $a -replace "\s{2,100}(?<!\S)", ""
$a = $a -replace " {",    "{"
$a = $a -replace "} ",    "}"
$a = $a -replace " \(",   "\("
$a = $a -replace "\) ",   "\)"
$a = $a -replace " \[",   "\["
$a = $a -replace "\] ",   "\]"
$a = $a -replace ": ",    ":"
$a = $a -replace "; ",    ";"
$a = $a -replace ", ",    ","
$a = $a -replace "\n",    ""
$a = $a -replace "\t",    ""

# My version:
$b = gc d:\css\1.css -Raw
$b = $b -replace "\s{2,100}|[\n\t]", ""
$b = $b -replace '\s{0,1}([])}{([:;,])\s{0,1}', '$1'

# Test that they both do the same thing on my random downloaded sample file:
$b -eq $a

# Yep.

Do that again with another | to combine the two into one:

$c = gc d:\css\1.css -Raw
$c = $c -replace "\s{2,100}|[\n\t]|\s{0,1}([])}{([:;,])\s{0,1}", '$1'

$c -eq $a   # also same output as your original.

NB. that the space and tab and newline capture nothing, so '$1' is empty,
    which removes them.

And you can spend lots of time building your own unreadable regex which probably won't be noticeably faster in any real scenario. :)

NB. '$1' in the replacement, the dollar is a .Net regex engine syntax, not a PowerShell variable. If you use double quotes, PowerShell will string interpolate from the variable $1 and likely replace it with nothing.

Awesome. Thank you for your help, @TessellatingHeckler ! This works perfectly and looks a lot better now too!

Wiktor Stribiżew · Accepted Answer · 2016-11-04 21:04:01Z

1

You may join the patterns that are similar into 1 bigger expression with capturing groups, and use a callback inside a Regex replace method where you may evaluate the match structure and use appropriate action.

Here is a solution for your scenario that you may extend:

$callback = {  param($match) 
  if ($match.Groups[1].Success -eq $true) { "" }
  else { 
    if ($match.Groups[2].Success -eq $true) { $match.Groups[2].Value }
    else {
      if ($match.Groups[3].Success -eq $true) { $match.Groups[3].Value }
      else {
        if ($match.Groups[4].Success -eq $true) { $match.Groups[4].Value }
      }
    }
  }
}
$path = "d:\input\folder\"
$file = "input_file.txt"
$a = [IO.File]::ReadAllText($path + $file)
$rx = [regex]'(\s{2,100}(?<!\S)|[\n\t])|\s+([{([])|([])}])\s+|([:;,])\s+'
$rx.Replace($a, $callback) | Out-File "d:\result\file.txt"

Pattern details:

(\s{2,100}(?<!\S)|[\n\t]) - Group 1 capturing 2 to 100 whitespaces not preceded with a non-whitespace char (maybe this lookbehind is redundant) OR a newline or tab char
| - or
\s+([{([]) - just matching one or more whitespaces (\s+), and then capturing into Group 2 any single char from the [{([] character class: {, ( or [
|([])}])\s+ - or Group 3 capturing any single char from the [])}] character class: }, ) or ] and then just matching one or more whitespaces
|([:;,])\s+ - or Group 4 capturing any char from [:;,] char class (:, ; or ,) and one or more whitespaces.

edited Nov 4, 2016 at 21:04

answered Nov 4, 2016 at 20:58

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

4 Comments

woxxom Over a year ago

The callback approach is much slower than the OP's 10+ repetitions because scriptblock invocation overhead is very big in PowerShell, and this particular one will be invoked a lot.

Wiktor Stribiżew Over a year ago

I do not insist it is the best approach here, I just made the first step of OP code analysis. If I had more time, I'd reach the higher level of abstraction like in the accepted answer. In cases where you can't get to a single backreference replacement, this will be the only valid approach.

woxxom Over a year ago

Proper parsing seems the only valid approach. Brute force regexps will fail on edge cases like content property with parentheses inside. Anyway, my point is that instead of slow callbacks one can use for example [regex]::matches and a much faster normal loop via while or foreach statement.

woxxom Over a year ago

...and assemble the output in [StringBuilder]

Collectives™ on Stack Overflow

RegEx in Powershell, combine replace calls

2 Answers 2

1 Comment

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related