regex - PowerShell multiple string replacement efficiency -
i'm trying replace 600 different strings in large text file 30mb+. i'm current building script this; following question:
script:
$string = gc $filepath $string | % { $_ -replace 'something0','somethingelse0' ` -replace 'something1','somethingelse1' ` -replace 'something2','somethingelse2' ` -replace 'something3','somethingelse3' ` -replace 'something4','somethingelse4' ` -replace 'something5','somethingelse5' ` ... (600 more lines...) ... } $string | ac "c:\log.txt"
but check each line 600 times , there on 150,000+ lines in text file means there’s lot of processing time.
is there better alternative doing more efficient?
any advice on appreciated, cheers.
so, you're saying want replace of 600 strings in each of 150,000 lines, , want run 1 replace operation per line?
yes, there way it, not in powershell, @ least can't think of one. can done in perl.
the method:
- construct hash keys somethings , values somethingelses.
- join keys of hash | symbol, , use match group in regex.
- in replacement, interpolate expression retrieves value hash using match variable capture group
the problem:
frustratingly, powershell doesn't expose match variables outside regex replace call. doesn't work -replace operator , doesn't work [regex]::replace.
in perl, can this, example:
$string =~ s/(1|2|3)/@{[$1 + 5]}/g;
this add 5 digits 1, 2, , 3 throughout string, if string "1224526123 [2] [6]", turns "6774576678 [7] [6]".
however, in powershell, both of these fail:
$string -replace '(1|2|3)',"$($1 + 5)" [regex]::replace($string,'(1|2|3)',"$($1 + 5)")
in both cases, $1 evaluates null, , expression evaluates plain old 5. match variables in replacements meaningful in resulting string, i.e. single-quoted string or whatever double-quoted string evaluates to. they're backreferences match variables. sure, can quote $ before number in double-quoted string, evaluate corresponding match group, defeats purpose - can't participate in expression.
the solution:
[this answer has been modified original. has been formatted fit match strings regex metacharacters. , tv screen, of course.]
if using language acceptable you, following perl script works charm:
$filepath = $argv[0]; # or hard-code or whatever open input, "< $filepath"; open output, '> c:\log.txt'; %replacements = ( 'something0' => 'somethingelse0', 'something1' => 'somethingelse1', 'something2' => 'somethingelse2', 'something3' => 'somethingelse3', 'something4' => 'somethingelse4', 'something5' => 'somethingelse5', 'x:\group_14\dacu' => '\\dacu$', '.*[^xyz]' => 'oo{xyz}', 'moresomethings' => 'moresomethingelses' ); foreach (keys %replacements) { push @strings, qr/\q$_\e/; $replacements{$_} =~ s/\\/\\\\/g; } $pattern = join '|', @strings; while (<input>) { s/($pattern)/$replacements{$1}/g; print output; } close input; close output;
it searches keys of hash (left of =>), , replaces them corresponding values. here's what's happening:
- the foreach loop goes through elements of hash , create array called @strings contains keys of %replacements hash, metacharacters quoted using \q , \e, , result of quoted use regex pattern (qr = quote regex). in same pass, escapes backslashes in replacement strings doubling them.
- next, elements of array joined |'s form search pattern. include grouping parentheses in $pattern if want, think way makes clearer what's happening.
- the while loop reads each line input file, replaces of strings in search pattern corresponding replacement strings in hash, , writes line output file.
btw, might have noticed several other modifications original script. perl has collected dust during recent powershell kick, , on second noticed several things done better.
while (<input>)
reads file 1 line @ time. lot more sensible reading entire 150,000 lines array, when goal efficiency.- i simplified
@{[$replacements{$1}]}
$replacements{$1}
. perl doesn't have built-in way of interpolating expressions powershell's $(), @{[ ]} used workaround - creates literal array of 1 element containing expression. realized it's not necessary if expression single scalar variable (i had in there holdover initial testing, applying calculations $1 match variable). - the close statements aren't strictly necessary, it's considered practice explicitly close filehandles.
- i changed for abbreviation foreach, make clearer , more familiar powershell programmers.
Comments
Post a Comment