Regex removing certain newlines (Python)

Question

I'm looking for a regex which allows me to remove certain "\r\n" characters (or just \n in Python) when the following line does not start with a number

In Perl I have achieved this by matching \r\n(?!\d) and replacing with \1 (in order not to lost the character matched in following line), but when I try that in Python (\n(?!\d)), it removes every \n in my document.

You have not defined any capture group in your pattern. Replace with empty string. See \r?\n(?!\d) demo. Is this demo working the way you expect? — Wiktor Stribiżew
– Wiktor Stribiżew, Commented May 4, 2016 at 15:34
I tried to replace with empty string, but the result is that almost every \n is removed from my document. I've tried that demo, unless I'm doing something wrong, unfortunately the result is much the same — Jausk
– Jausk, Commented May 4, 2016 at 15:40
Please paste the string you test against (as a Python variable) and the expected result. I feel as if your intention is not what you described in the question. All questions related to newlines are almost always a result of some misunderstanding or typo in the code, or even a matter of checking if there are CR+LF or just LF line endings. Sometimes, encoding issues. Thus, some code that does not work would be very helpful. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented May 4, 2016 at 15:42
Is it removing any newlines that are followed by a digit? You say it removes almost all newlines, but we have no sample data to look at. Maybe almost all newlines should be removed? We don't know until you show us examples of ones being removed that shouldn't be. — coffee-converter
– coffee-converter, Commented May 4, 2016 at 15:43
Sorry, I'm trying to replicate the issue with some example as I am not allowed to paste the raw data I'm working on (due to company policy). I have re-tested with the demo, and I'm getting what I expected, so it seems not to be a problem with regex, but with how I'm applying it (over a file, line by line, not over a string var.) — Jausk
– Jausk, Commented May 4, 2016 at 15:49

Blckknght · Accepted Answer · 2016-05-04 17:24:25Z

1

Based on your comments, I'm pretty sure the issue is that you're applying your match to individual lines, rather than to the whole text at once. A zero-width negative lookahead (which you're using, with (?!\d)) will match successfully if the newline is the last character in the input string, which will be the case if your code is working line by line. The lookahead basically says "match if not followed by a digit". That is always true if there is nothing left in the input string.

You can't change the regex to fix this issue. Nothing you check on a single line can tell you what the contents of the next line will be, so you'll need to change your surrounding code in some way. One approach would be to read and transform the whole text rather than just a single line at a time. Or you could use something like the pairwise recipe from itertools to examine two lines at a a time, and examine the second line to decide if you needed to transform the first line.

I'd also like to note that substituting with \1 is not appropriate, since you have no capturing group (the parentheses in your pattern are part of the zero-width lookahead syntax, not grouping syntax). You should just be substituting with an empty string (which is effectively what you're doing anyway, since the back-reference doesn't refer to anything).

answered May 4, 2016 at 17:24

Blckknght

106k11 gold badges135 silver badges188 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jausk Over a year ago

You are right, the main problema is that. I'm parsing one line at a time instead parsing the file content as a group. I will try your suggestion about using pairwise from itertools.

Collectives™ on Stack Overflow

Regex removing certain newlines (Python)

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related