0

I'm looking for a regex which allows me to remove certain "\r\n" characters (or just \n in Python) when the following line does not start with a number

In Perl I have achieved this by matching \r\n(?!\d) and replacing with \1 (in order not to lost the character matched in following line), but when I try that in Python (\n(?!\d)), it removes every \n in my document.

6
  • You have not defined any capture group in your pattern. Replace with empty string. See \r?\n(?!\d) demo. Is this demo working the way you expect? Commented May 4, 2016 at 15:34
  • I tried to replace with empty string, but the result is that almost every \n is removed from my document. I've tried that demo, unless I'm doing something wrong, unfortunately the result is much the same Commented May 4, 2016 at 15:40
  • 1
    Please paste the string you test against (as a Python variable) and the expected result. I feel as if your intention is not what you described in the question. All questions related to newlines are almost always a result of some misunderstanding or typo in the code, or even a matter of checking if there are CR+LF or just LF line endings. Sometimes, encoding issues. Thus, some code that does not work would be very helpful. Commented May 4, 2016 at 15:42
  • 1
    Is it removing any newlines that are followed by a digit? You say it removes almost all newlines, but we have no sample data to look at. Maybe almost all newlines should be removed? We don't know until you show us examples of ones being removed that shouldn't be. Commented May 4, 2016 at 15:43
  • Sorry, I'm trying to replicate the issue with some example as I am not allowed to paste the raw data I'm working on (due to company policy). I have re-tested with the demo, and I'm getting what I expected, so it seems not to be a problem with regex, but with how I'm applying it (over a file, line by line, not over a string var.) Commented May 4, 2016 at 15:49

1 Answer 1

1

Based on your comments, I'm pretty sure the issue is that you're applying your match to individual lines, rather than to the whole text at once. A zero-width negative lookahead (which you're using, with (?!\d)) will match successfully if the newline is the last character in the input string, which will be the case if your code is working line by line. The lookahead basically says "match if not followed by a digit". That is always true if there is nothing left in the input string.

You can't change the regex to fix this issue. Nothing you check on a single line can tell you what the contents of the next line will be, so you'll need to change your surrounding code in some way. One approach would be to read and transform the whole text rather than just a single line at a time. Or you could use something like the pairwise recipe from itertools to examine two lines at a a time, and examine the second line to decide if you needed to transform the first line.

I'd also like to note that substituting with \1 is not appropriate, since you have no capturing group (the parentheses in your pattern are part of the zero-width lookahead syntax, not grouping syntax). You should just be substituting with an empty string (which is effectively what you're doing anyway, since the back-reference doesn't refer to anything).

Sign up to request clarification or add additional context in comments.

1 Comment

You are right, the main problema is that. I'm parsing one line at a time instead parsing the file content as a group. I will try your suggestion about using pairwise from itertools.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.