python regex - replace newline (\n) to something else

Question

I'm trying to convert multiple continuous newline characters followed by a Capital Letter to "____" so that I can parse them.

For example,

i = "Inc\n\nContact"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i) 

In [25]: i
Out [25]: 'Inc____Contact'

This string works fine. I can parse them using ____ later.

However it doesn't work on this particular string.

i =  "(2 months)\n\nML"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)

Out [31]: '(2 months)____L'

It ate capital M. What am I missing here?

place the + inside the brackets - ([\n]+)([A-Z]+) - or just leave it out — Sebastian Proske
– Sebastian Proske, Commented Mar 18, 2016 at 18:20
@SebastianProske i = re.sub(r'([\n+])([A-Z])+', r"____\2", i) still eats first Capital letter. — aerin
– aerin, Commented Mar 18, 2016 at 18:24
As @SebastianProske points out, you need to place the + inside the brackets, like so: i = re.sub(r'(\n+)([A-Z]+)', r"____\2", i) Additionally, no square brackets around only one character is needed. — Jan
– Jan, Commented Mar 18, 2016 at 18:25

Quinn · Accepted Answer · 2016-03-18 19:16:18Z

6

EDIT To replace multiple continuous newline characters (\n) to ____, this should do:

>>> import re
>>> i =  "(2 months)\n\nML"
>>> re.sub(r'(\n+)(?=[A-Z])', r'____', i)
'(2 months)____ML'

(?=[A-Z]) is to assert "newline characters followed by Capital Letter". REGEX DEMO.

edited Mar 18, 2016 at 19:16

answered Mar 18, 2016 at 18:33

Quinn

4,5142 gold badges24 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Sebastian Proske · Accepted Answer · 2016-03-18 18:40:10Z

Well let's take a look at your regex ([\n]+)([A-Z])+ - the first part ([\n]+) is fine, matching multiple occurences of a newline into one group (note - this wont match the carriage return \r). However the second part ([A-Z])+ leeds to your error it matches a single uppercase letter into a capturing group - multiple times, if there are multiple Uppercase letter, which will reset the group to the last matched uppercase letter, which is then used for the replace.

Try the following and see what happens

import re    
i = "Inc\n\nABRAXAS"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)

You could simply place the + inside the capturing group, so multiple uppercase letters are matched into it. You could also just leave it out, as it doesn't make a difference, how many of these uppercase letters follow.

import re    
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)([A-Z])', r"____\2", i)

If you want to replace any sequence of linebreaks, no matter what follows - drop the ([A-Z]) completely and try

import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)', r"____", i)

You could also use ([\r\n]+) as pattern, if you want to consider carriage returns

Saleem · Accepted Answer · 2016-03-18 18:26:33Z

0

Try:

import re
p = re.compile(ur'[\r?\n]')
test_str = u"(2 months)\n\nML"
subst = u"_"

result = re.sub(p, subst, test_str)

It will reduce string to

(2 months)__ML

See Demo

answered Mar 18, 2016 at 18:26

Saleem

9,0582 gold badges22 silver badges34 bronze badges

Collectives™ on Stack Overflow

python regex - replace newline (\n) to something else

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related