3

I'm trying to convert multiple continuous newline characters followed by a Capital Letter to "____" so that I can parse them.

For example,

i = "Inc\n\nContact"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i) 

In [25]: i
Out [25]: 'Inc____Contact'

This string works fine. I can parse them using ____ later.

However it doesn't work on this particular string.

i =  "(2 months)\n\nML"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)

Out [31]: '(2 months)____L'

It ate capital M. What am I missing here?

8
  • 3
    place the + inside the brackets - ([\n]+)([A-Z]+) - or just leave it out Commented Mar 18, 2016 at 18:20
  • What is particular string? please update your post with Commented Mar 18, 2016 at 18:21
  • @Saleem i = "(2 months)\n\nML" Commented Mar 18, 2016 at 18:23
  • @SebastianProske i = re.sub(r'([\n+])([A-Z])+', r"____\2", i) still eats first Capital letter. Commented Mar 18, 2016 at 18:24
  • 1
    As @SebastianProske points out, you need to place the + inside the brackets, like so: i = re.sub(r'(\n+)([A-Z]+)', r"____\2", i) Additionally, no square brackets around only one character is needed. Commented Mar 18, 2016 at 18:25

3 Answers 3

6

EDIT To replace multiple continuous newline characters (\n) to ____, this should do:

>>> import re
>>> i =  "(2 months)\n\nML"
>>> re.sub(r'(\n+)(?=[A-Z])', r'____', i)
'(2 months)____ML'

(?=[A-Z]) is to assert "newline characters followed by Capital Letter". REGEX DEMO.

Sign up to request clarification or add additional context in comments.

Comments

1

Well let's take a look at your regex ([\n]+)([A-Z])+ - the first part ([\n]+) is fine, matching multiple occurences of a newline into one group (note - this wont match the carriage return \r). However the second part ([A-Z])+ leeds to your error it matches a single uppercase letter into a capturing group - multiple times, if there are multiple Uppercase letter, which will reset the group to the last matched uppercase letter, which is then used for the replace.

Try the following and see what happens

import re    
i = "Inc\n\nABRAXAS"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)

You could simply place the + inside the capturing group, so multiple uppercase letters are matched into it. You could also just leave it out, as it doesn't make a difference, how many of these uppercase letters follow.

import re    
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)([A-Z])', r"____\2", i)

If you want to replace any sequence of linebreaks, no matter what follows - drop the ([A-Z]) completely and try

import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)', r"____", i)

You could also use ([\r\n]+) as pattern, if you want to consider carriage returns

Comments

0

Try:

import re
p = re.compile(ur'[\r?\n]')
test_str = u"(2 months)\n\nML"
subst = u"_"

result = re.sub(p, subst, test_str)

It will reduce string to

(2 months)__ML

See Demo

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.