simple python regex issue

Question

I am trying to write a small bit of code using the regex module that will remove a portion of a url from a .csv file and return the selected chunk as output. if the section ends with .com/go/, I would like it to return the content AFTER "go". Here's the code:

import csv
import re

with open('rtdata.csv', 'rb') as fhand:
    reader = csv.reader(fhand)
    for row in reader:
        url=row[6].strip()
        section=re.findall("^http://www.xxxxxxxxx.com/(.*/)", url)
        if section==re.findall("^go.*", url):
            section=re.findall("^http://www.xxxxxxxxx.com/go/(.*/)", url)

        print url
        print section

and here's some sample input-output:

Example 1
1. input: http://www.xxxxxxxxx.com/go/news/videos/
2. output: news/videos
Example 2
1. input: http://www.xxxxxxxxx.com/new-cars/
2. output: new-cars

what am I missing here?

a csv file with various columns. the column I want to read is in position [6] of the string python reads in. — Mike
– Mike, Commented Oct 23, 2013 at 21:09
I'm not fluent in Python, but it seems like the "if section==re.findall("^go.*", url):" line is actually matching against the original url, not the subpart found at the previous line. — James
– James, Commented Oct 23, 2013 at 21:10

arturomp · Accepted Answer · 2013-10-24 03:45:38Z

2

Try the following

s = re.search('http://www.xxxxxxxxx.com/(go/)?(.*)/', url)
section = s.group(2)

instead of

    section=re.findall("^http://www.xxxxxxxxx.com/(.*/)", url)
    if section==re.findall("^go.*", url):
        section=re.findall("^http://www.xxxxxxxxx.com/go/(.*/)", url)

A visual illustration of the regex used:

http://www.xxxxxxxxx.com/(go/)?(.*)/

Regular expression visualization

Debuggex Demo

edited Oct 24, 2013 at 3:45

answered Oct 23, 2013 at 21:11

arturomp

29.8k11 gold badges48 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Mike Over a year ago

still doesn't seem to be working, I'll try the debuggex tool and see if I can't take a look at it from a different angle

arturomp Over a year ago

Welcome to SO! I edited your question to make the examples clearer - please edit it back if that's not what you meant! I've now revised my answer - let me know if it works!

Nolen Royalty · Accepted Answer · 2013-10-23 21:14:47Z

1

This is failing because of the ^ in your second regex. go is not at the start of the url, and so the match is failing.

Changing "^go.*" to "go.*" should resolve your issue.

answered Oct 23, 2013 at 21:14

Nolen Royalty

18.7k4 gold badges43 silver badges51 bronze badges

Comments

James · Accepted Answer · 2013-10-23 21:15:44Z

0

From what I see elsewhere, the correct way to do what you were doing.

section=re.match("^http://www.xxxxxxxxx.com/(.*/)", url).group(1)
if re.match("^go.*", section):
    section=re.match("^go/(.*/)", section).group(1)

Better yet, you should do all of this with a single regex:

section=re.match("^http://www.xxxxxxxxx.com/(go/)?(.*/)", url).group(1)

answered Oct 23, 2013 at 21:15

James

4,7611 gold badge24 silver badges35 bronze badges

Comments

eyquem · Accepted Answer · 2013-10-23 22:05:19Z

You can analyze directly the content of the file, without reading with the scv module capabilities:

import re

tata = '''0,1,2,3,4,5, http://www.gagal.com/go/zui ,kkll
00,10,20,30,40,50, http://hardo.fr/glut/popolo , ocean
000,100,200,300,400,500,  http://debeny.cz/rutu/padu/go/gemini/sun=
00,01,02,03,04,05,http://www.klemperer.com/discs/major
000,100,200,300,400,500,  http://www.julia.ch/go/snowy/trf
'''

r = re.compile('^[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,'
               ' *(http://[^ ,\n]+?(?:/go/([^ ,\n]+))?(?=[ ,\n]))',
               re.MULTILINE)

print tata

for g1,g2 in r.findall(tata):
    print '%s\n%s\n' % (g1,g2 if g2 else g1)

result

0,1,2,3,4,5, http://www.gagal.com/go/zui ,kkll
00,10,20,30,40,50, http://hardo.fr/glut/popolo , ocean
000,100,200,300,400,500,  http://debeny.cz/rutu/padu/go/gemini/sun=
00,01,02,03,04,05,http://www.klemperer.com/discs/major
000,100,200,300,400,500,  http://www.julia.ch/go/snowy/trf

http://www.gagal.com/go/zui
zui

http://hardo.fr/glut/popolo
http://hardo.fr/glut/popolo

http://debeny.cz/rutu/padu/go/gemini/sun=
gemini/sun=

http://www.klemperer.com/discs/major
http://www.klemperer.com/discs/major

http://www.julia.ch/go/snowy/trf
snowy/trf

Collectives™ on Stack Overflow

simple python regex issue

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related