1

I am trying to write a small bit of code using the regex module that will remove a portion of a url from a .csv file and return the selected chunk as output. if the section ends with .com/go/, I would like it to return the content AFTER "go". Here's the code:

import csv
import re

with open('rtdata.csv', 'rb') as fhand:
    reader = csv.reader(fhand)
    for row in reader:
        url=row[6].strip()
        section=re.findall("^http://www.xxxxxxxxx.com/(.*/)", url)
        if section==re.findall("^go.*", url):
            section=re.findall("^http://www.xxxxxxxxx.com/go/(.*/)", url)

        print url
        print section

and here's some sample input-output:

  1. Example 1
    1. input: http://www.xxxxxxxxx.com/go/news/videos/
    2. output: news/videos
  2. Example 2
    1. input: http://www.xxxxxxxxx.com/new-cars/
    2. output: new-cars

what am I missing here?

2
  • a csv file with various columns. the column I want to read is in position [6] of the string python reads in. Commented Oct 23, 2013 at 21:09
  • I'm not fluent in Python, but it seems like the "if section==re.findall("^go.*", url):" line is actually matching against the original url, not the subpart found at the previous line. Commented Oct 23, 2013 at 21:10

4 Answers 4

2

Try the following

s = re.search('http://www.xxxxxxxxx.com/(go/)?(.*)/', url)
section = s.group(2)

instead of

    section=re.findall("^http://www.xxxxxxxxx.com/(.*/)", url)
    if section==re.findall("^go.*", url):
        section=re.findall("^http://www.xxxxxxxxx.com/go/(.*/)", url)

A visual illustration of the regex used:

http://www.xxxxxxxxx.com/(go/)?(.*)/

Regular expression visualization

Debuggex Demo

Sign up to request clarification or add additional context in comments.

2 Comments

still doesn't seem to be working, I'll try the debuggex tool and see if I can't take a look at it from a different angle
Welcome to SO! I edited your question to make the examples clearer - please edit it back if that's not what you meant! I've now revised my answer - let me know if it works!
1

This is failing because of the ^ in your second regex. go is not at the start of the url, and so the match is failing.

Changing "^go.*" to "go.*" should resolve your issue.

Comments

0

From what I see elsewhere, the correct way to do what you were doing.

section=re.match("^http://www.xxxxxxxxx.com/(.*/)", url).group(1)
if re.match("^go.*", section):
    section=re.match("^go/(.*/)", section).group(1)

Better yet, you should do all of this with a single regex:

section=re.match("^http://www.xxxxxxxxx.com/(go/)?(.*/)", url).group(1)

Comments

0

You can analyze directly the content of the file, without reading with the scv module capabilities:

import re

tata = '''0,1,2,3,4,5, http://www.gagal.com/go/zui ,kkll
00,10,20,30,40,50, http://hardo.fr/glut/popolo , ocean
000,100,200,300,400,500,  http://debeny.cz/rutu/padu/go/gemini/sun=
00,01,02,03,04,05,http://www.klemperer.com/discs/major
000,100,200,300,400,500,  http://www.julia.ch/go/snowy/trf
'''

r = re.compile('^[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,'
               ' *(http://[^ ,\n]+?(?:/go/([^ ,\n]+))?(?=[ ,\n]))',
               re.MULTILINE)

print tata

for g1,g2 in r.findall(tata):
    print '%s\n%s\n' % (g1,g2 if g2 else g1)

result

0,1,2,3,4,5, http://www.gagal.com/go/zui ,kkll
00,10,20,30,40,50, http://hardo.fr/glut/popolo , ocean
000,100,200,300,400,500,  http://debeny.cz/rutu/padu/go/gemini/sun=
00,01,02,03,04,05,http://www.klemperer.com/discs/major
000,100,200,300,400,500,  http://www.julia.ch/go/snowy/trf

http://www.gagal.com/go/zui
zui

http://hardo.fr/glut/popolo
http://hardo.fr/glut/popolo

http://debeny.cz/rutu/padu/go/gemini/sun=
gemini/sun=

http://www.klemperer.com/discs/major
http://www.klemperer.com/discs/major

http://www.julia.ch/go/snowy/trf
snowy/trf

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.