how to extract urls using python, html.parser, and regex

Question

I need to create a program in Python that parses all the URLs from a .html file and prints out all the tags and links like so:

meta: https://someurl.com
a: https://someurl.com
link: css/bootstrap.min.css
script: https://somescript.js

Currently, what I have is

from html.parser import HTMLParser
import re

class HeadParser(HTMLParser):

def handle_starttag(self, tag, attrs):
    #use re.findall to get all the links
    links = re.findall("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", website)
    for url in links:
        print("{0}: {1}".format(tag, url))

website = open("./head.html").read()            
HeadParser().feed(website)

and it returns to me

head: https://scooptacular.net
head: https://scooptacular.net/img/uploaded/379d05029c0d84618c70ac037a25fd88.jpg
head: https://scooptacular.net/img/uploaded/4baaa58a1a37fd3da3e4e78caf366b7f.jpg
head: https://fonts.googleapis.com/css?family=Montserrat:400,700
head: https://fonts.googleapis.com/css?family=Kaushan+Script' rel='stylesheet' type='text/css'>
head: https://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic,700italic' rel='stylesheet' type='text/css'>
head: https://fonts.googleapis.com/css?family=Roboto+Slab:400,100,300,700' rel='stylesheet' type='text/css'>
head: https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js
head: https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js
meta: https://scooptacular.net
meta: https://scooptacular.net/img/uploaded/379d05029c0d84618c70ac037a25fd88.jpg
meta: https://scooptacular.net/img/uploaded/4baaa58a1a37fd3da3e4e78caf366b7f.jpg
meta: https://fonts.googleapis.com/css?family=Montserrat:400,700
meta: https://fonts.googleapis.com/css?family=Kaushan+Script' rel='stylesheet' type='text/css'>
meta: https://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic,700italic' rel='stylesheet' type='text/css'>
meta: https://fonts.googleapis.com/css?family=Roboto+Slab:400,100,300,700' rel='stylesheet' type='text/css'>
meta: https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js
meta: https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js
meta: https://scooptacular.net
meta: https://scooptacular.net/img/uploaded/379d05029c0d84618c70

As you can see, it returns me a link for every tag, even duplicates, and does not return any local file links. What is wrong with my code?

EDIT:

the html i'm using is:

<head>
<meta property="og:url" content="https://scooptacular.net" />    
<meta property="og:image" content="https://scooptacular.net/img/uploaded/379d05029c0d84618c70ac037a25fd88.jpg" />
<meta property="og:image" content="https://scooptacular.net/img/uploaded/4baaa58a1a37fd3da3e4e78caf366b7f.jpg" />

<link href="css/bootstrap.min.css" rel="stylesheet">

<link href="css/agency.css" rel="stylesheet">

<link href="font-awesome-4.1.0/css/font-awesome.min.css" rel="stylesheet" type="text/css">

        <link href="https://fonts.googleapis.com/css?family=Montserrat:400,700" rel="stylesheet" type="text/css">
    <link href='https://fonts.googleapis.com/css?family=Kaushan+Script' rel='stylesheet' type='text/css'>
    <link href='https://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic,700italic' rel='stylesheet' type='text/css'>
    <link href='https://fonts.googleapis.com/css?family=Roboto+Slab:400,100,300,700' rel='stylesheet' type='text/css'>


<link href="css/bootstrap-formhelpers.min.css" rel="stylesheet" media="screen">

    <script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
    <script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
</head>

It looks like you are getting what you want – all the urls. The duplicates may just occur more than once in that web page. Which are the ones that you did not expect? — Jongware
– Jongware, Commented Dec 2, 2018 at 21:38
it makes no sense to have a link inside of a "head" tag (there is no such thing as <head href="">) so why is it printing it out? — Erika N
– Erika N, Commented Dec 2, 2018 at 21:41
you can use set(links) to remove to duplicates,also you are clearly getting the first tag it runs to,what you want is the tag inside <head — wishmaster
– wishmaster, Commented Dec 2, 2018 at 21:43
Can you post an example of links that you're handling? Right now, i did http(|s):[^\'$\s]+ and seems to work (it gets all links there, but lets try something more general) Note: I assume all links starts with http or https. regex101.com/r/h1Y8Cu — lucas_7_94
– lucas_7_94, Commented Dec 2, 2018 at 21:47
@lucas_7_94 i edited the post to include the links i'm handling. not all links start with http/s — Erika N
– Erika N, Commented Dec 2, 2018 at 21:53

cody · Accepted Answer · 2018-12-03 03:09:07Z

The primary issue is that handle_starttag is called for every tag, and with each call you're searching the entire page for matches to your regex, not just the tag you're on (the second argument you're passing to re.findall is website).

I don't see why you need to use regular expressions at all here. Why not just rely on whether or not the tag has an href, src or content attribute:

from html.parser import HTMLParser


class HeadParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        for attr in attrs:
            if attr[0] in ['href', 'src', 'content']:
                print('{0}: {1}'.format(tag, attr[1]))


website = open("./head.html").read()
HeadParser().feed(website)

Output:

meta: https://scooptacular.net
meta: https://scooptacular.net/img/uploaded/379d05029c0d84618c70ac037a25fd88.jpg
meta: https://scooptacular.net/img/uploaded/4baaa58a1a37fd3da3e4e78caf366b7f.jpg
link: css/bootstrap.min.css
link: css/agency.css
link: font-awesome-4.1.0/css/font-awesome.min.css
link: https://fonts.googleapis.com/css?family=Montserrat:400,700
link: https://fonts.googleapis.com/css?family=Kaushan+Script
link: https://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic,700italic
link: https://fonts.googleapis.com/css?family=Roboto+Slab:400,100,300,700
link: css/bootstrap-formhelpers.min.css
script: https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js
script: https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js

Collectives™ on Stack Overflow

how to extract urls using python, html.parser, and regex

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related