Regular expression to extract image url from html code

Question

I wanted to extract Url of image from html code, e.g. html code below:

<div class="imageContainer">
   <img src="http://ecx.images-amazon.com/images/I/41%2B7N48F7JL._SL135_.jpg"
      alt="" width="135" height="94"
      style="margin-top: 21px; margin-bottom:20px;" /></div>

And I got a code from net

String regexImage = "(?<=<img (*)src=\")[^\"]*";
Pattern pImage = Pattern.compile(regexImage);
Matcher mImage = pImage.matcher(elementString);
while (mImage.find()) {
   String imagePath = mImage.group();}

which is working and has re(regular expression)

"(?<=<img src=\")[^\"]*"

But now I want to extract image url from html code like below :

<img onerror="img_onerror(this);" data-logit="true" data-pid="MOBDDDBRHVWQZHYY"
   data-imagesize="thumb"
   data-error-url="http://img1a.flixcart.com/mob/thumb/mobile.jpg"
   src="http://img8a.flixcart.com/image/mobile/h/y/y/samsung-galaxy-s-duos-s7562-125x125-imadddczzr4qhqnc.jpeg"
   alt="Samsung Galaxy S Duos S7562: Mobile"
   title="Samsung Galaxy S Duos S7562: Mobile"></img></a>
<div class="bp-offer-image image-offer"></div>

where there is code between img and src=

I'm trying the regular expression as "(?<=<img (*)src=\")[^\"]*" but its not working. So please give me regular expression so that i can extract image url i.e. http://ecx.images-amazon.com/images/I/61xqOQ3Sj8L._SL135_.jpg from above html code.

And, first I'm using Jsoup to parse html to extract tags containing img :

doc = Jsoup.connect(urlFromBrowse).get();
            Elements elements = doc.getElementsByTag("img");

            for (Element element : elements) {
                String elementString = element.toString();

and passed this elementString to matcher() meathod. And from the tag(element) that I'm getting, I'm using regular expression to parse image url, name etc things.

Parsing well formed html is easy but if isn't well formed it's a nightmare! — Aubin
– Aubin, Commented Oct 31, 2012 at 15:23
Just saw this on the front page. Surely Java has some DOM parser. Investigate this, rather than regex. — Joel Berger
– Joel Berger, Commented Oct 31, 2012 at 15:23
@Cthulhu please see question because I have edited it. And now tell me, am I doing wrong by parsing it. — user1699548
– user1699548, Commented Oct 31, 2012 at 15:27

Aubin · Accepted Answer · 2012-10-31 16:29:05Z

5

This post is an answer to the question, not a guideline.

The question was not "RegExp vs DOM", the question was "Regular expression to extract image url from html code".

Here it is:

String htmlFragment =
   "<img onerror=\"img_onerror(this);\" data-logit=\"true\" data-pid=\"MOBDDDBRHVWQZHYY\"\n" + 
   "   data-imagesize=\"thumb\"\n" + 
   "   data-error-url=\"http://img1a.flixcart.com/mob/thumb/mobile.jpg\"\n" + 
   "   src=\"http://img8a.flixcart.com/image/mobile/h/y/y/samsung-galaxy-s-duos-s7562-125x125-imadddczzr4qhqnc.jpeg\"\n" + 
   "   alt=\"Samsung Galaxy S Duos S7562: Mobile\"\n" + 
   "   title=\"Samsung Galaxy S Duos S7562: Mobile\"></img></a>";
Pattern pattern =
   Pattern.compile( "(?m)(?s)<img\\s+(.*)src\\s*=\\s*\"([^\"]+)\"(.*)" );
Matcher matcher = pattern.matcher( htmlFragment );
if( matcher.matches()) {
   System.err.println(
      "OK:\n" +
      "1: '" + matcher.group(1) + "'\n" +
      "2: '" + matcher.group(2) + "'\n" +
      "3: '" + matcher.group(3) + "'\n" );
}

and the ouput:

OK:
1: 'onerror="img_onerror(this);" data-logit="true" data-pid="MOBDDDBRHVWQZHYY"
   data-imagesize="thumb"
   data-error-url="http://img1a.flixcart.com/mob/thumb/mobile.jpg"
   '
2: 'http://img8a.flixcart.com/image/mobile/h/y/y/samsung-galaxy-s-duos-s7562-125x125-imadddczzr4qhqnc.jpeg'
3: '
   alt="Samsung Galaxy S Duos S7562: Mobile"
   title="Samsung Galaxy S Duos S7562: Mobile"></img></a>'

edited Oct 31, 2012 at 16:29

answered Oct 31, 2012 at 15:42

Aubin

14.9k11 gold badges67 silver badges89 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Joel Berger Over a year ago

less snarky: how does this handle old, ill-formed HTML better than a DOM parser? How do you know the DOM parser doesn't handle it in the first place?

user1699548 Over a year ago

hey, you are great, but I think you have made 1 mistake. Because your code works great here, but not in my code. And reason behind it is(I think that), you have used \ before every ", and designed regular expression for it, but in code there are no \. So please give regular expression for it, you are my last hope

Aubin Over a year ago

The example code you give contains ". Please, give the url of the real HTML source.

user1699548 Over a year ago

I'm trying a application like pinterest.com. The above html code sample is a tag from html from amazon.com

Andrew Wyld Over a year ago

if( matcher.matches() ) may be incorrect in the above example; should it be while( matcher.find() ) ?

|

Joel Berger · Accepted Answer · 2012-10-31 15:36:46Z

2

According to the docs JSoup (a DOM parser) can easily get the attribute after you have gotten the tag element. Something like

doc.getElementsByTag("img").attr("src")

ought to work.

For the record I'm a Perl guy, a community that often reaches for regexes too quickly. I am constantly trying to enlighten people to the joy that is using DOM parsers rather than fragile regexes.

edited Oct 31, 2012 at 15:36

answered Oct 31, 2012 at 15:31

Joel Berger

20.3k5 gold badges52 silver badges106 bronze badges

5 Comments

Aubin Over a year ago

Yes, use DOM for x-html but for ill-formed HTML (3.2) it's not applicable.

Joel Berger Over a year ago

who said anything about ill-formed HTML 3.2?

user1699548 Over a year ago

this seems like what I wanted, I will try this and will come back. By the way, thanks

Joel Berger Over a year ago

you may have to loop over the tag elements, see the docs for the Elements class for helper methods for this.

user1699548 Over a year ago

thanks for the useful link, because in my eclipse I'm not getting any documentation for mouse hover for any of the JSoup methods.

Brian Agnew · Accepted Answer · 2012-10-31 15:30:03Z

0

I'd expect you to be able to get the various attributes of the <img> element via the JSoup API. Does Node.attributes() give you what you want ?

answered Oct 31, 2012 at 15:30

Brian Agnew

273k38 gold badges342 silver badges443 bronze badges

Collectives™ on Stack Overflow

Regular expression to extract image url from html code

3 Answers 3

6 Comments

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related