3

I wanted to extract Url of image from html code, e.g. html code below:

<div class="imageContainer">
   <img src="http://ecx.images-amazon.com/images/I/41%2B7N48F7JL._SL135_.jpg"
      alt="" width="135" height="94"
      style="margin-top: 21px; margin-bottom:20px;" /></div>

And I got a code from net

String regexImage = "(?<=<img (*)src=\")[^\"]*";
Pattern pImage = Pattern.compile(regexImage);
Matcher mImage = pImage.matcher(elementString);
while (mImage.find()) {
   String imagePath = mImage.group();}

which is working and has re(regular expression)

"(?<=<img src=\")[^\"]*"

But now I want to extract image url from html code like below :

<img onerror="img_onerror(this);" data-logit="true" data-pid="MOBDDDBRHVWQZHYY"
   data-imagesize="thumb"
   data-error-url="http://img1a.flixcart.com/mob/thumb/mobile.jpg"
   src="http://img8a.flixcart.com/image/mobile/h/y/y/samsung-galaxy-s-duos-s7562-125x125-imadddczzr4qhqnc.jpeg"
   alt="Samsung Galaxy S Duos S7562: Mobile"
   title="Samsung Galaxy S Duos S7562: Mobile"></img></a>
<div class="bp-offer-image image-offer"></div>

where there is code between img and src=

I'm trying the regular expression as "(?<=<img (*)src=\")[^\"]*" but its not working. So please give me regular expression so that i can extract image url i.e. http://ecx.images-amazon.com/images/I/61xqOQ3Sj8L._SL135_.jpg from above html code.

And, first I'm using Jsoup to parse html to extract tags containing img :

doc = Jsoup.connect(urlFromBrowse).get();
            Elements elements = doc.getElementsByTag("img");

            for (Element element : elements) {
                String elementString = element.toString();

and passed this elementString to matcher() meathod. And from the tag(element) that I'm getting, I'm using regular expression to parse image url, name etc things.

8
  • 3
    Don't use Regex. Parse it as html code. Commented Oct 31, 2012 at 15:18
  • stackoverflow.com/questions/590747/… Commented Oct 31, 2012 at 15:22
  • 2
    Parsing well formed html is easy but if isn't well formed it's a nightmare! Commented Oct 31, 2012 at 15:23
  • Just saw this on the front page. Surely Java has some DOM parser. Investigate this, rather than regex. Commented Oct 31, 2012 at 15:23
  • @Cthulhu please see question because I have edited it. And now tell me, am I doing wrong by parsing it. Commented Oct 31, 2012 at 15:27

3 Answers 3

5

This post is an answer to the question, not a guideline.

The question was not "RegExp vs DOM", the question was "Regular expression to extract image url from html code".

Here it is:

String htmlFragment =
   "<img onerror=\"img_onerror(this);\" data-logit=\"true\" data-pid=\"MOBDDDBRHVWQZHYY\"\n" + 
   "   data-imagesize=\"thumb\"\n" + 
   "   data-error-url=\"http://img1a.flixcart.com/mob/thumb/mobile.jpg\"\n" + 
   "   src=\"http://img8a.flixcart.com/image/mobile/h/y/y/samsung-galaxy-s-duos-s7562-125x125-imadddczzr4qhqnc.jpeg\"\n" + 
   "   alt=\"Samsung Galaxy S Duos S7562: Mobile\"\n" + 
   "   title=\"Samsung Galaxy S Duos S7562: Mobile\"></img></a>";
Pattern pattern =
   Pattern.compile( "(?m)(?s)<img\\s+(.*)src\\s*=\\s*\"([^\"]+)\"(.*)" );
Matcher matcher = pattern.matcher( htmlFragment );
if( matcher.matches()) {
   System.err.println(
      "OK:\n" +
      "1: '" + matcher.group(1) + "'\n" +
      "2: '" + matcher.group(2) + "'\n" +
      "3: '" + matcher.group(3) + "'\n" );
}

and the ouput:

OK:
1: 'onerror="img_onerror(this);" data-logit="true" data-pid="MOBDDDBRHVWQZHYY"
   data-imagesize="thumb"
   data-error-url="http://img1a.flixcart.com/mob/thumb/mobile.jpg"
   '
2: 'http://img8a.flixcart.com/image/mobile/h/y/y/samsung-galaxy-s-duos-s7562-125x125-imadddczzr4qhqnc.jpeg'
3: '
   alt="Samsung Galaxy S Duos S7562: Mobile"
   title="Samsung Galaxy S Duos S7562: Mobile"></img></a>'
Sign up to request clarification or add additional context in comments.

6 Comments

less snarky: how does this handle old, ill-formed HTML better than a DOM parser? How do you know the DOM parser doesn't handle it in the first place?
hey, you are great, but I think you have made 1 mistake. Because your code works great here, but not in my code. And reason behind it is(I think that), you have used \ before every ", and designed regular expression for it, but in code there are no \. So please give regular expression for it, you are my last hope
The example code you give contains ". Please, give the url of the real HTML source.
I'm trying a application like pinterest.com. The above html code sample is a tag from html from amazon.com
if( matcher.matches() ) may be incorrect in the above example; should it be while( matcher.find() ) ?
|
2

According to the docs JSoup (a DOM parser) can easily get the attribute after you have gotten the tag element. Something like

doc.getElementsByTag("img").attr("src")

ought to work.

For the record I'm a Perl guy, a community that often reaches for regexes too quickly. I am constantly trying to enlighten people to the joy that is using DOM parsers rather than fragile regexes.

5 Comments

Yes, use DOM for x-html but for ill-formed HTML (3.2) it's not applicable.
who said anything about ill-formed HTML 3.2?
this seems like what I wanted, I will try this and will come back. By the way, thanks
you may have to loop over the tag elements, see the docs for the Elements class for helper methods for this.
thanks for the useful link, because in my eclipse I'm not getting any documentation for mouse hover for any of the JSoup methods.
0

I'd expect you to be able to get the various attributes of the <img> element via the JSoup API. Does Node.attributes() give you what you want ?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.