7

I'm trying to get the URLs for images (all MIME types) in a remote CSS file using Java.

I am using jsoup to get the URL of the css.

After countless hours of looking at CSS Parser I couldn't figure it out due to the lack of documentation.

I also looked at some other treads, but have just confused me even more:

I've also seen some examples using regex, but I am not too familiar how to implement it in java.

Would anyone have some suggestions on how to go at this problem?

5
  • Try CSS Parser Commented Nov 21, 2011 at 6:12
  • Do you need to follow recursive references to other CSS files? You could use a regular expression to find all url() occurrences. Commented Nov 21, 2011 at 6:16
  • Yes, I eventually need to get references to other CSS files. What regex would find all url() occurrences? Commented Nov 21, 2011 at 6:20
  • I've actually managed to get the contents of a CSS file using simple java URL code, so what would be the next step in matching all .jpg, .gif, .png, and other possible MIME inside the CSS file Commented Nov 21, 2011 at 6:54
  • ([^\s]+(\.(?i)(jpg|png|gif|bmp))$) works, now just need java implementation to pass it the css file as a String and find all URLS of images Commented Nov 21, 2011 at 7:32

2 Answers 2

6

In Java, you have to use a Pattern and a Matcher from the java.util.regex package.

You compile your pattern, then you instantiate your matcher with your string and then you look for everything that matches your pattern.

Pattern p = Pattern.compile("...");
Matcher m = p.matcher("your CSS file as a String");
while (m.find()) {
  // Here use m.group(), m.group(1), ...
}

The CSS 2.1 spec states:

The format of a URI value is 'url(' followed by optional white space followed by an optional single quote (') or double quote (") character followed by the URI itself, followed by an optional single quote (') or double quote (") character followed by optional white space followed by ')'. The two quote characters must be the same.

Thus you could use a regex like this one:

url\(\s*(['"]?+)(.*?)\1\s*\)

The .*? is non-greedy allowing you to take as few characters as necessary. The possessive quantifier avoids any backtrack in ['"]?+.

Sign up to request clarification or add additional context in comments.

2 Comments

very nice, you nailed it right on. the code I wrote is almost the same except for the regex, which i'm about to test right now. Just wanted to clarify that it will match everything between the '' inside the parentheses correct? url('domain/link/images/graphic.png'); would return domain.../graphic.png
Yes, it will return it in the second matching group.
0

You may also use ph-css for this. See the example "Visit all URLs contained in a CSS" located at https://github.com/phax/ph-css#code-examples. Can't do it much easier :)

3 Comments

Hi, how can I visit only the URLs of images and not all the URLs?
This is not easily possible because for the parser a URL is a URL - maybe you can decide upon the suffix of the URL. If it ends with ".jpg" or ".gif" than it is an image...
Alternatively you can check if declaration.getProperty ().equals ("background-image") etc. (declaration is the second parameter of onUrlDeclaration)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.