0

I'm working on building a Java program that will download a copy of a website to a local machine while maintaining the original file hierarchy.

I'm using the following: To find CSS of form http://www.w3schools.com/css/css_howto.asp (note working)

private static final String HTML_CSS_TAG_PATTERN = "\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))";
private static final String CSS_TAG_PATTERN = "(?i)<link([^>]+)>(.+?)>";

To find images (working fine):

private static final String HTML_IMG_TAG_PATTERN = "\\s*(?i)src\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))";
private static final String IMG_TAG_PATTERN = "(?i)<img([^>]+)>(.+?)>";

To find links of form http://www.w3schools.com/html/html_links.asp (working fine)

private static final String HTML_A_HREF_TAG_PATTERN = "\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))";
private static final String HTML_A_TAG_PATTERN = "(?i)<a([^>]+)>(.+?)</a>";

The link and images are working fine, but the CSS file isn't. I would like it to extract the link to the CSS file so that I can save it. Could anyone help me with what I missed?

2
  • What's wrong with just using a HTML parser like jsoup.org? Why trying it the overcomplicated and error prone way? Commented Dec 3, 2013 at 1:54
  • It's an assignment, I have to use regexs Commented Dec 3, 2013 at 6:22

3 Answers 3

1

Try: CSS_TAG_PATTERN

<link[^>]+?text/css[^>]*?>

will match

<link rel="stylesheet" type="text/css" href="//cdn.sstatic.net/stackoverflow/all.css?v=0eb8b68aff29">
Sign up to request clarification or add additional context in comments.

3 Comments

It's throwing the following error: Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 1 at java.util.regex.Matcher.group(Unknown Source) at CSSRegEx.grabHTMLLinks(CSSRegEx.java:42) at HTML.main(HTML.java:63) I think that means that it's finding the reference to the CSS, but there is still a mistake in CSS_TAG_PATTERN?
what is expected in group 1?
0

To make sure you only get CSS stylesheets try following CSS_TAG_PATTERN:

<link.*\s+rel="stylesheet"([^>]+)>

This pattern will match the following two

    <link rel="stylesheet" type="text/css" href="theme.css">
    <link type="text/css" rel="stylesheet"  href="theme.css">

but not

    <link type="text/css" rel="license"  href="someStuff">

1 Comment

I'm not sure what you mean.
0

Try this pattern

<link[.]+?text/css[.]*?>

It will match

<link rel="stylesheet" type="text/css" href="theme.css">
<link type="text/css" rel="stylesheet"  href="theme.css">
<link type="text/css" rel="license"  href="someStuff">

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.