0

I am currently working on a University assignment, and have a (most likely simple) question regarding regex / regular expressions.

To summarize; this assignment is a simple RSS feed manager, which uses JSP and a RESTful web service.

I am currently working on a section of the assignment spec which requires me to be able to input XML feed data (e.g. <feeds><feed><name>FEED NAME</name><uri>http://FEEDuri/</uri></feed></feeds> etc..) and from this data, extract the FEED NAME & FEEDuri via regex.

My lecturer has provided a base method for us to work off, and I think I have implemented it correctly within my RESTful web service, and now I am implementing error handling.

I have successfully implemented error handling for the case where there is no data input by user. My question is this: Based on the example method (below), is it possible to implement error handling for the case where the feed format input is incorrect

eg: &lt; fed> FEED NAME < /fiid> < uro>http://FEEDuri< /pro>The XML tags here are obviously incorrect.

Will regex ONLY pull the group from the String IF it lies between the defined values passed as the arguement to the compile method?

To supplement my question, here is the base method given to us to use (instead of an XML parser):

public static List<Feed> getFeedsFromXml(String xml) {
      Pattern feedPattern = Pattern.compile("<feed>\\s*<name>\\s*([^<]*)</name>\\s*<uri>\\s*([^<]*)</uri>\\s*</feed>");
      Matcher feedMatch = feedPattern.matcher(xml);

      while (feedMatch.find()) {
          String feedName = feedMatch.group(1);
          String feedURI = feedMatch.group(2);
          feeds.add(new Feed(feedName, feedURI));
      }

      return feeds;
}

2 Answers 2

1

Yes, the regex will only match sections of the string that it, well, matches. If your regex contains "<feed>", it's not going to go matching strings like "<fed>" or "<fiid>".

If there are no matches of the regex in the input string, feedMatch.find() will simply return false the first time you call it, so nothing in the while loop will execute. This method will simply return an empty list, as it probably should.

Sign up to request clarification or add additional context in comments.

Comments

1

I'm not entirely sure what your exact question is. If I understand correctly, you are implementing error handling and want to make sure to cleanly handly any XML that is ill-formed. There are two considerations here: 1) you need to report an error for any ill-formed XML and 2) you don't want the regex match correct XML and silently skip past any ill-formed XML.

Let's start by looking at how Matcher.find() works with a simplified version of your XML parser. I want to match anything that is between <feed> and </feed>. For simplicity, I will simply print out the results to the display.

Code:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexDemo {

    public static void main(String[] args) {
        System.out.println("Good XML");
        String goodXML = "<feed>CODE-GURU</feed><feed>ALEXM</feed>";
        matchFeeds(goodXML);

        System.out.println("Bad XML 1");
        String badXML1 = "<fed>CODE-GURU</feed><feed>ALEXM</feed>";
        matchFeeds(badXML1);

        System.out.println("Bad XML 2");
        String badXML2 = "<feed>CODE-GURU</fid><feed>ALEXM</feed>";
        matchFeeds(badXML2);

        System.out.println("Bad XML 3");
        String badXML3 = "<feed>CODE-GURU</fid><fiid>ALEXM</feed>";
        matchFeeds(badXML3);
    }

    public static void matchFeeds(String xml) {
        Pattern feedPattern = Pattern.compile("<feed>([^<]*)</feed>");
        Matcher feedMatch = feedPattern.matcher(xml);

        while (feedMatch.find()) {
            String feedName = feedMatch.group(1);

            System.out.println("Feed Name: " + feedName);
        }
    }
}

Output:

Good XML
Feed Name: CODE-GURU
Feed Name: ALEXM
Bad XML 1
Feed Name: ALEXM
Bad XML 2
Feed Name: ALEXM
Bad XML 3

The "Good XML" test prints out exactly what is expected. However, "Bad XML 1" and "Bad XML 2" might surprise you, if you don't understand how Java regexes work. The Matcher.find() locates "the next subsequence of the input sequence that matches the pattern." This means that it will skip anything that doesn't match until it finds a valid match, if any.

Fortunately, you can force the match to start at the beginning of the input with the correct regex. You simply need to add a \G at the beginning of the regex so that Matcher.find() will start exactly at the end of the last match. So in my example, the regex would be "\\G<feed>([^<]*)</feed>".

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.