0

I have an large String which contains some XML. This XML contains input like:

<xyz1>...</xyz1>
<hello>text between strings #1</hello>
<xyz2>...</xyz2>
<hello>text between strings #2</hello>
<xyz3>...</xyz3>

I want to get all these <hello>text between strings</hello>.

So in the end I want to have a List or any Collection which contains all <hello>...</hello>

I tried it with Regex and Matcher but the problem is it doesn't work with large strings.... if I try it with smaller Strings, it works. I read a blogpost about this and this says the Java Regex Broken for Alternation over Large Strings.

Is there any easy and good way to do this?

Edit:

An attempt is...

String pattern1 = "<hello>";
String pattern2 = "</hello>";
List<String> helloList = new ArrayList<String>();

String regexString = Pattern.quote(pattern1) + "(.*?)" + Pattern.quote(pattern2);


Pattern pattern = Pattern.compile(regexString);

Matcher matcher = pattern.matcher(scannerString);
while (matcher.find()) {
  String textInBetween = matcher.group(1); // Since (.*?) is capturing group 1
  // You can insert match into a List/Collection here
  helloList.add(textInBetween);
  logger.info("-------------->>>> " + textInBetween);
}
7
  • Never, use regex for parsing (X)HTML stackoverflow.com/questions/1732348/… Commented Apr 17, 2015 at 8:32
  • I suggest you to use XPATH to query an XML. Commented Apr 17, 2015 at 8:33
  • You will have to use some XML parser e.g. SAX or DOM, extract all values between the tags you want and put them in Collection. Commented Apr 17, 2015 at 8:33
  • 2
    Never use regex for parsing ^(HT|X)ML$ Commented Apr 17, 2015 at 8:34
  • String regexString = "(?s)" + Pattern.quote(pattern1) + "(.*?)" + Pattern.quote(pattern2); Commented Apr 17, 2015 at 8:35

4 Answers 4

1

If you have to parse an XML file, I suggest you to use XPath language. So you have to do basically these actions:

  1. Parse the XML String inside a DOM object
  2. Create an XPath query
  3. Query the DOM

Try to have a look at this link.

An example of what you haveto do is this:

String xml = ...;
try {
   // Build structures to parse the String
   DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
   // Parse the XML string into a DOM object
   Document document= builder.parse(new ByteArrayInputStream(xml.getBytes()));
   // Create an XPath query
   XPath xPath =  XPathFactory.newInstance().newXPath();
   // Query the DOM object with the query '//hello'
   NodeList nodeList = (NodeList) xPath.compile("//hello").evaluate(document, XPathConstants.NODESET);
} catch (Exception e) {
   e.printStackTrace();
}
Sign up to request clarification or add additional context in comments.

Comments

1

You have to parse your xml with an xml parser. It is easier than using regular expressions.

DOM parser is the simplest to use, but if your xml is very big use the SAX parser

Comments

1

I would highly recommend using one of the multiple public XML parsers available:

It is simply easier to achieve what you're trying to achieve (even if you wish to elaborate on your request in the future). If you have no issues with speed and memory, go ahead and use dom4j. There is ALOT of resource online if you wish me to post good examples on this answer for you, as my answer right now is simply redirecting you alternative options but I'm not sure what your limitations are.


Regarding REGEX when parsing XML, Dour High Arch gave a great response:

XML is not a regular language. You cannot parse it using a regular expression. An expression you think will work will break when you get nested tags, then when you fix that it will break on XML comments, then CDATA sections, then processor directives, then namespaces, ... It cannot work, use an XML parser.

Parsing XML with REGEX in Java

Comments

0

With Java 8 you could use the Dynamics library to do this in a straightforward way

XmlDynamic xml = new XmlDynamic(
    "<bunch_of_data>" +
        "<xyz1>...</xyz1>" +
        "<hello>text between strings #1</hello>" +
        "<xyz2>...</xyz2>" +
        "<hello>text between strings #2</hello>" +
        "<xyz3>...</xyz3>" +
    "</bunch_of_data>");

List<String> hellos = xml.get("bunch_of_data").children()
    .filter(XmlDynamic.hasElementName("hello"))
    .map(hello -> hello.asString())
    .collect(Collectors.toList()); // ["text between strings #1", "text between strings #2"]

See https://github.com/alexheretic/dynamics#xml-dynamics

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.