2

I am trying to match a DTD node such as this text:

<!ELEMENT note (to,from,body)>

With this regular expression:

match(/<!ELEMENT\s?(.*?)\s?\(.*?\)>/i)

and it returns the desired text + the text 'note' -can anyone explain why?

Also, when I remove either or both of the blank spaces either side of the 'note' text it still returns the result, and this is not wanted. Can anyone help explain why it is doing that too?

Here is my test file:

<!ENTITY Aring "&amp;#197;" >,
<!ENTITY aring "&amp;#229;" >,
<!ENTITY agrave "&amp;#224;" >,
<!ENTITY aacute "&amp;#225;" >,
<!ATTLIST ARTICLE AUTHOR CDATA #REQUIRED>,
<!ATTLIST ARTICLE EDITOR CDATA #IMPLIED>,
<!ATTLIST ARTICLE DATE CDATA #IMPLIED>,
<!ATTLIST ARTICLE EDITION CDATA #IMPLIED>,
<!ELEMENT note (to,from,heading,body)>,
<!ELEMENT to (#PCDATA)>,
<!ELEMENT from (#PCDATA)>,
<!ELEMENT heading (#PCDATA)>,
<!ELEMENT body (#PCDATA)>

Thanks in advance for any help!

4
  • So you only want to match "(to,from,body)"? Is the note element required? Commented Jun 26, 2013 at 11:44
  • I want to match the whole node so long as it is properly formatted...or any node of a similar structure, so the result in this case will be <!ELEMENT note (to,from,body)> and null if the original string was <!ELEMENTnote (to,from,body)> for example. Commented Jun 26, 2013 at 11:45
  • Show us how you are using this regex - it seems to work. What is wanted? Commented Jun 26, 2013 at 12:31
  • @Bergi, based on a comment to zmo's answer, I'm guessing the OP wants to match any DTD element node. Commented Jun 26, 2013 at 12:33

4 Answers 4

2

Here is what you regular expression looks like, looking at it through an automaton:

Regular expression image

So you're actually correctly matching what you want, but you're also capturing two groups:

  1. "<!ELEMENT note (to,from,body)"
  2. "note"

but it will also match other kind of strings, like:

  • <!ELEMENT%e
(jmopV|)
  • <!ELEMENT r()

which are not well formed tags.

So you'd better want to make a more precise regex , like:

<!ELEMENT\s+\w+\s+\((\w+, ?)*\w+\)>
  • here's what the regex matches:
    • text <!ELEMENT
    • \s+ one or more space
    • \w+ one or more in word character
    • \s+ one or more space
    • \( a real parenthesis
    • ( begin of a group
    • \w+ on or more in word character
    • , a comma
    • ? one or zero space (could be * zero or more spaces)
    • )* end of the group, that group being matched zero or more times
    • \w+ one or more in word character
    • (you may want to add \s* if you want to match optional spaces before the closing parenthesis)
    • \) closing parenthesis character
    • (you may want to add \s* if you want to match optional spaces before the end of the tag)
    • > closing tag character

Regular expression image

Then, when you do match(/<!ELEMENT\s+\w+\s+\((\w+, *)*\w+\)>/i), you will still get two groups:

  1. "<!ELEMENT note (to,from,body)>"
  2. "from,"

and you have to get the first group, you just need to get the first element of the returned array:

var match = "<!ELEMENT note (to,from,body)>".match(/<!ELEMENT\s+\w+\s+\((\w+, *)*\w+\)>/i);
if (match !== null)
    match = match[0];

and if you want to use the regexp object to do so:

pattern = new RegExp(/<!ELEMENT\s+\w+\s+\((\w+, *)*\w+\)>/i)
match = pattern.exec(text)
if (match !== null)
    match = match[0]

that will get you the first group of match (which is the full match).

AFTER EDIT:

you want a regex that works on this set of values:

<!ENTITY Aring "&amp;#197;" >,
<!ENTITY aring "&amp;#229;" >,
<!ENTITY agrave "&amp;#224;" >,
<!ENTITY aacute  "&amp;#225;" >,
<!ATTLIST ARTICLE AUTHOR CDATA #REQUIRED>,
<!ATTLIST ARTICLE EDITOR CDATA #IMPLIED>,
<!ATTLIST ARTICLE DATE CDATA #IMPLIED>,
<!ATTLIST ARTICLE EDITION CDATA #IMPLIED>,
<!ELEMENT note (to,from,heading,body)>,
<!ELEMENT to (#PCDATA)>,
<!ELEMENT from (#PCDATA)>,
<!ELEMENT heading (#PCDATA)>,
<!ELEMENT body (#PCDATA)>

so you want a regex that looks like this one:

/<!ELEMENT\s+\w+\s+\((\#?\w+,\s*)*\#?\w+\s*\)\s*>/

Regular expression image

look it up here

var match = "<!ELEMENT note (to,from,body)>".match(/<!ELEMENT\s+\w+\s+\((\#?\w+,\s*)*\#?\w+\s*\)\s*>/i);
if (match !== null)
    match = match[0];

there it matches only the <!ELEMENT... nodes, not the <!ATTLIST... or <!ENTITY... nodes. For those ones, match will be equal to null. For <!ELEMENT... nodes, they will contain the full string of the matched node.

Sign up to request clarification or add additional context in comments.

14 Comments

I got the distinct impression the OP did not want to match two groups, just one.
this doesn't seem to work, great images though! I tried using this: var testMatch = dtdNodes[i].match(/<!ELEMENT\s+\w+\s+\((\w+, *)*\w+\)>/i);
weird, I've tried : js> matches = '<!ELEMENT note (to,from,body)>'.match(/<!ELEMENT\s+\w+\s+\((\w+, *)*\w+\)>/i)[0]; and it returns "<!ELEMENT note (to,from,body)>". Though, I'm not sure about my new RegExp example.
About the images, I like them a lot, because they explain well why the regex is good or not. A regex is basically just an Automaton (more precisely a NFA). There are great courses on the topic at the MIT online courses: about NFA and Regex
oh and btw, always think about testing if the match === null before accessing an array element!
|
1

Providing the note part is fixed:

var node = '<!ELEMENT note (to,from,body)>';
node.match(/<!ELEMENT note \(.+,.+,.+\)/); // Will alert the whole element

var invalidNode = '<!ELEMENTnote (to,from,body)>';
invalidNode.match(/<!ELEMENT note \(.+,.+,.+\)/); // Will return null

See: http://jsfiddle.net/a5KkF/

Comments

1

The answer to both is because you are using .*, which matches everything zero or more times.

Instead, use the following regular expression:

/<!(?:ELEMENT|ENTITY|ATTLIST)\s+\w+\s+.+>/i

Proof the regular expression works

A fiddle to further demonstrate this works

And a lovely image to illustrate how the match works:

Regular expression image

To summarize, this matches the string <!, followed by either ELEMENT or ENTITY or ATTLIST, followed by 1 or more spaces (\s+), followed by 1 or more word characters (\w+), followed by 1 or more spaces, followed by one or more characters, followed by the closing bracket.

7 Comments

don't know - I copied the RegEx directly and it doesn't return anything!
@user1360809, it wasn't clear from your question that you wanted to match any valid DTD element node. The RE I gave you before only matched the specific string you provided. I have edited my answer so that it now matches any DTD element node.
yes works now thanks! You are correct, I am using it to pick out any DTD node so long as it matches the required format.
You're welcome. I'm glad we were able to find something that works. :) Please consider upvoting/accepting this answer. Thx!
will do...I appreciate the time it takes people to reply and try to give them the time back...have learnt lots such as the 'capturing' thing and efficiency of not using ungreedy quantifiers :) Is it the same for every language or can I expect subtle differences for example?
|
0

The reason you get note is capturing. Sets of parentheses make that part of the match available later (or within backreferences). Since you don't even need the parentheses for grouping, just remove them, if you don't want note.

Then your spaces are optional (due to the ?) - hence, removing them in the string does not matter at all. Simply remove the ? or make it a + (so that more than one space is allowed).

The other problem is, that . can match spaces as well. You should maybe be a bit more restrictive (this way you can also avoid ungreedy quantifiers, which are generally worse in performance):

/<!ELEMENT\s+\S*\s+\([^)]*\)>/i

\S matches anything except space character and [^)] matches anything except ) characters (it's a negated character class). In fact, you might want to exclude ( from the \S as well, because otherwise it could already match into the parentheses:

/<!ELEMENT\s+[^\s(]*\s+\([^)]*\)>/i

If the note part has to contain at least one character you should make that clear in the regex as well, by using + instead of *

/<!ELEMENT\s+[^\s(]+\s+\([^)]*\)>/i

If the note part is optional on the other hand, my earlier version requires at least 2 spaces (due to the two \s+). In that case, you could group the note part along with the following space and make it optional together. This way you only require the space, if note is there. To suppress capturing (so you don't get two strings again), use (?:...) for grouping instead of (...):

/<!ELEMENT\s+(?:[^\s(]+\s+)?\([^)]*\)>/i

Note that match will still give you an array containing the string you are looking for (and you can't do anything about that), so you'll have to access it with [0].

3 Comments

thanks - it still seems to match with spaces removed - any idea why?
@user1360809 yeah, I misunderstood that part of your question and edited my answer now.
can confirm it works! ;) Even though the syntax is a little more complex I prefer this answer for now...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.