JavaScript RegEx - returns result but still not working

Question

I am trying to match a DTD node such as this text:

<!ELEMENT note (to,from,body)>

With this regular expression:

match(/<!ELEMENT\s?(.*?)\s?\(.*?\)>/i)

and it returns the desired text + the text 'note' -can anyone explain why?

Also, when I remove either or both of the blank spaces either side of the 'note' text it still returns the result, and this is not wanted. Can anyone help explain why it is doing that too?

Here is my test file:

<!ENTITY Aring "&amp;#197;" >,
<!ENTITY aring "&amp;#229;" >,
<!ENTITY agrave "&amp;#224;" >,
<!ENTITY aacute "&amp;#225;" >,
<!ATTLIST ARTICLE AUTHOR CDATA #REQUIRED>,
<!ATTLIST ARTICLE EDITOR CDATA #IMPLIED>,
<!ATTLIST ARTICLE DATE CDATA #IMPLIED>,
<!ATTLIST ARTICLE EDITION CDATA #IMPLIED>,
<!ELEMENT note (to,from,heading,body)>,
<!ELEMENT to (#PCDATA)>,
<!ELEMENT from (#PCDATA)>,
<!ELEMENT heading (#PCDATA)>,
<!ELEMENT body (#PCDATA)>

Thanks in advance for any help!

So you only want to match "(to,from,body)"? Is the note element required? — Erik Schierboom
– Erik Schierboom, Commented Jun 26, 2013 at 11:44
I want to match the whole node so long as it is properly formatted...or any node of a similar structure, so the result in this case will be <!ELEMENT note (to,from,body)> and null if the original string was <!ELEMENTnote (to,from,body)> for example. — user1360809
– user1360809, Commented Jun 26, 2013 at 11:45
Show us how you are using this regex - it seems to work. What is wanted? — Bergi
– Bergi, Commented Jun 26, 2013 at 12:31
@Bergi, based on a comment to zmo's answer, I'm guessing the OP wants to match any DTD element node. — Derek Henderson
– Derek Henderson, Commented Jun 26, 2013 at 12:33

Community · Accepted Answer · 2017-02-08 14:42:28Z

2

Here is what you regular expression looks like, looking at it through an automaton:

Regular expression image

So you're actually correctly matching what you want, but you're also capturing two groups:

"<!ELEMENT note (to,from,body)"
"note"

but it will also match other kind of strings, like:

<!ELEMENT%e (jmopV|)
<!ELEMENT r()

which are not well formed tags.

So you'd better want to make a more precise regex , like:

<!ELEMENT\s+\w+\s+\((\w+, ?)*\w+\)>

here's what the regex matches:
- text <!ELEMENT
- \s+ one or more space
- \w+ one or more in word character
- \s+ one or more space
- \( a real parenthesis
- ( begin of a group
- \w+ on or more in word character
- , a comma
- ? one or zero space (could be * zero or more spaces)
- )* end of the group, that group being matched zero or more times
- \w+ one or more in word character
- (you may want to add \s* if you want to match optional spaces before the closing parenthesis)
- \) closing parenthesis character
- (you may want to add \s* if you want to match optional spaces before the end of the tag)
- > closing tag character

Regular expression image

Then, when you do match(/<!ELEMENT\s+\w+\s+\((\w+, *)*\w+\)>/i), you will still get two groups:

"<!ELEMENT note (to,from,body)>"
"from,"

and you have to get the first group, you just need to get the first element of the returned array:

var match = "<!ELEMENT note (to,from,body)>".match(/<!ELEMENT\s+\w+\s+\((\w+, *)*\w+\)>/i);
if (match !== null)
    match = match[0];

and if you want to use the regexp object to do so:

pattern = new RegExp(/<!ELEMENT\s+\w+\s+\((\w+, *)*\w+\)>/i)
match = pattern.exec(text)
if (match !== null)
    match = match[0]

that will get you the first group of match (which is the full match).

AFTER EDIT:

you want a regex that works on this set of values:

<!ENTITY Aring "&amp;#197;" >,
<!ENTITY aring "&amp;#229;" >,
<!ENTITY agrave "&amp;#224;" >,
<!ENTITY aacute  "&amp;#225;" >,
<!ATTLIST ARTICLE AUTHOR CDATA #REQUIRED>,
<!ATTLIST ARTICLE EDITOR CDATA #IMPLIED>,
<!ATTLIST ARTICLE DATE CDATA #IMPLIED>,
<!ATTLIST ARTICLE EDITION CDATA #IMPLIED>,
<!ELEMENT note (to,from,heading,body)>,
<!ELEMENT to (#PCDATA)>,
<!ELEMENT from (#PCDATA)>,
<!ELEMENT heading (#PCDATA)>,
<!ELEMENT body (#PCDATA)>

so you want a regex that looks like this one:

/<!ELEMENT\s+\w+\s+\((\#?\w+,\s*)*\#?\w+\s*\)\s*>/

Regular expression image

look it up here

var match = "<!ELEMENT note (to,from,body)>".match(/<!ELEMENT\s+\w+\s+\((\#?\w+,\s*)*\#?\w+\s*\)\s*>/i);
if (match !== null)
    match = match[0];

there it matches only the <!ELEMENT... nodes, not the <!ATTLIST... or <!ENTITY... nodes. For those ones, match will be equal to null. For <!ELEMENT... nodes, they will contain the full string of the matched node.

edited Feb 8, 2017 at 14:42

CommunityBot

11 silver badge

answered Jun 26, 2013 at 11:59

zmo

24.9k4 gold badges58 silver badges91 bronze badges

Sign up to request clarification or add additional context in comments.

14 Comments

Derek Henderson Over a year ago

I got the distinct impression the OP did not want to match two groups, just one.

user1360809 Over a year ago

this doesn't seem to work, great images though! I tried using this: var testMatch = dtdNodes[i].match(/<!ELEMENT\s+\w+\s+\((\w+, *)*\w+\)>/i);

zmo Over a year ago

weird, I've tried : js> matches = '<!ELEMENT note (to,from,body)>'.match(/<!ELEMENT\s+\w+\s+\((\w+, *)*\w+\)>/i)[0]; and it returns "<!ELEMENT note (to,from,body)>". Though, I'm not sure about my new RegExp example.

zmo Over a year ago

About the images, I like them a lot, because they explain well why the regex is good or not. A regex is basically just an Automaton (more precisely a NFA). There are great courses on the topic at the MIT online courses: about NFA and Regex

zmo Over a year ago

oh and btw, always think about testing if the match === null before accessing an array element!

|

Erik Schierboom · Accepted Answer · 2013-06-26 11:56:52Z

1

Providing the note part is fixed:

var node = '<!ELEMENT note (to,from,body)>';
node.match(/<!ELEMENT note \(.+,.+,.+\)/); // Will alert the whole element

var invalidNode = '<!ELEMENTnote (to,from,body)>';
invalidNode.match(/<!ELEMENT note \(.+,.+,.+\)/); // Will return null

See: http://jsfiddle.net/a5KkF/

answered Jun 26, 2013 at 11:56

Erik Schierboom

16.7k10 gold badges67 silver badges82 bronze badges

Comments

Community · Accepted Answer · 2017-02-08 14:42:27Z

1

The answer to both is because you are using .*, which matches everything zero or more times.

Instead, use the following regular expression:

/<!(?:ELEMENT|ENTITY|ATTLIST)\s+\w+\s+.+>/i

Proof the regular expression works

A fiddle to further demonstrate this works

And a lovely image to illustrate how the match works:

Regular expression image

To summarize, this matches the string <!, followed by either ELEMENT or ENTITY or ATTLIST, followed by 1 or more spaces (\s+), followed by 1 or more word characters (\w+), followed by 1 or more spaces, followed by one or more characters, followed by the closing bracket.

edited Feb 8, 2017 at 14:42

CommunityBot

11 silver badge

answered Jun 26, 2013 at 11:47

Derek Henderson

9,7364 gold badges46 silver badges72 bronze badges

7 Comments

user1360809 Over a year ago

don't know - I copied the RegEx directly and it doesn't return anything!

Derek Henderson Over a year ago

@user1360809, it wasn't clear from your question that you wanted to match any valid DTD element node. The RE I gave you before only matched the specific string you provided. I have edited my answer so that it now matches any DTD element node.

user1360809 Over a year ago

yes works now thanks! You are correct, I am using it to pick out any DTD node so long as it matches the required format.

Derek Henderson Over a year ago

You're welcome. I'm glad we were able to find something that works. :) Please consider upvoting/accepting this answer. Thx!

user1360809 Over a year ago

will do...I appreciate the time it takes people to reply and try to give them the time back...have learnt lots such as the 'capturing' thing and efficiency of not using ungreedy quantifiers :) Is it the same for every language or can I expect subtle differences for example?

|

Martin Ender · Accepted Answer · 2013-06-26 11:53:17Z

0

The reason you get note is capturing. Sets of parentheses make that part of the match available later (or within backreferences). Since you don't even need the parentheses for grouping, just remove them, if you don't want note.

Then your spaces are optional (due to the ?) - hence, removing them in the string does not matter at all. Simply remove the ? or make it a + (so that more than one space is allowed).

The other problem is, that . can match spaces as well. You should maybe be a bit more restrictive (this way you can also avoid ungreedy quantifiers, which are generally worse in performance):

/<!ELEMENT\s+\S*\s+\([^)]*\)>/i

\S matches anything except space character and [^)] matches anything except ) characters (it's a negated character class). In fact, you might want to exclude ( from the \S as well, because otherwise it could already match into the parentheses:

/<!ELEMENT\s+[^\s(]*\s+\([^)]*\)>/i

If the note part has to contain at least one character you should make that clear in the regex as well, by using + instead of *

/<!ELEMENT\s+[^\s(]+\s+\([^)]*\)>/i

If the note part is optional on the other hand, my earlier version requires at least 2 spaces (due to the two \s+). In that case, you could group the note part along with the following space and make it optional together. This way you only require the space, if note is there. To suppress capturing (so you don't get two strings again), use (?:...) for grouping instead of (...):

/<!ELEMENT\s+(?:[^\s(]+\s+)?\([^)]*\)>/i

Note that match will still give you an array containing the string you are looking for (and you can't do anything about that), so you'll have to access it with [0].

edited Jun 26, 2013 at 11:53

answered Jun 26, 2013 at 11:45

Martin Ender

44.4k11 gold badges93 silver badges132 bronze badges

3 Comments

user1360809 Over a year ago

thanks - it still seems to match with spaces removed - any idea why?

Martin Ender Over a year ago

@user1360809 yeah, I misunderstood that part of your question and edited my answer now.

user1360809 Over a year ago

can confirm it works! ;) Even though the syntax is a little more complex I prefer this answer for now...

Collectives™ on Stack Overflow

JavaScript RegEx - returns result but still not working

4 Answers 4

14 Comments

Comments

7 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

14 Comments

Comments

7 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related