3

I'm trying to write a Javascript HTML/php parser which would extract all opening tags from a HTML/php source and return the type of tag and attributes with their values while at the same time monitoring whether the values/attributes should be evaluated from static text or php variables. The problem is when I try to compose the Javascript RegExp pattern and more specifically certain rare cases. The RegExp I was able to come up with either involve negative lookbehind (to cope with the closing php tag - that is to match a closing bracket that is not preceded by a question mark) or fails in certain cases. The lookbehind version looks like:

<[a-zA-Z]+.*?(?<!\?)>

...and works perfect except for my case which must avoid using lookbehind. A more Javascript friendly version would be:

<[a-zA-Z]+((.(?!</)(?!<[a-zA-Z]+))*)?>

...which works except in this case:

<option value="<?php echo $img; ?>"<?php echo ($hpb[$i]['image_filename']==$img?' selected="selected"':''); ?>><?php echo $img; ?></option>

Am I approaching the problem completely messed up or is the lookbehind really necessary in my case? Any help is greatly appreciated.

6
  • stackoverflow.com/questions/1732348/… Commented Nov 7, 2011 at 15:00
  • Hm... maybe I shouldn't have added the (parsing) tag to the question. The tool I was developing isn't anything close to a real parser. It's more like a text processing tool that eats opening tags or sometimes simple HTML elements with opening tag, innerHTML and a closing tag. Nothing complex - no nested tags no crappy code. I'm the one who writes the templates which I will feed it so what I'm asking for is really a simple javascript regex that will match an opening tag out of an HTML element and break it down into normal attributes and attributes that involve PHP code. Commented Nov 7, 2011 at 15:17
  • Or to make things even simpler the HTML I'm planning to test against the pattern will be just the opening tag part. Out of curiosity I was wondering if I take a simple element like <td>foo</td> and test it against the pattern could I have only the opening tag as a result making sure it doesn't end with a closing PHP tag instead of the closing HTML bracket. Commented Nov 7, 2011 at 15:21
  • JavaScript with DOM already provides a way to parse HTML. Why not use it? Commented Nov 7, 2011 at 15:47
  • 1
    the browsers parsers will probably choke on the php. You could replace the php code with html-entities or something before feeding it to the browsers parser, and decode the entities afterwards. Also note that browser will sometimes modify the DOM, like for example automatically create closing elements, or creating a tbody element if absent. Commented Nov 7, 2011 at 16:09

3 Answers 3

8

Just make sure the last letter before the '>' is not a ?, using [^?]. No lookaheads or -behinds needed.

<[a-zA-Z](.*?[^?])?>

the parentheses and the last ? is to also match tags like <b>.

EDIT The solution didn't work for single character tags without attributes. So here is one that does:

<[a-zA-Z]+(>|.*?[^?]>)
Sign up to request clarification or add additional context in comments.

9 Comments

Hey a little side note here. Gerben's suggestion works really good but I think it does capture both the opening and closing tag in <b>foo</b> in case I provide it with the full definition of the HTML element :). I also wanted to put capturing groups around the tag name and the rest of the tag content. I modified it a little bit to include those features (note it also extracts only the opening tags of nested elements like the string '<td><a title="Move Up"'). The new regex is: <([a-zA-Z]+)((.(?!</)(?!<[a-zA-Z]))*?[^?])?>
No it shouldn't. In <b>foo</b> it will just match '<b>'. This is because '.*?' is ungreedy.
True. Interesting though I just tested it in firefox. '<b>foo</b>'.match(new RegExp(/<[a-zA-Z](.*?[^?])?>/)); returns ["<b>foo</b>", ">foo</b"]
That's interesting. Seems like the parentheses part is greedy somehow. I edited my post with another solution.
the s flag is not standard and does not exist in firefox
|
3

much simpler answer would be <[^/^>]+>

4 Comments

This should be the answer.
We could save another byte by leaving out the second circumflex <[^/>]+>. Am I wrong?
<[^/^>]+> also matches, for example, '<let's see>' and '< 3 should return True? \nSo should 3 >' in "<let's see> how you handle this one. Did you know that 2 < 3 should return True? So should 3 > 2. But 2 > 3 is always False." (the string from Udacity course "Design of Computer Programs")
This doesn't work for anchors, where the character '/' may appear in links
0

Matching all opening tags (including anchors like <a src="https://www.google.com">), a bit simpler from the accepted answer:

<[^/][^>]*>

Example:

let str = "<div></div><hello></hello><a src='www.a.com/ff'></a>";
let regex = /<[^/][^>]*>/g;
let matches = str.match(regex);
console.log(matches);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.