Create a Javascript RegExp to find opening tags in HTML/php template

Question

I'm trying to write a Javascript HTML/php parser which would extract all opening tags from a HTML/php source and return the type of tag and attributes with their values while at the same time monitoring whether the values/attributes should be evaluated from static text or php variables. The problem is when I try to compose the Javascript RegExp pattern and more specifically certain rare cases. The RegExp I was able to come up with either involve negative lookbehind (to cope with the closing php tag - that is to match a closing bracket that is not preceded by a question mark) or fails in certain cases. The lookbehind version looks like:

<[a-zA-Z]+.*?(?<!\?)>

...and works perfect except for my case which must avoid using lookbehind. A more Javascript friendly version would be:

<[a-zA-Z]+((.(?!</)(?!<[a-zA-Z]+))*)?>

...which works except in this case:

<option value="<?php echo $img; ?>"<?php echo ($hpb[$i]['image_filename']==$img?' selected="selected"':''); ?>><?php echo $img; ?></option>

Am I approaching the problem completely messed up or is the lookbehind really necessary in my case? Any help is greatly appreciated.

Hm... maybe I shouldn't have added the (parsing) tag to the question. The tool I was developing isn't anything close to a real parser. It's more like a text processing tool that eats opening tags or sometimes simple HTML elements with opening tag, innerHTML and a closing tag. Nothing complex - no nested tags no crappy code. I'm the one who writes the templates which I will feed it so what I'm asking for is really a simple javascript regex that will match an opening tag out of an HTML element and break it down into normal attributes and attributes that involve PHP code. — CodeFan
– CodeFan, Commented Nov 7, 2011 at 15:17
Or to make things even simpler the HTML I'm planning to test against the pattern will be just the opening tag part. Out of curiosity I was wondering if I take a simple element like <td>foo</td> and test it against the pattern could I have only the opening tag as a result making sure it doesn't end with a closing PHP tag instead of the closing HTML bracket. — CodeFan
– CodeFan, Commented Nov 7, 2011 at 15:21
JavaScript with DOM already provides a way to parse HTML. Why not use it? — Felix Kling
– Felix Kling, Commented Nov 7, 2011 at 15:47
the browsers parsers will probably choke on the php. You could replace the php code with html-entities or something before feeding it to the browsers parser, and decode the entities afterwards. Also note that browser will sometimes modify the DOM, like for example automatically create closing elements, or creating a tbody element if absent. — Gerben
– Gerben, Commented Nov 7, 2011 at 16:09

Gerben · Accepted Answer · 2011-11-07 20:51:34Z

8

Just make sure the last letter before the '>' is not a ?, using [^?]. No lookaheads or -behinds needed.

<[a-zA-Z](.*?[^?])?>

the parentheses and the last ? is to also match tags like .

EDIT The solution didn't work for single character tags without attributes. So here is one that does:

<[a-zA-Z]+(>|.*?[^?]>)

edited Nov 7, 2011 at 20:51

answered Nov 7, 2011 at 15:55

Gerben

16.8k6 gold badges39 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

CodeFan Over a year ago

Hey a little side note here. Gerben's suggestion works really good but I think it does capture both the opening and closing tag in foo in case I provide it with the full definition of the HTML element :). I also wanted to put capturing groups around the tag name and the rest of the tag content. I modified it a little bit to include those features (note it also extracts only the opening tags of nested elements like the string '<td><a title="Move Up"'). The new regex is: <([a-zA-Z]+)((.(?!</)(?!<[a-zA-Z]))*?[^?])?>

Gerben Over a year ago

No it shouldn't. In foo it will just match ''. This is because '.*?' is ungreedy.

CodeFan Over a year ago

True. Interesting though I just tested it in firefox. 'foo'.match(new RegExp(/<[a-zA-Z](.*?[^?])?>/)); returns ["foo", ">foo</b"]

Gerben Over a year ago

That's interesting. Seems like the parentheses part is greedy somehow. I edited my post with another solution.

user128511 Over a year ago

the s flag is not standard and does not exist in firefox

|

user31481 · Accepted Answer · 2015-11-20 10:32:31Z

3

much simpler answer would be <[^/^>]+>

answered Nov 20, 2015 at 10:32

user31481

711 silver badge7 bronze badges

4 Comments

AspiringCanadian Over a year ago

This should be the answer.

yckart Over a year ago

We could save another byte by leaving out the second circumflex <[^/>]+>. Am I wrong?

Nil Over a year ago

<[^/^>]+> also matches, for example, '<let's see>' and '< 3 should return True? \nSo should 3 >' in "<let's see> how you handle this one. Did you know that 2 < 3 should return True? So should 3 > 2. But 2 > 3 is always False." (the string from Udacity course "Design of Computer Programs")

user7870824 Over a year ago

This doesn't work for anchors, where the character '/' may appear in links

OfirD · Accepted Answer · 2023-06-14 14:01:34Z

0

Matching all opening tags (including anchors like <a src="https://www.google.com">), a bit simpler from the accepted answer:

<[^/][^>]*>

Example:

let str = "<div></div><hello></hello><a src='www.a.com/ff'></a>";
let regex = /<[^/][^>]*>/g;
let matches = str.match(regex);
console.log(matches);

answered Jun 14, 2023 at 14:01

OfirD

10.7k8 gold badges59 silver badges105 bronze badges

Collectives™ on Stack Overflow

Create a Javascript RegExp to find opening tags in HTML/php template

3 Answers 3

9 Comments

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

9 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related