PHP RegEx for BBCode multi-parameter

Question

This script identifies the "BBCode" (with parameters and values) in a text (online test):

<?php
preg_match_all(
    '#\[(link)(.*?)!?\](.*?)\[\/\\1\]#i', 
    '[link href="http://www.google.com" title="Google" target="_blank"]Google[/link]
     [link href="http://www.facebook.com"]Facebook[/link]
     [link href=\'http://www.twitter.com\' rel="nofollow"]Twitter[/link]', 
    $StrMatches
);

/* $StrMatches[0] = Full tag string
 * $StrMatches[1] = Tag name
 * $StrMatches[2] = tag params string
 * $StrMatches[3] = Tag content
 * */
print_r($StrMatches);


$ParamList = array();

foreach ($StrMatches[2] as $TagParamStr )
{
   preg_match_all('#\s*([^=]+)=[\'|"]([^\'|"]*)[\'|"]#', $TagParamStr, $ParamMatches);
   array_push($ParamList, $ParamMatches);
}

/* $ParamList[0] = Full param string
 * $ParamList[1] = Param name
 * $ParamList[2] = Param value
 * */
print_r($ParamList);

Output:

 Array
(
[0] => Array
    (
        [0] => [link href="http://www.google.com" title="Google" target="_blank"]Google[/link]
        [3] => [link href="http://www.facebook.com"]Facebook[/link]
        [2] => [link href='http://www.twitter.com' rel="nofollow"]Twitter[/link]
    )

[1] => Array
    (
        [0] => link
        [1] => link
        [2] => link
    )

[2] => Array
    (
        [0] =>  href="http://www.google.com" title="Google" target="_blank"
        [1] =>  href="http://www.facebook.com"
        [2] =>  href='http://www.twitter.com' rel="nofollow"
    )

[3] => Array
    (
        [0] => Google
        [1] => Facebook
        [2] => Twitter
    )

) 
Array
(
[0] => Array
    (
        [0] => Array
            (
                [0] =>  href="http://www.google.com"
                [1] =>  title="Google"
                [2] =>  target="_blank"
            )

        [1] => Array
            (
                [0] => href
                [1] => title
                [2] => target
            )

        [2] => Array
            (
                [0] => http://www.google.com
                [1] => Google
                [2] => _blank
            )

    )

[1] => Array
    (
        [0] => Array
            (
                [0] =>  href="http://www.facebook.com"
            )

        [1] => Array
            (
                [0] => href
            )

        [2] => Array
            (
                [0] => http://www.facebook.com
            )

    )

[2] => Array
    (
        [0] => Array
            (
                [0] =>  href='http://www.twitter.com'
                [1] =>  rel="nofollow"
            )

        [1] => Array
            (
                [0] => href
                [1] => rel
            )

        [2] => Array
            (
                [0] => http://www.twitter.com
                [1] => nofollow
            )

    )

)

The code works fine! but I would like to optimize it with a single RegEx.

How can I make it a unique RegEx?

sorry for my bad English :(

Sam · Accepted Answer · 2014-05-30 15:35:06Z

3

Short Answer:

Not really possible in the way that you would think, since regular expressions much capture a defined set of groups. The most ideal way would be to capture param1, param2, value with one match..but since the number of attributes changes, this is impossible. If we tried to repeat a capture group 1+ times, it will match the whole string but only capture the last occurrence as shown in this quick demo.

However, you will see that it is possible to match and capture all of this data into one expression. Each link will be split into multiple matches though, each containing some data. In my example I used capture group 1 for the attribute, capture group 2 for the attribute's value, and capture group 3 for the link's value. If these items do not exist in the match, the capture groups will be left null.

Explanation:

(?# START OF LINK)
(?:         (?# start non-capture group)
  \[link    (?# match [link literally)
 |          (?# OR)
  (?!^)     (?# assertion to make sure we aren't at the beginning of the string)
  \G        (?# start at the end of last match)
)           (?# end non-capture group)
\K          (?# throw everything to the left away)

(?# START OF CAPTURING)
(?:         (?# start non-capture group)
  \s+       (?# match 1+ whitespace characters)
  ([^=\s]+) (?# capture attribute)
  =         (?# match = literally)
  ["']      (?# match ' or ")
  (.*?)     (?# lazily capture attribute's value)
  ["']      (?# match ' or ")
 |          (?# OR)
  \s*       (?# optionally match whitespace characters)
  \]        (?# match ] literally)
  (.*?)     (?# lazily capture link's value)
  \[/link\] (?# match [/link] literally)
)           (?# end non-capture group)

Demo

The key to this is the \G and \K. The first time the RegEx engine makes a match it starts at [link, and everything matched gets thrown away with \K. Then we go on to our capturing where we find and grab an attribute and its value. The match is then over. Now it goes back again and can't find a [link, so it uses \G to start back over from the last attribute. Everything gets thrown away again with \K. It may find another attribute, or it may hit the alternation and match the end of the link with the third capture group. At this point when the regular expression starts over, it will once again find another [link and do it all over again.

Update: you'll see the (?!^) before the \G is what solves the problems in your comments. \G not only matches the end of your last match, but also the beginning of the string. We want to make sure we are in a link before we start matching stuff ([link), so this means we don't want \G to match the beginning of the string. This negative lookahead will assert just that.

PHP:

$regex = '#(?:\[link|(?!^)\G)\K(?:\s+(\w+)=["\'](.*?)["\']|\s*\](.*?)\[/link\])#si';
preg_match_all($regex, $html, $matches, PREG_SET_ORDER);

$links = array();
$reset = true;

foreach($matches as $match) {
    if($reset) {
        $links[] = array(
            'params' => array(),
            'value' => null
        );

        $reset = false;
    }

    end($links);
    $key = key($links);

    if(isset($match[3])) {
        $links[$key]['value'] = $match[3];
        $reset = true;
    } else {
        $links[$key]['params'][$match[1]] = $match[2];
    }
}

var_dump($links);

Output:

array(3) {
  [0]=>
  array(2) {
    ["params"]=>
    array(3) {
      ["href"]=>
      string(21) "http://www.google.com"
      ["title"]=>
      string(6) "Google"
      ["target"]=>
      string(6) "_blank"
    }
    ["value"]=>
    string(6) "Google"
  }
  [1]=>
  array(2) {
    ["params"]=>
    array(1) {
      ["href"]=>
      string(23) "http://www.facebook.com"
    }
    ["value"]=>
    string(8) "Facebook"
  }
  [2]=>
  array(2) {
    ["params"]=>
    array(2) {
      ["href"]=>
      string(22) "http://www.twitter.com"
      ["rel"]=>
      string(8) "nofollow"
    }
    ["value"]=>
    string(7) "Twitter"
  }
}

edited May 30, 2014 at 15:35

answered May 29, 2014 at 15:22

Sam

20.5k2 gold badges48 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

ar099968 Over a year ago

look here 3v4l.org/Qou7q if subject string does not begin with a [link.. does not work well. Output: Array ([params] => Array ( ) [value] => www.google.com)

Sam Over a year ago

Change \[link to .*?\[link, I'll update the answer when back at my PC.

ar099968 Over a year ago

I found another problem ... if there are two tags on the same line does not work regex101.com/r/cD4sT1

Sam Over a year ago

@ar099968 that's what happens when you try to fix things on a phone, you cause more problems. I changed my answer to use (?!^) which will fix both problems in these comments. I also added the s modifier and tweaked it slightly so multi-line links work..I think this should cover everything

Sam Over a year ago

@sndesign if you remove the \K from the regular expression, then nothing will be removed from the matches. You can then reference this in the foreach loop with $match[0] or reset($match) and append the match value of each part of the link. See the demo!

|

Collectives™ on Stack Overflow

PHP RegEx for BBCode multi-parameter

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related