1

This script identifies the "BBCode" (with parameters and values) in a text (online test):

<?php
preg_match_all(
    '#\[(link)(.*?)!?\](.*?)\[\/\\1\]#i', 
    '[link href="http://www.google.com" title="Google" target="_blank"]Google[/link]
     [link href="http://www.facebook.com"]Facebook[/link]
     [link href=\'http://www.twitter.com\' rel="nofollow"]Twitter[/link]', 
    $StrMatches
);

/* $StrMatches[0] = Full tag string
 * $StrMatches[1] = Tag name
 * $StrMatches[2] = tag params string
 * $StrMatches[3] = Tag content
 * */
print_r($StrMatches);


$ParamList = array();

foreach ($StrMatches[2] as $TagParamStr )
{
   preg_match_all('#\s*([^=]+)=[\'|"]([^\'|"]*)[\'|"]#', $TagParamStr, $ParamMatches);
   array_push($ParamList, $ParamMatches);
}

/* $ParamList[0] = Full param string
 * $ParamList[1] = Param name
 * $ParamList[2] = Param value
 * */
print_r($ParamList);

Output:

 Array
(
[0] => Array
    (
        [0] => [link href="http://www.google.com" title="Google" target="_blank"]Google[/link]
        [3] => [link href="http://www.facebook.com"]Facebook[/link]
        [2] => [link href='http://www.twitter.com' rel="nofollow"]Twitter[/link]
    )

[1] => Array
    (
        [0] => link
        [1] => link
        [2] => link
    )

[2] => Array
    (
        [0] =>  href="http://www.google.com" title="Google" target="_blank"
        [1] =>  href="http://www.facebook.com"
        [2] =>  href='http://www.twitter.com' rel="nofollow"
    )

[3] => Array
    (
        [0] => Google
        [1] => Facebook
        [2] => Twitter
    )

) 
Array
(
[0] => Array
    (
        [0] => Array
            (
                [0] =>  href="http://www.google.com"
                [1] =>  title="Google"
                [2] =>  target="_blank"
            )

        [1] => Array
            (
                [0] => href
                [1] => title
                [2] => target
            )

        [2] => Array
            (
                [0] => http://www.google.com
                [1] => Google
                [2] => _blank
            )

    )

[1] => Array
    (
        [0] => Array
            (
                [0] =>  href="http://www.facebook.com"
            )

        [1] => Array
            (
                [0] => href
            )

        [2] => Array
            (
                [0] => http://www.facebook.com
            )

    )

[2] => Array
    (
        [0] => Array
            (
                [0] =>  href='http://www.twitter.com'
                [1] =>  rel="nofollow"
            )

        [1] => Array
            (
                [0] => href
                [1] => rel
            )

        [2] => Array
            (
                [0] => http://www.twitter.com
                [1] => nofollow
            )

    )

)

The code works fine! but I would like to optimize it with a single RegEx.

How can I make it a unique RegEx?

sorry for my bad English :(

1 Answer 1

3

Short Answer:

Not really possible in the way that you would think, since regular expressions much capture a defined set of groups. The most ideal way would be to capture param1, param2, value with one match..but since the number of attributes changes, this is impossible. If we tried to repeat a capture group 1+ times, it will match the whole string but only capture the last occurrence as shown in this quick demo.

However, you will see that it is possible to match and capture all of this data into one expression. Each link will be split into multiple matches though, each containing some data. In my example I used capture group 1 for the attribute, capture group 2 for the attribute's value, and capture group 3 for the link's value. If these items do not exist in the match, the capture groups will be left null.


Explanation:

(?# START OF LINK)
(?:         (?# start non-capture group)
  \[link    (?# match [link literally)
 |          (?# OR)
  (?!^)     (?# assertion to make sure we aren't at the beginning of the string)
  \G        (?# start at the end of last match)
)           (?# end non-capture group)
\K          (?# throw everything to the left away)

(?# START OF CAPTURING)
(?:         (?# start non-capture group)
  \s+       (?# match 1+ whitespace characters)
  ([^=\s]+) (?# capture attribute)
  =         (?# match = literally)
  ["']      (?# match ' or ")
  (.*?)     (?# lazily capture attribute's value)
  ["']      (?# match ' or ")
 |          (?# OR)
  \s*       (?# optionally match whitespace characters)
  \]        (?# match ] literally)
  (.*?)     (?# lazily capture link's value)
  \[/link\] (?# match [/link] literally)
)           (?# end non-capture group)

Demo

The key to this is the \G and \K. The first time the RegEx engine makes a match it starts at [link, and everything matched gets thrown away with \K. Then we go on to our capturing where we find and grab an attribute and its value. The match is then over. Now it goes back again and can't find a [link, so it uses \G to start back over from the last attribute. Everything gets thrown away again with \K. It may find another attribute, or it may hit the alternation and match the end of the link with the third capture group. At this point when the regular expression starts over, it will once again find another [link and do it all over again.

Update: you'll see the (?!^) before the \G is what solves the problems in your comments. \G not only matches the end of your last match, but also the beginning of the string. We want to make sure we are in a link before we start matching stuff ([link), so this means we don't want \G to match the beginning of the string. This negative lookahead will assert just that.


PHP:

$regex = '#(?:\[link|(?!^)\G)\K(?:\s+(\w+)=["\'](.*?)["\']|\s*\](.*?)\[/link\])#si';
preg_match_all($regex, $html, $matches, PREG_SET_ORDER);

$links = array();
$reset = true;

foreach($matches as $match) {
    if($reset) {
        $links[] = array(
            'params' => array(),
            'value' => null
        );

        $reset = false;
    }

    end($links);
    $key = key($links);

    if(isset($match[3])) {
        $links[$key]['value'] = $match[3];
        $reset = true;
    } else {
        $links[$key]['params'][$match[1]] = $match[2];
    }
}

var_dump($links);

Output:

array(3) {
  [0]=>
  array(2) {
    ["params"]=>
    array(3) {
      ["href"]=>
      string(21) "http://www.google.com"
      ["title"]=>
      string(6) "Google"
      ["target"]=>
      string(6) "_blank"
    }
    ["value"]=>
    string(6) "Google"
  }
  [1]=>
  array(2) {
    ["params"]=>
    array(1) {
      ["href"]=>
      string(23) "http://www.facebook.com"
    }
    ["value"]=>
    string(8) "Facebook"
  }
  [2]=>
  array(2) {
    ["params"]=>
    array(2) {
      ["href"]=>
      string(22) "http://www.twitter.com"
      ["rel"]=>
      string(8) "nofollow"
    }
    ["value"]=>
    string(7) "Twitter"
  }
}
Sign up to request clarification or add additional context in comments.

6 Comments

look here 3v4l.org/Qou7q if subject string does not begin with a [link.. does not work well. Output: Array ([params] => Array ( ) [value] => www.google.com)
Change \[link to .*?\[link, I'll update the answer when back at my PC.
I found another problem ... if there are two tags on the same line does not work regex101.com/r/cD4sT1
@ar099968 that's what happens when you try to fix things on a phone, you cause more problems. I changed my answer to use (?!^) which will fix both problems in these comments. I also added the s modifier and tweaked it slightly so multi-line links work..I think this should cover everything
@sndesign if you remove the \K from the regular expression, then nothing will be removed from the matches. You can then reference this in the foreach loop with $match[0] or reset($match) and append the match value of each part of the link. See the demo!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.