3

I'm iterating through each character in a string in PHP. Currently I'm using direct access

 $len=strlen($str);
 $i=0;
 while($i++<$len){
    $char=$str[$i];
    ....
 }

That got me pondering what is probably purely academic. How does direct access work under the hood and is there a length of string that would see optimization in a character loop(micro though it may be) by splitting said string into an array and using the array's internal pointer to keep index location in memory?

TLDNR: Would accessing each member of a 5 million item array be faster than accessing each character of a 5 million character string directly?

5
  • 5
    You mean you are iterating over every BYTE of a string. Remember that UTF-8 and other multibyte encodings exist. Commented Jul 12, 2016 at 19:58
  • str_split will split into bytes as well. Commented Jul 12, 2016 at 20:00
  • 2
    All it takes is one 😎 to ruin your day if you're doing it byte by byte. Commented Jul 12, 2016 at 20:02
  • The string in PHP is implemented as an array of bytes (ref.) Commented Jul 12, 2016 at 20:06
  • 1
    Would it be so hard to measure it? Commented Jul 12, 2016 at 20:07

3 Answers 3

1

Accessing a string's bytes is faster by an order of magnitude. Why? PHP likely just has each array index referenced to the index where it is storing each byte in memory. So it likely just goes right to the location it needs to, reads in one byte of data, and it is done. Note that unless the characters are single-byte you will not actually get a usable character from accessing via string byte-array.

When accessing a potential multi-byte string (via mb_substr) a number of additional steps need to be taken in order to ensure the character is not more than one byte, how many bytes it is, then access each needed byte and return the individual [possibly multi-byte] character (notice there are a few extra steps).

So, I put together a simple test code just to show that array-byte access is orders of magnitude faster (but will not give you a usable character if it a multi-byte character exists as a given string's byte index). I grabbed the random character function from here ( Optimal function to create a random UTF-8 string in PHP? (letter characters only) ), then added the following:

$str = rand_str( 5000000, 5000000 );
$bStr = unpack('C*', $str);

$len = count($bStr)-1;

$i = 0;
$startTime = microtime(true);
while($i++<$len) {
    $char = $str[$i];
}
$endTime = microtime(true);

echo '<pre>Array access: ' . $len . ' items: ', $endTime-$startTime, ' seconds</pre>';


$i = 0;
$len = mb_strlen($str)-1;
$startTime = microtime(true);
while($i++<$len) {
    $char = mb_substr($str, $i, 1);
    if( $i >= 100000 ) {
        break;
    }
}
$endTime = microtime(true);

echo '<pre>Substring access: ' . ($len+1) . ' (limited to ' . $i . ') items: ', $endTime-$startTime, ' seconds</pre>';

You will notice that the mb_substr loop I have restricted to 100,000 characters. Why? It just takes too darn long to run through all 5,000,000 characters!

What were my results?

Array access: 12670380 items: 0.4850001335144 seconds

Substring access: 5000000 (limited to 100000) items: 17.00200009346 seconds

Notice the string array access was able to filter through all 12,670,380 bytes -- yep, 12.6 MILLION bytes from 5 MILLION characters [many were multi-byte] -- in just 1/2 second while the mb_substring, limited to 100,000 characters, took 17 seconds!

Sign up to request clarification or add additional context in comments.

5 Comments

If you measure stuff, you should make sure the two algorithms do the same thing. Currently this is not the case: mb_substr() cuts out a correct representation of exactly one (in your test case) character. Array access will simply grab one byte from anywhere. These two algorithms are not comparable right now. Even if you'd restrict yourself to only the UTF-8 case, your array access example should have to detect if it deals with a multi-byte character part (the 7th bit is set, so $byte & 128 === 128), and then has to look left and right in order to find the remaining bytes.
@Sven - Thanks, I did state that in my answer. (they are not the same thing) I included it because that is, from my understanding, what was asked (array array access vs individual characters).
Maybe you should compare array access to bare string functions without multibyte handling. I'm sure array access will win, but it might be less devastating for the string functions.
@Sven - I can include that later today however that would seem even further from the question since the regular substr() function (when not mapped to mb_substr()) will not return a valid character when it reaches a multi-byte character.
That's the point: Array access does it wrong in the same way that substr() does it wrong. But your benchmark compares the wrong array access (which is fast, hence suggesting it is superior) with the correct access using mb_substr() (which is slow, but does a lot more things than simply accessing bytes - but maybe "slow === evil" for some less experienced reader).
1

The answer to your question is that your current method is highly likely the fastest way.

Why?

Since a string in php is just an array of bytes with one byte representing each character (when using UTF-8), there shouldn't be a theoretically faster form of array.

Moreover, any additional implementation of an array to which you'd copy the characters of your original string would add overhead and slow things down.

If your string is highly limited in its contents (for instance, only allowing 16 characters instead of 256), there may be faster implementations, but that seems like an edge case.

3 Comments

In PHP strings are always arrays of bytes, independently from which encoding has been used.
@Sven - Thank you for pointing out the error in the way I stated it. I've updated my answer to better reflect what I was trying to get across.
Well, I'd say it's worse now. The essence of UTF-8 is that one byte is not one character. One character may be represented by up to 4 bytes. If you access string bytes individually, this variable length of characters will make things very complicated if you want to do string manipulation.
1

Quick answer (for non-multibyte strings which may have been what the OP was asking about, and useful to others as well): Direct access is still faster (by about a factor of 2). Here's the code, based on the accepted answer, but doing an apples-apples comparison of using substr() rather than mb_substr()

 $str = base64_encode(random_bytes(4000000));
 $len = strlen($str)-1;
 $i = 0;
 $startTime = microtime(true);
 while($i++<$len) {
     $char = $str[$i];
 }
 $endTime = microtime(true);

 echo '<pre>Array access: ' . $len . ' items: ', $endTime-$startTime, ' seconds</pre>';
 
 $i = 0;
 $len = strlen($str)-1;
 $startTime = microtime(true);
 while($i++<$len) {
     $char = substr($str, $i, 1);
 }
 $endTime = microtime(true);

 echo '<pre>Substring access: ' . ($len) . ' items: ', $endTime-$startTime, ' seconds</pre>';  

Note: used base64 coding of random numbers to create the random string, as rand_str was not a defined function. Maybe not exactly the most random, but certainly random enough for testing.

My results:

Array access: 5333335 items: 0.40552091598511 seconds

Substring access: 5333335 items: 0.87574410438538 seconds

Note: also tried to do a $chars = preg_split('//', $str, -1, PREG_SPLIT_NO_EMPTY); and iterating through $chars. Not only was this slower, but it ran out of space with a 5,000,000 character string

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.