153

string.split() returns a list instance. Is there a version that returns a generator instead? Are there any reasons against having a generator version?

5
  • 5
    This question might be related. Commented Oct 5, 2010 at 8:51
  • 1
    The reason is that it's very hard to think of a case where it's useful. Why do you want this? Commented Oct 5, 2010 at 9:02
  • 12
    @Glenn: Recently I saw a question about splitting a long string into chunks of n words. One of the solutions split the string and then returned a generator working on the result of split. That got me thinking if there was a way for split to return a generator to start with. Commented Oct 5, 2010 at 9:07
  • 6
    There is a relevant discussion on the Python Issue tracker: bugs.python.org/issue17343 Commented Apr 19, 2013 at 1:51
  • 2
    @GlennMaynard it can be useful for really large bare string/file parsing, but anybody can write generator parser himself very easy using self-brewed DFA and yield Commented Dec 5, 2018 at 6:50

19 Answers 19

106

It is highly probable that re.finditer uses fairly minimal memory overhead.

def split_iter(string):
    return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))

Demo:

>>> list( split_iter("A programmer's RegEx test.") )
['A', "programmer's", 'RegEx', 'test']

I have confirmed that this takes constant memory in python 3.2.1, assuming my testing methodology was correct. I created a string of very large size (1GB or so), then iterated through the iterable with a for loop (NOT a list comprehension, which would have generated extra memory). This did not result in a noticeable growth of memory (that is, if there was a growth in memory, it was far far less than the 1GB string).

More general version:

In reply to a comment "I fail to see the connection with str.split", here is a more general version:

def splitStr(string, sep="\s+"):
    # warning: does not yet work if sep is a lookahead like `(?=b)`
    if sep=='':
        return (c for c in string)
    else:
        return (_.group(1) for _ in re.finditer(f'(?:^|{sep})((?:(?!{sep}).)*)', string))
    # alternatively, more verbosely:
    regex = f'(?:^|{sep})((?:(?!{sep}).)*)'
    for match in re.finditer(regex, string):
        fragment = match.group(1)
        yield fragment

The idea is that ((?!pat).)* 'negates' a group by ensuring it greedily matches until the pattern would start to match (lookaheads do not consume the string in the regex finite-state-machine). In pseudocode: repeatedly consume (begin-of-string xor {sep}) + as much as possible until we would be able to begin again (or hit end of string)

Demo:

>>> splitStr('.......A...b...c....', sep='...')
<generator object splitStr.<locals>.<genexpr> at 0x7fe8530fb5e8>

>>> list(splitStr('A,b,c.', sep=','))
['A', 'b', 'c.']

>>> list(splitStr(',,A,b,c.,', sep=','))
['', '', 'A', 'b', 'c.', '']

>>> list(splitStr('.......A...b...c....', '\.\.\.'))
['', '', '.A', 'b', 'c', '.']

>>> list(splitStr('   A  b  c. '))
['', 'A', 'b', 'c.', '']

(One should note that str.split has an ugly behavior: it special-cases having sep=None as first doing str.strip to remove leading and trailing whitespace. The above purposefully does not do that; see the last example where sep="\s+".)

(I ran into various bugs (including an internal re.error) when trying to implement this... Negative lookbehind will restrict you to fixed-length delimiters so we don't use that. Almost anything besides the above regex seemed to result in errors with the beginning-of-string and end-of-string edge-cases (e.g. r'(.*?)($|,)' on ',,,a,,b,c' returns ['', '', '', 'a', '', 'b', 'c', ''] with an extraneous empty string at the end; one can look at the edit history for another seemingly-correct regex that actually has subtle bugs.)

(If you want to implement this yourself for higher performance (although they are heavweight, regexes most importantly run in C), you'd write some code (with ctypes? not sure how to get generators working with it?), with the following pseudocode for fixed-length delimiters: Hash your delimiter of length L. Keep a running hash of length L as you scan the string using a running hash algorithm, O(1) update time. Whenever the hash might equal your delimiter, manually check if the past few characters were the delimiter; if so, then yield substring since last yield. Special case for beginning and end of string. This would be a generator version of the textbook algorithm to do O(N) text search. Multiprocessing versions are also possible. They might seem overkill, but the question implies that one is working with really huge strings... At that point you might consider crazy things like caching byte offsets if few of them, or working from disk with some disk-backed bytestring view object, buying more RAM, etc. etc.)

Sign up to request clarification or add additional context in comments.

11 Comments

Excellent! I had forgotten about finditer. If one were interested in doing something like splitlines, I would suggest using this RE: '(.*\n|.+$)' str.splitlines chops off the trainling newline though (something that I don't really like...); if you wanted to replicated that part of the behavior, you could use grouping: (m.group(2) or m.group(3) for m in re.finditer('((.*)\n|(.+)$)', s)). PS: I guess the outer paren in the RE are not needed; I just feel uneasy about using | without paren :P
What about performance? re matching should be slower that ordinary search.
How would you rewrite this split_iter function to work like a_string.split("delimiter")?
split accepts regular expressions anyway so it's not really faster, if you want to use the returned value in a prev next fashion, look at my answer at the bottom...
str.split() does not accept regular expressions, that's re.split() you're thinking of...
|
20

The most efficient way I can think of it to write one using the offset parameter of the str.find() method. This avoids lots of memory use, and relying on the overhead of a regexp when it's not needed.

[edit 2016-8-2: updated this to optionally support regex separators]

def isplit(source, sep=None, regex=False):
    """
    generator version of str.split()

    :param source:
        source string (unicode or bytes)

    :param sep:
        separator to split on.

    :param regex:
        if True, will treat sep as regular expression.

    :returns:
        generator yielding elements of string.
    """
    if sep is None:
        # mimic default python behavior
        source = source.strip()
        sep = "\\s+"
        if isinstance(source, bytes):
            sep = sep.encode("ascii")
        regex = True
    if regex:
        # version using re.finditer()
        if not hasattr(sep, "finditer"):
            sep = re.compile(sep)
        start = 0
        for m in sep.finditer(source):
            idx = m.start()
            assert idx >= start
            yield source[start:idx]
            start = m.end()
        yield source[start:]
    else:
        # version using str.find(), less overhead than re.finditer()
        sepsize = len(sep)
        start = 0
        while True:
            idx = source.find(sep, start)
            if idx == -1:
                yield source[start:]
                return
            yield source[start:idx]
            start = idx + sepsize

This can be used like you want...

>>> print list(isplit("abcb","b"))
['a','c','']

While there is a little bit of cost seeking within the string each time find() or slicing is performed, this should be minimal since strings are represented as continguous arrays in memory.

Comments

15

Did some performance testing on the various methods proposed (I won't repeat them here). Some results:

  • str.split (default = 0.3461570239996945
  • manual search (by character) (one of Dave Webb's answer's) = 0.8260340550004912
  • re.finditer (ninjagecko's answer) = 0.698872097000276
  • str.find (one of Eli Collins's answers) = 0.7230395330007013
  • itertools.takewhile (Ignacio Vazquez-Abrams's answer) = 2.023023967998597
  • str.split(..., maxsplit=1) recursion = N/A†

†The recursion answers (string.split with maxsplit = 1) fail to complete in a reasonable time, given string.splits speed they may work better on shorter strings, but then I can't see the use-case for short strings where memory isn't an issue anyway.

Tested using timeit on:

the_text = "100 " * 9999 + "100"

def test_function( method ):
    def fn( ):
        total = 0

        for x in method( the_text ):
            total += int( x )

        return total

    return fn

This raises another question as to why string.split is so much faster despite its memory usage.

3 Comments

This is because memory is slower than cpu and in this case, the list is loaded by chunks where as all the others are loaded element by element. On the same note, many academics will tell you linked lists are faster and have less complexity while your computer will often be faster with arrays, which it finds easier to optimise. You can't assume an option is faster than another, test it ! +1 for testing.
The problem arise in the next steps of a processing chain. If you then want to find an specific chunk and ignore the rest when you find it, then you have the justification to use a generator based split instead of the built-in solution.
One of the other reasons that the default (builtin) approach is faster is because the builtin is basically written in C, which is in general a faster language. The other solutions have more overhead from the Python interpreter.
11

This is generator version of split() implemented via re.search() that does not have the problem of allocating too many substrings.

import re

def itersplit(s, sep=None):
    exp = re.compile(r'\s+' if sep is None else re.escape(sep))
    pos = 0
    while True:
        m = exp.search(s, pos)
        if not m:
            if pos < len(s) or sep is not None:
                yield s[pos:]
            break
        if pos < m.start() or sep is not None:
            yield s[pos:m.start()]
        pos = m.end()


sample1 = "Good evening, world!"
sample2 = " Good evening, world! "
sample3 = "brackets][all][][over][here"
sample4 = "][brackets][all][][over][here]["

assert list(itersplit(sample1)) == sample1.split()
assert list(itersplit(sample2)) == sample2.split()
assert list(itersplit(sample3, '][')) == sample3.split('][')
assert list(itersplit(sample4, '][')) == sample4.split('][')

EDIT: Corrected handling of surrounding whitespace if no separator chars are given.

2 Comments

why is this any better than re.finditer?
@ErikKaplun Because the regex logic for the items can be more complex than for their separators. In my case, I wanted to process each line individually, so I can report back if a line failed to match.
6

Here is my implementation, which is much, much faster and more complete than the other answers here. It has 4 separate subfunctions for different cases.

I'll just copy the docstring of the main str_split function:


str_split(s, *delims, empty=None)

Split the string s by the rest of the arguments, possibly omitting empty parts (empty keyword argument is responsible for that). This is a generator function.

When only one delimiter is supplied, the string is simply split by it. empty is then True by default.

str_split('[]aaa[][]bb[c', '[]')
    -> '', 'aaa', '', 'bb[c'
str_split('[]aaa[][]bb[c', '[]', empty=False)
    -> 'aaa', 'bb[c'

When multiple delimiters are supplied, the string is split by longest possible sequences of those delimiters by default, or, if empty is set to True, empty strings between the delimiters are also included. Note that the delimiters in this case may only be single characters.

str_split('aaa, bb : c;', ' ', ',', ':', ';')
    -> 'aaa', 'bb', 'c'
str_split('aaa, bb : c;', *' ,:;', empty=True)
    -> 'aaa', '', 'bb', '', '', 'c', ''

When no delimiters are supplied, string.whitespace is used, so the effect is the same as str.split(), except this function is a generator.

str_split('aaa\\t  bb c \\n')
    -> 'aaa', 'bb', 'c'

import string

def _str_split_chars(s, delims):
    "Split the string `s` by characters contained in `delims`, including the \
    empty parts between two consecutive delimiters"
    start = 0
    for i, c in enumerate(s):
        if c in delims:
            yield s[start:i]
            start = i+1
    yield s[start:]

def _str_split_chars_ne(s, delims):
    "Split the string `s` by longest possible sequences of characters \
    contained in `delims`"
    start = 0
    in_s = False
    for i, c in enumerate(s):
        if c in delims:
            if in_s:
                yield s[start:i]
                in_s = False
        else:
            if not in_s:
                in_s = True
                start = i
    if in_s:
        yield s[start:]


def _str_split_word(s, delim):
    "Split the string `s` by the string `delim`"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    yield s[start:]

def _str_split_word_ne(s, delim):
    "Split the string `s` by the string `delim`, not including empty parts \
    between two consecutive delimiters"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            if start!=i:
                yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    if start<len(s):
        yield s[start:]


def str_split(s, *delims, empty=None):
    """\
Split the string `s` by the rest of the arguments, possibly omitting
empty parts (`empty` keyword argument is responsible for that).
This is a generator function.

When only one delimiter is supplied, the string is simply split by it.
`empty` is then `True` by default.
    str_split('[]aaa[][]bb[c', '[]')
        -> '', 'aaa', '', 'bb[c'
    str_split('[]aaa[][]bb[c', '[]', empty=False)
        -> 'aaa', 'bb[c'

When multiple delimiters are supplied, the string is split by longest
possible sequences of those delimiters by default, or, if `empty` is set to
`True`, empty strings between the delimiters are also included. Note that
the delimiters in this case may only be single characters.
    str_split('aaa, bb : c;', ' ', ',', ':', ';')
        -> 'aaa', 'bb', 'c'
    str_split('aaa, bb : c;', *' ,:;', empty=True)
        -> 'aaa', '', 'bb', '', '', 'c', ''

When no delimiters are supplied, `string.whitespace` is used, so the effect
is the same as `str.split()`, except this function is a generator.
    str_split('aaa\\t  bb c \\n')
        -> 'aaa', 'bb', 'c'
"""
    if len(delims)==1:
        f = _str_split_word if empty is None or empty else _str_split_word_ne
        return f(s, delims[0])
    if len(delims)==0:
        delims = string.whitespace
    delims = set(delims) if len(delims)>=4 else ''.join(delims)
    if any(len(d)>1 for d in delims):
        raise ValueError("Only 1-character multiple delimiters are supported")
    f = _str_split_chars if empty else _str_split_chars_ne
    return f(s, delims)

This function works in Python 3, and an easy, though quite ugly, fix can be applied to make it work in both 2 and 3 versions. The first lines of the function should be changed to:

def str_split(s, *delims, **kwargs):
    """...docstring..."""
    empty = kwargs.get('empty')

Comments

3

No, but it should be easy enough to write one using itertools.takewhile().

EDIT:

Very simple, half-broken implementation:

import itertools
import string

def isplitwords(s):
  i = iter(s)
  while True:
    r = []
    for c in itertools.takewhile(lambda x: not x in string.whitespace, i):
      r.append(c)
    else:
      if r:
        yield ''.join(r)
        continue
      else:
        raise StopIteration()

11 Comments

@Ignacio: The example in docs uses a list of integers to illustrate the use of takeWhile. What would be a good predicate for splitting a string into words (default split) using takeWhile()?
Look for presence in string.whitespace.
The separator can have multiple characters, 'abc<def<>ghi<><>lmn'.split('<>') == ['abc<def', 'ghi', '', 'lmn']
@Ignacio: Can you add an example to your answer?
Easy to write, but many orders of magnitude slower. This is an operation that really should be implemented in native code.
|
3

I don't see any obvious benefit to a generator version of split(). The generator object is going to have to contain the whole string to iterate over so you're not going to save any memory by having a generator.

If you wanted to write one it would be fairly easy though:

import string

def gsplit(s,sep=string.whitespace):
    word = []

    for c in s:
        if c in sep:
            if word:
                yield "".join(word)
                word = []
        else:
            word.append(c)

    if word:
        yield "".join(word)

5 Comments

You'd halve the memory used, by not having to store a second copy of the string in each resulting part, plus the array and object overhead (which is typically more than the strings themselves). That generally doesn't matter, though (if you're splitting strings so large that this matters, you're probably doing something wrong), and even a native C generator implementation would always be significantly slower than doing it all at once.
@Glenn Maynard - I just realised that. I for some reason I originally the generator would store a copy of the string rather than a reference. A quick check with id() put me right. And obviously as strings are immutable you don't need to worry about someone changing the original string while you're iterating over it.
Isn't the main point in using a generator not the memory usage, but that you could save yourself having to split the whole string if you wanted to exit early? (That's not a comment on your particular solution, I was just surprised by the discussion about memory).
@Scott: It's hard to think of a case where that's really a win--where 1: you want to stop splitting partway through, 2: you don't know how many words you're splitting in advance, 3: you have a large enough string for it to matter, and 4: you consistently stop early enough for it to be a significant win over str.split. That's a very narrow set of conditions.
You can have much higher benefit if your string is lazily generated as well (e.g. from network traffic or file reads)
3

I wrote a version of @ninjagecko's answer that behaves more like string.split (i.e. whitespace delimited by default and you can specify a delimiter).

def isplit(string, delimiter = None):
    """Like string.split but returns an iterator (lazy)

    Multiple character delimters are not handled.
    """

    if delimiter is None:
        # Whitespace delimited by default
        delim = r"\s"

    elif len(delimiter) != 1:
        raise ValueError("Can only handle single character delimiters",
                        delimiter)

    else:
        # Escape, incase it's "\", "*" etc.
        delim = re.escape(delimiter)

    return (x.group(0) for x in re.finditer(r"[^{}]+".format(delim), string))

Here are the tests I used (in both python 3 and python 2):

# Wrapper to make it a list
def helper(*args,  **kwargs):
    return list(isplit(*args, **kwargs))

# Normal delimiters
assert helper("1,2,3", ",") == ["1", "2", "3"]
assert helper("1;2;3,", ";") == ["1", "2", "3,"]
assert helper("1;2 ;3,  ", ";") == ["1", "2 ", "3,  "]

# Whitespace
assert helper("1 2 3") == ["1", "2", "3"]
assert helper("1\t2\t3") == ["1", "2", "3"]
assert helper("1\t2 \t3") == ["1", "2", "3"]
assert helper("1\n2\n3") == ["1", "2", "3"]

# Surrounding whitespace dropped
assert helper(" 1 2  3  ") == ["1", "2", "3"]

# Regex special characters
assert helper(r"1\2\3", "\\") == ["1", "2", "3"]
assert helper(r"1*2*3", "*") == ["1", "2", "3"]

# No multi-char delimiters allowed
try:
    helper(r"1,.2,.3", ",.")
    assert False
except ValueError:
    pass

python's regex module says that it does "the right thing" for unicode whitespace, but I haven't actually tested it.

Also available as a gist.

Comments

3

If you would also like to be able to read an iterator (as well as return one) try this:

import itertools as it

def iter_split(string, sep=None):
    sep = sep or ' '
    groups = it.groupby(string, lambda s: s != sep)
    return (''.join(g) for k, g in groups if k)

Usage

>>> list(iter_split(iter("Good evening, world!")))
['Good', 'evening,', 'world!']

Comments

3

more_itertools.split_at offers an analog to str.split for iterators.

>>> import more_itertools as mit


>>> list(mit.split_at("abcdcba", lambda x: x == "b"))
[['a'], ['c', 'd', 'c'], ['a']]

>>> "abcdcba".split("b")
['a', 'cdc', 'a']

more_itertools is a third-party package.

2 Comments

Note that more_itertools.split_at() is still using a newly allocated list on each call, so while this does return an iterator, it is not achieving the constant memory requirement. So depending on why you wanted an iterator to begin with, this may or may not be helpful.
@jcater Good point. The intermediate values are indeed buffered as sub lists within the iterator, according to its implementation. One could adapt the source to substitute lists with iterators, append with itertools.chain and evaluate results using a list comprehension. Depending on the need and request, I can post an example.
2

I wanted to show how to use the find_iter solution to return a generator for given delimiters and then use the pairwise recipe from itertools to build a previous next iteration which will get the actual words as in the original split method.


from more_itertools import pairwise
import re

string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
delimiter = " "
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer("^|[{0}]+|$".format(delimiter), string)):
    print(string[prev.end(): curr.start()])

note:

  1. I use prev & curr instead of prev & next because overriding next in python is a very bad idea
  2. This is quite efficient

Comments

2

Dumbest method, without regex / itertools:

def isplit(text, split='\n'):
    while text != '':
        end = text.find(split)

        if end == -1:
            yield text
            text = ''
        else:
            yield text[:end]
            text = text[end + 1:]

Comments

2

Very old question, but here is my humble contribution with an efficient algorithm:

def str_split(text: str, separator: str) -> Iterable[str]:
    i = 0
    n = len(text)
    while i <= n:
        j = text.find(separator, i)
        if j == -1:
            j = n
        yield text[i:j]
        i = j + 1

Comments

1

Implementation:

iter(io.StringIO(my_str))

Example usage:

>>> import io
>>> for x in iter(io.StringIO('hello')):
...   print(x)
...
hello
>>> for x in iter(io.StringIO('hello\nworld\n')):
...   print(x)
...
hello

world

Documentation: https://docs.python.org/3/library/io.html#io.StringIO

Comments

0
def split_generator(f,s):
    """
    f is a string, s is the substring we split on.
    This produces a generator rather than a possibly
    memory intensive list. 
    """
    i=0
    j=0
    while j<len(f):
        if i>=len(f):
            yield f[j:]
            j=i
        elif f[i] != s:
            i=i+1
        else:
            yield [f[j:i]]
            j=i+1
            i=i+1

1 Comment

why do you yield [f[j:i]]and not f[j:i]?
0

here is a simple response

def gen_str(some_string, sep):
    j=0
    guard = len(some_string)-1
    for i,s in enumerate(some_string):
        if s == sep:
           yield some_string[j:i]
           j=i+1
        elif i!=guard:
           continue
        else:
           yield some_string[j:]

Comments

0
def isplit(text, sep=None, maxsplit=-1):
    if not isinstance(text, (str, bytes)):
        raise TypeError(f"requires 'str' or 'bytes' but received a '{type(text).__name__}'")
    if sep in ('', b''):
        raise ValueError('empty separator')

    if maxsplit == 0 or not text:
        yield text
        return

    regex = (
        re.escape(sep) if sep is not None
        else [br'\s+', r'\s+'][isinstance(text, str)]
    )
    yield from re.split(regex, text, maxsplit=max(0, maxsplit))

Comments

0

Here is an answer that is based on split and maxsplit. This does not use recursion.

def gsplit(todo):
    chunk= 100
    while todo:
        splits = todo.split(maxsplit=chunk)
        if len(splits) == chunk:
            todo = splits.pop()
        else:
            todo=None
        for item in splits:
            yield item

Comments

0
def splitter(string, delimiter=" "):
    start = end = 0
    while end < len(string):
        while end<len(string) and string[end] != delimiter:
            end += 1
        yield string[start: end]
        start = end = end +1
    return string[end:]

print(list(splitter("abdcabcd", "b")))

#> ['a', 'dca', 'cd']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.