Is there a generator version of `string.split()` in Python?

Question

string.split() returns a list instance. Is there a version that returns a generator instead? Are there any reasons against having a generator version?

The reason is that it's very hard to think of a case where it's useful. Why do you want this? — Glenn Maynard
– Glenn Maynard, Commented Oct 5, 2010 at 9:02
@Glenn: Recently I saw a question about splitting a long string into chunks of n words. One of the solutions split the string and then returned a generator working on the result of split. That got me thinking if there was a way for split to return a generator to start with. — Manoj Govindan
– Manoj Govindan, Commented Oct 5, 2010 at 9:07
There is a relevant discussion on the Python Issue tracker: bugs.python.org/issue17343 — saffsd
– saffsd, Commented Apr 19, 2013 at 1:51
@GlennMaynard it can be useful for really large bare string/file parsing, but anybody can write generator parser himself very easy using self-brewed DFA and yield — Dmitry Ponyatov
– Dmitry Ponyatov, Commented Dec 5, 2018 at 6:50

ninjagecko · Accepted Answer · 2023-04-14 22:26:05Z

106

It is highly probable that re.finditer uses fairly minimal memory overhead.

def split_iter(string):
    return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))

Demo:

>>> list( split_iter("A programmer's RegEx test.") )
['A', "programmer's", 'RegEx', 'test']

I have confirmed that this takes constant memory in python 3.2.1, assuming my testing methodology was correct. I created a string of very large size (1GB or so), then iterated through the iterable with a for loop (NOT a list comprehension, which would have generated extra memory). This did not result in a noticeable growth of memory (that is, if there was a growth in memory, it was far far less than the 1GB string).

More general version:

In reply to a comment "I fail to see the connection with str.split", here is a more general version:

def splitStr(string, sep="\s+"):
    # warning: does not yet work if sep is a lookahead like `(?=b)`
    if sep=='':
        return (c for c in string)
    else:
        return (_.group(1) for _ in re.finditer(f'(?:^|{sep})((?:(?!{sep}).)*)', string))

    # alternatively, more verbosely:
    regex = f'(?:^|{sep})((?:(?!{sep}).)*)'
    for match in re.finditer(regex, string):
        fragment = match.group(1)
        yield fragment

The idea is that ((?!pat).)* 'negates' a group by ensuring it greedily matches until the pattern would start to match (lookaheads do not consume the string in the regex finite-state-machine). In pseudocode: repeatedly consume (begin-of-string xor {sep}) + as much as possible until we would be able to begin again (or hit end of string)

Demo:

>>> splitStr('.......A...b...c....', sep='...')
<generator object splitStr.<locals>.<genexpr> at 0x7fe8530fb5e8>

>>> list(splitStr('A,b,c.', sep=','))
['A', 'b', 'c.']

>>> list(splitStr(',,A,b,c.,', sep=','))
['', '', 'A', 'b', 'c.', '']

>>> list(splitStr('.......A...b...c....', '\.\.\.'))
['', '', '.A', 'b', 'c', '.']

>>> list(splitStr('   A  b  c. '))
['', 'A', 'b', 'c.', '']

(One should note that str.split has an ugly behavior: it special-cases having sep=None as first doing str.strip to remove leading and trailing whitespace. The above purposefully does not do that; see the last example where sep="\s+".)

(I ran into various bugs (including an internal re.error) when trying to implement this... Negative lookbehind will restrict you to fixed-length delimiters so we don't use that. Almost anything besides the above regex seemed to result in errors with the beginning-of-string and end-of-string edge-cases (e.g. r'(.*?)($|,)' on ',,,a,,b,c' returns ['', '', '', 'a', '', 'b', 'c', ''] with an extraneous empty string at the end; one can look at the edit history for another seemingly-correct regex that actually has subtle bugs.)

(If you want to implement this yourself for higher performance (although they are heavweight, regexes most importantly run in C), you'd write some code (with ctypes? not sure how to get generators working with it?), with the following pseudocode for fixed-length delimiters: Hash your delimiter of length L. Keep a running hash of length L as you scan the string using a running hash algorithm, O(1) update time. Whenever the hash might equal your delimiter, manually check if the past few characters were the delimiter; if so, then yield substring since last yield. Special case for beginning and end of string. This would be a generator version of the textbook algorithm to do O(N) text search. Multiprocessing versions are also possible. They might seem overkill, but the question implies that one is working with really huge strings... At that point you might consider crazy things like caching byte offsets if few of them, or working from disk with some disk-backed bytestring view object, buying more RAM, etc. etc.)

edited Apr 14, 2023 at 22:26

answered Mar 19, 2012 at 12:41

ninjagecko

91.5k24 gold badges143 silver badges153 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

allyourcode Over a year ago

Excellent! I had forgotten about finditer. If one were interested in doing something like splitlines, I would suggest using this RE: '(.*\n|.+$)' str.splitlines chops off the trainling newline though (something that I don't really like...); if you wanted to replicated that part of the behavior, you could use grouping: (m.group(2) or m.group(3) for m in re.finditer('((.*)\n|(.+)$)', s)). PS: I guess the outer paren in the RE are not needed; I just feel uneasy about using | without paren :P

anatoly techtonik Over a year ago

What about performance? re matching should be slower that ordinary search.

Moberg Over a year ago

How would you rewrite this split_iter function to work like a_string.split("delimiter")?

Veltzer Doron Over a year ago

split accepts regular expressions anyway so it's not really faster, if you want to use the returned value in a prev next fashion, look at my answer at the bottom...

alexis Over a year ago

str.split() does not accept regular expressions, that's re.split() you're thinking of...

|

Eli Collins · Accepted Answer · 2016-09-02 23:34:16Z

The most efficient way I can think of it to write one using the offset parameter of the str.find() method. This avoids lots of memory use, and relying on the overhead of a regexp when it's not needed.

[edit 2016-8-2: updated this to optionally support regex separators]

def isplit(source, sep=None, regex=False):
    """
    generator version of str.split()

    :param source:
        source string (unicode or bytes)

    :param sep:
        separator to split on.

    :param regex:
        if True, will treat sep as regular expression.

    :returns:
        generator yielding elements of string.
    """
    if sep is None:
        # mimic default python behavior
        source = source.strip()
        sep = "\\s+"
        if isinstance(source, bytes):
            sep = sep.encode("ascii")
        regex = True
    if regex:
        # version using re.finditer()
        if not hasattr(sep, "finditer"):
            sep = re.compile(sep)
        start = 0
        for m in sep.finditer(source):
            idx = m.start()
            assert idx >= start
            yield source[start:idx]
            start = m.end()
        yield source[start:]
    else:
        # version using str.find(), less overhead than re.finditer()
        sepsize = len(sep)
        start = 0
        while True:
            idx = source.find(sep, start)
            if idx == -1:
                yield source[start:]
                return
            yield source[start:idx]
            start = idx + sepsize

This can be used like you want...

>>> print list(isplit("abcb","b"))
['a','c','']

While there is a little bit of cost seeking within the string each time find() or slicing is performed, this should be minimal since strings are represented as continguous arrays in memory.

c z · Accepted Answer · 2017-02-21 16:51:43Z

15

Did some performance testing on the various methods proposed (I won't repeat them here). Some results:

str.split (default = 0.3461570239996945
manual search (by character) (one of Dave Webb's answer's) = 0.8260340550004912
re.finditer (ninjagecko's answer) = 0.698872097000276
str.find (one of Eli Collins's answers) = 0.7230395330007013
itertools.takewhile (Ignacio Vazquez-Abrams's answer) = 2.023023967998597
str.split(..., maxsplit=1) recursion = N/A†

†The recursion answers (string.split with maxsplit = 1) fail to complete in a reasonable time, given string.splits speed they may work better on shorter strings, but then I can't see the use-case for short strings where memory isn't an issue anyway.

Tested using timeit on:

the_text = "100 " * 9999 + "100"

def test_function( method ):
    def fn( ):
        total = 0

        for x in method( the_text ):
            total += int( x )

        return total

    return fn

This raises another question as to why string.split is so much faster despite its memory usage.

answered Feb 21, 2017 at 16:51

c z

9,3564 gold badges58 silver badges63 bronze badges

3 Comments

Benoît P Over a year ago

This is because memory is slower than cpu and in this case, the list is loaded by chunks where as all the others are loaded element by element. On the same note, many academics will tell you linked lists are faster and have less complexity while your computer will often be faster with arrays, which it finds easier to optimise. You can't assume an option is faster than another, test it ! +1 for testing.

jgomo3 Over a year ago

The problem arise in the next steps of a processing chain. If you then want to find an specific chunk and ignore the rest when you find it, then you have the justification to use a generator based split instead of the built-in solution.

DeepThought42 Over a year ago

One of the other reasons that the default (builtin) approach is faster is because the builtin is basically written in C, which is in general a faster language. The other solutions have more overhead from the Python interpreter.

Bernd Petersohn · Accepted Answer · 2010-10-05 16:12:50Z

11

This is generator version of split() implemented via re.search() that does not have the problem of allocating too many substrings.

import re

def itersplit(s, sep=None):
    exp = re.compile(r'\s+' if sep is None else re.escape(sep))
    pos = 0
    while True:
        m = exp.search(s, pos)
        if not m:
            if pos < len(s) or sep is not None:
                yield s[pos:]
            break
        if pos < m.start() or sep is not None:
            yield s[pos:m.start()]
        pos = m.end()


sample1 = "Good evening, world!"
sample2 = " Good evening, world! "
sample3 = "brackets][all][][over][here"
sample4 = "][brackets][all][][over][here]["

assert list(itersplit(sample1)) == sample1.split()
assert list(itersplit(sample2)) == sample2.split()
assert list(itersplit(sample3, '][')) == sample3.split('][')
assert list(itersplit(sample4, '][')) == sample4.split('][')

EDIT: Corrected handling of surrounding whitespace if no separator chars are given.

edited Oct 5, 2010 at 16:12

answered Oct 5, 2010 at 15:47

Bernd Petersohn

2,2841 gold badge15 silver badges7 bronze badges

2 Comments

Erik Kaplun Over a year ago

why is this any better than re.finditer?

rovyko Over a year ago

@ErikKaplun Because the regex logic for the items can be more complex than for their separators. In my case, I wanted to process each line individually, so I can report back if a line failed to match.

Oleh Prypin · Accepted Answer · 2012-10-06 22:41:58Z

Here is my implementation, which is much, much faster and more complete than the other answers here. It has 4 separate subfunctions for different cases.

I'll just copy the docstring of the main str_split function:

str_split(s, *delims, empty=None)

Split the string s by the rest of the arguments, possibly omitting empty parts (empty keyword argument is responsible for that). This is a generator function.

When only one delimiter is supplied, the string is simply split by it. empty is then True by default.

str_split('[]aaa[][]bb[c', '[]')
    -> '', 'aaa', '', 'bb[c'
str_split('[]aaa[][]bb[c', '[]', empty=False)
    -> 'aaa', 'bb[c'

When multiple delimiters are supplied, the string is split by longest possible sequences of those delimiters by default, or, if empty is set to True, empty strings between the delimiters are also included. Note that the delimiters in this case may only be single characters.

str_split('aaa, bb : c;', ' ', ',', ':', ';')
    -> 'aaa', 'bb', 'c'
str_split('aaa, bb : c;', *' ,:;', empty=True)
    -> 'aaa', '', 'bb', '', '', 'c', ''

When no delimiters are supplied, string.whitespace is used, so the effect is the same as str.split(), except this function is a generator.

str_split('aaa\\t  bb c \\n')
    -> 'aaa', 'bb', 'c'

import string

def _str_split_chars(s, delims):
    "Split the string `s` by characters contained in `delims`, including the \
    empty parts between two consecutive delimiters"
    start = 0
    for i, c in enumerate(s):
        if c in delims:
            yield s[start:i]
            start = i+1
    yield s[start:]

def _str_split_chars_ne(s, delims):
    "Split the string `s` by longest possible sequences of characters \
    contained in `delims`"
    start = 0
    in_s = False
    for i, c in enumerate(s):
        if c in delims:
            if in_s:
                yield s[start:i]
                in_s = False
        else:
            if not in_s:
                in_s = True
                start = i
    if in_s:
        yield s[start:]


def _str_split_word(s, delim):
    "Split the string `s` by the string `delim`"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    yield s[start:]

def _str_split_word_ne(s, delim):
    "Split the string `s` by the string `delim`, not including empty parts \
    between two consecutive delimiters"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            if start!=i:
                yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    if start<len(s):
        yield s[start:]


def str_split(s, *delims, empty=None):
    """\
Split the string `s` by the rest of the arguments, possibly omitting
empty parts (`empty` keyword argument is responsible for that).
This is a generator function.

When only one delimiter is supplied, the string is simply split by it.
`empty` is then `True` by default.
    str_split('[]aaa[][]bb[c', '[]')
        -> '', 'aaa', '', 'bb[c'
    str_split('[]aaa[][]bb[c', '[]', empty=False)
        -> 'aaa', 'bb[c'

When multiple delimiters are supplied, the string is split by longest
possible sequences of those delimiters by default, or, if `empty` is set to
`True`, empty strings between the delimiters are also included. Note that
the delimiters in this case may only be single characters.
    str_split('aaa, bb : c;', ' ', ',', ':', ';')
        -> 'aaa', 'bb', 'c'
    str_split('aaa, bb : c;', *' ,:;', empty=True)
        -> 'aaa', '', 'bb', '', '', 'c', ''

When no delimiters are supplied, `string.whitespace` is used, so the effect
is the same as `str.split()`, except this function is a generator.
    str_split('aaa\\t  bb c \\n')
        -> 'aaa', 'bb', 'c'
"""
    if len(delims)==1:
        f = _str_split_word if empty is None or empty else _str_split_word_ne
        return f(s, delims[0])
    if len(delims)==0:
        delims = string.whitespace
    delims = set(delims) if len(delims)>=4 else ''.join(delims)
    if any(len(d)>1 for d in delims):
        raise ValueError("Only 1-character multiple delimiters are supported")
    f = _str_split_chars if empty else _str_split_chars_ne
    return f(s, delims)

This function works in Python 3, and an easy, though quite ugly, fix can be applied to make it work in both 2 and 3 versions. The first lines of the function should be changed to:

def str_split(s, *delims, **kwargs):
    """...docstring..."""
    empty = kwargs.get('empty')

Ignacio Vazquez-Abrams · Accepted Answer · 2010-10-05 08:53:21Z

3

No, but it should be easy enough to write one using itertools.takewhile().

EDIT:

Very simple, half-broken implementation:

import itertools
import string

def isplitwords(s):
  i = iter(s)
  while True:
    r = []
    for c in itertools.takewhile(lambda x: not x in string.whitespace, i):
      r.append(c)
    else:
      if r:
        yield ''.join(r)
        continue
      else:
        raise StopIteration()

edited Oct 5, 2010 at 8:53

answered Oct 5, 2010 at 8:33

Ignacio Vazquez-Abrams

804k160 gold badges1.4k silver badges1.4k bronze badges

11 Comments

Manoj Govindan Over a year ago

@Ignacio: The example in docs uses a list of integers to illustrate the use of takeWhile. What would be a good predicate for splitting a string into words (default split) using takeWhile()?

Ignacio Vazquez-Abrams Over a year ago

Look for presence in string.whitespace.

kennytm Over a year ago

The separator can have multiple characters, 'abc<def<>ghi<><>lmn'.split('<>') == ['abc<def', 'ghi', '', 'lmn']

Manoj Govindan Over a year ago

@Ignacio: Can you add an example to your answer?

Glenn Maynard Over a year ago

Easy to write, but many orders of magnitude slower. This is an operation that really should be implemented in native code.

|

David Webb · Accepted Answer · 2010-10-05 09:00:31Z

3

I don't see any obvious benefit to a generator version of split(). The generator object is going to have to contain the whole string to iterate over so you're not going to save any memory by having a generator.

If you wanted to write one it would be fairly easy though:

import string

def gsplit(s,sep=string.whitespace):
    word = []

    for c in s:
        if c in sep:
            if word:
                yield "".join(word)
                word = []
        else:
            word.append(c)

    if word:
        yield "".join(word)

edited Oct 5, 2010 at 9:00

answered Oct 5, 2010 at 8:53

David Webb

195k57 gold badges319 silver badges302 bronze badges

5 Comments

Glenn Maynard Over a year ago

You'd halve the memory used, by not having to store a second copy of the string in each resulting part, plus the array and object overhead (which is typically more than the strings themselves). That generally doesn't matter, though (if you're splitting strings so large that this matters, you're probably doing something wrong), and even a native C generator implementation would always be significantly slower than doing it all at once.

David Webb Over a year ago

@Glenn Maynard - I just realised that. I for some reason I originally the generator would store a copy of the string rather than a reference. A quick check with id() put me right. And obviously as strings are immutable you don't need to worry about someone changing the original string while you're iterating over it.

Scott Griffiths Over a year ago

Isn't the main point in using a generator not the memory usage, but that you could save yourself having to split the whole string if you wanted to exit early? (That's not a comment on your particular solution, I was just surprised by the discussion about memory).

Glenn Maynard Over a year ago

@Scott: It's hard to think of a case where that's really a win--where 1: you want to stop splitting partway through, 2: you don't know how many words you're splitting in advance, 3: you have a large enough string for it to matter, and 4: you consistently stop early enough for it to be a significant win over str.split. That's a very narrow set of conditions.

Lie Ryan Over a year ago

You can have much higher benefit if your string is lazily generated as well (e.g. from network traffic or file reads)

dshepherd · Accepted Answer · 2015-04-17 11:43:01Z

I wrote a version of @ninjagecko's answer that behaves more like string.split (i.e. whitespace delimited by default and you can specify a delimiter).

def isplit(string, delimiter = None):
    """Like string.split but returns an iterator (lazy)

    Multiple character delimters are not handled.
    """

    if delimiter is None:
        # Whitespace delimited by default
        delim = r"\s"

    elif len(delimiter) != 1:
        raise ValueError("Can only handle single character delimiters",
                        delimiter)

    else:
        # Escape, incase it's "\", "*" etc.
        delim = re.escape(delimiter)

    return (x.group(0) for x in re.finditer(r"[^{}]+".format(delim), string))

Here are the tests I used (in both python 3 and python 2):

# Wrapper to make it a list
def helper(*args,  **kwargs):
    return list(isplit(*args, **kwargs))

# Normal delimiters
assert helper("1,2,3", ",") == ["1", "2", "3"]
assert helper("1;2;3,", ";") == ["1", "2", "3,"]
assert helper("1;2 ;3,  ", ";") == ["1", "2 ", "3,  "]

# Whitespace
assert helper("1 2 3") == ["1", "2", "3"]
assert helper("1\t2\t3") == ["1", "2", "3"]
assert helper("1\t2 \t3") == ["1", "2", "3"]
assert helper("1\n2\n3") == ["1", "2", "3"]

# Surrounding whitespace dropped
assert helper(" 1 2  3  ") == ["1", "2", "3"]

# Regex special characters
assert helper(r"1\2\3", "\\") == ["1", "2", "3"]
assert helper(r"1*2*3", "*") == ["1", "2", "3"]

# No multi-char delimiters allowed
try:
    helper(r"1,.2,.3", ",.")
    assert False
except ValueError:
    pass

python's regex module says that it does "the right thing" for unicode whitespace, but I haven't actually tested it.

Also available as a gist.

reubano · Accepted Answer · 2016-01-08 12:54:02Z

3

If you would also like to be able to read an iterator (as well as return one) try this:

import itertools as it

def iter_split(string, sep=None):
    sep = sep or ' '
    groups = it.groupby(string, lambda s: s != sep)
    return (''.join(g) for k, g in groups if k)

Usage

>>> list(iter_split(iter("Good evening, world!")))
['Good', 'evening,', 'world!']

answered Jan 8, 2016 at 12:54

reubano

5,4331 gold badge44 silver badges44 bronze badges

Comments

boot-scootin · Accepted Answer · 2019-11-19 15:31:31Z

3

more_itertools.split_at offers an analog to str.split for iterators.

>>> import more_itertools as mit


>>> list(mit.split_at("abcdcba", lambda x: x == "b"))
[['a'], ['c', 'd', 'c'], ['a']]

>>> "abcdcba".split("b")
['a', 'cdc', 'a']

more_itertools is a third-party package.

edited Nov 19, 2019 at 15:31

boot-scootin

12.6k10 gold badges73 silver badges118 bronze badges

answered Jan 22, 2018 at 6:21

pylang

45.3k16 gold badges137 silver badges133 bronze badges

2 Comments

jcater Over a year ago

Note that more_itertools.split_at() is still using a newly allocated list on each call, so while this does return an iterator, it is not achieving the constant memory requirement. So depending on why you wanted an iterator to begin with, this may or may not be helpful.

pylang Over a year ago

@jcater Good point. The intermediate values are indeed buffered as sub lists within the iterator, according to its implementation. One could adapt the source to substitute lists with iterators, append with itertools.chain and evaluate results using a list comprehension. Depending on the need and request, I can post an example.

Veltzer Doron · Accepted Answer · 2018-01-22 19:15:20Z

2

I wanted to show how to use the find_iter solution to return a generator for given delimiters and then use the pairwise recipe from itertools to build a previous next iteration which will get the actual words as in the original split method.

from more_itertools import pairwise
import re

string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
delimiter = " "
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer("^|[{0}]+|$".format(delimiter), string)):
    print(string[prev.end(): curr.start()])

note:

I use prev & curr instead of prev & next because overriding next in python is a very bad idea
This is quite efficient

edited Jan 22, 2018 at 19:15

answered Dec 18, 2017 at 14:54

Veltzer Doron

9742 gold badges12 silver badges31 bronze badges

Comments

Tavy · Accepted Answer · 2020-05-21 12:00:27Z

2

Dumbest method, without regex / itertools:

def isplit(text, split='\n'):
    while text != '':
        end = text.find(split)

        if end == -1:
            yield text
            text = ''
        else:
            yield text[:end]
            text = text[end + 1:]

answered May 21, 2020 at 12:00

Tavy

90111 silver badges16 bronze badges

Comments

David Rissato Cruz · Accepted Answer · 2021-02-08 16:35:07Z

2

Very old question, but here is my humble contribution with an efficient algorithm:

def str_split(text: str, separator: str) -> Iterable[str]:
    i = 0
    n = len(text)
    while i <= n:
        j = text.find(separator, i)
        if j == -1:
            j = n
        yield text[i:j]
        i = j + 1

edited Feb 8, 2021 at 16:35

answered Jan 7, 2021 at 17:30

David Rissato Cruz

3,7272 gold badges21 silver badges17 bronze badges

Comments

Max Bileschi · Accepted Answer · 2025-01-15 19:56:41Z

1

Implementation:

iter(io.StringIO(my_str))

Example usage:

>>> import io
>>> for x in iter(io.StringIO('hello')):
...   print(x)
...
hello
>>> for x in iter(io.StringIO('hello\nworld\n')):
...   print(x)
...
hello

world

Documentation: https://docs.python.org/3/library/io.html#io.StringIO

answered Jan 15 at 19:56

Max Bileschi

2,2322 gold badges24 silver badges20 bronze badges

Comments

travelingbones · Accepted Answer · 2013-03-11 19:36:45Z

0

def split_generator(f,s):
    """
    f is a string, s is the substring we split on.
    This produces a generator rather than a possibly
    memory intensive list. 
    """
    i=0
    j=0
    while j<len(f):
        if i>=len(f):
            yield f[j:]
            j=i
        elif f[i] != s:
            i=i+1
        else:
            yield [f[j:i]]
            j=i+1
            i=i+1

answered Mar 11, 2013 at 19:36

travelingbones

91 bronze badge

1 Comment

Moberg Over a year ago

why do you yield [f[j:i]]and not f[j:i]?

Narcisse Doudieu Siewe · Accepted Answer · 2019-02-06 16:54:46Z

0

here is a simple response

def gen_str(some_string, sep):
    j=0
    guard = len(some_string)-1
    for i,s in enumerate(some_string):
        if s == sep:
           yield some_string[j:i]
           j=i+1
        elif i!=guard:
           continue
        else:
           yield some_string[j:]

answered Feb 6, 2019 at 16:54

Narcisse Doudieu Siewe

1,0941 gold badge7 silver badges9 bronze badges

Comments

Apalala · Accepted Answer · 2021-02-26 21:28:03Z

0

def isplit(text, sep=None, maxsplit=-1):
    if not isinstance(text, (str, bytes)):
        raise TypeError(f"requires 'str' or 'bytes' but received a '{type(text).__name__}'")
    if sep in ('', b''):
        raise ValueError('empty separator')

    if maxsplit == 0 or not text:
        yield text
        return

    regex = (
        re.escape(sep) if sep is not None
        else [br'\s+', r'\s+'][isinstance(text, str)]
    )
    yield from re.split(regex, text, maxsplit=max(0, maxsplit))

edited Feb 26, 2021 at 21:28

answered Feb 26, 2021 at 18:21

Apalala

9,2843 gold badges34 silver badges50 bronze badges

Comments

Brian C. · Accepted Answer · 2021-09-08 02:23:42Z

0

Here is an answer that is based on split and maxsplit. This does not use recursion.

def gsplit(todo):
    chunk= 100
    while todo:
        splits = todo.split(maxsplit=chunk)
        if len(splits) == chunk:
            todo = splits.pop()
        else:
            todo=None
        for item in splits:
            yield item

answered Sep 8, 2021 at 2:23

Brian C.

8,2005 gold badges41 silver badges50 bronze badges

Comments

mh-firouzjah · Accepted Answer · 2024-11-24 17:43:14Z

0

def splitter(string, delimiter=" "):
    start = end = 0
    while end < len(string):
        while end<len(string) and string[end] != delimiter:
            end += 1
        yield string[start: end]
        start = end = end +1
    return string[end:]

print(list(splitter("abdcabcd", "b")))

#> ['a', 'dca', 'cd']

answered Nov 24, 2024 at 17:43

mh-firouzjah

8441 gold badge7 silver badges18 bronze badges

Collectives™ on Stack Overflow

Is there a generator version of `string.split()` in Python?

19 Answers 19

More general version:

11 Comments

Comments

3 Comments

2 Comments

Comments

11 Comments

5 Comments

Comments

Comments

2 Comments

Comments

Comments

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

19 Answers 19

More general version:

11 Comments

Comments

3 Comments

2 Comments

Comments

11 Comments

5 Comments

Comments

Comments

2 Comments

Comments

Comments

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related