Python中是否有“string.split（）”的生成器版本？

3条回答

网友

1楼 · 编辑于 2024-05-06 04:11:29

我能想到的最有效的方法是使用str.find()方法的offset参数编写一个。这避免了大量的内存使用，并且在不需要regexp时依赖它的开销。

[编辑2016-8-2:更新此选项以可选地支持regex分隔符]

def isplit(source, sep=None, regex=False):
    """
    generator version of str.split()

    :param source:
        source string (unicode or bytes)

    :param sep:
        separator to split on.

    :param regex:
        if True, will treat sep as regular expression.

    :returns:
        generator yielding elements of string.
    """
    if sep is None:
        # mimic default python behavior
        source = source.strip()
        sep = "\\s+"
        if isinstance(source, bytes):
            sep = sep.encode("ascii")
        regex = True
    if regex:
        # version using re.finditer()
        if not hasattr(sep, "finditer"):
            sep = re.compile(sep)
        start = 0
        for m in sep.finditer(source):
            idx = m.start()
            assert idx >= start
            yield source[start:idx]
            start = m.end()
        yield source[start:]
    else:
        # version using str.find(), less overhead than re.finditer()
        sepsize = len(sep)
        start = 0
        while True:
            idx = source.find(sep, start)
            if idx == -1:
                yield source[start:]
                return
            yield source[start:idx]
            start = idx + sepsize

你想怎么用就怎么用。。。

>>> print list(isplit("abcb","b"))
['a','c','']

虽然每次执行find（）或切片时，字符串中都会有一点查找开销，但这应该是最小的，因为字符串在内存中表示为连续数组。

网友

2楼 · 编辑于 2024-05-06 04:11:29

极有可能^{}使用的内存开销相当小。

def split_iter(string):
    return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))

演示：

>>> list( split_iter("A programmer's RegEx test.") )
['A', "programmer's", 'RegEx', 'test']

编辑：我刚刚确认，假设我的测试方法是正确的，在Python3.2.1中这需要恒定内存。我创建了一个非常大的字符串（1GB左右），然后用一个for循环遍历iterable（不是列表理解，它会生成额外的内存）。这并没有导致明显的内存增长（也就是说，如果内存有增长，它远远小于1GB字符串）。

网友

3楼 · 编辑于 2024-05-06 04:11:29

这是通过re.search()实现的split()的生成器版本，它不存在分配太多子字符串的问题。

import re

def itersplit(s, sep=None):
    exp = re.compile(r'\s+' if sep is None else re.escape(sep))
    pos = 0
    while True:
        m = exp.search(s, pos)
        if not m:
            if pos < len(s) or sep is not None:
                yield s[pos:]
            break
        if pos < m.start() or sep is not None:
            yield s[pos:m.start()]
        pos = m.end()


sample1 = "Good evening, world!"
sample2 = " Good evening, world! "
sample3 = "brackets][all][][over][here"
sample4 = "][brackets][all][][over][here]["

assert list(itersplit(sample1)) == sample1.split()
assert list(itersplit(sample2)) == sample2.split()
assert list(itersplit(sample3, '][')) == sample3.split('][')
assert list(itersplit(sample4, '][')) == sample4.split('][')

编辑：如果未给出分隔符，则更正了对周围空白的处理。

相关问题更多 >

编程相关推荐

热门问题

热门文章