<p>我能想到的最有效的方法是使用<code>str.find()</code>方法的<code>offset</code>参数编写一个。这避免了大量的内存使用,并且在不需要regexp时依赖它的开销。</p>
<p><em>[编辑2016-8-2:更新此选项以可选地支持regex分隔符]</em></p>
<pre><code>def isplit(source, sep=None, regex=False):
"""
generator version of str.split()
:param source:
source string (unicode or bytes)
:param sep:
separator to split on.
:param regex:
if True, will treat sep as regular expression.
:returns:
generator yielding elements of string.
"""
if sep is None:
# mimic default python behavior
source = source.strip()
sep = "\\s+"
if isinstance(source, bytes):
sep = sep.encode("ascii")
regex = True
if regex:
# version using re.finditer()
if not hasattr(sep, "finditer"):
sep = re.compile(sep)
start = 0
for m in sep.finditer(source):
idx = m.start()
assert idx >= start
yield source[start:idx]
start = m.end()
yield source[start:]
else:
# version using str.find(), less overhead than re.finditer()
sepsize = len(sep)
start = 0
while True:
idx = source.find(sep, start)
if idx == -1:
yield source[start:]
return
yield source[start:idx]
start = idx + sepsize
</code></pre>
<p>你想怎么用就怎么用。。。</p>
<pre><code>>>> print list(isplit("abcb","b"))
['a','c','']
</code></pre>
<p>虽然每次执行find()或切片时,字符串中都会有一点查找开销,但这应该是最小的,因为字符串在内存中表示为连续数组。</p>