<p>可以使用如下递归函数:</p>
<p><strong>注意</strong>:result参数将被视为全局变量(因为向函数传递可变对象会影响调用方)</p>
<pre><code>import re
def finder(st,past_ind=0,result=[]):
m=re.search(r'(.+)\1+',st)
if m:
i,j=m.span()
sub=st[i:j]
ind = (sub+sub).find(sub, 1)
sub=sub[:ind]
if len(sub)>1:
result.append([sub,(i+past_ind+1,j+past_ind+1)])
past_ind+=j
return finder(st[j:],past_ind)
else:
return result
s='AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
print finder(s)
</code></pre>
<p>结果:</p>
^{pr2}$
<h2>以下字符串的上一个问题的答案:</h2>
<pre><code>s = 'AAAC**ACGTACGTA**ATTCC**GTGTGT**CCCC**TATACGTATACG**TTT'
</code></pre>
<p>您可以使用来自<a href="https://stackoverflow.com/questions/29481088/how-can-i-tell-if-a-string-repeats-itself-in-python">mentioned question</a>的答案和一些额外的配方:</p>
<p>首先可以使用<code>**</code>拆分字符串,然后创建一个新列表,其中包含使用<code>r'(.+)\1+'</code>regex重复的字符串:</p>
<p>所以结果是:</p>
<pre><code>>>> new=[re.search(r'(.+)\1+',i).group(0) for i in s.split('**')]
>>> new
['AAA', 'ACGTACGT', 'TT', 'GTGTGT', 'CCCC', 'TATACGTATACG', 'TTT']
</code></pre>
<p>注意,关于<code>'ACGTACGT'</code>在结尾错过了<code>A</code>的内容!在</p>
<p>然后可以使用<code>principal_period</code>的函数来获取重复的子字符串:</p>
<pre><code>def principal_period(s):
i = (s+s).find(s, 1, -1)
return None if i == -1 else s[:i]
>>> for i in new:
... p=principal_period(i)
... if p is not None and len(p)>1:
... l.append(p)
... sub.append(i)
...
</code></pre>
<p>因此,<code>l</code>中有重复的字符串,<code>sub</code>中有主字符串:</p>
<pre><code>>>> l
['ACGT', 'GT', 'TATACG']
>>> sub
['ACGTACGT', 'GTGTGT', 'TATACGTATACG']
</code></pre>
<p>然后您需要一个<code>region</code>,您可以使用<code>span</code>方法来完成:</p>
<pre><code>>>> for t in sub:
... regons.append(re.search(t,s).span())
>>> regons
[(6, 14), (24, 30), (38, 50)]
</code></pre>
<p>最后,您可以压缩3个列表<code>regon</code>,<code>sub</code>,<code>l</code>,并使用dict理解来创建预期结果:</p>
<pre><code>>>> z=zip(sub,l,regons)
>>> out={i :{'repeat':i.count(j),'region':reg} for i,j,reg in z}
>>> out
{'TATACGTATACG': {'region': (38, 50), 'repeat': 2}, 'ACGTACGT': {'region': (6, 14), 'repeat': 2}, 'GTGTGT': {'region': (24, 30), 'repeat': 3}}
</code></pre>
<p>主要代码:</p>
<pre><code>>>> s = 'AAAC**ACGTACGTA**ATTCC**GTGTGT**CCCC**TATACGTATACG**TTT'
>>> sub=[]
>>> l=[]
>>> regon=[]
>>> new=[re.search(r'(.+)\1+',i).group(0) for i in s.split('**')]
>>> for i in new:
... p=principal_period(i)
... if p is not None and len(p)>1:
... l.append(p)
... sub.append(i)
...
>>> for t in sub:
... regons.append(re.search(t,s).span())
...
>>> z=zip(sub,l,regons)
>>> out={i :{'repeat':i.count(j),'region':reg} for i,j,reg in z}
>>> out
{'TATACGTATACG': {'region': (38, 50), 'repeat': 2}, 'ACGTACGT': {'region': (6, 14), 'repeat': 2}, 'GTGTGT': {'region': (24, 30), 'repeat': 3}}
</code></pre>