<p>您可以迭代地查找常见的模式,并创建一个最常见模式的列表来删除它们。听起来你有一个足够大的数据集,它不可能是100%正确的这一点。你知道吗</p>
<p>因为您提到的模式只出现在开头或结尾,所以您可以这样做:</p>
<pre class="lang-py prettyprint-override"><code>from collections import Counter
data = [
"Amet urna tincidunt efficitur - The Guardian",
"Yltricies hendrerit eu a nisi - The Guardian",
"Faucibus pharetra id quis arck - The Guardian",
"Net tristique facilisis | New York Times",
"Quis finibus lacinia | New York Times",
"My blog: Net tristique facilisis",
"My blog: Quis finibus lacinia",
]
def find_common(data, num_phrases=50):
phrases = Counter()
for sentence in data:
for n in range(2, 6):
phrases[" ".join(sentence.split()[:n])] += 1
phrases[" ".join(sentence.split()[-n:])] += 1
return phrases.most_common(num_phrases)
find_common(data, 8)
Out[145]:
[('The Guardian', 3),
('- The Guardian', 3),
('York Times', 2),
('Net tristique facilisis', 2),
('New York Times', 2),
('| New York Times', 2),
('Quis finibus lacinia', 2),
('My blog:', 2)]
</code></pre>
<p>从中,你可以发现“《卫报》”、“《纽约时报》和“我的博客”是常见的网页名称模式。然后,您可以从数据中删除这些内容并再次运行,对其进行迭代,直到您感觉得到了其中的大部分内容。你知道吗</p>