<blockquote>
<p>All of the sentences have some form of the word 'conjecture', i.e. conjectures, conjectured, etc.</p>
</blockquote>
<p>其他答案中显示的<code>word in string</code>方法通常会失败,例如,在一个含有<code>communities</code>的句子中,他们找不到<code>community</code>这个词。在</p>
<p>在这种情况下,您可能需要一个词干分析算法,如<a href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.stem-module.html" rel="nofollow">^{<cd4>} package</a>:</p>
<pre><code>from nltk.stem.snowball import EnglishStemmer
from nltk import word_tokenize
stemmer = EnglishStemmer()
stem_word = stemmer.stem
stem = stem_word(u"conjecture")
sentence = u'He conjectured that the interface was...'
words = word_tokenize(sentence)
found_words = [(i, w) for i, w in enumerate(words) if stem_word(w) == stem]
# -> [(1, u'conjectured')]
</code></pre>
<p>还有其他的stem和<a href="http://nltk.org/api/nltk.tokenize.html" rel="nofollow">tokenize methods in nltk</a>,您可以根据具体需要使用。在</p>
<blockquote>
<p>however some words start with the nasty characters: “ or the like.. how can I get rid of them?</p>
</blockquote>
<p>“讨厌的字符”是错误地将<code>utf-8</code>字节序列视为<code>cp1252</code>的结果:</p>
^{pr2}$
<p>你不应该盲目地删除乱码文本,而是修改字符编码。在</p>
<p><a href="http://www.hanselman.com/blog/WhyTheAskObamaTweetWasGarbledOnScreenKnowYourUTF8UnicodeASCIIAndANSIDecodingMrPresident.aspx" rel="nofollow">Why the #AskObama Tweet was Garbled on Screen: Know your UTF-8, Unicode, ASCII and ANSI Decoding Mr. President</a>显示了这个问题在电视上公开的例子。在</p>
<p>为了理解阅读<a href="http://www.joelonsoftware.com/articles/Unicode.html" rel="nofollow">The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)</a>。在</p>