<p>我已经试过你的代码,它的作品很好,但有一个轻微的调整,我会建议。不要使用<code>replace</code>,而是使用<a href="https://docs.python.org/2/library/stdtypes.html#str.startswith" rel="nofollow">^{<cd2>}</a>,以确保字符串确实以<code>transcript</code>开头。Replace将从整个字符串中删除所有出现的transcript,但您真正需要的是删除位于字符串开头的transcript</p>
<pre><code>import urllib2
import sys
from string import punctuation as p
import re
reload(sys)
chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752'
chester_3752 = urllib2.urlopen(chester_url).read()
chester_3752 = BeautifulSoup(chester_3752)
# find the speech itself within the HTML
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# removes extraneous characters (e.g. '<br/>')
chester_3752 = chester_3752.text.lower()
# for further text analysis, remove punctuation
punctuation = re.compile('[{}]+'.format(re.escape(p)))
chester_3752 = punctuation.sub('', chester_3752)
chester_3752 = chester_3752.replace('-',' ')
print(chester_3752)
# chester_3752 = chester_3752.replace('transcript','') #avoid this as it will delete all instances of transcript in the string
if chester_3752.startswith("transcript"): #this ensures only transcript at the beginning of the string is deleted which is what you want
chester_3752 = chester_3752[10:].strip()
print chester_3752
</code></pre>