擅长:python、mysql、java
<p>不知道你的问题是什么,但当我用python3.4和bs4运行这个程序时,它删除了“transcript”和一堆标点符号(我拿出一堆include,把<code>urllib2</code>改成<code>urllib.request</code>)</p>
<pre><code>import urllib.request
import re
from bs4 import BeautifulSoup
import re
from string import punctuation as p
chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752'
chester_3752 = urllib.request.urlopen(chester_url).read()
chester_3752 = BeautifulSoup(chester_3752)
# find the speech itself within the HTML
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# removes extraneous characters (e.g. '<br/>')
chester_3752 = chester_3752.text.lower()
# for further text analysis, remove punctuation
punctuation = re.compile('[{}]+'.format(re.escape(p)))
chester_3752 = punctuation.sub('', chester_3752)
chester_3752 = chester_3752.replace('—',' ')
chester_3752 = chester_3752.replace('transcript','')
print(chester_3752)
</code></pre>