回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我试图统计政客在某些演讲中使用的缩略词的数量。我有很多演讲,但以下是一些URL示例:</p>
<pre><code>every_link_test = ['http://www.millercenter.org/president/obama/speeches/speech-4427',
'http://www.millercenter.org/president/obama/speeches/speech-4424',
'http://www.millercenter.org/president/obama/speeches/speech-4453',
'http://www.millercenter.org/president/obama/speeches/speech-4612',
'http://www.millercenter.org/president/obama/speeches/speech-5502']
</code></pre>
<p>我现在有一个非常粗略的计数器-它只计算所有这些链接中使用的收缩总数。例如,下面的代码为上面的五个链接返回<code>79,101,101,182,224</code>。但是,我想链接<code>filename</code>,这是我在下面创建的一个变量,所以我会有类似<code>(speech_1, 79),(speech_2, 22),(speech_3,0),(speech_4,81),(speech_5,42)</code>的东西。这样,我就可以追踪每个语音中使用的收缩次数。我的代码出现以下错误:<code>AttributeError: 'tuple' object has no attribute 'split'</code></p>
<p>这是我的密码:</p>
<pre><code>import urllib2,sys,os
from bs4 import BeautifulSoup,NavigableString
from string import punctuation as p
from multiprocessing import Pool
import re, nltk
import requests
reload(sys)
url = 'http://www.millercenter.org/president/speeches'
url2 = 'http://www.millercenter.org'
conn = urllib2.urlopen(url)
html = conn.read()
miller_center_soup = BeautifulSoup(html)
links = miller_center_soup.find_all('a')
linklist = [tag.get('href') for tag in links if tag.get('href') is not None]
# remove all items in list that don't contain 'speeches'
linkslist = [_ for _ in linklist if re.search('speeches',_)]
del linkslist[0:2]
# concatenate 'http://www.millercenter.org' with each speech's URL ending
every_link_dups = [url2 + end_link for end_link in linkslist]
# remove duplicates
seen = set()
every_link = [] # no duplicates array
for l in every_link_dups:
if l not in seen:
every_link.append(l)
seen.add(l)
def processURL_short_2(l):
open_url = urllib2.urlopen(l).read()
item_soup = BeautifulSoup(open_url)
item_div = item_soup.find('div',{'id':'transcript'},{'class':'displaytext'})
item_str = item_div.text.lower()
splitlink = l.split("/")
president = splitlink[4]
speech_num = splitlink[-1]
filename = "{0}_{1}".format(president, speech_num)
return item_str, filename
every_link_test = every_link[0:5]
print every_link_test
count = 0
for l in every_link_test:
content_1 = processURL_short_2(l)
for word in content_1.split():
word = word.strip(p)
if word in contractions:
count = count + 1
print count, filename
</code></pre>