这是Focusing in on specific results while scraping Twitter with Python and Beautiful Soup 4?和{a2}的后续操作。在
我没有使用Twitter API,因为它没有在这么远的时候通过hashtag来查看tweets。在
编辑:这里描述的错误只在Windows7中出现。正如bernie报告的那样,代码在Linux上按预期运行,见下面的注释,我可以在osx10.10.2上运行它而不会出现编码错误。在
当我尝试循环获取tweet内容的代码时,就会出现编码错误。在
第一个片段只抓取第一条tweet,并按预期获取<p>
标记中的所有内容。在
amessagetext = soup('p', {'class': 'TweetTextSize js-tweet-text tweet-text'})
amessage = amessagetext[0]
但是,当我尝试使用循环来使用第二个片段来获取所有tweet时
^{pr2}$我得到了一个众所周知的cp437.py
编码错误。在
File "C:\Anaconda3\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2014' in position 4052: character maps to <undefined>
那么,为什么第一条tweet的代码被成功地刮走了,但是多条tweet却导致了编码问题呢?是不是因为第一条推文碰巧没有出现问题的人物?我试着在几个不同的搜索中成功地抓取了第一条tweet,所以我不确定这是否是原因。在
我该怎么解决这个问题?我读过一些关于这个的帖子和书籍部分,我理解为什么会发生这种情况,但我不确定如何在beauthulsoup代码中更正它。在
以下是完整的代码供参考。在
from bs4 import BeautifulSoup
import requests
import sys
import csv #Will be exporting to csv
url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
headers = {'User-Agent': 'Mozilla/5.0'} # (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
r = requests.get(url, headers=headers)
data = r.text.encode('utf-8')
soup = BeautifulSoup(data, "html.parser")
names = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'})
usernames = [name.contents[0] for name in names]
handles = soup('span', {'class': 'username js-action-profile-name'})
userhandles = [handle.contents[1].contents[0] for handle in handles]
athandles = [('@')+abhandle for abhandle in userhandles]
links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
urls = [link["href"] for link in links]
fullurls = [('http://www.twitter.com')+permalink for permalink in urls]
timestamps = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
datetime = [timestamp["title"] for timestamp in timestamps]
messagetexts = soup('p', {'class': 'TweetTextSize js-tweet-text tweet-text'})
messages = [messagetext for messagetext in messagetexts]
amessagetext = soup('p', {'class': 'TweetTextSize js-tweet-text tweet-text'})
amessage = amessagetext[0]
retweets = soup('button', {'class': 'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})
retweetcounts = [retweet.contents[3].contents[1].contents[1].string for retweet in retweets]
favorites = soup('button', {'class': 'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})
favcounts = [favorite.contents[3].contents[1].contents[1].string for favorite in favorites]
print (usernames, "\n", "\n", athandles, "\n", "\n", fullurls, "\n", "\n", datetime, "\n", "\n",retweetcounts, "\n", "\n", favcounts, "\n", "\n", amessage, "\n", "\n", messages)
我通过消除用于错误检查的print语句,并通过在两个
with open
命令中添加encoding="utf-8"
为被刮取的HTML文件和csv输出文件指定编码,从而达到了我自己的满意程度。在相关问题 更多 >
编程相关推荐