在python中解析tab分隔文件时的奇怪行为

2024-10-02 02:38:40 发布

您现在位置：Python中文网/ 问答频道 /正文

8271

网友

男 | 程序猿一只，喜欢编程写python代码。

我正在解析一个标签分隔的文件，其中第一个元素是twitter标签，第二个元素是tweet内容

我的输入文件如下所示：

#trumpisanabuser    of young black men . calling for the execution of the innocent !url "
#centralparkfiv of young black men . calling for the execution of the innocent !url "
#trumppence16   "
#trumppence16   "
#america2that   @user "

我的代码就是通过检查第二个tab分隔的元素是否是重复的来过滤重复的内容，比如转发

import sys
import csv

tweetfile = sys.argv[1]
tweetset = set()
with open(tweetfile, "rt") as f:
    reader = csv.reader(f, delimiter = '\t')
    for row in reader:
       print("hashtag: " + str(row[0]) + "\t" + "tweet: " + str(row[1]))
       row[1] = row[1].replace("\\ n", "").rstrip()
       if row[1] in tweetset: 
          continue  
       temp = row[1].replace("!url","")
       temp = temp.replace("@user","")
       temp = "".join([c if c.isalnum() else "" for c in temp])
       if temp: 
           taglines.append(row[0] + "\t" + row[1])
       tweetset.add(row[1])

但是，解析过程很奇怪。当我打印每个解析的项目时，输出如下。有人能解释为什么解析中断并导致打印这一行（hashtag: #trumppence16 tweet:，newline，然后是#trumppence16）吗

hashtag: #centralparkfive   tweet: of young black men . calling for the execution of the innocent !url "
hashtag: #trumppence16  tweet: 
#trumppence16   
hashtag: #america2that  tweet: @user "

Tags： of the url 元素 for temp tweet row

1条回答

网友

1楼 · 发布于 2024-10-02 02:38:40

你的tweet有"行。CSV可以通过在值周围用"引用列，包括换行符。从开始"到下一个结束"的所有内容都是一个列值

您可以通过将^{} option设置为^{}来禁用引号处理：

reader = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)

在python中解析tab分隔文件时的奇怪行为

相关问题更多 >

编程相关推荐

热门问题

热门文章

在python中解析tab分隔文件时的奇怪行为

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >