Pandas:如何在阅读CSV时将其他列合并到最后一列

2024-09-30 06:25:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我在CSV中有固定的列。我用的是熊猫的阅读csv。 但是有些行有额外的列值。我需要把所有额外的列合并到最后一列。你知道吗

基本上,我正在尝试读取一个CSV,它在某些列中有特殊字符()和('),因此它正在拆分并创建其他列。因此,我得到了'parserror:Error标记化数据。C错误:第7行应该有4个字段,见5'.所以我需要一种方法来动态地将最后一列/额外的列转换为最后一列

例如,在下面的示例中,问题是最后一列,它混合了、'和'。你知道吗

from StringIO import StringIO
import pandas as pd

csv = r"""dummy,obj,loc,query
bar,6usrg82hwsa3,a,'select * from abc'
bar,b6usrg82hwsa3,a,'select * from abc'
bar,4g9cgbm813czs,a,'select * from abc'
bar,fhf8upax5cxsz,b,'select * from abc'
bar,cnphq355f5rah,b,'select * from abc'
bar,b6usrg82hwsa3,b,'SELECT LIST(HIGHLIGHT, ',') WITHIN GR...'"""

df = pd.read_csv(StringIO(csv), quotechar="'")

这将抛出'Error tokenizing data'

预期输出为

>>> print(df)
  dummy            obj loc              query
0   bar   6usrg82hwsa3   a  select * from abc
1   bar  b6usrg82hwsa3   a  select * from abc
2   bar  4g9cgbm813czs   a  select * from abc
3   bar  fhf8upax5cxsz   b  select * from abc
4   bar  cnphq355f5rah   b  select * from abc
5   bar  b6usrg82hwsa3   b  SELECT LIST(HIGHLIGHT, ',') WITHIN GR...

Tags: csvfromimportobjbarerrorqueryselect
3条回答

一种可能的解决方案是创建一个列DataFrame,其分隔符不在|这样的数据中,然后使用^{}n参数:

from io import StringIO
import pandas as pd

csv = r"""dummy,obj,loc,query
bar,6usrg82hwsa3,a,'select * from abc'
bar,b6usrg82hwsa3,a,'select * from abc'
bar,4g9cgbm813czs,a,'select * from abc'
bar,fhf8upax5cxsz,b,'select * from abc'
bar,cnphq355f5rah,b,'select * from abc'
bar,b6usrg82hwsa3,b,'SELECT LIST(HIGHLIGHT, ',') WITHIN GR...'"""

df = pd.read_csv(StringIO(csv), quotechar="'", sep='|')
print (df)
                                 dummy,obj,loc,query
0             bar,6usrg82hwsa3,a,'select * from abc'
1            bar,b6usrg82hwsa3,a,'select * from abc'
2            bar,4g9cgbm813czs,a,'select * from abc'
3            bar,fhf8upax5cxsz,b,'select * from abc'
4            bar,cnphq355f5rah,b,'select * from abc'
5  bar,b6usrg82hwsa3,b,'SELECT LIST(HIGHLIGHT, ',...

df1 = df.iloc[:, 0].str.split(',', expand=True, n=3).apply(lambda x: x.str.strip("'"))
df1.columns = df.columns[0].split(',')
print (df1)
  dummy            obj loc                                     query
0   bar   6usrg82hwsa3   a                         select * from abc
1   bar  b6usrg82hwsa3   a                         select * from abc
2   bar  4g9cgbm813czs   a                         select * from abc
3   bar  fhf8upax5cxsz   b                         select * from abc
4   bar  cnphq355f5rah   b                         select * from abc
5   bar  b6usrg82hwsa3   b  SELECT LIST(HIGHLIGHT, ',') WITHIN GR...

如果您的数据包含文本列,请不要使用.csv来存储数据,即使它们当时不包含逗号。在这种情况下,仅当并且仅当您严格知道数据中不可能使用逗号时才使用它。使用制表符分隔或其他文件类型。你可以使用下面的解决方案,它适用于你的情况

def refactor_text(csv):
    my_dict = dict(
        dummy=[],
        obj=[],
        loc=[],
        query=[]
        )
    for i,line in enumerate(csv.split('\n')):
        if i == 0:
            continue
        line_args = line.split(',')
        for i,key in enumerate(my_dict.keys()):
            if not key == 'query':
                my_dict[key].append(line_args[i])
            else:
                my_dict[key].append(','.join(line_args[i:]))
    return my_dict


df = pd.DataFrame(refactor_text(csv))

函数refactor_text接受一个参数作为字符串(csv),如果您直接从文件访问,或者在其他情况下,您可能需要重构它。你知道吗

这样做有效: 我想我不确定这对于巨大的数据集。你知道吗

csv = r"""dummy,obj,loc,query
bar,6usrg82hwsa3,a,'select * from abc'
bar,b6usrg82hwsa3,a,'select * from abc'
bar,4g9cgbm813czs,a,'select * from abc'
bar,fhf8upax5cxsz,b,'select * from abc'
bar,cnphq355f5rah,b,'select * from abc'
bar,b6usrg82hwsa3,b,'SELECT LIST(HIGHLIGHT, ',') WITHIN GR...'"""

lengths = []
for line in csv.split('\n'):
    lengths.append(line.count(',') + 1)

n_columns = min(lengths)
rows = []
for line in csv.split('\n'):
    temp = line.split(',')
    row = temp[:(n_columns-1)]
    temp = temp[(n_columns-1):]
    try:
        temp = [string + ',' for string in temp[:-1]] + temp[-1]
        row += [''.join(temp)]
    except IndexError:
        pass
    rows.append(row)
data = pd.DataFrame(data=rows[1:] , columns=rows[0])

print(data)

# dummy            obj loc                                     query
# 0   bar   6usrg82hwsa3   a                       'select * from abc'
# 1   bar  b6usrg82hwsa3   a                       'select * from abc'
# 2   bar  4g9cgbm813czs   a                       'select * from abc'
# 3   bar  fhf8upax5cxsz   b                       'select * from abc'
# 4   bar  cnphq355f5rah   b                       'select * from abc'
# 5   bar  b6usrg82hwsa3   b  'SELECT LIST(HIGHLIGHT '') WITHIN GR...'

相关问题 更多 >

    热门问题