在R或Python中转换数据类型

2024-10-03 11:13:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我有亚马逊的数据,想把它转换成csv格式的R或Python格式。我得到的原始数据如下:

product/productId: B000GKXY3   
product/title: Nun Chuck  
product/price: 17.99  
review/userId: ADX8VLDUOL7BG  
review/profileName: M. Gingras


product/productId: B000GKXY34  
product/title: Nun Chuck  
product/price: 17.99  
review/userId: A3NM6P6BIWTIAE  
review/profileName: Maria Carpenter

我想把它改成csv格式,如下所示:

^{pr2}$

amazon数据集在我看来有点独特,不知道如何将其转换为csv格式。 我主要使用R,但也对Python开放。所以,任何知道如何使用R或Python实现这一点的人,请分享您的想法。在

提前谢谢。在


Tags: csv数据原始数据title格式productpricereview
3条回答

我假设,你拥有固定的字段列表。在这种情况下,您可以这样生成csv:

buff = [] # buffer with values for one output row
with open('source.txt') as inp:
    with open('target.txt', 'w') as out:
        for line in inp:
            if line == '\n': # blank string in input separates rows for output
                out.write('%s\n' % ','.join(buff))
                buff = [] # clear buffer
            else:
                buff.append(line.rstrip('\n').split(': ')[1])
        if buff: # if buffer is not empty, we have to write it to last row
            out.write('%s\n' % ','.join(buff))

假设你的数据和你的样本是一致的:有序的,5行,第6行是空的。。。在

#!/usr/bin/env python
# -*- coding: utf-8 -*-

def partition(l, n):
    def _part():
        for i in xrange(0, len(l), n):
            yield l[i:i+n]
    return [i for i in _part()]

def loadData():
    with open('data.dat') as f:
        return [row.split(': ') for row in f.read().splitlines() if row ]

data = partition(loadData(), 5)

headers = [[h[0] for h in data[0]]]
columns = [[col[1] for col in row] for row in data]

_data = headers + columns

print "\n".join(",".join(row) for row in _data)

结果:

^{pr2}$

这里有一种在R中实现的方法,它要求所有数据块的字段(顺序和名称)都是相同的,并且数据块用空行分隔。我想有更简单的方法来实现这一点,也许使用plyr?在

读入一些数据。您可以将readLines指向文本文件。在

dat <- readLines(textConnection('product/productId: B000GKXY3
product/title: Nun Chuck
product/price: 17.99
review/userId: ADX8VLDUOL7BG
review/profileName: M. Gingras

product/productId: B000GKXY34
product/title: Nun Chuck
product/price: 17.99
review/userId: A3NM6P6BIWTIAE
review/profileName: Maria Carpenter

product/productId: B000GKXY35
product/title: Nun Chuck
product/price: 17.99
review/userId: A3NM6P6BIWTIAF
review/profileName: Someone Else'))

# Identify blocks of data (assuming blank line indicates a new block) 
#  and split to list L.
L <- split(dat, rep(seq_along(diff(c(0, which(dat==''), length(dat)))), 
                   diff(c(0, which(dat==''), length(dat)))))

# Remove empty elements.
L <- lapply(L, function(x) x[x != ''])

# rbind to a matrix
M <- do.call(rbind, L)

# Extract column names
nm <- sub(':.*$', '', M[1, ])

# Remove column names from matrix elements
M <- gsub('^.*: *', '', M)

# Add column names attribute
colnames(M) <- nm

M

  product/productId product/title product/price review/userId    review/profileName
1 "B000GKXY3"       "Nun Chuck"   "17.99"       "ADX8VLDUOL7BG"  "M. Gingras"      
2 "B000GKXY34"      "Nun Chuck"   "17.99"       "A3NM6P6BIWTIAE" "Maria Carpenter" 
3 "B000GKXY35"      "Nun Chuck"   "17.99"       "A3NM6P6BIWTIAF" "Someone Else" 

然后你可以很容易地强制使用一个data.frame来使product/price数字,如果这能让你的船漂浮。在

相关问题 更多 >