无法用加载arff数据集scipy.arff.load公司

3条回答

网友

1楼 · 编辑于 2024-09-29 21:53:58

Is this expected ? (aka scipy implementation does not fully comply with the arff format)

是的，很不幸。如docstring for ^{}中所述，“它无法读取具有稀疏数据的文件（{}在文件中）。”文件yahoo_arts.arff在其@data部分使用稀疏格式。你知道吗

你可以试着searching PyPi for "arff"找到另一种选择。我没有用过这些，所以我没有任何具体的建议。你知道吗

网友

2楼 · 编辑于 2024-09-29 21:53:58

如Warren Weckesser的回答所示，scipy无法读取稀疏arff文件。我已经实现了一个快速的解决方法来解析稀疏的arff文件，如果它能帮助其他人，我将在下面与大家分享。如果我有时间做一个干净的版本，我会努力为scipy版本做贡献。你知道吗

编辑：对不起，我没有看到你的版本，但我想它也可以。你知道吗

from scipy.sparse import coo_matrix
from functools import reduce
import pandas as pd

def loadarff(filename):

  features = list()
  data = list()
  row_idx = 0

  with open(filename, "rb") as f:
    for line in f:
      line = line.decode("utf8")
      if line.startswith("@data"):
        continue
      elif line.startswith("@relation"):
        continue
      elif line.startswith("@attribute"):
        try:
          features.append(line.split(" ")[1])
        except Exception as e:
          print(f"Cannot parse {line}")
          raise e
      elif line.startswith("{"):
        try:
          line = line.replace("{", "").replace("}", "")
          line = [[row_idx,]+[int(x) for x in v.split(" ")] for v in line.split(",")]
          data.append(line)
          row_idx += 1
        except Exception as e:
          print(f"Cannot parse {line}")
          raise e
      else:
        print(f"Cannot parse {line}")

  flatten = lambda l: [item for sublist in l for item in sublist]
  data = flatten(data)

  sparse_matrix = coo_matrix(([x[2] for x in data], ([x[0] for x in data], [x[1] for x in data])), shape=(row_idx, len(features)))

  df = pd.DataFrame(sparse_matrix.todense(), columns=features)

  return df

网友

3楼 · 编辑于 2024-09-29 21:53:58

您可以使用以下解决方法：

import numpy as np
import pandas as pd


with open('yahoo_arts.arff', 'r') as fp:
    file_content = fp.readlines()


def parse_row(line, len_row):
    line = line.replace('{', '').replace('}', '')

    row = np.zeros(len_row)
    for data in line.split(','):
        index, value = data.split()
        row[int(index)] = float(value)

    return row


columns = []
len_attr = len('@attribute')

# get the columns
for line in file_content:
    if line.startswith('@attribute '):
        col_name = line[len_attr:].split()[0]
        columns.append(col_name)

rows = []
len_row = len(columns)
# get the rows
for line in file_content:
    if line.startswith('{'):
        rows.append(parse_row(line, len_row))

df = pd.DataFrame(data=rows, columns=columns)

df.head()

输出：

相关问题更多 >

编程相关推荐

热门问题

热门文章

无法用加载arff数据集scipy.arff.load公司

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >