Python中的分治列表（使用pyreadstat读取sav文件）

# Reads only the medata to get information about the variables df, meta = pyreadstat.read_sav('Test.sav', metadataonly=True) list = meta.column_names # All variables are stored in list result = [] for var in list: print(var) try: df, meta = pyreadstat.read_sav('Test.sav', usecols=[str(var)]) # If no error that means we can store this variable in result result.append(var) except: pass # This will finally load the sav for non error variables df, meta = pyreadstat.read_sav('Test.sav', usecols=result)

1条回答

网友

1楼 · 发布于 2024-09-27 07:35:54

对于这种特殊情况，我建议使用另一种方法：您可以为pyreadstat.read_sav提供一个参数“encoding”，以手动设置编码。如果您不知道它是哪一个，那么您可以在此处迭代编码列表：https://gist.github.com/hakre/4188459以找出哪一个有意义。例如：

# here codes is a list with all the encodings in the link mentioned before
for c in codes:
    try:
        df, meta = p.read_sav("Test.sav", encoding=c)
        print(encoding)
        print(df.head())
    except:
        pass

我做了，有一些可能是有意义的，假设字符串是非拉丁字母。然而，最有希望的一个不在列表中：encoding=“UTF8”（列表包含UTF-8，带破折号，但失败）。使用UTF8（无破折号）我得到以下结果：

నేను గతంలో వాడిన బ

根据谷歌翻译，这在泰卢固语中的意思是“我过去常来b”。不确定这是否完全合理，但这是一种方式

这种方法的优点是，如果找到正确的编码，就不会丢失数据，读取数据也会很快。缺点是您可能找不到正确的编码

如果您找不到正确的编码，您无论如何都会非常快地读取有问题的列，您可以稍后在pandas中通过检查哪些字符列不包含拉丁字符来丢弃这些列。这将比您建议的算法快得多

相关问题更多 >

编程相关推荐

热门问题

热门文章