如何读取data.txt文本文件，对数据进行排序，然后使用Python将其转换为数据帧？

2条回答

网友

1楼 · 编辑于 2024-06-26 01:32:22

好的，你说每个文件有一条记录，但是有很多文件。让我们假设您有一个东西提供文件名，因此list(filenames())是一个包含相关文件名的列表

您应该首先构建一个函数，该函数根据文件名构建字典：

fieldnames = ['Yield', 'Timestamp', 'Angle', 'ErrorCode 10', 'ErrorCode 12',
              'ErrorCode 13', 'ErrorCode 20']

def getrecord(filename):
    with open(filename) as fd:
        d = {'FileName': filename}
        for line in fd:
            k, v = [i.strip() for i in line.split(':', 1)]
            if k in fieldnames:
                d[k] = v
        return d

现在，您可以使用以下工具构建数据帧：

df = pd.DataFrame([getrecord(filename) for filename in filenames()],
                  columns = ['FileName'] + fieldnames)

网友

2楼 · 编辑于 2024-06-26 01:32:22

最新答复：

正如OP提到的，只有一条记录的文本文件，以下解决方案是合适的：

import pandas as pd
import re
from os import sep, getcwd
from path import glob, Path
from collections import OrderedDict

def oneFileSingleRecordParser(textFilePath):
    fileName = textFilePath.rsplit(sep, 1)[-1]
    
    with open(textFilePath, "r") as textFile:
        # The structure is:
        # Yield:
        # Timestamp
        # Angle
        # ErrorCode 10
        # ErrorCode 12
        # ErrorCode 16
        # ErrorCode 20
        
        # The error codes can be present or absent
        lines = textFile.readlines()
        
        dataDict = OrderedDict()
        dataDict["File Name"] = fileName
        
        for line in lines:
            matchObject = re.match(r"(\w+\s?\d*):\s(.*)", line.strip())
            
            if matchObject is not None:
                key, value = matchObject.groups()
                dataDict[key] = value
            
        return dict(dataDict)

def convertAllFilesToDataFrame(textFilePathsRoot, parser = oneFileSingleRecordParser):
    if not os.path.isdir(textFilePathsRoot):
        raise Exception("Please pass in a valid path to the root of the text files")

    textFilePaths = list(map(lambda path: str(path), Path(textFilePathsRoot).glob("*.txt")))
    
    dataDicts = []
    
    for textFilePath in textFilePaths:
        dataDicts.append(parser(textFilePath))
    
    dataFrame = pd.DataFrame(dataDicts)
    return dataFrame

convertAllFilesToDataFrame("path/to/your/text/file/directory")仍应产生以下输出（在我的情况下，我只有两个具有完全相同记录的文件）：

原始答案

根据文本文件的结构，可以通过两种方式解决此问题：

一个文本文件正好包含五行（一条记录）
单个文本文件可能包含5行的倍数（多条记录）

以下是我的应对策略：

import pandas as pd
import re
from os import sep, getcwd
from path import glob, Path
from collections import OrderedDict

def oneFileSingleRecordParser(textFilePath):
    fileName = textFilePath.rsplit(sep, 1)[-1]
    
    with open(textFilePath, "r") as textFile:
        # The structure is:
        # Yield:
        # Timestamp
        # Angle
        # ErrorCode 10
        # ErrorCode 12
        lines = textFile.readlines()
        
        if len(lines) != 5:
            raise Exception("The file at {} doesn't have a proper single record.".format(textFilePath))
        
        dataDict = OrderedDict()
        dataDict["File Name"] = fileName
        
        for line in lines:
            # regex to extract the key and value name
            matchObject = re.match(r"(\w+\s?\d*):\s(.*)", line.strip())
            
            if matchObject is not None:
                key, value = matchObject.groups()
                dataDict[key] = value
            
        return dict(dataDict)

def oneFileMultiRecordParser(textFilePath):
    fileName = textFilePath.rsplit(sep, 1)[-1]
    
    with open(textFilePath, "r") as textFile:
        # The structure is:
        # Yield_1:
        # Timestamp_1:
        # Angle_1:
        # ErrorCode 10_1:
        # ErrorCode 12_1:
        # Yield_2:
        # Timestamp_2:
        # Angle_2:
        # ErrorCode 10_2:
        # ErrorCode 12_2:
        # ...
        lines = textFile.readlines()
        
        if len(lines) % 5 != 0:
            raise Exception("The file at {} doesn't have a uniform structure.".format(textFilePath))
        
        records = []
        
        dataDict = OrderedDict()
        dataDict["File Name"] = fileName
        
        for index, line in enumerate(lines):
            # regex to extract the key and value name
            matchObject = re.match(r"(\w+\s?\d*):\s(.*)", line.strip())
            
            if matchObject is not None:
                key, value = matchObject.groups()
                dataDict[key] = value
            else:
                raise Exception("Line={}, content=\"{}\" has some formatting issues, regex failed".format(index + 1, line))
            
            if (index + 1) % 5 == 0:
                records.append(dataDict)
                dataDict = OrderedDict() # reset for next iteration
                dataDict["File Name"] = fileName
            
        return records

def convertAllFilesToDataFrame(
        parser = oneFileSingleRecordParser, 
        validParserNames = ("oneFileSingleRecordParser", "oneFileMultiRecordParser",)
    ):
    
    if not parser.__name__ in validParserNames:
        raise Exception("Proper parser was not used")
    
    pathToFiles = getcwd()
    textFilePaths = list(map(lambda path: str(path), Path(pathToFiles).glob("*.txt")))
    
    dataDicts = []
    
    for textFilePath in textFilePaths:
        if parser.__name__ == validParserNames[0]:
            dataDicts.append(parser(textFilePath))
        elif parser.__name__ == validParserNames[1]:
            dataDicts.extend(parser(textFilePath))
    
    dataFrame = pd.DataFrame(dataDicts)
    return dataFrame

convertAllFilesToDataFrame(parser = oneFileMultiRecordParser)将产生：

convertAllFilesToDataFrame(parser = oneFileSingleRecordParser)将产生：

代码并不完全枯燥，但您可能需要更多的时间来完成

最新答复：

原始答案

相关问题更多 >

编程相关推荐

热门问题

热门文章