如何读取data.txt文本文件,对数据进行排序,然后使用Python将其转换为数据帧?

2024-06-26 01:32:22 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个包含数据的文本文件(.txt),它显示如下:-

Yield: 99.7598
Timestamp: 2021/February/13-01:55:04
Angle: 0.00309331
ErrorCode 10: 6
ErrorCode 12: 2 

现在我想使用python将其转换为数据帧,如下所示:-

FileName | Yield | TimeStamp | Angle | ErrorCode 10 | ErrorCode 12

xxxxx     99.75 2021/Feb      0.003      6                2

我试图通过以下方式编写此代码:-

import os
import pandas as pd

def sortbycode():
    sam_file = open('210107343_summary.txt', 'r')
    sams = []
    for line in sam_file:
        sams.append([i for i in line.strip("\n").split(":")])
    sams.sort(key=lambda x:x[0])
    for sam in sams:
        print("{0:5}|{1:13}".format(*sam))
sortbycode()

这是我目前得到的输出:-

输出:

Angle| 0.00309331  
ErrorCode 10| 6           
ErrorCode 12| 2           
Timestamp| 2021/February/13-01
Yield| 99.7598 

这是不好的,因为我的计划是构建它并将其转换为数据帧。我被困在这一点转换成数据帧。这个输出还有一点,它也缺少文件名

你能帮我纠正这个错误,或者告诉我正确的方向吗


Tags: 数据inimporttxtforsamlinetimestamp
2条回答

好的,你说每个文件有一条记录,但是有很多文件。让我们假设您有一个东西提供文件名,因此list(filenames())是一个包含相关文件名的列表

您应该首先构建一个函数,该函数根据文件名构建字典:

fieldnames = ['Yield', 'Timestamp', 'Angle', 'ErrorCode 10', 'ErrorCode 12',
              'ErrorCode 13', 'ErrorCode 20']

def getrecord(filename):
    with open(filename) as fd:
        d = {'FileName': filename}
        for line in fd:
            k, v = [i.strip() for i in line.split(':', 1)]
            if k in fieldnames:
                d[k] = v
        return d

现在,您可以使用以下工具构建数据帧:

df = pd.DataFrame([getrecord(filename) for filename in filenames()],
                  columns = ['FileName'] + fieldnames)

最新答复:

正如OP提到的,只有一条记录的文本文件,以下解决方案是合适的:

import pandas as pd
import re
from os import sep, getcwd
from path import glob, Path
from collections import OrderedDict

def oneFileSingleRecordParser(textFilePath):
    fileName = textFilePath.rsplit(sep, 1)[-1]
    
    with open(textFilePath, "r") as textFile:
        # The structure is:
        # Yield:
        # Timestamp
        # Angle
        # ErrorCode 10
        # ErrorCode 12
        # ErrorCode 16
        # ErrorCode 20
        
        # The error codes can be present or absent
        lines = textFile.readlines()
        
        dataDict = OrderedDict()
        dataDict["File Name"] = fileName
        
        for line in lines:
            matchObject = re.match(r"(\w+\s?\d*):\s(.*)", line.strip())
            
            if matchObject is not None:
                key, value = matchObject.groups()
                dataDict[key] = value
            
        return dict(dataDict)

def convertAllFilesToDataFrame(textFilePathsRoot, parser = oneFileSingleRecordParser):
    if not os.path.isdir(textFilePathsRoot):
        raise Exception("Please pass in a valid path to the root of the text files")

    textFilePaths = list(map(lambda path: str(path), Path(textFilePathsRoot).glob("*.txt")))
    
    dataDicts = []
    
    for textFilePath in textFilePaths:
        dataDicts.append(parser(textFilePath))
    
    dataFrame = pd.DataFrame(dataDicts)
    return dataFrame

convertAllFilesToDataFrame("path/to/your/text/file/directory")仍应产生以下输出(在我的情况下,我只有两个具有完全相同记录的文件):

enter image description here

原始答案

根据文本文件的结构,可以通过两种方式解决此问题:

  • 一个文本文件正好包含五行(一条记录)
  • 单个文本文件可能包含5行的倍数(多条记录)

以下是我的应对策略:

import pandas as pd
import re
from os import sep, getcwd
from path import glob, Path
from collections import OrderedDict

def oneFileSingleRecordParser(textFilePath):
    fileName = textFilePath.rsplit(sep, 1)[-1]
    
    with open(textFilePath, "r") as textFile:
        # The structure is:
        # Yield:
        # Timestamp
        # Angle
        # ErrorCode 10
        # ErrorCode 12
        lines = textFile.readlines()
        
        if len(lines) != 5:
            raise Exception("The file at {} doesn't have a proper single record.".format(textFilePath))
        
        dataDict = OrderedDict()
        dataDict["File Name"] = fileName
        
        for line in lines:
            # regex to extract the key and value name
            matchObject = re.match(r"(\w+\s?\d*):\s(.*)", line.strip())
            
            if matchObject is not None:
                key, value = matchObject.groups()
                dataDict[key] = value
            
        return dict(dataDict)

def oneFileMultiRecordParser(textFilePath):
    fileName = textFilePath.rsplit(sep, 1)[-1]
    
    with open(textFilePath, "r") as textFile:
        # The structure is:
        # Yield_1:
        # Timestamp_1:
        # Angle_1:
        # ErrorCode 10_1:
        # ErrorCode 12_1:
        # Yield_2:
        # Timestamp_2:
        # Angle_2:
        # ErrorCode 10_2:
        # ErrorCode 12_2:
        # ...
        lines = textFile.readlines()
        
        if len(lines) % 5 != 0:
            raise Exception("The file at {} doesn't have a uniform structure.".format(textFilePath))
        
        records = []
        
        dataDict = OrderedDict()
        dataDict["File Name"] = fileName
        
        for index, line in enumerate(lines):
            # regex to extract the key and value name
            matchObject = re.match(r"(\w+\s?\d*):\s(.*)", line.strip())
            
            if matchObject is not None:
                key, value = matchObject.groups()
                dataDict[key] = value
            else:
                raise Exception("Line={}, content=\"{}\" has some formatting issues, regex failed".format(index + 1, line))
            
            if (index + 1) % 5 == 0:
                records.append(dataDict)
                dataDict = OrderedDict() # reset for next iteration
                dataDict["File Name"] = fileName
            
        return records

def convertAllFilesToDataFrame(
        parser = oneFileSingleRecordParser, 
        validParserNames = ("oneFileSingleRecordParser", "oneFileMultiRecordParser",)
    ):
    
    if not parser.__name__ in validParserNames:
        raise Exception("Proper parser was not used")
    
    pathToFiles = getcwd()
    textFilePaths = list(map(lambda path: str(path), Path(pathToFiles).glob("*.txt")))
    
    dataDicts = []
    
    for textFilePath in textFilePaths:
        if parser.__name__ == validParserNames[0]:
            dataDicts.append(parser(textFilePath))
        elif parser.__name__ == validParserNames[1]:
            dataDicts.extend(parser(textFilePath))
    
    dataFrame = pd.DataFrame(dataDicts)
    return dataFrame

convertAllFilesToDataFrame(parser = oneFileMultiRecordParser)将产生: enter image description here

convertAllFilesToDataFrame(parser = oneFileSingleRecordParser)将产生: enter image description here

代码并不完全枯燥,但您可能需要更多的时间来完成

相关问题 更多 >