处理多个txt文件的代码优化(Python 3.6)

2024-09-27 21:32:23 发布

您现在位置:Python中文网/ 问答频道 /正文

我想优化我的代码。我需要处理150000个文本文件,并从文件标题中提取id。我还需要从检测到的文本字段中只提取单词。我的代码正在运行,但需要很长时间。我需要一些更快的解决方案,比如并行作业或类似的东西,但我不知道如何实现这种解决方案

我的代码:


# folder with multiple text files

file_list = glob.glob(os.path.join(os.getcwd(), "text_files", "*.txt"))

corpus = []

for file_path in file_list:
    with open(file_path, encoding="utf8") as f_input:
        pathPre = file_path[:file_path.find(".")]
        pathFinal = pathPre[pathPre.rfind("\\")+1:]
        ID= "ID: " + pathFinal + "\n----------\n"
        corpus.append(ID + f_input.read())


df = pd.DataFrame(corpus) 
df_txt = df[0].str.split('\n', expand=True)
df_txt[0] = df_txt[0].str.partition('_')[0].str.strip()

listOfLists = []

for index, row in df_txt.iterrows():
    detectedTextIter = []
    for i in range (12115):
        if (row[i] is None):
            continue
        else:
            if ("ID: " in row[i]):
                ID = row[i].split("ID: ", 1)[1]
                detectedTextIter.append(ID)
            elif ("Detected Text:" in row[i]):
                detectedText = row[i].split("Detected Text:", 1)[1]
                detectedTextIter.append(detectedText)
            else:
                continue
    listOfLists.append(detectedTextIter)

newDF = pd.DataFrame.from_records(listOfLists)

IDList = []

for index, row in newDF.iterrows():
    ID= row[0]
    IDlist.append(ID)

uniqueIDList = list(set(IDList))

keywordListofLists = []

for i in uniqueIDList:
    newList = []
    newList.append(i)
    keywordList = []
    newList.append(keywordList)
    keywordListofLists.append(newList)

for i in listOfLists:
    IDLookup = i[0]
    words = []
    
    for k in range(1, len(i)):
        words.append(i[k])
    
    for j in range(len(keywordListofLists)):
        if (keywordListofLists[j][0] == IDLookup):
            
            for x in words:
                keywordListofLists[j][1].append(x)

for i in keywordListofLists:
    wordList = i[1]
    uniqueWords = list(set(wordList))
    i[1] = uniqueWords

UniqueKeywordPerUniqueIDList = pd.DataFrame.from_records(keywordListofLists)
UniqueKeywordPerUniqueIDList.columns = ['id','text']
ML_df = UniqueKeywordPerUniqueIDList.text.apply(pd.Series).merge(UniqueKeywordPerUniqueIDList, left_index = True, right_index = True).drop(['text'], axis = 1).melt(id_vars = ['id'], value_name = 'text').drop("variable", axis = 1).dropna()

我的文本文件:


Id: 84194a52-6402-41c8-9057-4fd31a9b2cea
Type: LINE
Detected Text: ORPINE
Confidence: 37.295963
Id: dcfeca0e-1dc2-47e7-8abe-6c4a4309b525
Type: LINE
Detected Text: BOAT SOA
Confidence: 91.334778
Id: 69889983-b22a-4d08-bf3b-841bb1303512
Type: LINE
Detected Text: ORPRODUCTS
Confidence: 96.001930
Id: 67c9a54a-f8c0-4217-8764-d288842ad3a1
Type: LINE
Detected Text: PRESH
Confidence: 38.313396
Id: bddd43e2-b284-40fb-ae0b-6d91c3f41cc3
Type: WORD
Detected Text: ORPINE
Confidence: 37.295963
Id: 7f55b98d-e2eb-4e38-a517-79a45aff5a94
Type: WORD
Detected Text: BOAT
Confidence: 99.360634
Id: 52309976-6dcd-4727-98d4-fc24640f98ae
Type: WORD
Detected Text: SOA
Confidence: 83.308922
Id: 50c1f5b2-a2e2-470e-b823-9ed9b8059ea3
Type: WORD
Detected Text: ORPRODUCTS
Confidence: 96.001930
Id: aa7b9b69-8f8e-49e6-bd63-b26e8f7a3d08
Type: WORD
Detected Text: PRESH
Confidence: 38.313396


Tags: pathtextiniddffortypefile
2条回答

你可以用熊猫和地球仪来阅读它们。看看这个例子:

在glob的帮助下,我们可以读取所有的txt文件并将它们放入一个数据帧中

import pandas as pd
from glob import glob
txts = sorted(glob('*.txt'))
df = pd.concat((pd.read_csv(file, sep=',') for file in txts),ignore_index=True)

对于提取id和其他东西,可以使用regex和nltk

只需使用:作为分隔符直接读取文本文件即可。然后将数据帧子集为仅保留Detected Text行。使用assign添加所需的列。用链式调用将所有内容包装在列表中

file_list = glob.glob(os.path.join(os.getcwd(), "text_files", "*.txt"))

# BUILD LIST OF DFs WITH LIST COMPREHENSION
df_list = [(pd.read_csv(file_path, sep=":", header=0, names=['key', 'text'])   # IMPORT COLON-SEPARATED TEXT FILE
             .query("key=='Detected Text'")                                    # SUBSET DF FOR NEEDED ROWS   
             .drop(['key'], axis='columns')                                    # DROP UNNEEDED COLUMN
             .assign(id = os.join.basename(file_path).split('_')[0])           # EXTRACT STRING BEFORE UNDERSCORE
             .reindex(['id', 'text'], axis='columns')                          # RE-ORDER COLUMNS
           ) for file_path in file_list]

final_df = pd.concat(df_list, ignore_index=True)                               # VERTICALLY STACK ALL DFs

相关问题 更多 >

    热门问题