我想优化我的代码。我需要处理150000个文本文件,并从文件标题中提取id。我还需要从检测到的文本字段中只提取单词。我的代码正在运行,但需要很长时间。我需要一些更快的解决方案,比如并行作业或类似的东西,但我不知道如何实现这种解决方案
我的代码:
# folder with multiple text files
file_list = glob.glob(os.path.join(os.getcwd(), "text_files", "*.txt"))
corpus = []
for file_path in file_list:
with open(file_path, encoding="utf8") as f_input:
pathPre = file_path[:file_path.find(".")]
pathFinal = pathPre[pathPre.rfind("\\")+1:]
ID= "ID: " + pathFinal + "\n----------\n"
corpus.append(ID + f_input.read())
df = pd.DataFrame(corpus)
df_txt = df[0].str.split('\n', expand=True)
df_txt[0] = df_txt[0].str.partition('_')[0].str.strip()
listOfLists = []
for index, row in df_txt.iterrows():
detectedTextIter = []
for i in range (12115):
if (row[i] is None):
continue
else:
if ("ID: " in row[i]):
ID = row[i].split("ID: ", 1)[1]
detectedTextIter.append(ID)
elif ("Detected Text:" in row[i]):
detectedText = row[i].split("Detected Text:", 1)[1]
detectedTextIter.append(detectedText)
else:
continue
listOfLists.append(detectedTextIter)
newDF = pd.DataFrame.from_records(listOfLists)
IDList = []
for index, row in newDF.iterrows():
ID= row[0]
IDlist.append(ID)
uniqueIDList = list(set(IDList))
keywordListofLists = []
for i in uniqueIDList:
newList = []
newList.append(i)
keywordList = []
newList.append(keywordList)
keywordListofLists.append(newList)
for i in listOfLists:
IDLookup = i[0]
words = []
for k in range(1, len(i)):
words.append(i[k])
for j in range(len(keywordListofLists)):
if (keywordListofLists[j][0] == IDLookup):
for x in words:
keywordListofLists[j][1].append(x)
for i in keywordListofLists:
wordList = i[1]
uniqueWords = list(set(wordList))
i[1] = uniqueWords
UniqueKeywordPerUniqueIDList = pd.DataFrame.from_records(keywordListofLists)
UniqueKeywordPerUniqueIDList.columns = ['id','text']
ML_df = UniqueKeywordPerUniqueIDList.text.apply(pd.Series).merge(UniqueKeywordPerUniqueIDList, left_index = True, right_index = True).drop(['text'], axis = 1).melt(id_vars = ['id'], value_name = 'text').drop("variable", axis = 1).dropna()
我的文本文件:
Id: 84194a52-6402-41c8-9057-4fd31a9b2cea
Type: LINE
Detected Text: ORPINE
Confidence: 37.295963
Id: dcfeca0e-1dc2-47e7-8abe-6c4a4309b525
Type: LINE
Detected Text: BOAT SOA
Confidence: 91.334778
Id: 69889983-b22a-4d08-bf3b-841bb1303512
Type: LINE
Detected Text: ORPRODUCTS
Confidence: 96.001930
Id: 67c9a54a-f8c0-4217-8764-d288842ad3a1
Type: LINE
Detected Text: PRESH
Confidence: 38.313396
Id: bddd43e2-b284-40fb-ae0b-6d91c3f41cc3
Type: WORD
Detected Text: ORPINE
Confidence: 37.295963
Id: 7f55b98d-e2eb-4e38-a517-79a45aff5a94
Type: WORD
Detected Text: BOAT
Confidence: 99.360634
Id: 52309976-6dcd-4727-98d4-fc24640f98ae
Type: WORD
Detected Text: SOA
Confidence: 83.308922
Id: 50c1f5b2-a2e2-470e-b823-9ed9b8059ea3
Type: WORD
Detected Text: ORPRODUCTS
Confidence: 96.001930
Id: aa7b9b69-8f8e-49e6-bd63-b26e8f7a3d08
Type: WORD
Detected Text: PRESH
Confidence: 38.313396
你可以用熊猫和地球仪来阅读它们。看看这个例子:
在glob的帮助下,我们可以读取所有的txt文件并将它们放入一个数据帧中
对于提取id和其他东西,可以使用regex和nltk
只需使用
:
作为分隔符直接读取文本文件即可。然后将数据帧子集为仅保留Detected Text
行。使用assign
添加所需的列。用链式调用将所有内容包装在列表中相关问题 更多 >
编程相关推荐