使用python根据文本中的主题提取子文本(数据清理/提取/处理)

2024-05-05 09:48:59 发布

您现在位置:Python中文网/ 问答频道 /正文

考虑文本1:

What is Lorem Ipsum:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Where does it come from:
Contrary to popular belief, Lorem Ipsum is not simply random text.

Why do we use it:
It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout.

文本2:

What is Lorem Ipsum:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Other Topic:
There are many variations of passages of Lorem Ipsum available.

Why do we use it:
It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout.

文本3:

What is Lorem Ipsum:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Where does it come from:
Contrary to popular belief, Lorem Ipsum is not simply random text.

Some other topic:
Various versions have evolved over the years.

我可以使用python处理这个文本,在开始字符串和结束字符串之间进行提取。我使用的代码-

# This code is run once separately for each text variation 
import sys
s = "text1 or text2 or text3" # one at a time
start_String = s.find("What is Lorem Ipsum:")
end_String = s.find("Why do we use it:")
if start_String == -1 or end_String == -1:
    print("Not found")
    sys.exit(0)
print(s[start_String:end_String])

但我的要求不同。 我需要的文本只涉及“什么是洛雷姆Ipsum:”,“它从哪里来:”,“我们为什么要使用它:”

预期结果:
文本1:

What is Lorem Ipsum:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Where does it come from:
Contrary to popular belief, Lorem Ipsum is not simply random text.

Why do we use it:
It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout.

文本2:

What is Lorem Ipsum:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Why do we use it:
It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout.

文本3:

What is Lorem Ipsum:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Where does it come from:
Contrary to popular belief, Lorem Ipsum is not simply random text.

我在一个巨大的数据集中收集了如上所述的文本。所有我需要做的是提取只需要根据必要的主题子文本。如何在python中实现这一点?我希望我说的有道理


Tags: andofthetext文本isitwhat
1条回答
网友
1楼 · 发布于 2024-05-05 09:48:59

这正是你想要的:

my_list=["""What is Lorem Ipsum:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Where does it come from:
Contrary to popular belief, Lorem Ipsum is not simply random text.

Why do we use it:
It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout.""","""What is Lorem Ipsum:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Why do we use it:
It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout.""","""What is Lorem Ipsum:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Where does it come from:
Contrary to popular belief, Lorem Ipsum is not simply random text."""]


new_list =[]   ## Creating an empty list

for i in range(len(my_list)):
    new_list.extend(my_list[i].split(":"))

相关问题 更多 >