如何将此文本转换为Pandas数据帧？

"TITULO: Albedo SUBTITULO Y PARRAFO: ===Trees=== Because forests generally have a low albedo, (the majority of the ultraviolet and [[visible `spectrum]] is absorbed through [[photosynthesis]]) " "TITULO: Albedo SUBTITULO Y PARRAFO: ===Human activities=== Human activities (e.g., deforestation, farming, and urbanization) change the albedo of various areas around "TITULO: Abraham Lincoln SUBTITULO Y PARRAFO: ==U.S. House of Representatives, 1847–1849== [[File:Abraham Lincoln by Nicholas Shepherd, 1846-crop.jpg|thumb|upright|alt=Middle True to his record, Lincoln professed to friends in 1861 to be ""an old line Whig, "TITULO: Abraham Lincoln SUBTITULO Y PARRAFO: ===Re-election=== {{Main|1864 United States presidential election}} [[File:ElectoralCollege1864.svg|thumb|upright=1.3|alt=Map of the "TITULO: Algeria SUBTITULO Y PARRAFO: ===Research and alternative energy sources=== Algeria has invested an estimated 100 billion dinars towards developing research facilities and paying researchers. Ecological anthropology is defined as the ""study of [[cultural adaptation]]s to environments"" "TITULO: Agricultural science SUBTITULO Y PARRAFO: ==Fields or related disciplines== {{Col-begin}} {{Col-break}} * [[Agricultural biotechnology]] * [[Agricultural chemistry]] * [[Agricultural diversification]] * [[Agricultural education]] * [[Agricultural economics]] * [[Agricultural engineering]]

Tile Head TXT Albedo Trees Because forests generally have a low ...([[photosynthesis]]) Albedo Human activities Human activities (e.g., de...areas around Abraham Lincoln U.S. House of..1849 [[File:Abraham Lincoln by... line Whig, . . . Agricultural science Fields or related disciplines {{Col-begin}} {{Col-break}}...* [[Agricultural engineering]]

1条回答

网友

1楼 · 发布于 2024-09-28 22:19:42

IIUC，您可以在re.module中使用两组正则表达式，首先我们将迭代您的文本文件以获得标题和标题字段

其次，我们将使用re.split来收集文本字段，这是基于这样一种假设，即尽管您的数据是混乱的文本格式，但它仍然保持一些顺序标题>；标题>；文本

您必须进一步清理Text列，但这就是乐趣的一部分：）

import re
from collections import defaultdict
import pandas as pd

pandas_dict = defaultdict(list)

with open("file.txt", "r") as f:
    for line in f:

      pat = r"TITULO: (.*) SUBTITULO Y PARRAFO: ===(.*?)==="
      if re.search(pat, line):
          pandas_dict["title"].append(re.search(pat, line).group(1))
          pandas_dict["head"].append(re.search(pat, line).group(2))


with open("file.txt", "r") as f:
    body = f.read()

    b = re.split(r"===", body.strip())

    for line in b[2::2]:
        pandas_dict["text"].append(line.strip())

df = pd.DataFrame(pandas_dict)

print(df)

             title                                     head                                               text
0           Albedo                                    Trees  Because forests generally have a low albedo, (...
1           Albedo                         Human activities  Human activities (e.g., deforestation, farming...
2  Abraham Lincoln                              Re-election  {{Main|1864 United States presidential electio...
3          Algeria  Research and alternative energy sources  Algeria has invested an estimated 100 billion ...

print(df[df['Title'] == 'Algeria']['Text'])

paying researchers.
Ecological anthropology is defined as the ""study of [[cultural adaptation]]s to environments""
"TITULO: Agricultural science SUBTITULO Y PARRAFO: ==Fields or related disciplines==
{{Col-begin}}
{{Col-break}}
* [[Agricultural biotechnology]]
* [[Agricultural chemistry]]
* [[Agricultural diversification]]
* [[Agricultural education]]
* [[Agricultural economics]]
* [[Agricultural engineering]]"

相关问题更多 >

编程相关推荐

热门问题

热门文章