如何将此文本转换为Pandas数据帧?

2024-09-28 22:19:42 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个没有csv格式的文件,内容是这样的

文件:

"TITULO: Albedo SUBTITULO Y PARRAFO: ===Trees===
Because forests generally have a low albedo, (the majority of the ultraviolet and [[visible 
`spectrum]] is absorbed through [[photosynthesis]])
"

"TITULO: Albedo SUBTITULO Y PARRAFO: ===Human activities===
Human activities (e.g., deforestation, farming, and urbanization) change the albedo of various areas 
around 
"TITULO: Abraham Lincoln SUBTITULO Y PARRAFO: ==U.S. House of Representatives, 1847–1849==
[[File:Abraham Lincoln by Nicholas Shepherd, 1846-crop.jpg|thumb|upright|alt=Middle 
True to his record, Lincoln professed to friends in 1861 to be ""an old line Whig,
"TITULO: Abraham Lincoln SUBTITULO Y PARRAFO: ===Re-election===
{{Main|1864 United States presidential election}}
[[File:ElectoralCollege1864.svg|thumb|upright=1.3|alt=Map of the 
"TITULO: Algeria SUBTITULO Y PARRAFO: ===Research and alternative energy sources===
Algeria has invested an estimated 100 billion dinars towards developing research facilities and 
paying researchers. 
Ecological anthropology is defined as the ""study of [[cultural adaptation]]s to environments""
"TITULO: Agricultural science SUBTITULO Y PARRAFO: ==Fields or related disciplines==
{{Col-begin}}
{{Col-break}}
* [[Agricultural biotechnology]]
* [[Agricultural chemistry]]
* [[Agricultural diversification]]
* [[Agricultural education]]
* [[Agricultural economics]]
* [[Agricultural engineering]]

我有这个节目

import pandas as pd

data = pd.read_csv('datos_titulos.csv', header = None)
print(data)

我有一个错误:

ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 3

数据帧表必须是

Tile                   Head                          TXT
Albedo                 Trees                         Because forests generally have a low  ...([[photosynthesis]])
Albedo                 Human activities              Human activities (e.g., de...areas around 
Abraham Lincoln        U.S. House of..1849           [[File:Abraham Lincoln by... line Whig,
.
.
.
Agricultural science  Fields or related disciplines  {{Col-begin}} {{Col-break}}...* [[Agricultural engineering]]

就是, 专栏标题是titulo。 head是párrafo y subtitulo==此文本== txt是下一个标题的文本


Tags: andofcsvthetocolactivitieshuman
1条回答
网友
1楼 · 发布于 2024-09-28 22:19:42

IIUC,您可以在re.module中使用两组正则表达式,首先我们将迭代您的文本文件以获得标题和标题字段

其次,我们将使用re.split来收集文本字段,这是基于这样一种假设,即尽管您的数据是混乱的文本格式,但它仍然保持一些顺序标题>;标题>;文本

您必须进一步清理Text列,但这就是乐趣的一部分:)

import re
from collections import defaultdict
import pandas as pd

pandas_dict = defaultdict(list)

with open("file.txt", "r") as f:
    for line in f:

      pat = r"TITULO: (.*) SUBTITULO Y PARRAFO: ===(.*?)==="
      if re.search(pat, line):
          pandas_dict["title"].append(re.search(pat, line).group(1))
          pandas_dict["head"].append(re.search(pat, line).group(2))


with open("file.txt", "r") as f:
    body = f.read()

    b = re.split(r"===", body.strip())

    for line in b[2::2]:
        pandas_dict["text"].append(line.strip())

df = pd.DataFrame(pandas_dict)

print(df)

             title                                     head                                               text
0           Albedo                                    Trees  Because forests generally have a low albedo, (...
1           Albedo                         Human activities  Human activities (e.g., deforestation, farming...
2  Abraham Lincoln                              Re-election  {{Main|1864 United States presidential electio...
3          Algeria  Research and alternative energy sources  Algeria has invested an estimated 100 billion ...

print(df[df['Title'] == 'Algeria']['Text'])

paying researchers.
Ecological anthropology is defined as the ""study of [[cultural adaptation]]s to environments""
"TITULO: Agricultural science SUBTITULO Y PARRAFO: ==Fields or related disciplines==
{{Col-begin}}
{{Col-break}}
* [[Agricultural biotechnology]]
* [[Agricultural chemistry]]
* [[Agricultural diversification]]
* [[Agricultural education]]
* [[Agricultural economics]]
* [[Agricultural engineering]]"

相关问题 更多 >