我有一个没有csv格式的文件,内容是这样的
文件:
"TITULO: Albedo SUBTITULO Y PARRAFO: ===Trees===
Because forests generally have a low albedo, (the majority of the ultraviolet and [[visible
`spectrum]] is absorbed through [[photosynthesis]])
"
"TITULO: Albedo SUBTITULO Y PARRAFO: ===Human activities===
Human activities (e.g., deforestation, farming, and urbanization) change the albedo of various areas
around
"TITULO: Abraham Lincoln SUBTITULO Y PARRAFO: ==U.S. House of Representatives, 1847–1849==
[[File:Abraham Lincoln by Nicholas Shepherd, 1846-crop.jpg|thumb|upright|alt=Middle
True to his record, Lincoln professed to friends in 1861 to be ""an old line Whig,
"TITULO: Abraham Lincoln SUBTITULO Y PARRAFO: ===Re-election===
{{Main|1864 United States presidential election}}
[[File:ElectoralCollege1864.svg|thumb|upright=1.3|alt=Map of the
"TITULO: Algeria SUBTITULO Y PARRAFO: ===Research and alternative energy sources===
Algeria has invested an estimated 100 billion dinars towards developing research facilities and
paying researchers.
Ecological anthropology is defined as the ""study of [[cultural adaptation]]s to environments""
"TITULO: Agricultural science SUBTITULO Y PARRAFO: ==Fields or related disciplines==
{{Col-begin}}
{{Col-break}}
* [[Agricultural biotechnology]]
* [[Agricultural chemistry]]
* [[Agricultural diversification]]
* [[Agricultural education]]
* [[Agricultural economics]]
* [[Agricultural engineering]]
我有这个节目
import pandas as pd
data = pd.read_csv('datos_titulos.csv', header = None)
print(data)
我有一个错误:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 3
数据帧表必须是
Tile Head TXT
Albedo Trees Because forests generally have a low ...([[photosynthesis]])
Albedo Human activities Human activities (e.g., de...areas around
Abraham Lincoln U.S. House of..1849 [[File:Abraham Lincoln by... line Whig,
.
.
.
Agricultural science Fields or related disciplines {{Col-begin}} {{Col-break}}...* [[Agricultural engineering]]
就是, 专栏标题是titulo。 head是párrafo y subtitulo==此文本== txt是下一个标题的文本
IIUC,您可以在re.module中使用两组正则表达式,首先我们将迭代您的文本文件以获得标题和标题字段
其次,我们将使用
re.split
来收集文本字段,这是基于这样一种假设,即尽管您的数据是混乱的文本格式,但它仍然保持一些顺序标题>;标题>;文本您必须进一步清理
Text
列,但这就是乐趣的一部分:)相关问题 更多 >
编程相关推荐