在python3中如何从文本文件中获取段落的起始偏移量和结束偏移量

2024-10-01 00:23:31 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图在python中获取文本文件段落的开始和结束偏移量。我尝试了下面的代码,它给出了开始和结束偏移量,但如果段落以空格或制表符开头,则不会将其视为段落。在

  paraStartOffset = []
  paraEndOffset = []

  for match in re.finditer(r'(?s)((?:[^\n]?)+)', textFile):
      paraStartOffset.append(match.start())
      paraEndOffset.append(match.end())

  print "start Offset --> ",paraStartOffset
  print "end Offset --> ",paraEndOffset

有人能告诉我我在哪里错过了什么。谢谢。在


Tags: 代码formatchstart制表符offset偏移量end
1条回答
网友
1楼 · 发布于 2024-10-01 00:23:31

我想这篇question / answer基本上讨论了你在找什么。 如果我在段落开头也使用前导空格测试代码(取自答案),那么它几乎可以工作。在

for match in re.finditer(r'(?s)((?:[^\n][\n]?)+)', DATA):
    print match.start(), match.end()

当我在我的测试文本(取自Bram Stoker's Dracula)上运行它时,它返回以下结果:第一段是上的标准。第二个从空格开始。第三个以TAB开头。在

结果:(显示每个段落的起始偏移量和结束偏移量)

^{pr2}$

测试文本:(我无法获得与原始格式完全相同的格式,但无论如何…)

_3 May. Bistritz._ Left Munich at 8:35 P. M., on 1st May, arriving at
Vienna early next morning; should have arrived at 6:46, but train was an
hour late. Buda-Pesth seems a wonderful place, from the glimpse which I
got of it from the train and the little I could walk through the
streets. I feared to go very far from the station, as we had arrived
late and would start as near the correct time as possible. The
impression I had was that we were leaving the West and entering the
East; the most western of splendid bridges over the Danube, which is
here of noble width and depth, took us among the traditions of Turkish
rule.

  "My Friend. Welcome to the Carpathians. I am anxiously expecting
you. Sleep well to-night. At three to-morrow the diligence will
start for Bukovina; a place on it is kept for you. At the Borgo
Pass my carriage will await you and will bring you to me. I trust
that your journey from London has been a happy one, and that you
will enjoy your stay in my beautiful land.

    Just before I was leaving, the old lady came up to my room and said in a
very hysterical way:

相关问题 更多 >