用Python中的re-package实现句子的分句

2024-05-20 15:27:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我有很多句子的数据,把一个例子作为下面的句子,我想把它分成两个子句子:

Both whole plasma and the d < 1.006 g/ml density fraction of plasma from 2/2 mice show this broad beta-migration pattern (Fig. 1 B) |T:**1SP3E3| ; |I:**1SP3E3| |L:**1SP3E3| in contrast, 3/3 plasma shows virtually no lipid staining at the beta-position. |T:**1SN3E3| |I:**1SN3E3| |L:**1SN3E3|

拆分为:

Both whole plasma and the d < 1.006 g/ml density fraction of plasma from 2/2 mice show this broad beta-migration pattern (Fig. 1 B)

以及

in contrast, 3/3 plasma shows virtually no lipid staining at the beta-position.

我的代码是:

newData =[]
for item in Data:
    test2= re.split(r" (?:\|.*?\| ?)+", item[0])
    test2 =test2[:-1]
    for tx in test2:
        newData.append(tx)
print len(newData)
print newData

但是,结果中有3项,包括;。我检查了原来的句子,发现;|T:**1SP3E3| ; |I:**1SP3E3|中,所以我需要从结果中删除这个;。我把代码改成了

test2= re.split(r" (?:\|.*?\| ?;?)+", item[0])

但是我不能得到正确的结果。有人能帮忙吗?谢谢。你知道吗


Tags: andoftheindensityitemmlbeta
3条回答

进口re

字符串=[str.strip公司()中的str回复sub('\|\w:*\w+\|','',string).split(';')]

输出为: [‘来自2/2小鼠的整个血浆和血浆的密度分数d<;1.006 g/ml都显示了这种广泛的β迁移模式(图1 B)’‘相反,3/3血浆在β位置几乎没有脂质染色。’

import re
x="""Both whole plasma and the d < 1.006 g/ml density fraction of plasma from 2/2 mice show this broad beta-migration pattern (Fig. 1 B) |T:**1SP3E3| ; |I:**1SP3E3| |L:**1SP3E3| in contrast, 3/3 plasma shows virtually no lipid staining at the beta-position. |T:**1SN3E3| |I:**1SN3E3| |L:**1SN3E3|"""
print [i for i in re.split(r"(?:\|[^:]*:.*?\|(?:[\s;]+|$))+",x) if i]

输出如下所示:

['Both whole plasma and the d < 1.006 g/ml density fraction of plasma from 2/2 mice show this broad beta-migration pattern (Fig. 1 B) ', 'in contrast, 3/3 plasma shows virtually no lipid staining at the beta-position. ']

[i.strip() for i in re.sub(r'\|\w:\*\*\w*\|', '', re.sub(r' +', r' ', s.strip())).split(';')]

退货

['Both whole plasma and the d < 1.006 g/ml density fraction of plasma from 2/2 mice show this broad beta-migration pattern (Fig. 1 B)', 'in contrast, 3/3 plasma shows virtually no lipid staining at the beta-position.']

但是要有一点保留,因为这取决于你的文章和你的例子是否一致。你知道吗

相关问题 更多 >