删除一个句子块的开头和结尾都有明确的定义

2024-10-03 02:48:19 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用Python 3.6.8

我有一个文本文件像-

###
books 22 feb 2017 21 april 2018
books 22 feb 2017 21
22 feb 2017 21 april
feb 2017 21 april 2018
$$$
###
risk true stories people never thought they d dare share
risk true stories people never
true stories people never thought
stories people never thought they
people never thought they d
never thought they d dare
thought they d dare share
$$$
###
everyone hanging out without me mindy kaling non fiction
everyone hanging out without me
hanging out without me mindy
out without me mindy kaling
without me mindy kaling non
me mindy kaling non fiction
$$$

我们用-

for line_no, line in enumerate(books):
    tokens = line.split(" ")
    output = list(ngrams(tokens, 5))
    booksWithNGrams.append("###") #Adding start of block
    booksWithNGrams.append(books[line_no]) # Adding original line
    for x in output: # Adding n-grams
        booksWithNGrams.append(' '.join(x))
    booksWithNGrams.append("$$$") # Adding end of block

正如你所看到的,一个n字元的句子以###开头,以$$$结尾。因此,块的开始和结束是明确定义的。你知道吗

给定一个句子,我想删除一个块。例如-如果我输入22 feb 2017 21 april,我想删除-

###
books 22 feb 2017 21 april 2018
books 22 feb 2017 21
22 feb 2017 21 april
feb 2017 21 april 2018
$$$

我该怎么做?你知道吗


Tags: lineoutpeoplebooksfebstorieswithoutme
1条回答
网友
1楼 · 发布于 2024-10-03 02:48:19

正如你所说的,这个街区限制在#到$之间。 我们可以将文本视为这些符号之间的数字序列。 使用finditer指出块限制。你知道吗

    import re

    starts =[]
    starts = [s.start() for s in re.finditer('###',text)]
    # [0, 105, 349]          

    ends = []          
    ends   = [e.end() for e in re.finditer(re.escape('$$$'),text)] #special char $
    # [104, 348, 558]

    blocks = []
    blocks = list(starts+ends)
    blocks.sort()

    #sequence of blocks
    nBlocks = [blocks[i:i+2] for i in range(0, len(blocks), 2)]
    #[[0, 104], [105, 348], [349, 558]]


    #find where the input text belongs       
    for i in text:       
        find   = '22 feb 2017 21 april'
        where  = text.index(find)
    # 10  

    #removing block elements    
    for n in range(len(nBlocks)):
        if where in range(nBlocks[n][0],nBlocks[n][1]): 
            for x in range(nBlocks[n][0],nBlocks[n][1]+1):
                             #text starts          #text ends
                 cleanText = text[0:nBlocks[n][0]]+text[nBlocks[n][1]+1::]


    print(cleanText)

    ###
    risk true stories people never thought they d dare share
    risk true stories people never
    true stories people never thought
    stories people never thought they
    people never thought they d
    never thought they d dare
    thought they d dare share
    $$$
    ###
    everyone hanging out without me mindy kaling non fiction
    everyone hanging out without me
    hanging out without me mindy
    out without me mindy kaling
    without me mindy kaling non
    me mindy kaling non fiction
    $$$

相关问题 更多 >