为了账号安全,请及时绑定邮箱和手机立即绑定

删除明确定义了块的开头和结尾的句子块

删除明确定义了块的开头和结尾的句子块

千巷猫影 2022-06-22 17:48:52
我在用Python 3.6.8我有一个文本文件,例如-###books 22 feb 2017 21 april 2018books 22 feb 2017 2122 feb 2017 21 aprilfeb 2017 21 april 2018$$$###risk true stories people never thought they d dare sharerisk true stories people nevertrue stories people never thoughtstories people never thought theypeople never thought they dnever thought they d darethought they d dare share$$$###everyone hanging out without me mindy kaling non fictioneveryone hanging out without mehanging out without me mindyout without me mindy kalingwithout me mindy kaling nonme mindy kaling non fiction$$$我们使用 -for line_no, line in enumerate(books):    tokens = line.split(" ")    output = list(ngrams(tokens, 5))    booksWithNGrams.append("###") #Adding start of block    booksWithNGrams.append(books[line_no]) # Adding original line    for x in output: # Adding n-grams        booksWithNGrams.append(' '.join(x))    booksWithNGrams.append("$$$") # Adding end of block如您所见,一个带有 n-gram 的句子以 . 开头###和结尾$$$。因此,块的开始和结束是明确定义的。给定一个句子,我想删除一个块。例如 - 如果我输入22 feb 2017 21 april,我想删除 -###books 22 feb 2017 21 april 2018books 22 feb 2017 2122 feb 2017 21 aprilfeb 2017 21 april 2018$$$我怎样才能做到这一点?
查看完整描述

1 回答

?
catspeake

TA贡献1111条经验 获得超0个赞

正如您所说,该块限制在#和$之间。我们可以将文本视为这些符号之间的数字序列。使用 finditer 指向块限制。


    import re


    starts =[]

    starts = [s.start() for s in re.finditer('###',text)]

    # [0, 105, 349]          


    ends = []          

    ends   = [e.end() for e in re.finditer(re.escape('$$$'),text)] #special char $

    # [104, 348, 558]


    blocks = []

    blocks = list(starts+ends)

    blocks.sort()


    #sequence of blocks

    nBlocks = [blocks[i:i+2] for i in range(0, len(blocks), 2)]

    #[[0, 104], [105, 348], [349, 558]]



    #find where the input text belongs       

    for i in text:       

        find   = '22 feb 2017 21 april'

        where  = text.index(find)

    # 10  


    #removing block elements    

    for n in range(len(nBlocks)):

        if where in range(nBlocks[n][0],nBlocks[n][1]): 

            for x in range(nBlocks[n][0],nBlocks[n][1]+1):

                             #text starts          #text ends

                 cleanText = text[0:nBlocks[n][0]]+text[nBlocks[n][1]+1::]



    print(cleanText)


    ###

    risk true stories people never thought they d dare share

    risk true stories people never

    true stories people never thought

    stories people never thought they

    people never thought they d

    never thought they d dare

    thought they d dare share

    $$$

    ###

    everyone hanging out without me mindy kaling non fiction

    everyone hanging out without me

    hanging out without me mindy

    out without me mindy kaling

    without me mindy kaling non

    me mindy kaling non fiction

    $$$


查看完整回答
反对 回复 2022-06-22
  • 1 回答
  • 0 关注
  • 131 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
微信客服

购课补贴
联系客服咨询优惠详情

帮助反馈 APP下载

慕课网APP
您的移动学习伙伴

公众号

扫描二维码
关注慕课网微信公众号