为了账号安全,请及时绑定邮箱和手机立即绑定

Python 中的句子拆分不超过字符数

Python 中的句子拆分不超过字符数

湖上湖 2023-10-11 21:14:51
我有一个包含句子的字符串。如果该字符串包含的字符多于给定的数字。我想将此字符串拆分为几个字符串,其字符数少于最大字符数,但仍包含完整的句子。我做了下面的操作,似乎运行良好,但不确定将其投入生产时是否会遇到错误。下面的看起来还好吗?from nltk.tokenize import sent_tokenizesentences = sent_tokenize(my_text)sentences_split = []shortened_sentence = ""for idx, sentence in enumerate(sentences):    if len(shortened_sentence) + len(sentence) < 5120:        shortened_sentence += sentence            if (len(shortened_sentence) + len(sentence) > 5120) or (idx + 1 == len(sentences)):        sentences_split.append(shortened_sentence)        shortened_sentence = ""        print(sentences_split)
查看完整描述

1 回答

?
哔哔one

TA贡献1854条经验 获得超8个赞

为了更好地解释我对第二个 if 块问题的观点(以注释形式表达),请参阅以下示例。我们想要 max len=15 的字符串,即本例中的 1520 是 16。正如您所看到的,列表中的前 3 项是 5 + 6 + 4 = 15,因此,fisrt 应由列表中的前 3 项组成shortened_sentence。但事实并非如此。因为第二个if的逻辑不正确。


sentences = ['abcde', 'fghijk', 'lmno', 'pqr']


# we need sentences with less than 16 chars

print([len(sentence) for sentence in sentences])


sentences_split = []

shortened_sentence = ""

for idx, sentence in enumerate(sentences):

    if len(shortened_sentence) + len(sentence) < 16:

        shortened_sentence += sentence

        

    if (len(shortened_sentence) + len(sentence) > 16) or (idx + 1 == len(sentences)):

        sentences_split.append(shortened_sentence)

        shortened_sentence = ""

        

print(sentences_split)

print([len(sentence) for sentence in sentences_split])

输出


[5, 6, 4, 3]

['abcdefghijk', 'lmnopqr']

[11, 7]

将其与


sentences = ['abcde', 'fghijk', 'lmno', 'pqr']


# we need sentences with less than 16 chars

print([len(word) for word in sentences])


sentences_split = []

shortened_sentence = ""

for sentence in sentences:

    if len(shortened_sentence) + len(sentence) < 16:

        shortened_sentence += sentence

    else:

        sentences_split.append(shortened_sentence)

        shortened_sentence = sentence

sentences_split.append(shortened_sentence)

        

print(sentences_split)

print([len(sentence) for sentence in sentences_split])

输出


[5, 6, 4, 3]

['abcdefghijklmno', 'pqr']

[15, 3]

最后,如果您不确定“将其投入生产时是否会遇到错误” - 编写测试,大量测试。这就是测试的目的 - 帮助最大限度地减少生产中的错误。


另请注意,第二个片段只是一个示例实现,还有其他可能的实现。


查看完整回答
反对 回复 2023-10-11
  • 1 回答
  • 0 关注
  • 52 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信