如何将文本文件中的数据提取到定义为空白行之间的数据行的句子中？

数据位于文本文件中，我想将其中的数据分组为句子。句子的定义是所有行依次排列，每行至少有 1 个字符。包含数据的行之间有空白行，因此我希望空白行标记句子的开头和结尾。有没有办法通过列表理解来做到这一点？文本文件中的示例。数据看起来像这样：This is thefirst sentence.This is a really long sentenceand it just keeps going across manyrows there will not necessarily be punctuationor consistency in word lengththe only difference in ending sentenceis the next row will be blankhere would be the third sentenceas you seethe blanks between rows of data help define what a sentence isthis would be sentence 4i want to pull datafrom text fileas such (in sentences) where sentences are defined withblank records in betweenthis would be sentence 5 since blank row above itand continues but ends because blank row(s) below it

查看完整描述

2 回答

GCT1015

TA贡献1827条经验获得超4个赞

您可以使用 . 获取整个文件作为单个字符串file_as_string = file_object.read()。由于您想将此字符串拆分为空行，这相当于拆分两个后续换行符，因此我们可以这样做sentences = file_as_string.split("\n\n")。最后，您可能想要删除句子中间仍然存在的换行符。您可以通过列表理解来做到这一点，将换行符替换为空：sentences = [s.replace('\n', '') for s in sentences]

总共给出：

file_as_string = file_object.read()

sentences = file_as_string.split("\n\n")

sentences = [s.replace('\n', '') for s in sentences]

反对回复 2023-10-05

蝴蝶不菲

TA贡献1810条经验获得超4个赞

为此，您可以非常有效地使用正则表达式拆分。

如果您只想用双空格分隔，请使用：

^[ \t]*$

演示

在Python中，你可以这样做：

import re

with open(fn) as f_in:

sentencences=re.split(r'\r?\n^[ \t]*$', f_in.read(), flags=re.M)

如果要删除\n文本中的单个内容：

with open(fn) as f_in:

sentencences=[re.sub(r'[ \t]*(?:\r?\n){1,}', ' ', s)

for s in re.split(r'\r?\n^[ \t]*$', f_in.read(), flags=re.M)]

反对回复 2023-10-05

热搜

最近搜索清空

如何将文本文件中的数据提取到定义为空白行之间的数据行的句子中？

如何将文本文件中的数据提取到定义为空白行之间的数据行的句子中？

2 回答

添加回答