如何使用正则表达式按单词分隔文本？

Cats萌萌 2022-12-31 13:58:13

OpenFileDialog openFileDialog = new OpenFileDialog(); if (openFileDialog.ShowDialog() == true) { //your code }我有 .srt 文件，它有一些文本结构。例子：100:00:01,514 --> 00:00:04,185I'm investigatingSaturday night's shootings.200:00:04,219 --> 00:00:05,754What's to investigate?Innocent people我希望得到像“我是”、“正在调查”、“星期六”这样的分裂词。我创造了模式@"[a-zA-Z']"这将我的文字分开几乎是正确的。但是 .srt 文件也包含一些无用的标签结构，就像这样<i>我想删除。如何构建我的模式，将文本按单词分隔并删除“<”和“>”之间的所有文本（包括大括号）？

查看完整描述

2 回答

HUX布斯

TA贡献1876条经验获得超6个赞

好吧，很难以一种方式在正则表达式中做到这一点（至少对我来说是这样），但你可以分两步做到这一点。

首先，您从字符串中删除 html 字符，然后提取之后的单词。

看看下面。

var text = "00:00:01,514 --> 00:00:04,185 I'm investigating Saturday night's shootings.<i>"

// remove all html char

var noHtml = Regex.Replace(text, @"(<[^>]*>).*", "");

// and now you could get only the words by using @"[a-zA-Z']" on noHtml. You should get "I'm investigating Saturday night's shootings."

反对回复 2022-12-31

LEATH

TA贡献1936条经验获得超7个赞

您可以否定环顾四周以断言不存在由以下not <s 结束的序列，并且不存在后跟 not s 序列的 a 序列。><>

using System;

using System.Text.RegularExpressions;

public class Program

{

public static void Main()

{

string input = @"

Hello world, <rubbish>it's a wonderful day.

<trash>

foreach (Match match in Regex.Matches(input, @"(?<!<[^>]*)[a-zA-Z']+(?![^<]*>)"))

{

Console.WriteLine(match.Value);

}

输出：

Hello

world

it's

wonderful

day

.NET 小提琴

反对回复 2022-12-31

热搜

最近搜索清空

如何使用正则表达式按单词分隔文本？

如何使用正则表达式按单词分隔文本？

2 回答

添加回答