首页猿问 strings.index...

strings.index unicode behavior

倚天杖 2022-08-09 20:34:39

package mainimport ( "fmt" "strings")func main() { fmt.Println(strings.Index("ééé hannah","han")) fmt.Println(strings.Index("eee hannah", "han"))}预期输出：44实际输出：74我怀疑这种行为与非ASCII字符的事实有关。您知道如何实现预期输出吗？é

查看完整描述

2 回答

ITMISS

TA贡献1871条经验获得超8个赞

它的字节索引在7和4，请参阅注释，请尝试一下：

s1 := "ééé hannah"

s2 := "eee hannah"

s3 := "han"

fmt.Println([]rune(s3))

// [104 97 110]

fmt.Println([]rune(s1))

// [233 233 233 32 104 97 110 110 97 104]

fmt.Println([]byte(s1))

// [195 169 195 169 195 169 32 104 97 110 110 97 104]

fmt.Println(strings.Index(s1, s3))

fmt.Println([]rune(s2))

// [101 101 101 32 104 97 110 110 97 104]

fmt.Println([]byte(s2))

// [101 101 101 32 104 97 110 110 97 104]

fmt.Println(strings.Index(s2, s3))

请参阅：，它使用：Go/src/strings/strings.goIndexByte

// IndexByte returns the index of the first instance of c in s, or -1 if c is not present in s.

func IndexByte(s string, c byte) int {

return bytealg.IndexByteString(s, c)

}

反对回复 2022-08-09

斯蒂芬大帝

TA贡献1827条经验获得超8个赞

因此，as wasmup 在他们的答案中已经说过：返回字节索引。您期望的是 Unicode 索引。Unicode字符实际上是多字节编码的东西，这就是为什么输入字符串中的3似乎被计数两次（产生索引7而不是预期的4）。strings.Indexéé

一些背景

golang中的字符串基本上是一段字节。这就是返回它所执行的值的原因：以字节为单位找到匹配项的偏移量。但是，Unicode 处理码位以允许使用多字节字符。golang 没有一个相当抽象的名称类型，而是将这种类型称为 a 。关于这个问题还有很多话要说，但你可以在这里阅读更多。strings.Indexcodepointrune

但是，考虑到这一点，我们可以创建自己的函数，为您提供rune索引，而不是字节索引。我们调用函数。此类函数的现成实现可能如下所示：IndexRuneIndex

func RuneIndex(str, sub string) int {

// ensure valid input

if len(str) == 0 || len(sub) == 0 {

return -1

}

// convert to rune slices

rin, rmatch := []rune(str), []rune(sub)

// iterate over input until end of string - length of match we're trying to find

for i := 0; i < len(rin) - len(rmatch); i++ {

// slight optimisation: if the first runes don't match, don't bother comparing full substrings

if rin[i] != rmatch[0] {

continue

}

// compare substrings directly, if they match, we're done

if string(rin[i:i+len(rmatch)]) == sub {

return i

}

return -1

}

它基本上只是将子字符串与我们尝试搜索的字符串的子切片进行比较。通过将符文子片转换为字符串，我们可以只使用运算符，如果找到匹配项，则返回，即符文索引（而不是字节索引）。我添加了一些检查来确保参数不为空，如果未找到索引，该函数将返回-1，类似于标准库函数。==i

实现非常简单，并且没有高度优化，但是考虑到我认为这是一个想要做的利基事情，因此优化这种类型的功能无论如何都会归类为微优化。

反对回复 2022-08-09

2 回答
0 关注
221 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

strings.index unicode behavior

strings.index unicode behavior

2 回答

添加回答