为了账号安全,请及时绑定邮箱和手机立即绑定

检测字节流是否是UTF8编码

标签:
架构

几天前偶尔看到有人发帖子问“如何自动识别判断url中的中文参数是GB2312还是Utf-8编码”

也拜读了wcwtitxu使用巨牛的正则表达式检测UTF8编码的算法。

使用无数或条件的正则表达式用起来却是性能不高。

刚好曾经在项目中有类似的需求,这里把处理思路和整理后的源代码贴出来供大家参考

先聊聊原理:

UTF8的编码规则如下表

UTF8 Encoding Rule

看起来很复杂,总结起来如下:

ASCII码(U+0000 - U+007F),不编码

其余编码规则为

•第一个Byte二进制以形式为n个1紧跟个0 (n >= 2), 0后面的位数用来存储真正的字符编码,n的个数说明了这个多Byte字节组字节数(包括第一个Byte)
•结下来会有n个以10开头的Byte,后6个bit存储真正的字符编码。
因此对整个编码byte流进行分析可以得出是否是UTF8编码的判断。

根据这个规则,我给出的C#代码如下:

?

/// <summary>///   Determines whether the given <paramref name="inputStream"/>is UTF8 encoding bytes./// </summary>/// <param name="inputStream">///    The input stream.///  </param>/// <returns>///   <see langword="true"/> if given bystes stream is in UTF8 encoding; otherwise, <see langword="false"/>./// </returns>/// <remarks>///   All ASCII chars will regards not UTF8 encoding./// </remarks>public static bool IsTextUTF8(ref byte[] inputStream){    int encodingBytesCount = 0;    bool allTextsAreASCIIChars = true;     for (int i = 0; i < inputStream.Length; i++)    {        byte current = inputStream[i];         if ((current & 0x80) == 0x80)        {                                allTextsAreASCIIChars = false;        }        // First byte        if (encodingBytesCount == 0)        {            if ((current & 0x80) == 0)            {                // ASCII chars, from 0x00-0x7F                continue;            }             if ((current & 0xC0) == 0xC0)            {                encodingBytesCount = 1;                current <<= 2;                 // More than two bytes used to encoding a unicode char.                // Calculate the real length.                while ((current & 0x80) == 0x80)                {                    current <<= 1;                    encodingBytesCount++;                }            }                                else            {                // Invalid bits structure for UTF8 encoding rule.                return false;            }        }                        else        {            // Following bytes, must start with 10.            if ((current & 0xC0) == 0x80)            {                                        encodingBytesCount--;            }            else            {                // Invalid bits structure for UTF8 encoding rule.                return false;            }        }    }     if (encodingBytesCount != 0)    {        // Invalid bits structure for UTF8 encoding rule.        // Wrong following bytes count.        return false;    }     // Although UTF8 supports encoding for ASCII chars, we regard as a input stream, whose contents are all ASCII as default encoding.    return !allTextsAreASCIIChars;}

 

 

再附上单元测试代码:

 

?

/// <summary>///This is a test class for EncodingHelperTest and is intended///to contain all EncodingHelperTest Unit Tests///</summary>[TestClass()]public class EncodingHelperTest{    /// <summary>    ///  Normal test for this method.    ///</summary>    [TestMethod()]    public void IsTextUTF8Test()    {        for (int i = 0; i < 1000; i++)        {            List<Char> chars = new List<char>();            chars.Add('中');             List<UnicodeCategory> temp = new List<UnicodeCategory>();            Random rd = new Random((int)(DateTime.Now.Ticks & 0x7FFFFFFF));             for (int j = 0; j < 255; j++)            {                char ch = (char)rd.Next(0xFFFF);                UnicodeCategory uc = System.Globalization.CharUnicodeInfo.GetUnicodeCategory(ch);                if (uc == UnicodeCategory.Surrogate || // Single surrogate could not be encoding correctly.                    uc == UnicodeCategory.PrivateUse || // Private use blocks should be excluded.                    uc == UnicodeCategory.OtherNotAssigned                    )                {                    j--;                }                else                {                    chars.Add(ch);                    temp.Add(uc);                }            }             string str = new string(chars.ToArray());             byte[] inputStream = Encoding.UTF8.GetBytes(str);            bool expected = true;             bool actual;            actual = EncodingHelper.IsTextUTF8(ref inputStream);            Assert.AreEqual(expected, actual, string.Format("UTF8_Assert Fails at:{0}", str));             inputStream = Encoding.GetEncoding(932).GetBytes(str);            expected = false;             actual = EncodingHelper.IsTextUTF8(ref inputStream);            Assert.AreEqual(expected, actual, string.Format("ShiftJIS_Assert Fails at:{0}", str));        }    }     /// <summary>    ///   Check with All ASCII chars    /// </summary>    [TestMethod]    public void IsTextUTF8Test_AllASCII()    {        string str = "ABCDEFGHKLHSJKLDFHJKLHAJKLSHJKLHAJKLSHDJKLAHSDJKLHAJKLSDHJKLASHDJKLHASJKLDHJKLASD";         byte[] inputStream = Encoding.UTF8.GetBytes(str);        bool expected = false;        bool actual;        actual = EncodingHelper.IsTextUTF8(ref inputStream);        Assert.AreEqual(expected, actual, string.Format("UTF8_Assert Fails at:{0}", str));      }}

 

另:

如果是判断一个文件是否使用了UTF8编码,不一定非用这种方法,因为通常以UTF8格式保存的文件最初两个字符是BOM头,标示该文件使用了UTF8编码。

参考:

维基百科:http://en.wikipedia.org/wiki/UTF-8

点击查看更多内容
TA 点赞

若觉得本文不错,就分享一下吧!

评论

作者其他优质文章

正在加载中
  • 推荐
  • 评论
  • 收藏
  • 共同学习,写下你的评论
感谢您的支持,我会继续努力的~
扫码打赏,你说多少就多少
赞赏金额会直接到老师账户
支付方式
打开微信扫一扫,即可进行扫码打赏哦
今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与 放弃机会
微信客服

购课补贴
联系客服咨询优惠详情

帮助反馈 APP下载

慕课网APP
您的移动学习伙伴

公众号

扫描二维码
关注慕课网微信公众号

举报

0/150
提交
取消