我正在尝试使用以下代码解析带有requests和BeautifulSoup库的任意网页:try: response = requests.get(url)except Exception as error: return Falseif response.encoding == None: soup = bs4.BeautifulSoup(response.text) # This is line 809else: soup = bs4.BeautifulSoup(response.text, from_encoding=response.encoding)在大多数网页上,这都可以正常工作。但是,在某些任意页面(<1%)上,出现此崩溃:Traceback (most recent call last): File "/home/dotancohen/code/parser.py", line 155, in has_css soup = bs4.BeautifulSoup(response.text) File "/usr/lib/python3/dist-packages/requests/models.py", line 809, in text content = str(self.content, encoding, errors='replace') TypeError: str() argument 2 must be str, not None作为参考,这是请求库的relevent方法:@propertydef text(self): """Content of the response, in unicode. if Response.encoding is None and chardet module is available, encoding will be guessed. """ # Try charset from content-type content = None encoding = self.encoding # Fallback to auto-detected encoding. if self.encoding is None: if chardet is not None: encoding = chardet.detect(self.content)['encoding'] # Decode unicode from given encoding. try: content = str(self.content, encoding, errors='replace') # This is line 809 except LookupError: # A LookupError is raised if the encoding was not found which could # indicate a misspelling or similar mistake. # # So we try blindly encoding. content = str(self.content, errors='replace') return content可以看出,抛出此错误时,我没有传递编码。我如何错误地使用该库,以及如何防止该错误?这是在Python 3.2.3上实现的,但我也可以在Python 2上获得相同的结果。
1 回答

天涯尽头无女友
TA贡献1831条经验 获得超9个赞
这意味着服务器未发送标头中内容的编码,并且chardet库也无法确定内容的编码。实际上,您实际上是在测试是否缺少编码;如果没有可用的编码,为什么要尝试获取解码的文本?
您可以尝试将解码留给BeautifulSoup解析器:
if response.encoding is None:
soup = bs4.BeautifulSoup(response.content)
并有没有必要在编码BeautifulSoup通过,因为如果.text没有失败,你正在使用Unicode和BeautifulSoup反正会忽略编码参数:
else:
soup = bs4.BeautifulSoup(response.text)
添加回答
举报
0/150
提交
取消