Python 字符串中的 u'\ufeff'-IT科技

摘要：问题描述：我收到一个错误，异常消息如下：UnicodeEncodeError: 'ascii' codec can't encode character u'/ufeff' in position 155: ordinal not in range(128) 不知道u'/ufeff'是什么，它在我进行网页抓取...

问题描述：

我收到一个错误，异常消息如下：

UnicodeEncodeError: 'ascii' codec can't encode character u'/ufeff' in
position 155: ordinal not in range(128)

不知道u'/ufeff'是什么，它在我进行网页抓取时出现。我该如何补救？.replace()字符串方法对它不起作用。

解决方案 1：

我在 Python 3 上遇到了这个问题，并找到了这个问题（和解决方案）。打开文件时，Python 3 支持 encoding 关键字来自动处理编码。

如果没有它，读取结果中会包含 BOM：

>>> f = open('file', mode='r')
>>> f.read()
'/ufefftest'

如果给出正确的编码，结果中将省略 BOM：

>>> f = open('file', mode='r', encoding='utf-8-sig')
>>> f.read()
'test'

仅我的观点。

解决方案 2：

Unicode 字符U+FEFF是字节顺序标记 (BOM)，用于区分大端和小端 UTF-16 编码。如果您使用正确的编解码器解码网页，Python 会为您删除它。示例：

#!python2
#coding: utf8
u = u'ABC'
e8 = u.encode('utf-8')        # encode without BOM
e8s = u.encode('utf-8-sig')   # encode with BOM
e16 = u.encode('utf-16')      # encode with BOM
e16le = u.encode('utf-16le')  # encode without BOM
e16be = u.encode('utf-16be')  # encode without BOM
print 'utf-8     %r' % e8
print 'utf-8-sig %r' % e8s
print 'utf-16    %r' % e16
print 'utf-16le  %r' % e16le
print 'utf-16be  %r' % e16be
print
print 'utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8')
print 'utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')
print 'utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16')
print 'utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le')

请注意，这EF BB BF是 UTF-8 编码的 BOM。它不是 UTF-8 所必需的，但仅用作签名（通常在 Windows 上）。

输出：

utf-8     'ABC'
utf-8-sig 'xefxbbxbfABC'
utf-16    'xffxfeAx00Bx00Cx00'    # Adds BOM and encodes using native processor endian-ness.
utf-16le  'Ax00Bx00Cx00'
utf-16be  'x00Ax00Bx00C'

utf-8  w/ BOM decoded with utf-8     u'/ufeffABC'    # doesn't remove BOM if present.
utf-8  w/ BOM decoded with utf-8-sig u'ABC'          # removes BOM if present.
utf-16 w/ BOM decoded with utf-16    u'ABC'          # *requires* BOM to be present.
utf-16 w/ BOM decoded with utf-16le  u'/ufeffABC'    # doesn't remove BOM if present.

请注意，utf-16编解码器需要存在 BOM，否则 Python 将不知道数据是大端还是小端。

解决方案 3：

该字符是BOM或“字节顺序标记”。它通常作为文件的前几个字节接收，告诉您如何解释其余数据的编码。您可以简单地删除该字符以继续。尽管错误表明您试图转换为“ascii”，但您可能应该为您尝试执行的任何操作选择另一种编码。

解决方案 4：

您正在抓取的内容是用 unicode 而不是 ascii 文本编码的，并且您得到的字符不会转换为 ascii。正确的“翻译”取决于原始网页认为它是什么。Python 的 unicode 页面提供了其工作原理的背景。

您是想打印结果还是将其粘贴到文件中？错误表明是写入数据导致了问题，而不是读取数据。这个问题是寻找修复方法的好地方。

解决方案 5：

您可以在迭代所有字符时替换它，如下所示：

''.join(char.replace('/ufeff', '') for char in html_string)

来源：https：//www.reddit.com/r/learnpython/comments/rlfos7/comment/hpfezxz/？ utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

解决方案 6：

这是基于 Mark Tolonen 的回答。字符串包含单词“test”的不同语言，以“|”分隔，因此您可以看到差异。

u = u'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
e8 = u.encode('utf-8')        # encode without BOM
e8s = u.encode('utf-8-sig')   # encode with BOM
e16 = u.encode('utf-16')      # encode with BOM
e16le = u.encode('utf-16le')  # encode without BOM
e16be = u.encode('utf-16be')  # encode without BOM
print('utf-8     %r' % e8)
print('utf-8-sig %r' % e8s)
print('utf-16    %r' % e16)
print('utf-16le  %r' % e16le)
print('utf-16be  %r' % e16be)
print()
print('utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8'))
print('utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig'))
print('utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16'))
print('utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le'))

以下是一次测试运行：

>>> u = u'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> e8 = u.encode('utf-8')        # encode without BOM
>>> e8s = u.encode('utf-8-sig')   # encode with BOM
>>> e16 = u.encode('utf-16')      # encode with BOM
>>> e16le = u.encode('utf-16le')  # encode without BOM
>>> e16be = u.encode('utf-16be')  # encode without BOM
>>> print('utf-8     %r' % e8)
utf-8     b'ABCtestxcexb2xe8xb2x9dxe5xa1x94xecx9cx84mxc3xa1sbxc3xaata|test|xd8xa7xd8xaexd8xaaxd8xa8xd8xa7xd8xb1|xe6xb5x8bxe8xafx95|xe6xb8xacxe8xa9xa6|xe3x83x86xe3x82xb9xe3x83x88|xe0xa4xaaxe0xa4xb0xe0xa5x80xe0xa4x95xe0xa5x8dxe0xa4xb7xe0xa4xbe|xe0xb4xaaxe0xb4xb0xe0xb4xbfxe0xb4xb6xe0xb5x8bxe0xb4xa7xe0xb4xa8|xd7xa4xd6xbcxd7xa8xd7x95xd7x91xd7x99xd7xa8xd7x9f|kixe1xbbx83m tra|xc3x96lxc3xa7ek|'
>>> print('utf-8-sig %r' % e8s)
utf-8-sig b'xefxbbxbfABCtestxcexb2xe8xb2x9dxe5xa1x94xecx9cx84mxc3xa1sbxc3xaata|test|xd8xa7xd8xaexd8xaaxd8xa8xd8xa7xd8xb1|xe6xb5x8bxe8xafx95|xe6xb8xacxe8xa9xa6|xe3x83x86xe3x82xb9xe3x83x88|xe0xa4xaaxe0xa4xb0xe0xa5x80xe0xa4x95xe0xa5x8dxe0xa4xb7xe0xa4xbe|xe0xb4xaaxe0xb4xb0xe0xb4xbfxe0xb4xb6xe0xb5x8bxe0xb4xa7xe0xb4xa8|xd7xa4xd6xbcxd7xa8xd7x95xd7x91xd7x99xd7xa8xd7x9f|kixe1xbbx83m tra|xc3x96lxc3xa7ek|'
>>> print('utf-16    %r' % e16)
utf-16    b"xffxfeAx00Bx00Cx00tx00ex00sx00tx00xb2x03x9dx8cTXx04xc7mx00xe1x00sx00bx00xeax00tx00ax00|x00tx00ex00sx00tx00|x00'x06.x06*x06(x06'x061x06|x00Kmxd5x8b|x00,nfx8a|x00xc60xb90xc80|x00*    0    @    x15    M    7    >    |x00*
0
?
6
K
'
(
|x00xe4x05xbcx05xe8x05xd5x05xd1x05xd9x05xe8x05xdfx05|x00kx00ix00xc3x1emx00 x00tx00rx00ax00|x00xd6x00lx00xe7x00ex00kx00|x00"
>>> print('utf-16le  %r' % e16le)
utf-16le  b"Ax00Bx00Cx00tx00ex00sx00tx00xb2x03x9dx8cTXx04xc7mx00xe1x00sx00bx00xeax00tx00ax00|x00tx00ex00sx00tx00|x00'x06.x06*x06(x06'x061x06|x00Kmxd5x8b|x00,nfx8a|x00xc60xb90xc80|x00*    0    @    x15    M    7    >    |x00*
0
?
6
K
'
(
|x00xe4x05xbcx05xe8x05xd5x05xd1x05xd9x05xe8x05xdfx05|x00kx00ix00xc3x1emx00 x00tx00rx00ax00|x00xd6x00lx00xe7x00ex00kx00|x00"
>>> print('utf-16be  %r' % e16be)
utf-16be  b"x00Ax00Bx00Cx00tx00ex00sx00tx03xb2x8cx9dXTxc7x04x00mx00xe1x00sx00bx00xeax00tx00ax00|x00tx00ex00sx00tx00|x06'x06.x06*x06(x06'x061x00|mKx8bxd5x00|n,x8afx00|0xc60xb90xc8x00|    *    0    @    x15    M    7    >x00|
*
0
?
6
K
'
(x00|x05xe4x05xbcx05xe8x05xd5x05xd1x05xd9x05xe8x05xdfx00|x00kx00ix1exc3x00mx00 x00tx00rx00ax00|x00xd6x00lx00xe7x00ex00kx00|"
>>> print()

>>> print('utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8'))
utf-8  w/ BOM decoded with utf-8     '/ufeffABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig'))
utf-8  w/ BOM decoded with utf-8-sig 'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16'))
utf-16 w/ BOM decoded with utf-16    'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le'))
utf-16 w/ BOM decoded with utf-16le  '/ufeffABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'

值得注意的是，只有在和之后utf-8-sig才能utf-16返回原始字符串。encode`decode`

解决方案 7：

这个问题基本上是在您以UTF-8 或 UTF-16 编码保存 Python 代码时出现的，因为 Python 会自动在代码开头添加一些特殊字符（文本编辑器不会显示）来识别编码格式。但是，当您尝试执行代码时，它会在第 1 行（即代码开头）给出语法错误，因为Python 编译器理解 ASCII 编码。当您使用read()函数查看文件的代码时，您可以看到在返回代码的开头显示了'/ufeff'。解决此问题最简单的方法就是将编码改回 ASCII 编码（为此，您可以将代码复制到记事本并保存。记住！选择 ASCII 编码...希望这会有所帮助。