摘要：问题描述：我收到了一些经过编码的文本，但我不知道使用了什么字符集。有没有办法使用 Python 确定文本文件的编码？如何使用 C#检测文本文件的编码/代码页。解决方案 1：编辑：chardet 似乎无人维护，但大部分答案都适用。请查看https://pypi.org/project/charset-norma...

问题描述：

我收到了一些经过编码的文本，但我不知道使用了什么字符集。有没有办法使用 Python 确定文本文件的编码？如何使用 C#检测文本文件的编码/代码页。

解决方案 1：

编辑：chardet 似乎无人维护，但大部分答案都适用。请查看https://pypi.org/project/charset-normalizer/了解替代方案

每次都正确检测编码是不可能的。

（来自 chardet 常见问题解答：）

然而，有些编码是针对特定语言进行优化的，而语言并不是随机的。有些字符序列总是会出现，而其他序列则毫无意义。一个英语流利的人打开报纸，发现“txzqJv 2!dasd0a QqdKjvz”，会立即意识到这不是英语（即使它完全由英文字母组成）。通过研究大量“典型”文本，计算机算法可以模拟这种流利程度，并对文本的语言做出有根据的猜测。

有一个chardet库使用该研究来尝试检测编码。chardet 是 Mozilla 中自动检测代码的一个端口。

你也可以使用UnicodeDammit。它将尝试以下方法：

在文档本身中发现的编码：例如，在 XML 声明或（对于 HTML 文档）http-equiv META 标记中。如果 Beautiful Soup 在文档中发现这种编码，它会从头开始重新解析文档并尝试新的编码。唯一的例外是如果您明确指定了编码，并且该编码确实有效：那么它将忽略在文档中找到的任何编码。
通过查看文件的前几个字节来嗅探编码。如果在此阶段检测到编码，它将是 UTF-* 编码、EBCDIC 或 ASCII 之一。
如果您安装了chardet库，它会嗅探到一种编码。
UTF-8
Windows-1252

解决方案 2：

另一个解决编码问题的选项是使用
libmagic （这是file命令背后的代码
）。有大量的 python 绑定可用。

文件源树中的 python 绑定可作为
python-magic（或python3-magic）debian 包使用。它可以通过执行以下操作来确定文件的编码：

import magic

blob = open('unknown-file', 'rb').read()
m = magic.open(magic.MAGIC_MIME_ENCODING)
m.load()
encoding = m.buffer(blob)  # "utf-8" "us-ascii" etc

pypi 上有一个同名但不兼容的python-magic pip 包，它也使用libmagic。它也可以通过以下方式获取编码：

import magic

blob = open('unknown-file', 'rb').read()
m = magic.Magic(mime_encoding=True)
encoding = m.from_buffer(blob)

解决方案 3：

一些编码策略，请取消注释以适应：

#!/bin/bash
#
tmpfile=$1
echo '-- info about file file ........'
file -i $tmpfile
enca -g $tmpfile
echo 'recoding ........'
#iconv -f iso-8859-2 -t utf-8 back_test.xml > $tmpfile
#enca -x utf-8 $tmpfile
#enca -g $tmpfile
recode CP1250..UTF-8 $tmpfile

您可能希望通过以循环的形式打开和读取文件来检查编码...但您可能需要先检查文件大小：

# PYTHON
encodings = ['utf-8', 'windows-1250', 'windows-1252'] # add more
for e in encodings:
    try:
        fh = codecs.open('file.txt', 'r', encoding=e)
        fh.readlines()
        fh.seek(0)
    except UnicodeDecodeError:
        print('got unicode error with %s , trying different encoding' % e)
    else:
        print('opening the file with encoding:  %s ' % e)
        break

解决方案 4：

这是一个读取并从表面接受chardet编码预测的示例，n_lines如果文件很大，则从文件中读取。

chardet还为您提供了它的编码预测的概率（即confidence）（还没有研究他们是如何得出这个结论的），该概率与它的预测一起返回chardet.predict()，因此如果您愿意，可以以某种方式使用它。

import chardet
from pathlib import Path

def predict_encoding(file_path: Path, n_lines: int=20) -> str:
    '''Predict a file's encoding using chardet'''

    # Open the file as binary data
    with Path(file_path).open('rb') as f:
        # Join binary lines for specified number of lines
        rawdata = b''.join([f.readline() for _ in range(n_lines)])

    return chardet.detect(rawdata)['encoding']

解决方案 5：

这可能会有帮助

from bs4 import UnicodeDammit
with open('automate_data/billboard.csv', 'rb') as file:
   content = file.read()

suggestion = UnicodeDammit(content)
suggestion.original_encoding
#'iso-8859-1'

解决方案 6：

如果您对自动工具不满意，您可以尝试所有编解码器，然后手动查看哪个编解码器合适。

all_codecs = ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp273', 'cp424', 'cp437', 
'cp500', 'cp720', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 
'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 
'cp874', 'cp875', 'cp932', 'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1125', 
'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', 
'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr', 
'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2', 
'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 
'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 
'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_11', 'iso8859_13', 
'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab', 'koi8_r', 'koi8_t', 'koi8_u', 
'kz1048', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman', 
'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 
'utf_32', 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 
'utf_8', 'utf_8_sig']

def find_codec(text):
    for i in all_codecs:
        for j in all_codecs:
            try:
                print(i, "to", j, text.encode(i).decode(j))
            except:
                pass

find_codec("The example string which includes ö, ü, or ÄŸ, Ã¶")

此脚本至少创建了 9409 行输出。因此，如果输出无法适应终端屏幕，请尝试将输出写入文本文件。

解决方案 7：

# Function: OpenRead(file)

# A text file can be encoded using:
#   (1) The default operating system code page, Or
#   (2) utf8 with a BOM header
#
#  If a text file is encoded with utf8, and does not have a BOM header,
#  the user can manually add a BOM header to the text file
#  using a text editor such as notepad++, and rerun the python script,
#  otherwise the file is read as a codepage file with the 
#  invalid codepage characters removed

import sys
if int(sys.version[0]) != 3:
    print('Aborted: Python 3.x required')
    sys.exit(1)

def bomType(file):
    """
    returns file encoding string for open() function

    EXAMPLE:
        bom = bomtype(file)
        open(file, encoding=bom, errors='ignore')
    """

    f = open(file, 'rb')
    b = f.read(4)
    f.close()

    if (b[0:3] == b'xefxbbxbf'):
        return "utf8"

    # Python automatically detects endianess if utf-16 bom is present
    # write endianess generally determined by endianess of CPU
    if ((b[0:2] == b'xfexff') or (b[0:2] == b'xffxfe')):
        return "utf16"

    if ((b[0:5] == b'xfexffx00x00') 
              or (b[0:5] == b'x00x00xffxfe')):
        return "utf32"

    # If BOM is not provided, then assume its the codepage
    #     used by your operating system
    return "cp1252"
    # For the United States its: cp1252


def OpenRead(file):
    bom = bomType(file)
    return open(file, 'r', encoding=bom, errors='ignore')


#######################
# Testing it
#######################
fout = open("myfile1.txt", "w", encoding="cp1252")
fout.write("* hi there (cp1252)")
fout.close()

fout = open("myfile2.txt", "w", encoding="utf8")
fout.write("/u2022 hi there (utf8)")
fout.close()

# this case is still treated like codepage cp1252
#   (User responsible for making sure that all utf8 files
#   have a BOM header)
fout = open("badboy.txt", "wb")
fout.write(b"hi there.  barf(x81x8Dx90x9D)")
fout.close()

# Read Example file with Bom Detection
fin = OpenRead("myfile1.txt")
L = fin.readline()
print(L)
fin.close()

# Read Example file with Bom Detection
fin = OpenRead("myfile2.txt")
L =fin.readline() 
print(L) #requires QtConsole to view, Cmd.exe is cp1252
fin.close()

# Read CP1252 with a few undefined chars without barfing
fin = OpenRead("badboy.txt")
L =fin.readline() 
print(L)
fin.close()

# Check that bad characters are still in badboy codepage file
fin = open("badboy.txt", "rb")
fin.read(20)
fin.close()

解决方案 8：

原则上，一般情况下不可能确定文本文件的编码。所以没有标准 Python 库可以帮你做到这一点。

如果您对文本文件有更具体的了解（例如它是 XML），那么可能会有库函数。

解决方案 9：

根据您的平台，我选择使用 linux shellfile命令。这对我来说很有效，因为我正在一个专门在我们的一台 linux 机器上运行的脚本中使用它。

显然这不是理想的解决方案或答案，但可以根据您的需要进行修改。就我而言，我只需要确定文件是否为 UTF-8。

import subprocess
file_cmd = ['file', 'test.txt']
p = subprocess.Popen(file_cmd, stdout=subprocess.PIPE)
cmd_output = p.stdout.readlines()
# x will begin with the file type output as is observed using 'file' command
x = cmd_output[0].split(": ")[1]
return x.startswith('UTF-8')

解决方案 10：

如果您知道文件的一些内容，您可以尝试使用几种编码对其进行解码，看看缺少了哪些内容。一般来说，没有办法，因为文本文件就是文本文件，而那些很愚蠢 ;)

解决方案 11：

此站点有用于识别 ascii、使用 boms 编码和无 bom utf8 的 python 代码： https: //unicodebook.readthedocs.io/guess_encoding.html。将文件读入字节数组（数据）： http: //www.codecodex.com/wiki/Read_a_file_into_a_byte_array 。这是一个例子。我在 osx 中。

#!/usr/bin/python                                                                                                  

import sys

def isUTF8(data):
    try:
        decoded = data.decode('UTF-8')
    except UnicodeDecodeError:
        return False
    else:
        for ch in decoded:
            if 0xD800 <= ord(ch) <= 0xDFFF:
                return False
        return True

def get_bytes_from_file(filename):
    return open(filename, "rb").read()

filename = sys.argv[1]
data = get_bytes_from_file(filename)
result = isUTF8(data)
print(result)


PS /Users/js> ./isutf8.py hi.txt                                                                                     
True

解决方案 12：

`cchardet`更快的替代方案`chardet`

安装：pip install cchardet

使用：

import cchardet as chardet

filepath = Path(filename)

blob = filepath.read_bytes()
detection = chardet.detect(blob)

encoding = detection["encoding"]
confidence = detection["confidence"]

解决方案 13：

使用 linuxfile -i 命令

import subprocess

file = "path/to/file/file.txt"

encoding =  subprocess.Popen("file -bi "+file, shell=True, stdout=subprocess.PIPE).stdout

encoding = re.sub(r"(\\n)[^a-z0-9-]", "", str(encoding.read()).split("=")[1], flags=re.IGNORECASE)
    
print(encoding)

解决方案 14：

您可以使用不将整个文件加载到内存中的 python-magic 包：

import magic


def detect(
    file_path,
):
    return magic.Magic(
        mime_encoding=True,
    ).from_file(file_path)

输出是编码名称，例如：

iso-8859-1
us-ascii
utf-8

解决方案 15：

您可以使用 chardet 模块

import chardet

with open (filepath , "rb") as f:
    data= f.read()
    encode=chardet.UniversalDetector()
    encode.close()
    print(encode.result)

或者您可以在 Linux 中使用 chardet3 命令，但这需要一些时间：

chardet3 fileName

例子：

chardet3 donnee/dir/donnee.csv
donnee/dir/donnee.csv: ISO-8859-1 with confidence 0.73

解决方案 16：

一些文本文件可以识别其编码，但大多数则不能。识别：

具有 BOM 的文本文件
XML 文件采用 UTF-8 编码或其编码在前言中给出
JSON 文件始终采用 UTF-8 编码

不知道：

CSV 文件
任何随机文本文件

有些编码是通用的，也就是说，它们可以解码任何字节序列，但有些编码则不是。US-ASCII 不是通用的，因为任何大于 127 的字节都不会映射到任何字符。UTF-8 不是通用的，因为任何字节序列都是无效的。

相反，Latin-1、Windows-1252 等是通用的（即使某些字节未正式映射到字符）：

>>> [b.to_bytes(1, 'big').decode("latin-1") for b in range(256)]
['x00', ..., 'ÿ']

给定一个以字节序列编码的随机文本文件，除非文件知道其编码，否则您无法确定编码，因为某些编码是通用的。但有时您可以排除非通用编码。所有通用编码仍然是可能的。模块chardet使用字节频率来猜测哪种编码最适合编码文本。

如果您不想使用这个模块或类似的模块，这里有一个简单的方法：

检查文件是否知道其编码（BOM）
检查非通用编码并接受第一个可以解码字节的编码（ASCII 优先于 UTF-8，因为它更严格）
选择后备编码。

如果仅检查样本，第二步有点冒险，因为文件其余部分中的某些字节可能无效。

代码：

def guess_encoding(data: bytes, fallback: str = "iso8859_15") -> str:
    """
    A basic encoding detector.
    """
    for bom, encoding in [
        (codecs.BOM_UTF32_BE, "utf_32_be"),
        (codecs.BOM_UTF32_LE, "utf_32_le"),
        (codecs.BOM_UTF16_BE, "utf_16_be"),
        (codecs.BOM_UTF16_LE, "utf_16_le"),
        (codecs.BOM_UTF8, "utf_8_sig"),
    ]:
        if data.startswith(bom):
            return encoding

    if all(b < 128 for b in data):
        return "ascii"  # you may want to use the fallback here if data is only a sample.

    decoder = codecs.getincrementaldecoder("utf_8")()
    try:
        decoder.decode(data, final=False)
    except UnicodeDecodeError:
        return fallback
    else:
        return "utf_8"  # not certain if data is only a sample

请记住非通用编码可能会失败。可以将方法errors的参数decode设置为'ignore'或'replace'以'backslashreplace'避免出现异常。

解决方案 17：

很久以前我就有这种需要。

阅读我的旧代码，我发现了这一点：

    import urllib.request
    import chardet
    import os
    import settings

    [...]
    file = 'sources/dl/file.csv'
    media_folder = settings.MEDIA_ROOT
    file = os.path.join(media_folder, str(file))
    if os.path.isfile(file):
        file_2_test = urllib.request.urlopen('file://' + file).read()
        encoding = (chardet.detect(file_2_test))['encoding']
        return encoding

这对我有用并且返回ascii

解决方案 18：

有了这个答案

为了方便大家参考，我只想补充一下，从 Python 3 pip 安装 magic：

pip install python-magic

参考文献 1
参考文献 2

解决方案 19：

概括：

cchardet用于小输入
charset_normalizer用于大输入

当输入是随机字节（未检测到文本编码）

时，您可能需要hex_escape

比较一些算法：guess-encoding-of-bytestring.py

#!/usr/bin/env python3

# guess text encoding of bytestring

# [cchardet]: https://github.com/PyYoshi/cChardet
# [faust-cchardet]: https://github.com/faust-streaming/cChardet
# [uchardet]: https://gitlab.freedesktop.org/uchardet/uchardet
# good for short strings
# fails on long strings
def guess_encoding_cchardet(bs: bytes):
    return cchardet.detect(bs).get("encoding")

# [charset_normalizer]: https://github.com/jawah/charset_normalizer
# [charset_normalizer#566]: https://github.com/jawah/charset_normalizer/issues/566
# good for long strings
# fails on short strings
#   https://github.com/jawah/charset_normalizer/issues/486
# 20x faster than chardet [charset_normalizer]
#   -> 200x slower than cchardet
# 5x slower than cchardet [charset_normalizer#566]
# benchmark versus chardet
#   https://github.com/jawah/charset_normalizer/raw/master/bin/performance.py
def guess_encoding_charset_normalizer(bs: bytes):
    match = charset_normalizer.from_bytes(bs).best()
    if match:
        return match.encoding
    return None

# [rs_chardet]: https://github.com/emattiza/rs_chardet
# 40x slower than cchardet [rs_chardet]
def guess_encoding_rs_chardet(bs: bytes):
    return rs_chardet.detect_rs_enc_name(bs)
    # return rs_chardet.detect_codec(bs).name

# [chardet]: https://github.com/chardet/chardet
# 4000x slower than cchardet [rs_chardet]
# 2000x slower than cchardet [cchardet]
def guess_encoding_chardet(bs: bytes):
    return chardet.detect(bs).get("encoding")

# [magic]: https://github.com/ahupp/python-magic
# fails on short strings
def guess_encoding_magic(bs: bytes):
    e = magic.detect_from_content(bs).encoding
    if e in ("binary", "unknown-8bit"):
        return None
    return e

# [icu]: https://github.com/unicode-org/icu
# fails on short strings
def guess_encoding_icu(bs: bytes):
    try:
        return icu.CharsetDetector(bs).detect().getName()
    except icu.ICUError:
        return None



if __name__ == "__main__":

    # test

    import random

    bytes_encoding_list = [
        ("ü".encode("latin1"), "latin1"),
        ("üü".encode("latin1"), "latin1"),
        ("üüü".encode("latin1"), "latin1"),
    ]

    for _ in range(10):
        bytes_encoding_list += [
            (random.randbytes(20), None),
        ]

    def test(guess_encoding):
        global bytes_encoding_list
        module_name = guess_encoding._name
        for input_bytes, expected_encoding in bytes_encoding_list:
            assert isinstance(input_bytes, bytes)
            # TODO better...
            guessed_encoding = guess_encoding(input_bytes)
            actual_string = None
            if guessed_encoding:
                try:
                    actual_string = input_bytes.decode(guessed_encoding)
                except Exception as exc:
                    if expected_encoding == None:
                        print(f"{module_name}: fail. found wrong encoding {guessed_encoding} in random bytes {input_bytes}")
                        continue
                    else:
                        print(f"{module_name}: FIXME failed to decode bytes: {exc}")
            if expected_encoding == None:
                # the guessed encoding can be anything -> dont compare encoding
                if guessed_encoding == None:
                    print(f"{module_name}: ok. found no encoding in random bytes {input_bytes}")
                else:
                    print(f"{module_name}: ok. found encoding {guessed_encoding} in random bytes {input_bytes} -> string {actual_string!r}")
            else:
                expected_string = input_bytes.decode(expected_encoding)
                if actual_string == expected_string:
                    print(f"{module_name}: ok. decoded {actual_string} from {guessed_encoding} bytes {input_bytes}")
                else:
                    #print(f"{module_name}: fail. actual {actual_string!r} from {guessed_encoding}. expected {expected_string!r} from {expected_encoding} bytes {input_bytes}")
                    print(f"{module_name}: fail. string: {actual_string!r} != {expected_string!r}. encoding: {guessed_encoding} != {expected_encoding}. bytes: {input_bytes}")

    for k in list(globals().keys()):
        if not k.startswith("guess_encoding_"):
            continue
        module_name = k[15:]
        module_found = False
        try:
            module = __import__(module_name)
            globals()[module_name] = module
            module_found = True
        except ModuleNotFoundError as exc:
            print(f"{module_name}: module not found. hint: pip install {module_name}")
            pass
        if module_found:
            guess_encoding = locals()[k]
            guess_encoding._name = module_name
            test(guess_encoding)

如何确定文本的编码

问题描述：

解决方案 1：

解决方案 2：

解决方案 3：

解决方案 4：

解决方案 5：

解决方案 6：

解决方案 7：

解决方案 8：

解决方案 9：

解决方案 10：

解决方案 11：

解决方案 12：

`cchardet`更快的替代方案`chardet`

解决方案 13：

解决方案 14：

解决方案 15：

解决方案 16：

解决方案 17：

解决方案 18：

解决方案 19：

云端的项目管理软件

问题描述：

解决方案 1：

解决方案 2：

解决方案 3：

解决方案 4：

解决方案 5：

解决方案 6：

解决方案 7：

解决方案 8：

解决方案 9：

解决方案 10：

解决方案 11：

解决方案 12：

cchardet更快的替代方案chardet

解决方案 13：

解决方案 14：

解决方案 15：

解决方案 16：

解决方案 17：

解决方案 18：

解决方案 19：

云端的项目管理软件

`cchardet`更快的替代方案`chardet`