欢迎来到尧图网

客户服务 关于我们

您的位置:首页 > 新闻 > 国际 > python 中文编码问题

python 中文编码问题

2024/10/22 22:04:33 来源:https://blog.csdn.net/u014210048/article/details/143034498  浏览:    关键词:python 中文编码问题

用pandas 读取文件:

df = pd.read_csv(in_file, sep='\t', encoding='gb18030',error_bad_lines=False)

报错内容:

$python test.py
Traceback (most recent call last):
  File "pandas/_libs/parsers.pyx", line 1169, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1299, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "pandas/_libs/parsers.pyx", line 1318, in pandas._libs.parsers.TextReader._string_convert
  File "pandas/_libs/parsers.pyx", line 1611, in pandas._libs.parsers._string_box_decode
UnicodeDecodeError: 'gb18030' codec can't decode byte 0xe1 in position 150: illegal multibyte sequence

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test.py", line 31, in <module>
    df = pd.read_csv(in_file, sep='\t', encoding='gb18030',error_bad_lines=False)
  File "/home/wubin/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/pandas/io/parsers.py", line 702, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/wubin/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/pandas/io/parsers.py", line 435, in _read
    data = parser.read(nrows)
  File "/home/wubin/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/pandas/io/parsers.py", line 1139, in read
    ret = self._engine.read(nrows)
  File "/home/wubin/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/pandas/io/parsers.py", line 1995, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 991, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1123, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1176, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1299, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "pandas/_libs/parsers.pyx", line 1318, in pandas._libs.parsers.TextReader._string_convert
  File "pandas/_libs/parsers.pyx", line 1611, in pandas._libs.parsers._string_box_decode
UnicodeDecodeError: 'gb18030' codec can't decode byte 0xe1 in position 150: illegal multibyte sequence
 

说明文件中有些内容无法用 'gb18030'编码。

那么该用哪一种编码呢?

用这个办法:

def test_read_by_open(in_file):f = open(in_file, 'rb')i = 0while True:i += 1line = f.readline() # 按行读取if not line:breakelse:try:line.decode('gb18030')# line.decode('ISO-8859-9')except: # 打印出不能通过'gb18030'方式解码的数据行print(i)# print(str(line))print(chardet.detect(line))

结果显示:

$python test.py
177891
{'encoding': 'ISO-8859-9', 'confidence': 0.36612147346031526, 'language': 'Turkish'}
 

也就是 177891这一行有内容是用“ISO-8859-9” 编码的,所以:

df = pd.read_csv(in_file, sep='\t', encoding='ISO-8859-9',error_bad_lines=False)

这样就可以了。

经验总结:

  1. 要能想到用 open(in_file, 'rb') 方式打开,不然,open这一步都没法实现,当然就没法chardet.detect() 
  2. 核心操作就是:chardet.detect(line)

参考链接:

 python使用pd.read_csv(),出现错误UnicodeDecodeError: ‘utf-8‘ codec can‘t decode byte 0xc8_python读取csv文件报错,unicodedecodeerror: 'utf-8' codec -CSDN博客

版权声明:

本网仅为发布的内容提供存储空间,不对发表、转载的内容提供任何形式的保证。凡本网注明“来源:XXX网络”的作品,均转载自其它媒体,著作权归作者所有,商业转载请联系作者获得授权,非商业转载请注明出处。

我们尊重并感谢每一位作者,均已注明文章来源和作者。如因作品内容、版权或其它问题,请及时与我们联系,联系邮箱:809451989@qq.com,投稿邮箱:809451989@qq.com