用pandas 读取文件:
df = pd.read_csv(in_file, sep='\t', encoding='gb18030',error_bad_lines=False)
报错内容:
$python test.py
Traceback (most recent call last):
File "pandas/_libs/parsers.pyx", line 1169, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx", line 1299, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas/_libs/parsers.pyx", line 1318, in pandas._libs.parsers.TextReader._string_convert
File "pandas/_libs/parsers.pyx", line 1611, in pandas._libs.parsers._string_box_decode
UnicodeDecodeError: 'gb18030' codec can't decode byte 0xe1 in position 150: illegal multibyte sequenceDuring handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test.py", line 31, in <module>
df = pd.read_csv(in_file, sep='\t', encoding='gb18030',error_bad_lines=False)
File "/home/wubin/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/pandas/io/parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/wubin/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/pandas/io/parsers.py", line 435, in _read
data = parser.read(nrows)
File "/home/wubin/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/pandas/io/parsers.py", line 1139, in read
ret = self._engine.read(nrows)
File "/home/wubin/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/pandas/io/parsers.py", line 1995, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 991, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1123, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1176, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx", line 1299, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas/_libs/parsers.pyx", line 1318, in pandas._libs.parsers.TextReader._string_convert
File "pandas/_libs/parsers.pyx", line 1611, in pandas._libs.parsers._string_box_decode
UnicodeDecodeError: 'gb18030' codec can't decode byte 0xe1 in position 150: illegal multibyte sequence
说明文件中有些内容无法用 'gb18030'编码。
那么该用哪一种编码呢?
用这个办法:
def test_read_by_open(in_file):f = open(in_file, 'rb')i = 0while True:i += 1line = f.readline() # 按行读取if not line:breakelse:try:line.decode('gb18030')# line.decode('ISO-8859-9')except: # 打印出不能通过'gb18030'方式解码的数据行print(i)# print(str(line))print(chardet.detect(line))
结果显示:
$python test.py
177891
{'encoding': 'ISO-8859-9', 'confidence': 0.36612147346031526, 'language': 'Turkish'}
也就是 177891这一行有内容是用“ISO-8859-9” 编码的,所以:
df = pd.read_csv(in_file, sep='\t', encoding='ISO-8859-9',error_bad_lines=False)
这样就可以了。
经验总结:
- 要能想到用 open(in_file, 'rb') 方式打开,不然,open这一步都没法实现,当然就没法chardet.detect() 了
- 核心操作就是:chardet.detect(line)
参考链接:
python使用pd.read_csv(),出现错误UnicodeDecodeError: ‘utf-8‘ codec can‘t decode byte 0xc8_python读取csv文件报错,unicodedecodeerror: 'utf-8' codec -CSDN博客