nodejs爬虫遇到的乱码问题汇总
上一篇文章中使用nodejs程序解析了网页编码为gbk,gb2312,以及utf-8的情况,这里面有三种特殊的乱码情况需要单独的说明一下.
1,网页编码为utf-8,但是解析为乱码,代表网站为www.guoguo-app.com。
这个问题真是个逗逼问题,查看网页源码中给出的编码方式为utf8,如下:
查快递
由于解析出来的始终是乱码,我就抓包看了下,报文中的编码方式为gbk,果然我使用gbk的方式之后,得到的不再是乱码了。淘宝为了反爬虫也是操碎了新,但是我也很好奇这种方式是怎么实现的,知道的告诉我。
GET/HTTP/1.1 Host:www.guoguo-app.com Connection:close HTTP/1.1200OK Date:Thu,06Apr201701:56:23GMT Content-Type:text/html;charset=GBK Transfer-Encoding:chunked Connection:close Vary:Accept-Encoding Vary:Accept-Encoding Content-Language:zh-CN Server:Tengine/Aserver
1,网页编码为utf-8,解析为乱码情况二,代表网站http//andersonjiang.blog.sohu.com/
单纯的查看网页源码看不出任何毛病,于是我又抓了一次包,得到如下情形:
GET/HTTP/1.1 Host:andersonjiang.blog.sohu.com Connection:close HTTP/1.1200OK Content-Type:text/html;charset=GBK Transfer-Encoding:chunked Connection:close Server:nginx Date:Thu,06Apr201702:10:33GMT Vary:Accept-Encoding Expires:Thu,01Jan197000:00:00GMT RHOST:192.168.110.68@11177 Pragma:No-cache Cache-Control:no-cache Content-Language:en-US Content-Encoding:gzip FSS-Cache:MISSfrom13539701.18454911.21477824 FSS-Proxy:Poweredby9935166.11245896.17873234
andersonjiang.blog.sohu.com这个网站同时采用了Transfer-Encoding:chunked传输编码和Content-Encoding:gzip内容编码功能,由于nodejs爬虫没有gzip解包功能,因此该网站提取不到任何字段,即title和charset等。要想提取此类网站则要添加gzip解包功能。
下面两个网站www.cr173.com以及www.csdn.net是正常的抓包情况。
GET/HTTP/1.1 Host:www.cr173.com Connection:close HTTP/1.1200OK Expires:Thu,06Apr201702:42:20GMT Date:Thu,06Apr201702:12:20GMT Content-Type:text/html Last-Modified:Thu,06Apr201700:52:42GMT ETag:"96a4141970aed21:0" Cache-Control:max-age=1800 Accept-Ranges:bytes Content-Length:158902 Accept-Ranges:bytes X-Varnish:1075189606 Via:1.1varnish X-Via:1.1dxxz46:4(CdnCacheServerV2.0),1.1oudxin15:1(CdnCacheServerV2.0) Connection:close GET/HTTP/1.1 Host:www.csdn.net Connection:close HTTP/1.1200OK Server:openresty Date:Thu,06Apr201702:18:59GMT Content-Type:text/html;charset=utf-8 Content-Length:99363 Connection:close Vary:Accept-Encoding Last-Modified:Thu,06Apr201702:10:02GMT Vary:Accept-Encoding ETag:"58e5a37a-18423" Accept-Ranges:bytes
3,网页编码为其他形式的编码,解析为乱码,例如:
(1)编码为Big5,代表网站为www.ruten.com.tw,www.ctgoodjobs.hk
(2)编码为Shift_JIS,代表网站为www.vector.co.jp,www.smbc.co.jp
(3)编码为windows-12,代表网站为www.tff.org,www.pravda.com.ua
(4)编码为EUC-JP,代表网站为www.showtime.jp
(5)编码为EUC-KR,代表网站为www.incruit.com,www.samsunghospital.com,
由于iconv-lite的说明中支持如下的编码方式:
Currentlyonlyasmallpartofencodingssupported:
Allnode.jsnativeencodings:'utf8','ucs2','ascii','binary','base64'. Baseencodings:'latin1' Cyrillicencodings:'windows-1251','koi8-r','iso8859-5'. Simplifiedchinese:'gbk','gb2313'.
Otherencodingsareeasytoadd,seethesource.Please,participate
因此对于上述出现的网页编码,只有自己添加解码方式加以解决了。
总之要写一个通用的爬虫程序还有很长的路要走。