这篇文章主要介绍“怎么配置Nutch模拟浏览器绕过反爬虫限制”,在日常操作中,相信很多人在怎么配置Nutch模拟浏览器绕过反爬虫限制问题上存在疑惑,小编查阅了各式资料,整理出简单好用的操作方法,希望对大家解答”怎么配置Nutch模拟浏览器绕过反爬虫限制”的疑惑有所帮助!接下来,请跟着小编一起来学习吧!

成都创新互联坚持“要么做到,要么别承诺”的工作理念,服务领域包括:网站制作、网站建设、企业官网、英文网站、手机端网站、网站推广等服务,满足客户于互联网时代的尖扎网站设计、移动媒体设计的需求,帮助企业找到有效的互联网解决方案。努力成为您成熟可靠的网络建设合作伙伴!
当我们配置Nutch抓取 http://yangshangchuan.iteye.com 的时候,抓取的所有页面内容均为:您的访问请求被拒绝 ...... 这是最简单的反爬虫策略(该策略简单地读取HTTP请求头User-Agent的值来判断是人(浏览器)还是机器爬虫),我们只需要简单地配置Nutch来模拟浏览器(simulate web browser)就可以绕过这种限制。
在nutch-default.xml中有5项配置是和User-Agent相关的:
http.agent.description Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. http.agent.url A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler. http.agent.email An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this address (e.g. 'info at example dot com') to avoid spamming. http.agent.name HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. http.agent.version Nutch-1.7 A version string to advertise in the User-Agent header.
在类nutch2.7/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java中可以看到这5项配置是如何构成User-Agent的:
this.userAgent = getAgentString( conf.get("http.agent.name"),
conf.get("http.agent.version"),
conf.get("http.agent.description"),
conf.get("http.agent.url"),
conf.get("http.agent.email") ); private static String getAgentString(String agentName,
String agentVersion,
String agentDesc,
String agentURL,
String agentEmail) {
if ( (agentName == null) || (agentName.trim().length() == 0) ) {
// TODO : NUTCH-258
if (LOGGER.isErrorEnabled()) {
LOGGER.error("No User-Agent string set (http.agent.name)!");
}
}
StringBuffer buf= new StringBuffer();
buf.append(agentName);
if (agentVersion != null) {
buf.append("/");
buf.append(agentVersion);
}
if ( ((agentDesc != null) && (agentDesc.length() != 0))
|| ((agentEmail != null) && (agentEmail.length() != 0))
|| ((agentURL != null) && (agentURL.length() != 0)) ) {
buf.append(" (");
if ((agentDesc != null) && (agentDesc.length() != 0)) {
buf.append(agentDesc);
if ( (agentURL != null) || (agentEmail != null) )
buf.append("; ");
}
if ((agentURL != null) && (agentURL.length() != 0)) {
buf.append(agentURL);
if (agentEmail != null)
buf.append("; ");
}
if ((agentEmail != null) && (agentEmail.length() != 0))
buf.append(agentEmail);
buf.append(")");
}
return buf.toString();
}在类nutch2.7/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java中使用User-Agent请求头,这里的http.getUserAgent()返回的userAgent就是HttpBase.java中的userAgent:
String userAgent = http.getUserAgent();
if ((userAgent == null) || (userAgent.length() == 0)) {
if (Http.LOG.isErrorEnabled()) { Http.LOG.error("User-agent is not set!"); }
} else {
reqStr.append("User-Agent: ");
reqStr.append(userAgent);
reqStr.append("\r\n");
}通过上面的分析可知:在nutch-site.xml中只需要增加如下几种配置之一便可以模拟一个特定的浏览器(Imitating a specific browser):
1、模拟Firefox浏览器:
http.agent.name Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko http.agent.version 20100101 Firefox/27.0
2、模拟IE浏览器:
http.agent.name Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident http.agent.version 6.0)
3、模拟Chrome浏览器:
http.agent.name Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari http.agent.version 537.36
4、模拟Safari浏览器:
http.agent.name Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari http.agent.version 534.57.2
5、模拟Opera浏览器:
http.agent.name Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36 OPR http.agent.version 19.0.1326.59
后记:查看User-Agent的方法:
1、http://www.useragentstring.com
2、http://whatsmyuseragent.com
3、http://www.enhanceie.com/ua.aspx
到此,关于“怎么配置Nutch模拟浏览器绕过反爬虫限制”的学习就结束了,希望能够解决大家的疑惑。理论与实践的搭配能更好的帮助大家学习,快去试试吧!若想继续学习更多相关知识,请继续关注创新互联网站,小编会继续努力为大家带来更多实用的文章!
文章题目:怎么配置Nutch模拟浏览器绕过反爬虫限制
本文地址:http://www.jxjierui.cn/article/gjposs.html


咨询
建站咨询
