
Here are the things that could potentially slow down fetching

1) DNS setup

2) The number of crawlers you have, too many, too few.

3) Bandwidth limitations

4) Number of threads per host (politeness)

5) Uneven distribution of urls to fetch and politeness.

6) High crawl-delays from robots.txt (usually along with an uneven distribution of urls).

7) Many slow websites (again usually with an uneven distribution).

8) Downloading lots of content (PDFS, very large html pages, again possibly an uneven distribution).

9) Others

Now how do we fix them

1) Have a DNS setup on each local crawling machine, if multiple crawling machines and a single centralized DNS it can act like a DOS attack on the DNS server slowing the entire system. We always did a two layer setup hitting first to the local DNS cache then to a large DNS cache like OpenDNS or Verizon.

2) This would be number of map tasks * fetcher.threads.fetch. So 10 map tasks * 20 threads = 200 fetchers at once. Too many and you overload your system, too few and other factors and the machine sites idle. You will need to play around with this setting for your setup.

3) Bandwidth limitations. Use ntop, ganglia, and other monitoring tools to determine how much bandwidth you are using. Account for in and out bandwidth. A simple test, from a server inside the fetching network but not itself fetching, if it is very slow connecting to or downloading content when fetching is occurring, it is a good bet you are maxing out bandwidth. If you set http timeout as we describe later and are maxing your bandwidth, you will start seeing many http timeout errors.

4) Politeness along with uneven distribution of urls is probably the biggest limiting factor. If one thread is processing a single site and there are a lot of urls from that site to fetch all other threads will sit idle while that one thread finishes. Some solutions, use fetcher.server.delay to shorten the time between page fetches and use to increase the number of threads fetching for a single site (this would still be in the same map task though and hence the same JVM ChildTask process). If increasing this > 0 you could also set fetcher.server.min.delay to some value > 0 for politeness to min and max bound the process.

5) Fetching a lot of pages from a single site or a lot of pages from a few sites will slow down fetching dramatically. For full web crawls you want an even distribution so all fetching threads can be active. Setting to a value > 0 will limit the number of pages from a single host/domain to fetch.

6) Crawl-delay can be used and is obeyed by nutch in robots.txt. Most sites don‘t use this setting but a few (some malicious do). I have seen crawl-delays as high as 2 days in seconds. The fetcher.max.crawl.delay variable will ignore pages with crawl delays > x. I usually set this to 10 seconds, default is 30. Even at 10 seconds if you have a lot of pages from a site from which you can only crawl 1 page every 10 seconds it is going to be slow. On the flip side, setting this to a low value will ignore and not fetch those pages.

7) Sometimes, manytimes websites are just slow. Setting a low value for http.timeout helps. The default is 10 seconds. If you don‘t care and want as many pages as fast as possible, set it lower. Some websites, digg for instance, will bandwidth limit you on their side only allowing x connections per given time frame. So even if you only have say 50 pages from a single site (which I still think is to many). It may be waiting 10 seconds on each page. The ftp.timeout can also be set if fetching ftp content.

8) Lots of content means slower fetching. If downloading PDFs and other non-html documents this is especially true. To avoid non-html content you can use the url filters. I prefer the prefix and suffix filters. The http.content.limit and ftp.content.limit can be used to limit the amount of content downloaded for a single document.

9) Other things that could be causing slow fetching:

Max the number of open sockets/files on a machine. You will start seeing IO errors or can‘t open socket errors.
    Poor routing. Bad routers or home routers might not be able to handle the number of connections going through at once. An incorrect routing setup could also be causing problems but those are usually much more complex to diagnose. Use network trace and mapping tools if you think this is happening. Upstream routing can also be a problem from your network provider.
    Bad network cards. I have seen network cards flip once they reach a certain bandwidth point. This was more prevalent on, at the time, newer gigabit cards. Not usually my first thought but always a possibility. Use tcpdump and network monitoring tools on the single interface.

时间: 2024-12-14 14:50:24


【转】Nutch源代码研究 网页抓取 数据结构

今天我们看看Nutch网页抓取,所用的几种数据结构: 主要涉及到了这几个类:FetchListEntry,Page, 首先我们看看FetchListEntry类: public final class FetchListEntry implements Writable, Cloneable 实现了Writable, Cloneable接口,Nutch许多类实现了Writable, Cloneable. 自己负责自己的读写操作其实是个很合理的设计方法,分离出来反倒有很琐碎 的感觉. 看看里面的成


一个实用的C# 网页抓取类 模拟蜘蛛,类中定义了超多的C#采集文章.网页抓取文章的基础技巧,下面分享代码: using System; using System.Data; using System.Configuration; using System.Net; using System.IO; using System.Text; using System.Collections.Generic; using System.Text.RegularExpressions; using Sys


引言 从网页中提取信息的需求日益剧增,其重要性也越来越明显.每隔几周,我自己就想要到网页上提取一些信息.比如上周我们考虑建立一个有关各种数据科学在线课程的欢迎程度和意见的索引.我们不仅需要找出新的课程,还要抓取对课程的评论,对它们进行总结后建立一些衡量指标.这是一个问题或产品,其功效更多地取决于网页抓取和信息提取(数据集)的技术,而非以往我们使用的数据汇总技术. 网页信息提取的方式 从网页中提取信息有一些方法.使用API可能被认为是从网站提取信息的最佳方法.几乎所有的大型网站,像Twitter.


CasperJS is a navigation scripting & testing utility for the PhantomJS (WebKit) and SlimerJS (Gecko) headless browsers, written in Javascript. PhantomJS是基于WebKit内核的headless browser SlimerJS则是基于Gecko内核的headless browser Headless browser: 无界面显示的浏览器,可以用于


2016年一月,刚做完三个课程设计,C++网络版打地鼠,北山超市收银系统J2EE,JAVA聊天程序,累不堪言,置身奋斗之年承受这些是应该的,毕竟自己的技术还太菜了,没有一个开发者应有的底气. -------------------------------------- 前记 在此之际,一同事介绍了一个项目,做一个教务信息记录抓取到自己的网页显示之. --------------------------------------缘由 做这个东西首先也百度了下,网上的文章大多没什么营养,不过也基于网上

淘搜索之网页抓取系统分析与实现(2)—redis + scrapy

1.scrapy+redis使用 (1)应用 这里redis与scrapy一起,scrapy作为crawler,而redis作为scrapy的调度器.如架构图中的②所示.图1 架构图 (2)为什么选择redis redis作为调度器的实现仍然和其特性相关,可见<一淘搜索之网页抓取系统分析与实现(1)--redis使用>(中关于redis的分析. 2.redis实现scrapy sc


(from 事情的起因是,我做survey的时候搜到了这两本书:Computational Social Network Analysis和Computational Social Network,感觉都蛮不错的,想下载下来看看,但是点开网页发现这个只能分章节下载,晕,我可没时间一章一章下载,想起了迅雷的下载全部链接,试试看,果真可以把他们一网打尽,但是,sadly,迅雷下载的时候,文件名没办法跟章节名对应起来,晕,我可


现在有越来越多的人热衷于做网络爬虫(网络蜘蛛),也有越来越多的地方需要网络爬虫,比如搜索引擎.资讯采集.舆情监测等等,诸如此类.网络爬虫涉及到的技术(算法/策略)广而复杂,如网页获取.网页跟踪.网页分析.网页搜索.网页评级和结构/非结构化数据抽取以及后期更细粒度的数据挖掘等方方面面,对于新手来说,不是一朝一夕便能完全掌握且熟练应用的,对于作者来说,更无法在一篇文章内就将其说清楚.因此在本篇文章中,我们仅将视线聚焦在网络爬虫的最基础技术--网页抓取方面. 说到网页抓取,往往有两个点是不得不说的,首


来源: 抓取某一个网页中的内容,需要对DOM树进行解析,找到指定节点后,再抓取我们需要的内容,过程有点繁琐.LZ总结了几种常用的.易于实现的网页抓取方式,如果熟悉JQuery选择器,这几种框架会相当简单. 一.Ganon 项目地址: 文档: 测试:抓取我的网站首页所有class属性值是focus的