gocrawl 分析

1. gocrawl 类结构

 1 // The crawler itself, the master of the whole process
 2 type Crawler struct {
 3     Options *Options
 5     // Internal fields
 6     logFunc         func(LogFlags, string, ...interface{})
 7     push            chan *workerResponse
 8     enqueue         chan interface{}
 9     stop            chan struct{}
10     wg              *sync.WaitGroup
11     pushPopRefCount int
12     visits          int
14     // keep lookups in maps, O(1) access time vs O(n) for slice. The empty struct value
15     // is of no use, but this is the smallest type possible - it uses no memory at all.
16     visited map[string]struct{}
17     hosts   map[string]struct{}
18     workers map[string]*worker
19 }
 1 // The Options available to control and customize the crawling process.
 2 type Options struct {
 3     UserAgent             string
 4     RobotUserAgent        string
 5     MaxVisits             int
 6     EnqueueChanBuffer     int
 7     HostBufferFactor      int
 8     CrawlDelay            time.Duration // Applied per host
 9     WorkerIdleTTL         time.Duration
10     SameHostOnly          bool
11     HeadBeforeGet         bool
12     URLNormalizationFlags purell.NormalizationFlags
13     LogFlags              LogFlags
14     Extender              Extender
15 }
 1 // Extension methods required to provide an extender instance.
 2 type Extender interface {
 3     // Start, End, Error and Log are not related to a specific URL, so they don‘t
 4     // receive a URLContext struct.
 5     Start(interface{}) interface{}
 6     End(error)
 7     Error(*CrawlError)
 8     Log(LogFlags, LogFlags, string)
10     // ComputeDelay is related to a Host only, not to a URLContext, although the FetchInfo
11     // is related to a URLContext (holds a ctx field).
12     ComputeDelay(string, *DelayInfo, *FetchInfo) time.Duration
14     // All other extender methods are executed in the context of an URL, and thus
15     // receive an URLContext struct as first argument.
16     Fetch(*URLContext, string, bool) (*http.Response, error)
17     RequestGet(*URLContext, *http.Response) bool
18     RequestRobots(*URLContext, string) ([]byte, bool)
19     FetchedRobots(*URLContext, *http.Response)
20     Filter(*URLContext, bool) bool
21     Enqueued(*URLContext)
22     Visit(*URLContext, *http.Response, *goquery.Document) (interface{}, bool)
23     Visited(*URLContext, interface{})
24     Disallowed(*URLContext)
25 }

entry point:

 1 func main() {
 2     ext := &Ext{&gocrawl.DefaultExtender{}}
 3     // Set custom options
 4     opts := gocrawl.NewOptions(ext)
 5     opts.CrawlDelay = 1 * time.Second
 6     opts.LogFlags = gocrawl.LogError
 7     opts.SameHostOnly = false
 8     opts.MaxVisits = 10
10     c := gocrawl.NewCrawlerWithOptions(opts)
11     c.Run("http://0value.com")
12 }

3 steps:  in main

1) get a Extender

2) create Options with given Extender

3) create gocrawel

as it is commented, go crawel contols the whole process, Option supplies some configuration info and Extender does the real work.

2. other key structs

worker, workResponse and sync.WaitGroup

1 // Communication from worker to the master crawler, about the crawling of a URL
2 type workerResponse struct {
3     ctx           *URLContext
4     visited       bool
5     harvestedURLs interface{}
6     host          string
7     idleDeath     bool
8 }
 1 // The worker is dedicated to fetching and visiting a given host, respecting
 2 // this host‘s robots.txt crawling policies.
 3 type worker struct {
 4     // Worker identification
 5     host  string
 6     index int
 8     // Communication channels and sync
 9     push    chan<- *workerResponse
10     pop     popChannel
11     stop    chan struct{}
12     enqueue chan<- interface{}
13     wg      *sync.WaitGroup
15     // Robots validation
16     robotsGroup *robotstxt.Group
18     // Logging
19     logFunc func(LogFlags, string, ...interface{})
21     // Implementation fields
22     wait           <-chan time.Time
23     lastFetch      *FetchInfo
24     lastCrawlDelay time.Duration
25     opts           *Options
26 }

3. I will give a whole workflow of gocrawl in a few days.(6/20/2014)

