webcollector 2.x 爬取搜狗搜素结果页

 1 /**
 2  * 使用搜狗搜索检索关键字并爬取结果集的标题
 3  * @author tele
 4  *
 5  */
 6 public class SougouCrawler extends RamCrawler{
 7
 8     public SougouCrawler() {
 9
10     }
11
12     public SougouCrawler(String keyword,int maxnum) {
13         for(int i=1;i<=maxnum;i++) {
14             //拼接url
15             String url ="https://www.sogou.com/web?query="+keyword+"&s_from=result_up&cid=&page="+ i +"&ie=utf8&p=40040100&dp=1&w=01029901&dr=1";
16             CrawlDatum crawlDatum = new CrawlDatum(url).meta("pageNum",i);
17             addSeed(crawlDatum);
18             addRegex(".*");
19         }
20     }
21
22
23     @Override
24     public void visit(Page page, CrawlDatums next) {
25         String pageNum = page.meta("pageNum");
26         Elements results = page.doc().select("div.results div[^class] h3 a");
27         for(int i=0;i<results.size();i++) {
28             System.out.println("第"+ pageNum +"页第"+ (i+1) +"条结果------" + results.get(i).text());
29         }
30     }
31
32     public static void main(String[] args) throws Exception {
33         String keyword="淘宝";
34         SougouCrawler crawler  = new SougouCrawler(keyword,3);
35         crawler.setThreads(8);
36
37         Configuration conf = Configuration.copyDefault();
38         conf.setExecuteInterval(3000);
39         conf.setReadTimeout(5000);
40         conf.setWaitThreadEndTime(3000);
41
42         crawler.setConf(conf);
43         crawler.start(1);//只有一层
44     }
45 }

输出截图(部分)

验证第二页数据标题

原文地址：https://www.cnblogs.com/tele-share/p/9466947.html

时间： 2024-10-06 09:14:43

webcollector 2.x 爬取搜狗搜素结果页的相关文章

第三百三十节，web爬虫讲解2—urllib库爬虫—实战爬取搜狗微信公众号

第三百三十节,web爬虫讲解2-urllib库爬虫-实战爬取搜狗微信公众号封装模块 #!/usr/bin/env python # -*- coding: utf-8 -*- import urllib from urllib import request import json import random import re import urllib.error def hq_html(hq_url): """ hq_html()封装的爬虫函数,自动启用了用户代理和ip

九 web爬虫讲解2—urllib库爬虫—实战爬取搜狗微信公众号—抓包软件安装Fiddler4讲解

封装模块 #!/usr/bin/env python # -*- coding: utf-8 -*- import urllib from urllib import request import json import random import re import urllib.error def hq_html(hq_url): """ hq_html()封装的爬虫函数,自动启用了用户代理和ip代理接收一个参数url,要爬取页面的url,返回html源码 "

用WebCollector 2.x爬取新浪微博（无需手动获取cookie)

用WebCollector 2.x 配合另一个项目WeiboHelper,就可以直接爬取新浪微博的数据(无需手动获取cookie) 1.导入WebCollector 2.x和WeiboHelper的所有jar包两个项目的地址:http://git.oschina.net/webcollector/WebCollector http://git.oschina.net/webcollector/WeiboHelper 2.示例代码: package cn.edu.hfut.dmic.webcol

用WebCollector制作一个爬取《知乎》并进行问题精准抽取的爬虫（JAVA）

简单介绍: WebCollector是一个无须配置.便于二次开发的JAVA爬虫框架(内核),它提供精简的的API.仅仅需少量代码就可以实现一个功能强大的爬虫. 怎样将WebCollector导入项目请看以下这个教程: JAVA网络爬虫WebCollector深度解析--爬虫内核參数: WebCollector无需繁琐配置.仅仅要在代码中给出以下几个必要參数,就可以启动爬虫: 1.种子(必要): 种子即爬虫的起始页面.一个爬虫可加入一个或多个种子. 2.正则(可选): 正则是约束爬取范围的一些正

爬取某图片网站多页图片的python爬虫

1. [代码][Python]代码 ? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 # coding=utf-8 import requests import re from lxml

requests利用selenium,代理Ip,云打码，验证码抠图操作爬取搜狗微信公众号内容

import requests from lxml import etree import time import json import random from dama import yundama from selenium import webdriver from PIL import Image #处理图片包 import pymysql import re from requests.exceptions import ConnectionError #==============

用scrapy爬取搜狗Lofter图片

# -*- coding: utf-8 -*- import json import scrapy from scrapy.http import Request from urllib import parse from scrapy.loader import ItemLoader from tutorial.items import LofterSpiderItem class LofterSpider(scrapy.Spider): name = "lofter" allowe

Python 2.7_爬取妹子图网站单页测试图片_20170114

1.url= http://www.mzitu.com/74100/x,2为1到23的值 2.用到模块 os 创建文件目录; re模块正则匹配目录名图片下载地址; time模块限制下载时间;requests模块获取网页源代码;urllib模块 urllib.urlretrieve(图片url,保存的带扩展名的文件名x.jpg)方法下载图片 3.知识点文件目录处理函数封装调用全局变量 4.代码 #coding:utf-8 import os import re import reque

【ichartjs】爬取理想论坛前30页帖子获得每个子贴的发帖时间，总计83767条数据进行统计，生成统计图表

统计数据如下: {'00': 967, '01': 373, '02': 177, '03': 79, '04': 65, '05': 163, '06': 514, '07': 1143, '08': 3550, '09': 9137, '10': 8534, '11': 6415, '12': 3275, '13': 6755, '14': 7911, '15': 6397, '16': 3567, '17': 2839, '18': 2689, '19': 2989, '20': 4034