爬取网页数据基础

代码如下：

package com.tracker.offline.tools;

import com.alibaba.fastjson.JSONObject;
import com.google.common.collect.Lists;
import com.tracker.common.utils.StringUtil;
import com.tracker.coprocessor.utils.JsonUtil;
import org.apache.commons.lang.StringUtils;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.List;
import java.util.Map;

/**
 * 文件名：爬取页面上的数据
 */
public class SpiderUserPosTag {

    private static List<Integer> idList = Lists.newArrayList(113717565,113856580);

    private static final String url="http://192.168.202.17:8080/business/job51_jobstats/actions/jobshow_list";
    private static final String output="E:\\result.tsv";

    public String getDataFromWeb (String id) throws IOException {

        Document response = Jsoup.connect(url).timeout(12 * 1000).userAgent("Mozilla/5.0").method(Connection.Method.POST)
                .ignoreContentType(true)
                .cookie("JSESSIONID", "986C7BA4E6FE3DB5C4591F3481D3FF1D")
                .header("Content-Type", "application/json;charset=UTF-8")
                .data("a","b")
                .requestBody("{\"startTime\":\"20190624\",\"endTime\":\"20190627\",\"seType\":\"2\",\"idType\":\"1\",\"timeType\":\"1\",\"startIndex\":1,\"offset\":50,\"id\":"+id+"}")
                .post();
        return response.text();
    }

    public static void main(String[] args)  throws Exception{
        SpiderUserPosTag sp=new SpiderUserPosTag();
        int n=0;
        int start=898440;
        BufferedWriter bw=new BufferedWriter(new FileWriter(new File(output),true));
        try {
            for (Integer id:idList) {　　　　　　　　　　//返回数据转化和解析，Map<String,List<Map<String,String>>>
                String line = sp.getDataFromWeb(String.valueOf(id));
                Map<String,String> maps = JsonUtil.parseJSON2MapStr(line);
                String str2 = maps.get("result");
                List<String> lists = JSONObject.parseArray(str2,String.class);
                for (String str3:lists) {
                    Map<String,String> maps2 = JsonUtil.parseJSON2MapStr(str3);
                    bw.write(StringUtil.joinString("\t",maps2.get("jobId"),maps2.get("jobName"),maps2.get("totalShowCount")
                        ,maps2.get("totalClickCount"),maps2.get("totalApplyCount"),maps2.get("time"),maps2.get("webShowCount")
                        ,maps2.get("webClickCount"),maps2.get("webApplyCount"),maps2.get("appShowCount"),maps2.get("appClickCount")
                        ,maps2.get("appApplyCount"),maps2.get("mShowCount")    ,maps2.get("mClickCount"),maps2.get("mApplyCount")
                        ,maps2.get("showCount"),maps2.get("clickCount"),maps2.get("applyCount"))+"\n");
                }
            }
            bw.flush();
            bw.close();
        }
        catch (IOException e){
            e.printStackTrace();
        }

    }

}

需要确定的三个元素：

url：

cookeid 和请求body的格式：

返回参数：

原文地址：https://www.cnblogs.com/parent-absent-son/p/11317024.html

时间： 2024-08-29 13:09:43

爬取网页数据基础的相关文章

python之爬取网页数据总结（一）

今天尝试使用python,爬取网页数据.因为python是新安装好的,所以要正常运行爬取数据的代码需要提前安装插件.分别为requests Beautifulsoup4 lxml 三个插件. 因为配置了环境变量,可以cmd命令直接安装.假如电脑上有两个版本的python,建议进入到目录安装. 安装的命令为 pip install requests(Beautifulsoup4 /lxml ) 三条分别执行. 安装结束,可以尝试网上一些简单的例子,明白了解 Beautifulso

用puppeteer爬取网页数据初体验

用puppeteer爬取网页数据业务需求,页面需要显示很多链接列表,像这样的. 我问项目经理要字典表,他笑咪咪地拍着我的肩膀说:"这边有点忙,要不按照这个自己抄一下吧". emmm- 我看了一下,数据大概有七八百条,一个一个录入,那不得搞到地老天荒.海枯石烂. 心口一股燥热,差点就要口吐莲花,舌吐芬芳了- 转念一想,做人要儒雅随和,念在平时没少蹭吃蹭喝的份上,咱先弄一下吧. 可是怎么弄呢? 一个一个输入是不可能的,我们需要录入每个组的标题.标题下的名称和链接,这是需要看网页源码,效率

python爬虫——爬取网页数据和解析数据

1.网络爬虫的基本概念网络爬虫(又称网络蜘蛛,机器人),就是模拟客户端发送网络请求,接收请求响应,一种按照一定的规则,自动地抓取互联网信息的程序.只要浏览器能够做的事情,原则上,爬虫都能够做到. 2.网络爬虫的功能图2 网络爬虫可以代替手工做很多事情,比如可以用于做搜索引擎,也可以爬取网站上面的图片,比如有些朋友将某些网站上的图片全部爬取下来,集中进行浏览,同时,网络爬虫也可以用于金融投资领域,比如可以自动爬取一些金融信息,并进行投资分析等. 有时,我们比较喜欢的新闻网站可能有几个,每次

03：requests与BeautifulSoup结合爬取网页数据应用

1.1 爬虫相关模块命令回顾 1.requests模块 1. pip install requests 2. response = requests.get('http://www.baidu.com/ ') #获取指定url的网页内容 3. response.text #获取文本文件 4. response.cont

爬虫爬取网页数据

public static void Main(string[] args) { string url = "https://ly.esf.fang.com/house-a010204-b012374/";//所需要爬取网站地址 string data = GetWebContent(url); var htmlDoc = new HtmlDocument(); htmlDoc.LoadHtml(data);//加载数据流 HtmlNodeCollection htmlNodes =

pycharm爬取网页数据

1 python环境的配置 1.1 安装python文件包,放到可以找到的位置 1.2 右键计算机->属性->高级环境设置->系统变量->Path->编辑->复制python路径位置 1.3 管理员身份打开cmd,输入python,测试环境是否安装成功 2 安装pycharm 2.1 安装pycharm文件包,放到可以找到的位置 2.2 新建文件夹,需要设置环境 2.3 File->Setting->project ...->add->找到pyt

python爬取网页数据

python时间戳将时间戳转为日期 #!/usr/bin/python # -*- coding: UTF-8 -*- # 引入time模块 import time #时间戳 timeStamp = 1581004800 timeArray = time.localtime(timeStamp) #转为年-月-日形式 otherStyleTime = time.strftime("%Y-%m-%d ", timeArray) print(otherStyleTime) python爬

【推荐】oc解析HTML数据的类库（爬取网页数据）

TFhpple是一个用于解析html数据的第三方库,本人感觉功能还算可以,只不过在使用前必须配置项目. 配置 1.导入libxml2.tbd 2.设置编译路径使用这里使用一个例子来说明 http://so.gushiwen.org/guwen/book_2.aspx 1.创建TFHpple对象,data为网站返回的数据 TFHpple *htmlParser = [[TFHpple alloc] initWithHTMLData:data]; 2.使用searchWithXPathQuery

接着上次的python爬虫，今天进阶一哈，局部解析爬取网页数据

*解析网页数据的仓库用Beatifulsoup基于lxml包lxml包基于html和xml的标记语言的解析包.可以去解析网页的内容,把我们想要的提取出来. 第一步.导入两个包,项目中必须包含beautifulsoup4和lxml 第二步.先去获取网页的数据 def get_html(): url="http://www.scetc.net" response=request.get(url) response.encoding="UTF-8" return res