htmlparser实现从网页上抓取数据

package parser;

import java.io.BufferedReader;

import java.io.BufferedWriter;

import java.io.FileWriter;

import java.io.IOException;

import java.io.InputStream;

import java.io.InputStreamReader;

import java.net.MalformedURLException;

import java.net.URL;

/**

* 基本能实现网页抓取,不过要手动输入URL 将整个html内容保存到指定文件

*

*@author chenguoyong

*

*/

public class ScrubSelectedWeb {

privatefinal static String CRLF = System.getProperty("line.separator");

/**

* @param args

*/

publicstatic void main(String[] args) {

try{

URLur = newURL("http://10.249.187.199:8083/injs100/");

InputStreaminstr = ur.openStream();

Strings, str;

BufferedReaderin = new BufferedReader(new InputStreamReader(instr));

StringBuffersb = new StringBuffer();

BufferedWriterout = new BufferedWriter(new FileWriter(

"D:/outPut.txt"));

while((s = in.readLine()) != null) {

sb.append(s+ CRLF);

}

System.out.println(sb);

str= new String(sb);

out.write(str);

out.close();

in.close();

}catch (MalformedURLException e) {

e.printStackTrace();

}catch (IOException e) {

e.printStackTrace();

}

}

}

基本能实现网页抓取,不过要手动输入URL,此外没有重构。只是一个简单的思路。

http://c.tieba.baidu.com/p/3357123567

http://c.tieba.baidu.com/p/3357135017

http://c.tieba.baidu.com/p/3357188487

http://c.tieba.baidu.com/p/3356975908

http://c.tieba.baidu.com/p/3357115966

http://c.tieba.baidu.com/p/3357308004

http://c.tieba.baidu.com/p/3357853803

http://c.tieba.baidu.com/p/3357119461

http://c.tieba.baidu.com/p/3360398522

http://c.tieba.baidu.com/p/3360397387

http://c.tieba.baidu.com/p/3360396194

http://c.tieba.baidu.com/p/3360534865

http://c.tieba.baidu.com/p/3360536000

http://c.tieba.baidu.com/p/3360536000

http://c.tieba.baidu.com/p/3360537168

http://c.tieba.baidu.com/p/3360538179

http://c.tieba.baidu.com/p/3360539318

http://c.tieba.baidu.com/p/3360540512

http://c.tieba.baidu.com/p/3360545141

http://c.tieba.baidu.com/p/3360568934

http://c.tieba.baidu.com/p/3360571757

http://c.tieba.baidu.com/p/3360570598

http://c.tieba.baidu.com/p/3360578878

http://c.tieba.baidu.com/p/3360583365

http://c.tieba.baidu.com/p/3360597635

http://c.tieba.baidu.com/p/3357730668

http://c.tieba.baidu.com/p/3357740205

http://c.tieba.baidu.com/p/3357738861

http://c.tieba.baidu.com/p/3357732435

http://c.tieba.baidu.com/p/3357731702

http://c.tieba.baidu.com/p/3357744489

http://c.tieba.baidu.com/p/3357749552

http://c.tieba.baidu.com/p/3357748244

http://c.tieba.baidu.com/p/3357745240

http://c.tieba.baidu.com/p/3357746820

http://c.tieba.baidu.com/p/3357747462

http://c.tieba.baidu.com/p/3357844591

http://c.tieba.baidu.com/p/3357843183

http://c.tieba.baidu.com/p/3357856179

http://c.tieba.baidu.com/p/3357855061

http://c.tieba.baidu.com/p/3357701054

http://c.tieba.baidu.com/p/3357702373

http://c.tieba.baidu.com/p/3357711758

http://c.tieba.baidu.com/p/3357708654

http://c.tieba.baidu.com/p/3357720495

http://c.tieba.baidu.com/p/3357717009

http://c.tieba.baidu.com/p/3357715962

http://c.tieba.baidu.com/p/3357713402

http://c.tieba.baidu.com/p/3357722434

http://c.tieba.baidu.com/p/3357724762

http://c.tieba.baidu.com/p/3357728150

http://c.tieba.baidu.com/p/3357727059

http://c.tieba.baidu.com/p/3357719062

http://c.tieba.baidu.com/p/3357741757

http://c.tieba.baidu.com/p/3357730030

http://c.tieba.baidu.com/p/3357270782

http://c.tieba.baidu.com/p/3357318531

http://c.tieba.baidu.com/p/3357694273

http://c.tieba.baidu.com/p/3357659897

http://c.tieba.baidu.com/p/3357317697

http://c.tieba.baidu.com/p/3357692426

http://c.tieba.baidu.com/p/3357657994

http://c.tieba.baidu.com/p/3357275312

http://c.tieba.baidu.com/p/3357689388

http://c.tieba.baidu.com/p/3357274265

http://c.tieba.baidu.com/p/3357656525

http://c.tieba.baidu.com/p/3357685342

http://c.tieba.baidu.com/p/3357273179

http://c.tieba.baidu.com/p/3357316739

http://c.tieba.baidu.com/p/3357675967

http://c.tieba.baidu.com/p/3357664551

http://c.tieba.baidu.com/p/3361685940

http://c.tieba.baidu.com/p/3369262457

http://c.tieba.baidu.com/p/3361226381

http://c.tieba.baidu.com/p/3361701748

http://c.tieba.baidu.com/p/3369277477

http://c.tieba.baidu.com/p/3369313857

http://c.tieba.baidu.com/p/3369963501

http://c.tieba.baidu.com/p/3369970938

http://c.tieba.baidu.com/p/3369978239

http://c.tieba.baidu.com/p/3369982545

http://c.tieba.baidu.com/p/3369992787

http://c.tieba.baidu.com/p/3369998386

http://c.tieba.baidu.com/p/3370003534

http://c.tieba.baidu.com/p/3370009443

http://c.tieba.baidu.com/p/3370023015

http://c.tieba.baidu.com/p/3370094552

http://c.tieba.baidu.com/p/3370105356

http://c.tieba.baidu.com/p/3370150360

http://c.tieba.baidu.com/p/3370158940

http://c.tieba.baidu.com/p/3370159295

http://c.tieba.baidu.com/p/3370165911

http://c.tieba.baidu.com/p/3370168751

http://c.tieba.baidu.com/p/3370174645

http://c.tieba.baidu.com/p/3370186461

http://c.tieba.baidu.com/p/3370197915

http://c.tieba.baidu.com/p/3370205863

http://c.tieba.baidu.com/p/3370218402

http://c.tieba.baidu.com/p/3370230272

http://c.tieba.baidu.com/p/3370292674

http://c.tieba.baidu.com/p/3370305221

http://c.tieba.baidu.com/p/3370323987

http://c.tieba.baidu.com/p/3370334781

http://c.tieba.baidu.com/p/3370335764

http://c.tieba.baidu.com/p/3370337895

http://c.tieba.baidu.com/p/3370339341

http://c.tieba.baidu.com/p/3370339541

http://c.tieba.baidu.com/p/3370348387

http://c.tieba.baidu.com/p/3370351032

http://c.tieba.baidu.com/p/3370352833

http://c.tieba.baidu.com/p/3370353950

http://c.tieba.baidu.com/p/3370355095

http://c.tieba.baidu.com/p/3370357853

http://c.tieba.baidu.com/p/3370374120

http://c.tieba.baidu.com/p/3370374814

http://c.tieba.baidu.com/p/3370375487

http://c.tieba.baidu.com/p/3370375928

http://c.tieba.baidu.com/p/3370376930

http://c.tieba.baidu.com/p/3370377380

http://c.tieba.baidu.com/p/3370377463

http://c.tieba.baidu.com/p/3370378072

http://c.tieba.baidu.com/p/3370378125

http://c.tieba.baidu.com/p/3370378575

http://c.tieba.baidu.com/p/3370378614

http://c.tieba.baidu.com/p/3370379179

http://c.tieba.baidu.com/p/3370379233

http://c.tieba.baidu.com/p/3370379724

http://c.tieba.baidu.com/p/3370379179

http://c.tieba.baidu.com/p/3370379919

http://c.tieba.baidu.com/p/3370380646

http://c.tieba.baidu.com/p/3370380702

http://c.tieba.baidu.com/p/3370381528

http://c.tieba.baidu.com/p/3370381739

http://c.tieba.baidu.com/p/3370382101

http://c.tieba.baidu.com/p/3370382216

http://c.tieba.baidu.com/p/3370382759

http://c.tieba.baidu.com/p/3370383521

http://c.tieba.baidu.com/p/3370383575

http://c.tieba.baidu.com/p/3370385074

http://c.tieba.baidu.com/p/3370383575

http://c.tieba.baidu.com/p/3370385446

http://c.tieba.baidu.com/p/3370386163

http://c.tieba.baidu.com/p/3370386374

http://c.tieba.baidu.com/p/3370387498

http://c.tieba.baidu.com/p/3370389359

http://c.tieba.baidu.com/p/3370390933

http://c.tieba.baidu.com/p/3370391036

http://c.tieba.baidu.com/p/3370391036

http://c.tieba.baidu.com/p/3370391453

http://c.tieba.baidu.com/p/3370391468

http://c.tieba.baidu.com/p/3370393162

http://c.tieba.baidu.com/p/3370399408

http://c.tieba.baidu.com/p/3370403804

http://c.tieba.baidu.com/p/3370408675

http://c.tieba.baidu.com/p/3370409225

http://c.tieba.baidu.com/p/3370409602

http://c.tieba.baidu.com/p/3370411429

http://c.tieba.baidu.com/p/3370411571

http://c.tieba.baidu.com/p/3370415337

http://c.tieba.baidu.com/p/3370415667

http://c.tieba.baidu.com/p/3370416326

http://c.tieba.baidu.com/p/3370417849

http://c.tieba.baidu.com/p/3370417926

http://c.tieba.baidu.com/p/3370418876

http://c.tieba.baidu.com/p/3370419068

http://c.tieba.baidu.com/p/3370420330

http://c.tieba.baidu.com/p/3370420571

http://c.tieba.baidu.com/p/3370421490

http://c.tieba.baidu.com/p/3370422394

http://c.tieba.baidu.com/p/3370423071

http://c.tieba.baidu.com/p/3370424478

http://c.tieba.baidu.com/p/3370424908

http://c.tieba.baidu.com/p/3370426273

http://c.tieba.baidu.com/p/3370426319

http://c.tieba.baidu.com/p/3370462138

http://c.tieba.baidu.com/p/3371658420

http://c.tieba.baidu.com/p/3371663222

http://c.tieba.baidu.com/p/3371664077

http://c.tieba.baidu.com/p/3371674626

http://c.tieba.baidu.com/p/3371671097

http://c.tieba.baidu.com/p/3371676658

http://c.tieba.baidu.com/p/3371679958

http://c.tieba.baidu.com/p/3371682907

http://c.tieba.baidu.com/p/3371685449

http://c.tieba.baidu.com/p/3371689289

http://c.tieba.baidu.com/p/3371697541

http://c.tieba.baidu.com/p/3371698752

http://c.tieba.baidu.com/p/3371701811

http://c.tieba.baidu.com/p/3371704043

http://c.tieba.baidu.com/p/3371710108

http://c.tieba.baidu.com/p/3371714425

http://c.tieba.baidu.com/p/3371719038

http://c.tieba.baidu.com/p/3371726190

http://c.tieba.baidu.com/p/3371732092

http://c.tieba.baidu.com/p/3371732412

http://c.tieba.baidu.com/p/3371737828

http://c.tieba.baidu.com/p/3371738097

http://c.tieba.baidu.com/p/3371742564

http://c.tieba.baidu.com/p/3371742591

http://c.tieba.baidu.com/p/3371755208

http://c.tieba.baidu.com/p/3371810189

http://c.tieba.baidu.com/p/3371837480

http://c.tieba.baidu.com/p/3371864092

http://c.tieba.baidu.com/p/3371908427

http://c.tieba.baidu.com/p/3371945665

http://c.tieba.baidu.com/p/3372010213

http://c.tieba.baidu.com/p/3372015825

http://c.tieba.baidu.com/p/3372077866

http://c.tieba.baidu.com/p/3372089452

http://c.tieba.baidu.com/p/3357720495

http://c.tieba.baidu.com/p/3372148588

时间: 2024-11-03 22:43:26

htmlparser实现从网页上抓取数据的相关文章

C# 从需要登录的网站上抓取数据

[转] C# 从需要登录的网站上抓取数据 背景:昨天一个学金融的同学让我帮她从一个网站上抓取数据,然后导出到excel,粗略看了下有1000+条记录,人工统计的话确实不可能.虽说不会,但作为一个学计算机的,我还是厚着脸皮答应了. . 刚开始想的是直接发送GET请求,然后再解析返回的html不就可以获取需要的信息吗?的确,如果是不需要登录的网站,这样可行,但对于这个网站就行不通.所以首先我们需要做的就是抓包,即分析用户登录时浏览器向服务器发送的POST请求.许多浏览器都自带抓包工具,但我还是更喜欢

网页中抓取数据

下面写个例子,实现从网页中抓取数据. 这个例子中,只是从网页中获取了数据,但是没有进行任何处理,只是将数据保存到一个txt文件中. 该例子是在android工程中写的. package com.example.creepertest; import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.File; import java.io.FileOutputStream; import java.io.I

从网页上抓取Windows补丁信息然后整型输出(Python)

Powershell实现:http://www.cnblogs.com/IvanChen/p/4488246.html 今天通过Python实现: # coding=utf-8 import re import requests import csv import sys from lxml import etree reload(sys) sys.setdefaultencoding('utf8') summaryurl = 'https://technet.microsoft.com/en-

java抓取网页数据,登录之后抓取数据。

最近做了一个从网络上抓取数据的一个小程序.主要关于信贷方面,收集的一些黑名单网站,从该网站上抓取到自己系统中. 也找了一些资料,觉得没有一个很好的,全面的例子.因此在这里做个笔记提醒自己. 首先需要一个jsoup的jar包,我用的1.6.0..下载地址为:http://pan.baidu.com/s/1mgqOuHa 1,获取网页内容(核心代码,技术有限没封装). 2,登录之后抓取网页数据(如何在请求中携带cookie). 3,获取网站的ajax请求方法(返回json). 以上这三点我就用一个类

从网页抓取数据的一般方法

首先要了解对方网页的执行机制 ,这能够用httpwacth或者httplook来看一下http发送和接收的数据.这两个工具应该说是比較简单易懂的.这里就不再介绍了.主要关注的内容是header和post的内容.通常会包括cookie,Referer页面和其它一些乱其八糟可能看不懂的变量,还有就是正常交互的參数,比方须要post或者get的querystring所包括的东西. httplook和httpwacth 网上有非常多下载的,这里推荐使用httpwach,由于能够直接嵌入到ie中,个人认为

PHP的cURL库:抓取网页,POST数据及其他,HTTP认证 抓取数据

From : http://developer.51cto.com/art/200904/121739.htm 下面是一个小例程: ﹤?php// 初始化一个 cURL 对象$curl = curl_init(); // 设置你需要抓取的URLcurl_setopt($curl, CURLOPT_URL, 'http://cocre.com'); // 设置headercurl_setopt($curl, CURLOPT_HEADER, 1); // 设置cURL 参数,要求结果保存到字符串中还

Fiddler:在PC和移动设备上抓取HTTPS数据包

Fiddler是一个免费的Web调试代理,支持任何浏览器.系统以及平台.这个工具是进行Web和App网络开发的必备工具,戳此处下载. 根据Fiddler官网的描述,具有以下六大特点: Web调试 性能测试 HTTP/HTTPS流量记录 Web会话处理 安全测试 自定义扩展性 本文讨论的主要内容是如何设置Fiddler,使PC和移动设备上可以抓取HTTPS数据包. 首先,在菜单栏选择Tools->FiddlerOptions,切换到Connections选项卡 第二步,勾选允许远程连接,并设置一个

C# 网页数据表格抓取数据

主要方法: public List<string> datasearch() { List<string> list = new List<string>(); string url = @""+txtUrl.Text.Trim().ToString(); WebRequest request = WebRequest.Create(url); //请求url WebResponse response = request.GetResponse();

python自然语言处理1——从网络抓取数据

python自然语言处理1--从网络抓取数据 写在前面 本节学习python2.7 BeautifulSoup库从网络抽取数据的技术,检验之简而言之就是爬虫技术.网络编程是一门复杂的技术,在需要基础的地方,文中给出的链接地址,都是很好的教程,可以参考,我在这里不在重复发明轮子.本节的主旨在于: 帮助快速掌握基本爬虫技术,形成一条主线,能为自己的实验构造基础数据.掌握爬虫技术后,可以从网络抓取符合特定需求的数据供分析,这里学习的爬虫技术适用于数据挖掘.自然语言处理等需要从外部挖掘数据的学科. 1.