Java WebClient 总结

private WebClient getAWebClient() {
        WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24);
        webClient.getOptions().setTimeout(20000);
        // webClient.getCookieManager().setCookiesEnabled(true);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setCssEnabled(false);
        webClient.getOptions().setJavaScriptEnabled(false);
        webClient.addRequestHeader("Accept", "textml,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
        webClient.addRequestHeader("Accept-Encoding", "gzip, deflate");
        webClient.addRequestHeader("Accept-Language", "en-US,en;q=0.5");
        webClient.addRequestHeader("Cache-Control", "max-age=0");
        webClient.addRequestHeader("Connection", "keep-alive");
        webClient.addRequestHeader("Host", "www.amazon.com");
        webClient.addRequestHeader("User-Agent", "Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Firefox/24.0");
        return webClient;
    }
/**
     * 采集网页
     */
    public StringBuilder crawlPage(String url) {
        StringBuilder builder = new StringBuilder();
        logger.info(Thread.currentThread().getName() + " crawl " + url);
        // mygetpage代码放在这里
        webClient.getCookieManager().clearCookies();
        logger.info(Thread.currentThread().getName() + " webClient.getCookieManager().clearCookies();");
        File file = new File(cookiePathAppendRandom());
        logger.info(Thread.currentThread().getName() + " File file = new File(cookiePathAppendRandom());");
        if (file.exists()) {
            FileInputStream fin = null;
            try {
                fin = new FileInputStream(file);
            } catch (FileNotFoundException e1) {
                e1.printStackTrace();
            }
            CookieStore cookieStore = null;
            ObjectInputStream in;
            try {
                in = new ObjectInputStream(fin);
                cookieStore = (CookieStore) in.readObject();
                in.close();
            } catch (IOException e) {
                logger.error(e);
            } catch (ClassNotFoundException e) {
                logger.error(e);
            }
            List<org.apache.http.cookie.Cookie> l = cookieStore.getCookies();
            for (org.apache.http.cookie.Cookie temp : l) {
                Cookie cookie = new Cookie(temp.getDomain(), temp.getName(), temp.getValue(), temp.getPath(),
                        temp.getExpiryDate(), false);
                webClient.getCookieManager().addCookie(cookie);
            }
        }
        logger.info(Thread.currentThread().getName() + " MyGetPage start,url:" + url);
        HtmlPage page = MyGetPage(new StringBuffer(url));
        logger.info(Thread.currentThread().getName() + " MyGetPage end,url:" + url);
        if (page == null) {
            // 采集过程中出现异常的model,可以统一放在一个list中,发送给server重新加入到采集分配队列
            logger.info("Page null!");
            AmazonCrawlModel model=new AmazonCrawlModel(crawlId, crawlURLId, url, depth,ischange);
            exceptionFun(model);
            return (new StringBuilder("getNullPage"));
        }
        logger.info(Thread.currentThread().getName() + " builder.append(page.asXml());");
        builder.append(page.asXml());
        logger.info(Thread.currentThread().getName() + " return builder;");
        logger.info(Thread.currentThread().getName() +" CrawlPage $Length="+builder.toString().length());
        if(builder.toString().length()<=300){
            AmazonCrawlModel model=new AmazonCrawlModel(crawlId, crawlURLId, url, depth,ischange);
            exceptionFun(model);
            return (new StringBuilder("getNullPage"));
        }
        return builder;
    }
/***
     * 自定义的getpage,遇到验证码页面识别直至成功
     *
     */
    private HtmlPage MyGetPage(StringBuffer URL) {
        HtmlPage page = null;
        boolean flag = true;
        int TryTimeCnt = 1;
        int UnknowHostTryTimeCnt = 1;
        while (flag) {
            flag = false;
            try {
                logger.info(Thread.currentThread().getName() + " webClient.getPage : " + URL + ",CrawlURL_id:"
                        + crawlURLId);
                page = webClient.getPage(URL.toString());
                Document doc = Jsoup.parse(page.asXml());
                int robotchecknum = 1;
                while (doc.select("title").text().equals("Robot Check")) {
                    logger.info(Thread.currentThread().getName() + " " + dayformat1.format(System.currentTimeMillis())
                            + " [Robot Check,URL:" + URL + "]");
                    String captcha_str = AmazonGetCaptcha.GetCaptcha(new StringBuilder(doc.toString()));
                    logger.info(Thread.currentThread().getName() + " " + dayformat1.format(System.currentTimeMillis())
                            + " end AmazonGetCaptcha.GetCaptcha");
                    logger.info(dayformat1.format(new Date()) + " " + Thread.currentThread().getName() + " : "
                            + captcha_str);

                    HtmlForm form = null;

                    logger.info(Thread.currentThread().getName() + " page.getForms().get(0) Start");
                    form = page.getForms().get(0);
                    logger.info(Thread.currentThread().getName() + " page.getForms().get(0) End");

                    HtmlButton button = null;

                    logger.info(Thread.currentThread().getName() + " form.getElementsByTagName(button).get(0) Start");
                    button = (HtmlButton) form.getElementsByTagName("button").get(0);
                    logger.info(Thread.currentThread().getName() + " form.getElementsByTagName(button).get(0) End");

                    logger.info(Thread.currentThread().getName() + " setValueAttribute Start");
                    form.getInputByName("field-keywords").setValueAttribute(captcha_str);
                    logger.info(Thread.currentThread().getName() + " setValueAttribute End");

                    logger.info(Thread.currentThread().getName() + " button.click Start");
                    boolean click_flag = false;
                    while (!click_flag) {
                        try {
                            click_flag = true;
                            page = button.click();
                        } catch (Exception e1) {
                            logger.error(Thread.currentThread().getName() + " button.click出错了: " + e1);
                            //e1.printStackTrace();
                            click_flag = false;
                        }
                    }
                    logger.info(Thread.currentThread().getName() + " button.click end");
                    while (page.asXml() == null) {
                        logger.info(Thread.currentThread().getName() + " page xml null");
                        logger.info(Thread.currentThread().getName() +" "+ page.asXml());
                        page.refresh();
                        logger.info(Thread.currentThread().getName() + " refresh End!");
                    }
                    logger.info(Thread.currentThread().getName() + " button.click End");

                    logger.info(Thread.currentThread().getName() + " Start ParsePage!");
                    doc = Jsoup.parse(page.asXml());
                    if (!doc.select("title").text().equals("Robot Check")) {
                        logger.info(Thread.currentThread().getName() + " " + doc.select("title").text());
                        logger.info(Thread.currentThread().getName() + " "
                                + dayformat1.format(System.currentTimeMillis()) + " [Robot Check,captcha success:"
                                + captcha_str + ",try num:" + robotchecknum + "]");
                    }
                    robotchecknum++;
                }

            } catch (FailingHttpStatusCodeException e) {
                logger.error(Thread.currentThread().getName() +" "+ e);
                flag = true;
            } catch (MalformedURLException e) {
                logger.error(Thread.currentThread().getName() +" "+ e);
                flag = true;
            }catch(UnknownHostException e) {
                logger.error(Thread.currentThread().getName() +" "+ e);
                flag = true;
                logger.info("found UnknownHostException,start sleep 20 min");
                try {
                    Thread.sleep(1000*60*Integer.parseInt(Configuration.getProperties("unknowhost_sleeptime")));
                } catch (InterruptedException e1) {
                    logger.error(Thread.currentThread().getName() +" "+ e1);
                }
                logger.info("found UnknownHostException,end sleep 20 min");
                UnknowHostTryTimeCnt++;// 访问异常数加一
                logger.info(Thread.currentThread().getName() + " " + dayformat1.format(System.currentTimeMillis())
                        + " [UnknowHostTryTimeCnt:" + UnknowHostTryTimeCnt + "]");
                if (UnknowHostTryTimeCnt > Integer.parseInt(Configuration.getProperties("unknowhost_maxtrytime"))) {
                    return null;
                }
            }catch (Exception eq) {
                logger.error(Thread.currentThread().getName() + " "+eq);
                TryTimeCnt++;// 访问异常数加一
                logger.info(Thread.currentThread().getName() + " " + dayformat1.format(System.currentTimeMillis())
                        + " [TryTimeCnt:" + TryTimeCnt + "]");
                if (TryTimeCnt > 5) {
                    return null;
                }
                try {
                    Thread.sleep(1000);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                    logger.error(Thread.currentThread().getName() + e);
                }
                flag = true;
            }
            try {
                Thread.sleep(random.nextInt(500) + 1500);
            } catch (InterruptedException e) {
                logger.error(Thread.currentThread().getName() + e);
                flag = true;
            }
        }
        return page;
    }
时间: 2024-10-11 23:37:31

Java WebClient 总结的相关文章

C#调用JAVA接口WSSE方式用WebClient方式

C#读取JAVA的WSSE接口的调用代码: 用webclient 方式: /// <summary> /// 调用java cxf ws_security加密的服务wcf客户端对应的加密类 /// </summary> public class WssSecurity { private byte[] _nonce ; private string _nonceStr = GetNoce(29); private readonly string _pass; //密码 privat

Android的WebView通过JS调用java代码

做项目时候会遇到我们用WebView 打开一个web,希望这个web可以调用自己的一些方法,比如我们在进一个web页面,然后当我们点击web上的某个按钮时,希望能判断当前手机端是否已经登录,如果未登录,那么就会跳转到登录页面(登陆页面是另一个Activity).这个时候,一个简单的做法就是在按钮动作事件的js上调用java的方法,从而起到判断是否登录,并决定是否跳转到另一个页面. Google的WebView为我们提供了 addJavascriptInterface(Object obj, St

Java通过SMS短信平台实现发短信功能

在项目中使用过发短信的功能,但那个由于公司内部的限制很麻烦,今天在网上找到一个简单的,闲来无事就把它记录如下: 本程序是通过使用中国网建提供的SMS短信平台实现的(该平台目前为注册用户提供5条免费短信,3条免费彩信,这足够用于我们测试用了.在使用前需要注册,注册地址为http://sms.webchinese.cn/reg.shtml),下面是程序源码: /** * @Author dengsilinming * @Date 2012-9-18 * */ package com.dengsili

Atitit.http httpclient实践java c# .net php attilax总结

1. Navtree>> net .http1 2. Httpclient理论1 2.1. 自动url转向的控制1 3. Java里面的httpclient1 4. C# .net的httpclient2 4.1.1. .NET 4.5(C#):2 4.2. 对COOKIE和SEIION支持区别3 4.3. 用户对是否自动url转向的控制3 4.4. 对用户代理服务器的支持3 5. Php的httpclient3 6. Node.js4 7. solu解决问题::4 8. ref参考资料4 8

java.io.EOFException

使用webclient抓取网页时报错:java.io.EOFException atjava.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:207) atjava.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:197) atjava.util.zip.GZIPInp使用webclient抓取网页时报错: java.io.EOFException at java.u

通过Java WebService接口从服务端下载文件

一. 前言 本文讲述如何通过webservice接口,从服务端下载文件.报告到客户端.适用于跨系统间的文件交互,传输文件不大的情况(控制在几百M以内).对于这种情况搭建一个FTP环境,增加了系统部署的复杂度和系统对外暴露的接口.通过在服务端读取文件,返回字节流到客户端的方式比较简单. 下面采用restful的接口形式,满足SOA架构接口要求.如下代码拷贝到eclipse中即可运行,功能自测试运行正常.样例代码的服务端和客户端在同一台PC上运行,放到不同PC上运行改一下发布服务和请求服务的IP地址

ScreenCaptureHtmlUnitDriver.java

https://github.com/apache/incubator-zeppelin/blob/master/zeppelin-server/src/test/java/com/webautomation/ScreenCaptureHtmlUnitDriver.java /*   * Licensed to the Apache Software Foundation (ASF) under one or more   * contributor license agreements. Se

java WebService简单使用案例

首先,建立一个WebService: package garfield; import javax.jws.WebService; import javax.xml.ws.Endpoint; @WebService public class MyJ6WebService { public String SayHello(String strName) { return "Hello ,"+strName+"!"; } public static void main(

Java下HttpUnit和Jsoup的Http抓取

简单记录下:搜集信息-分析问题-解决问题 关于html文档的操作现成库有: HttpUnit 很老了,不更了 http://www.httpunit.org/  20 May 2008 HttpUnit 1.7 released Jsoup 还更新 http://jsoup.org/ htmlunit http://htmlunit.sourceforge.net/ selenium WebDriver 带有HttpUnit Phantomjs 截图 等... 抓取xiami网的音乐漫游列表和热