//这个函数的目的是在获取页面的同时,也获取链接对应的cookiepublic static HtmlPage getCookieAndHtml(String url)throws IOException{ URL link=new URL(url); WebClient wc=new WebClient(); WebRequest request=new WebRequest(link); wc.getCookieManager().setCookiesEnabled(true);//开启cookie管理 wc.getOptions().setJavaScriptEnabled(true);//开启js解析。对于变态网页,这个是必须的 wc.getOptions().setCssEnabled(true);//开启css解析。对于变态网页,这个是必须的。 HtmlPage page = wc.getPage(request); CookieManager CM = wc.getCookieManager(); //WC = Your WebClient‘s name ThreeExecute.cookie = CM.getCookies();//返回的Cookie在这里,下次请求的时候可能可以用上啦。这里的cookie是ThreeExecute这个类中的全局静态变量,类型为Set<Cookie> wc.close(); return page;} //调用
public static HtmlPage getHtml1(String url, Set<Cookie> cookies)throws IOException{ URL link =new URL(url); final WebClient webClient = new WebClient(); WebRequest request = new WebRequest(link); webClient.getOptions().setCssEnabled(false); webClient.getOptions().setJavaScriptEnabled(true); webClient.getOptions().setThrowExceptionOnScriptError(true); Iterator<Cookie> it = cookies.iterator(); while (it.hasNext()) { webClient.getCookieManager().addCookie(it.next()); } final HtmlPage page = webClient.getPage(request); webClient.close(); return page;} Jsoup在爬取数据需要用到Cookie时的操作就很简单
Map<String, String> cookies = null;Response res = Jsoup.connect("http://www.chengmi.com/shanghai").timeout(30000).execute();cookies = res.cookies();
Document doc = null;doc = Jsoup.connect(url).cookies(cookies).timeout(30000).get();
时间: 2024-10-14 00:37:45