在爬取某些网站,有些需要登录才能获取访问权限。如果仅仅只是需要登录,这里可以推荐大家一个工具,很好用的
在火狐浏览其中有个插件firebug(需要安装),通过这个插件可以详细的查看网站的访问过程(链接的跳转和访问先后顺序),以及每次链接的请求头信息、响应头信息,同时也可以查看post提交的数据。当然在IE和谷歌浏览器中也有些开发工具,F12直接唤出,但是个人感觉火狐的firebug比较好用,IE的和谷歌的,我也偶尔使用。
通过上面介绍的工具可以获取模拟的详细过程,然后模拟登录,都是很容易的事。
这里我是介绍的是登录如果需要验证码,就有些麻烦了,我这里想到一种解决办法,比较常用,就是弹出验证码
实现如下,模拟登录
public class LoginByCode { public static void main(String[] args) { CloseableHttpClient httpClient = HttpClientBuilder.create().build(); SimpleDateFormat format = new SimpleDateFormat("yyyyMMddhhmmss"); String path = "d:/img/tmp/" + format.format(new Date()) + ".jpg"; try { String imgurl = "http://www.shanghaiip.cn/wasWeb/login/Random.jsp"; HttpUriRequest get = new HttpGet(imgurl); HttpResponse res = httpClient.execute(get); res.setHeader("Content-Type", "image/gif"); byte[] img = EntityUtils.toByteArray(res.getEntity());//下载验证码图片 saveFile(path, img); String code = new ImgDialog().showDialog(null, path);//弹出验证码,获取填写验证码 String login = "http://www.shanghaiip.cn/wasWeb/login/loginServer.jsp"; HttpPost post = new HttpPost(login); List<NameValuePair> data = new ArrayList<NameValuePair>(); data.add(new BasicNameValuePair("username", "zhpatent")); data.add(new BasicNameValuePair("password", "5ca072839350b0733a2a456cc4004371"));//火狐里面用firebug可以查看密码是加密后的 data.add(new BasicNameValuePair("newrandom", code)); post.setEntity(new UrlEncodedFormEntity(data)); res = httpClient.execute(post); Header[] headers = res.getHeaders("Location");//获取跳转链接 get = new HttpGet(headers[0].getValue()); res = httpClient.execute(get); String body = EntityUtils.toString(res.getEntity()); if (body.contains("zhpatent")) { System.out.println("模拟登录成功:" + body.substring(body.indexOf("zhpatent") - 40, body.indexOf("zhpatent") + 40)); } } catch (Exception e) { System.out.println("异常:" + e.getMessage()); } finally { File file = new File(path); if (file.exists()) { file.delete(); } try { httpClient.close(); } catch (IOException e) { e.printStackTrace(); } } } private static void saveFile(String path, byte[] data) { int size = 0; byte[] buffer = new byte[10240]; try (BufferedOutputStream bos = new BufferedOutputStream(new FileOutputStream(path)); ByteArrayInputStream is = new ByteArrayInputStream(data)) { while ((size = is.read(buffer)) != -1) { bos.write(buffer, 0, size); } } catch (IOException e) { e.printStackTrace(); } } }
验证码工具类
public class ImgDialog { public String message = null; private JButton confirm; private JDialog dialog = null; private TextField field; String result = ""; public String showDialog(JFrame father, String path) { JLabel label = new JLabel(); label.setBorder(new EtchedBorder(EtchedBorder.LOWERED, null, null)); label.setBounds(10, 10, 125, 51); label.setIcon(new ImageIcon(path)); field = new TextField(); field.setBounds(145, 10, 65, 20); confirm = new JButton("确定"); confirm.setBounds(145, 40, 65, 20); confirm.addActionListener(new ActionListener() { @Override public void actionPerformed(ActionEvent e) { result = field.getText(); ImgDialog.this.dialog.dispose(); } }); dialog = new JDialog(father, true); dialog.setTitle("请输入图片中的验证码"); Container pane = dialog.getContentPane(); pane.setLayout(null); pane.add(label); pane.add(field); pane.add(confirm); dialog.pack(); dialog.setSize(new Dimension(235, 110)); dialog.setLocation(750, 430); // dialog.setLocationRelativeTo(father); dialog.setVisible(true); return result; } }
实验效果如下
运行会下载验证码并弹出
输入验证码,在登录后跳转的页面中获取到我的用户信息。
我这里是使用的httpclient模拟登录的,httpclient不用管理cookies,所以用起来方便,不会出现验证码对不上号的问题。
如果是使用Jsoup模拟登录就稍微麻烦点,得自己管理cookies,在访问验证码页面的时候同时得下载验证码和拿到cookies,然后在模拟登录的时候需要带上cookies
时间: 2024-10-07 08:29:48