纯golang爬虫实战-(七)-使用mime/multipart传输附件

还是先用Fiddler(设置过滤器、自动断点、捕获通信),截获以下内容:

POST http://192.168.132.80/docs/docs/UploadDoc.jsp HTTP/1.1
Accept: text/html, application/xhtml+xml, */*
Referer: http://192.168.132.80/docs/docs/DocAdd.jsp?mainid=15&subid=49&secid=48&showsubmit=1&coworkid=&prjid=&isExpDiscussion=&crmid=&hrmid=&topage=
Accept-Language: zh-CN
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko
Content-Type: multipart/form-data; boundary=---------------------------7e431d37a30abc
Accept-Encoding: gzip, deflate
Host: 192.168.132.80
Content-Length: 4212
Connection: Keep-Alive
Pragma: no-cache
Cookie: testBanCookie=test; JSESSIONID=abcIswHnk9uU49ql9MP2w; loginfileweaver=%2Fwui%2Ftheme%2Fecology7%2Fpage%2Flogin.jsp%3FtemplateId%3D6%26logintype%3D1%26gopage%3D; loginidweaver=114; languageidweaver=7

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="needShow"

0
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="docreplyable"

0
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="usertype"

1
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="from"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="userCategory"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="userId"

114
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="userType"

1
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="docstatus"

0
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="doccode"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="docedition"

-1
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="doceditionid"

-1
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="maincategory"

15
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="subcategory"

49
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="seccategory"

48
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="ownerid"

114
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="docdepartmentid"

10
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="doclangurage"

7
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="maindoc"

-1
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="topage"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="operation"

addsave
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="SecId"

48
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="imageidsExt"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="imagenamesExt"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="delImageidsExt"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="namerepeated"

0
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="docsubject"

1
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="doccontent"

1
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="readoptercanprint"

1
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="selectCategory"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="tempDocModule"

-1
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="docmodule"

-1
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="keyword"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="selectMainDocument"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="invalidationdate"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="dummycata"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="hrmresid"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="crmid"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="projectid"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="imgType"

2
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="imgUrl_doccontent"

http://
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="docimages_num"

0
-----------------------------7e431d37a30abc--

为了写代码简单些,直接用浏览器登录获取JSESSIONID写入代码中,在浏览器保持登录状态下运行代码。

关于Content-Length可参考https://www.cnblogs.com/lovelacelee/p/5385683.html

上面Content-Length: 4212,如果在fiddler中修改body部分,可将修改内容复制到notepad++中查看实际字符数。

代码:

package main

import (
    "bytes"
    "fmt"
    "io"
    "io/ioutil"
    "log"
    "mime/multipart"
    "net/http"
    "os"
    "path/filepath"
    "strings"

    "crypto/md5"
    "encoding/hex"
)

func main() {
    bodyBuffer := &bytes.Buffer{}
    bodyBuffer.WriteString(`-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="needShow"

0
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="docreplyable"

0
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="usertype"

1
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="from"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="userCategory"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="userId"

114
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="userType"

1
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="docstatus"

0
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="doccode"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="docedition"

-1
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="doceditionid"

-1
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="maincategory"

15
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="subcategory"

49
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="seccategory"

48
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="ownerid"

114
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="docdepartmentid"

10
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="doclangurage"

7
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="maindoc"

-1
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="topage"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="operation"

addsave
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="SecId"

48
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="imageidsExt"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="imagenamesExt"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="delImageidsExt"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="namerepeated"

0
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="docsubject"

1
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="doccontent"

1
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="readoptercanprint"

1
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="selectCategory"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="tempDocModule"

-1
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="docmodule"

-1
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="keyword"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="selectMainDocument"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="invalidationdate"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="dummycata"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="hrmresid"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="crmid"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="projectid"

-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="imgType"

2
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="imgUrl_doccontent"

http://
-----------------------------7e431d37a30abc
Content-Disposition: form-data; name="docimages_num"

0
-----------------------------7e431d37a30abc--`)

    headers := `Accept: text/html, application/xhtml+xml, */*
Referer: http://192.168.132.80/docs/docs/DocAdd.jsp?mainid=15&subid=49&secid=48&showsubmit=1&coworkid=&prjid=&isExpDiscussion=&crmid=&hrmid=&topage=
Accept-Language: zh-CN
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko
Content-Type: multipart/form-data; boundary=---------------------------7e431d37a30abc
Accept-Encoding: gzip, deflate
Host: 192.168.132.80
Content-Length: 4212
Connection: Keep-Alive
Pragma: no-cache
Cookie: testBanCookie=test; JSESSIONID=abcIswHnk9uU49ql9MP2w; loginfileweaver=%2Fwui%2Ftheme%2Fecology7%2Fpage%2Flogin.jsp%3FtemplateId%3D6%26logintype%3D1%26gopage%3D; loginidweaver=114; languageidweaver=7`

    uri := fmt.Sprintf("http://192.168.132.80/docs/docs/DocDsp.jsp?fromFlowDoc=&id=803038&blnOsp=false&topage=&pstate=sub")
    req, err := http.NewRequest("POST", uri, ioutil.NopCloser(bodyBuffer))
    if err != nil {
        log.Printf("Cannot NewRequest: %s , err: %v", uri, err)
        return
    }
    AddHeaders(req, headers)
    fmt.Println(req.Header)
    //fmt.Println(req.Body)
    client := &http.Client{}
    resp, err := client.Do(req)
    defer resp.Body.Close()
    if err != nil {
        log.Printf("Cannot client.Do, err: %v", err)
        return
    }
    body, _ := ioutil.ReadAll(resp.Body)
    fmt.Println(len(string(body)))

}

func attachField(bodyWriter *multipart.Writer, keyname, keyvalue string) error {
    if err := bodyWriter.WriteField(keyname, keyvalue); err != nil {
        log.Printf("Cannot WriteField: %s, err: %v", keyname, err)
        return err
    }
    return nil
}

func attachFile(bodyWriter *multipart.Writer, formname, filename string) error {
    fullname := filepath.Join(".", filename)
    file, err := os.Open(fullname)
    if err != nil {
        log.Printf("Cannot open file: %s , err: %v", fullname, err)
        return err
    }
    defer file.Close()

    // MD5
    md5hash := md5.New()
    if _, err = io.Copy(md5hash, file); err != nil {
        log.Printf("Cannot open md5 hash: %s , err: %v", fullname, err)
        return err
    }

    keyname := filename + ".md5cksum"
    keyvalue := hex.EncodeToString(md5hash.Sum(nil)[:16])
    if err = attachField(bodyWriter, keyname, keyvalue); err != nil {
        log.Printf("Cannot WriteField: %s, err: %v", keyname, err)
        return err
    }

    // file
    part, err := bodyWriter.CreateFormFile(formname, filename)
    if err != nil {
        log.Printf("Cannot CreateFormFile for: %s , err: %v", filename, err)
        return err
    }

    _, err = io.Copy(part, file)
    if err != nil {
        log.Printf("Cannot Copy file: %s , err: %v", fullname, err)
        return err
    }

    return nil
}

func AddHeaders(req *http.Request, headers string) *http.Request {
    //将传入的Header分割成[]ak和[]av
    a := strings.Split(headers, "\n")
    ak := make([]string, len(a[:]))
    av := make([]string, len(a[:]))
    //要用copy复制值;若用等号仅表示指针,会造成修改ak也就是修改了av
    copy(ak, a[:])
    copy(av, a[:])
    //fmt.Println(ak[0], av[0])
    for k, v := range ak {
        i := strings.Index(v, ":")
        j := i + 1
        ak[k] = v[:i]
        av[k] = v[j:]
        //设置Header
        req.Header.Set(ak[k], av[k])
    }
    return req
}

重要,注意:代码中的请求uri地址与最初截获的请求地址不一样!我在这里被卡了一天。后来,

中断响应才发现,原来有一个302跳转,所以要将代码中的uri改为302跳转后的请求地址才能成功。

参考:

https://www.jianshu.com/p/f2d9c601c66a (重点推荐)

https://www.cnblogs.com/wonyun/p/7966967.html

https://my.oschina.net/bianweiall/blog/544355

https://stackoverflow.com/questions/3508338/what-is-the-boundary-in-multipart-form-data

https://studygolang.com/articles/14075

https://www.jianshu.com/p/f95558a49e98

http://www.mamicode.com/info-detail-2406025.html

原文地址:https://www.cnblogs.com/pu369/p/12327676.html

时间: 2024-10-13 11:41:08

纯golang爬虫实战-(七)-使用mime/multipart传输附件的相关文章

纯golang爬虫实战(二)

接上一篇文章https://www.cnblogs.com/pu369/p/12202845.html只讲了原理,抽时间写个了实用版,将员工信息爬取到一个TXT文档中,以便于查询,上代码: //纯golang爬虫 package main import ( "bytes" "fmt" "io/ioutil" "net/http" "net/http/cookiejar" "regexp"

纯golang爬虫实战-(五-小结篇)

对前几篇文章的代码进行梳理,形成4个通用型函数: 1 直接Get或Post,通常会被网站限制访问: 2 带headers进行Get或Post,模拟了浏览器,通常可以正常访问. 代码(注意由于下面的代码中设置http header时有*/*,造成代码的显示不太正常,但不影响): //Header是直接从chrome console中复制的view source形式的Request Headers,注意只包括以冒号分割的内容. //FormData也是直接从chrome console中复制的vie

纯golang爬虫实战(三)

网站上有9000多张照片要下载: //一开始参考https://www.jb51.net/article/153275.htm用reader和writer进行io.Copy,但经常是抓取100多个网页后就崩溃了, //原因似乎是输入输出流影响或并发数量影响,代码执行快,输出流写硬盘慢. //后来参考https://www.cnblogs.com/smartrui/p/12110576.html,改为ioutil.WriteFile直接写文件,还不行 //因为一共才9000多张照片,原来的程序每次

Python爬虫实战七之计算大学本学期绩点

大家好,本次为大家带来的项目是计算大学本学期绩点.首先说明的是,博主来自山东大学,有属于个人的学生成绩管理系统,需要学号密码才可以登录,不过可能广大读者没有这个学号密码,不能实际进行操作,所以最主要的还是获取它的原理.最主要的是了解cookie的相关操作. 本篇目标 1.模拟登录学生成绩管理系统 2.抓取本学期成绩界面 3.计算打印本学期成绩 1.URL的获取 恩,博主来自山东大学~ 先贴一个URL,让大家知道我们学校学生信息系统的网站构架,主页是 http://jwxt.sdu.edu.cn:

Python爬虫入门七之正则表达式

在前面我们已经搞定了怎样获取页面的内容,不过还差一步,这么多杂乱的代码夹杂文字我们怎样把它提取出来整理呢?下面就开始介绍一个十分强大的工具,正则表达式! 1.了解正则表达式 正则表达式是对字符串操作的一种逻辑公式,就是用事先定义好的一些特定字符.及这些特定字符的组合,组成一个"规则字符串",这个"规则字符串"用来表达对字符串的一种过滤逻辑. 正则表达式是用来匹配字符串非常强大的工具,在其他编程语言中同样有正则表达式的概念,Python同样不例外,利用了正则表达式,我

定向爬虫实战笔记

定向爬虫实战笔记 流程图如下: 来自追女神助手(痴汉)v0.1: 1.#-*-coding:utf8-*-2.3.import smtplib4.from email.mime.text import MIMEText5.import requests6.from lxml import etree7.import os8.import time9.import sys10.reload(sys)11.sys.setdefaultencoding('utf-8')12.13.14.15.clas

Python爬虫实战二之爬取百度贴吧帖子

大家好,上次我们实验了爬取了糗事百科的段子,那么这次我们来尝试一下爬取百度贴吧的帖子.与上一篇不同的是,这次我们需要用到文件的相关操作. 前言 亲爱的们,教程比较旧了,百度贴吧页面可能改版,可能代码不好使,八成是正则表达式那儿匹配不到了,请更改一下正则,当然最主要的还是帮助大家理解思路. 2016/12/2 本篇目标 1.对百度贴吧的任意帖子进行抓取 2.指定是否只抓取楼主发帖内容 3.将抓取到的内容分析并保存到文件 1.URL格式的确定 首先,我们先观察一下百度贴吧的任意一个帖子. 比如:ht

python爬虫实战——5分钟做个图片自动下载器

python爬虫实战--图片自动下载器 制作爬虫的基本步骤 顺便通过这个小例子,可以掌握一些有关制作爬虫的基本的步骤. 一般来说,制作一个爬虫需要分以下几个步骤: 分析需求(对,需求分析非常重要,不要告诉我你老师没教你) 分析网页源代码,配合F12(没有F12那么乱的网页源代码,你想看死我?) 编写正则表达式或者XPath表达式(就是前面说的那个神器) 正式编写python爬虫代码 效果 运行: 恩,让我输入关键词,让我想想,输入什么好呢?好像有点暴露爱好了. 回车 好像开始下载了!好赞!,我看

Python爬虫实战(2):爬取京东商品列表

1,引言 在上一篇<Python爬虫实战:爬取Drupal论坛帖子列表>,爬取了一个用Drupal做的论坛,是静态页面,抓取比较容易,即使直接解析html源文件都可以抓取到需要的内容.相反,JavaScript实现的动态网页内容,无法从html源代码抓取需要的内容,必须先执行JavaScript. 我们在<Python爬虫使用Selenium+PhantomJS抓取Ajax和动态HTML内容>一文已经成功检验了动态网页内容的抓取方法,本文将实验程序进行改写,使用开源Python爬虫