[爬虫]采用Go语言爬取天猫页面

最近工作中有一个需求,需要爬取天猫商品的信息,整个需求的过程如下:

修改后端广告交易平台的代码,从阿里上传的素材中解析url,该url格式如下:

https://handycam.alicdn.com/slideshow/26/7ef5aed1e3c39843e8feac816a436ecf.mp4?content=%7B%22items%22%3A%5B%7B%22images%22%3A%5B%22https%3A%2F%2Fasearch.alicdn.com%2Fbao%2Fuploaded%2F%2Fi4%2F22356367%2FTB2PMQinN6I8KJjy0FgXXXXzVXa_%21%210-saturn_solar.jpg%22%5D%2C%22itemid%22%3A%227664169349%22%2C%22shorttitle%22%3A%22%E4%B9%92%E4%B9%93%E7%90%83%E6%8B%8D%20%E6%97%A0%E7%BA%BF%E4%B8%93%E5%B1%9E%22%7D%5D%7D

明显进行编码了,首先我们需要进行解码,解码的在线网站如下:

http://tool.chinaz.com/Tools/urlencode.aspx

经过decode以后,我们得到:

https://handycam.alicdn.com/slideshow/26/7ef5aed1e3c39843e8feac816a436ecf.mp4?content={"items":[{"images":["https://asearch.alicdn.com/bao/uploaded//i4/22356367/TB2PMQinN6I8KJjy0FgXXXXzVXa_!!0-saturn_solar.jpg"],"itemid":"7664169349","shorttitle":"乒乓球拍 无线专属"}]}

我们需要的就是其中的"itemid":"7664169349"。

然后我们通过访问https://detail.tmall.com/item.htm?id=7664169349,打开如下页面:

这就是我们需要抓取的页面信息。广告交易平台将解析的itemid放入到nsq中,爬虫系统通过拼接URL抓取页面的关键信息,然后将关键信息发送到Kafka中,Hive和ES再从Kafka中获取相应的信息,进行查询操作。

第一步

第一步就是解析出ItemId,在广告交易平台我们可以获取需要解析的URL,接下来我们用代码对URL进行decode并且解析出相应的ItemId数值。由于项目采用的是Golang,所以这里以Golang为例,Python写其实更简单,原理一样。

URL解析的方法,可以参考:

https://gobyexample.com/url-parsing

JSON序列化和反序列化,可以参考:

https://www.cnblogs.com/liang1101/p/6741262.html

这里给出我的代码:

package main

import (
    "encoding/json"
    "fmt"
    "net/url"
    "strconv"
)
//结构体的首字母大写
type item struct {
    Images []string
    ItemId string
    ShortTitle string
}

func main() {
    var urlstring string = "https://handycam.alicdn.com/slideshow/26/7ef5aed1e3c39843e8feac816a436ecf.mp4?content=%7B%22items%22%3A%5B%7B%22images%22%3A%5B%22https%3A%2F%2Fasearch.alicdn.com%2Fbao%2Fuploaded%2F%2Fi4%2F22356367%2FTB2PMQinN6I8KJjy0FgXXXXzVXa_%21%210-saturn_solar.jpg%22%5D%2C%22itemid%22%3A%227664169349%22%2C%22shorttitle%22%3A%22%E4%B9%92%E4%B9%93%E7%90%83%E6%8B%8D%20%E6%97%A0%E7%BA%BF%E4%B8%93%E5%B1%9E%22%7D%5D%7D"
    unescape, err := url.QueryUnescape(urlstring)
    if err != nil {
        fmt.Println("err is", err)
    }
    fmt.Println(unescape)
    parse, err := url.Parse(unescape)
    fmt.Println(parse.RawQuery)
    query, err := url.ParseQuery(parse.RawQuery)
    fmt.Println(query)
    fmt.Printf("%T, %v\n", query["content"][0], query["content"][0])
    m := make(map[string][]item)
    json.Unmarshal([]byte(query["content"][0]), &m)
    fmt.Println("m:", m)
    itemValue := m["items"][0]
    fmt.Println(itemValue.ItemId)
    //转成int64
    i, err := strconv.ParseInt(itemValue.ItemId, 10, 64)
    fmt.Printf("%T, %v", i, i)
}

运行结果:

便可以得到我们需要的ItemId数值。

第二步

第二步就是拼接我们的URL进行页面内容的爬取。

如何通过GoLang拉取网页呢?附上一个简单demo。

package main
import (
    "net/http"
    "io/ioutil"
    "fmt"
)
func main(){
    var website string = "http://www.future.org.cn"
    if resp,err := http.Get(website); err == nil{
        defer resp.Body.Close()
        if body, err := ioutil.ReadAll(resp.Body); err == nil {
            fmt.Println("HTML content:", string(body));
        }else{
            fmt.Println("Cannot read from connected http server:", err);
        }
    }else{
        fmt.Println("Cannot connect the server:", err);
    }
}

但是爬取页面以后,会发现个问题,就是中文显示乱码。

中文乱码问题解决,参考:

https://gocn.vip/article/364

安装 iconv-go

go get github.com/djimenez/iconv-go

可以获取以后再转码,比如:

func convFromGbk(s string) string {
    gbkConvert, _ := iconv.NewConverter("gbk", "utf-8")
    res, _ := gbkConvert.ConvertString(s)
    return res
}

也可以用如下方式转换Reader:

req, err := http.NewRequest("GET", url, nil)
    if err != nil {
        return nil, err
    }
    req.Header.Set("User-Agent", ualist[rand.Intn(len(ualist))])
    rsp, err := j.client.Do(req)
    if err != nil {
        return nil, err
    }
    //转码
    utfBody, err := iconv.NewReader(rsp.Body, "gb2312", "utf-8")
    //if body, err := ioutil.ReadAll(utfBody); err == nil {
    //    fmt.Println("HTML content:", string(body))
    //}

爬取以后的页面我们需要进行解析,这里采用的XPath。

关于使用XPath的方式,参考:

http://www.w3school.com.cn/xpath/xpath_axes.asp

非常简单,看完就明白了。

因为爬取之后是html,你只需要获取自己想要的内容即可,说白了就是解析html。

接下来还有一个难点,就是我们抓取的静态页面,很多信息都包含,但是价格信息不包含,因为它是动态加载的。

我们不妨分析一下,

我们将其点开,复制URL在浏览器打开,发现无法访问,403,不要着急,只需要在请求的Header中加上如下的参数即可。

在代码中如下:

referer := fmt.Sprintf("https://detail.tmall.com/item.htm?id=%d", itemID)
req.Header.Set("Referer", referer)

我们查看响应发现是一个JSON,

格式化一下:格式化网址:http://tool.oschina.net/codeformat/json

{
    "defaultModel": {
        "bannerDO": {
            "success": true
        },
        "deliveryDO": {
            "areaId": 110100,
            "deliveryAddress": "浙江金华",
            "deliverySkuMap": {
                "6310159781": [
                    {
                        "arrivalNextDay": false,
                        "arrivalThisDay": false,
                        "forceMocked": false,
                        "postage": "快递: 0.00 ",
                        "postageFree": false,
                        "skuDeliveryAddress": "浙江金华",
                        "type": 0
                    }
                ],
                "default": [
                    {
                        "arrivalNextDay": false,
                        "arrivalThisDay": false,
                        "forceMocked": false,
                        "postage": "快递: 0.00 ",
                        "postageFree": false,
                        "skuDeliveryAddress": "浙江金华",
                        "type": 0
                    }
                ],
                "6310159797": [
                    {
                        "arrivalNextDay": false,
                        "arrivalThisDay": false,
                        "forceMocked": false,
                        "postage": "快递: 0.00 ",
                        "postageFree": false,
                        "skuDeliveryAddress": "浙江金华",
                        "type": 0
                    }
                ],
                "3280089025135": [
                    {
                        "arrivalNextDay": false,
                        "arrivalThisDay": false,
                        "forceMocked": false,
                        "postage": "快递: 0.00 ",
                        "postageFree": false,
                        "skuDeliveryAddress": "浙江金华",
                        "type": 0
                    }
                ],
                "3280089025136": [
                    {
                        "arrivalNextDay": false,
                        "arrivalThisDay": false,
                        "forceMocked": false,
                        "postage": "快递: 0.00 ",
                        "postageFree": false,
                        "skuDeliveryAddress": "浙江金华",
                        "type": 0
                    }
                ]
            },
            "destination": "北京市",
            "success": true
        },
        "detailPageTipsDO": {
            "crowdType": 0,
            "hasCoupon": true,
            "hideIcons": false,
            "jhs99": false,
            "minicartSurprise": 0,
            "onlyShowOnePrice": false,
            "priceDisplayType": 4,
            "primaryPicIcons": [ ],
            "prime": false,
            "showCuntaoIcon": false,
            "showDou11Style": false,
            "showDou11SugPromPrice": false,
            "showDou12CornerIcon": false,
            "showDuo11Stage": 0,
            "showJuIcon": false,
            "showMaskedDou11SugPrice": false,
            "success": true,
            "trueDuo11Prom": false
        },
        "doubleEleven2014": {
            "doubleElevenItem": false,
            "halfOffItem": false,
            "showAtmosphere": false,
            "showRightRecommendedArea": false,
            "step": 0,
            "success": true
        },
        "extendedData": { },
        "extras": { },
        "gatewayDO": {
            "changeLocationGateway": {
                "queryDelivery": true,
                "queryProm": false
            },
            "success": true,
            "trade": {
                "addToBuyNow": { },
                "addToCart": { }
            }
        },
        "inventoryDO": {
            "hidden": false,
            "icTotalQuantity": 225,
            "skuQuantity": {
                "3280089025136": {
                    "quantity": 71,
                    "totalQuantity": 71,
                    "type": 1
                },
                "6310159781": {
                    "quantity": 33,
                    "totalQuantity": 33,
                    "type": 1
                },
                "6310159797": {
                    "quantity": 44,
                    "totalQuantity": 44,
                    "type": 1
                },
                "3280089025135": {
                    "quantity": 77,
                    "totalQuantity": 77,
                    "type": 1
                }
            },
            "success": true,
            "totalQuantity": 225,
            "type": 1
        },
        "itemPriceResultDO": {
            "areaId": 110100,
            "duo11Item": false,
            "duo11Stage": 0,
            "extraPromShowRealPrice": false,
            "halfOffItem": false,
            "hasDPromotion": false,
            "hasMobileProm": false,
            "hasTmallappProm": false,
            "hiddenNonBuyPrice": false,
            "hideMeal": false,
            "priceInfo": {
                "6310159781": {
                    "areaSold": true,
                    "onlyShowOnePrice": false,
                    "price": "178.00",
                    "promotionList": [
                        {
                            "amountPromLimit": 0,
                            "amountRestriction": "",
                            "basePriceType": "IcPrice",
                            "canBuyCouponNum": 0,
                            "endTime": 1561651200000,
                            "extraPromTextType": 0,
                            "extraPromType": 0,
                            "limitProm": false,
                            "postageFree": false,
                            "price": "75.00",
                            "promType": "normal",
                            "start": false,
                            "startTime": 1546267717000,
                            "status": 2,
                            "tfCartSupport": false,
                            "tmallCartSupport": false,
                            "type": "火爆促销",
                            "unLogBrandMember": false,
                            "unLogShopVip": false,
                            "unLogTbvip": false
                        }
                    ],
                    "sortOrder": 0
                },
                "6310159797": {
                    "areaSold": true,
                    "onlyShowOnePrice": false,
                    "price": "178.00",
                    "promotionList": [
                        {
                            "amountPromLimit": 0,
                            "amountRestriction": "",
                            "basePriceType": "IcPrice",
                            "canBuyCouponNum": 0,
                            "endTime": 1561651200000,
                            "extraPromTextType": 0,
                            "extraPromType": 0,
                            "limitProm": false,
                            "postageFree": false,
                            "price": "75.00",
                            "promType": "normal",
                            "start": false,
                            "startTime": 1546267717000,
                            "status": 2,
                            "tfCartSupport": false,
                            "tmallCartSupport": false,
                            "type": "火爆促销",
                            "unLogBrandMember": false,
                            "unLogShopVip": false,
                            "unLogTbvip": false
                        }
                    ],
                    "sortOrder": 0
                },
                "3280089025135": {
                    "areaSold": true,
                    "onlyShowOnePrice": false,
                    "price": "168.00",
                    "promotionList": [
                        {
                            "amountPromLimit": 0,
                            "amountRestriction": "",
                            "basePriceType": "IcPrice",
                            "canBuyCouponNum": 0,
                            "endTime": 1561651200000,
                            "extraPromTextType": 0,
                            "extraPromType": 0,
                            "limitProm": false,
                            "postageFree": false,
                            "price": "68.00",
                            "promType": "normal",
                            "start": false,
                            "startTime": 1546267717000,
                            "status": 2,
                            "tfCartSupport": false,
                            "tmallCartSupport": false,
                            "type": "火爆促销",
                            "unLogBrandMember": false,
                            "unLogShopVip": false,
                            "unLogTbvip": false
                        }
                    ],
                    "sortOrder": 0
                },
                "3280089025136": {
                    "areaSold": true,
                    "onlyShowOnePrice": false,
                    "price": "168.00",
                    "promotionList": [
                        {
                            "amountPromLimit": 0,
                            "amountRestriction": "",
                            "basePriceType": "IcPrice",
                            "canBuyCouponNum": 0,
                            "endTime": 1561651200000,
                            "extraPromTextType": 0,
                            "extraPromType": 0,
                            "limitProm": false,
                            "postageFree": false,
                            "price": "68.00",
                            "promType": "normal",
                            "start": false,
                            "startTime": 1546267717000,
                            "status": 2,
                            "tfCartSupport": false,
                            "tmallCartSupport": false,
                            "type": "火爆促销",
                            "unLogBrandMember": false,
                            "unLogShopVip": false,
                            "unLogTbvip": false
                        }
                    ],
                    "sortOrder": 0
                }
            },
            "queryProm": false,
            "success": true,
            "successCall": true,
            "tmallShopProm": [ ]
        },
        "memberRightDO": {
            "activityType": 0,
            "level": 0,
            "postageFree": false,
            "shopMember": false,
            "success": true,
            "time": 1,
            "value": 0.5
        },
        "miscDO": {
            "bucketId": 15,
            "city": "北京",
            "cityId": 110100,
            "debug": { },
            "hasCoupon": false,
            "region": "东城区",
            "regionId": 110101,
            "rn": "fa015e69c6a4ca4bb559805d670557e7",
            "smartBannerFlag": "top",
            "success": true,
            "supportCartRecommend": false,
            "systemTime": "1555232632711",
            "town": "东华门街道",
            "townId": 110101001
        },
        "regionalizedData": {
            "success": true
        },
        "sellCountDO": {
            "sellCount": "5",
            "success": true
        },
        "servicePromise": {
            "has3CPromise": false,
            "servicePromiseList": [
                {
                    "description": "商品支持正品保障服务",
                    "displayText": "正品保证",
                    "icon": "无",
                    "link": "//www.tmall.com/wow/portal/act/bzj",
                    "rank": -1
                },
                {
                    "description": "极速退款是为诚信会员提供的退款退货流程的专享特权,额度是根据每个用户当前的信誉评级情况而定",
                    "displayText": "极速退款",
                    "icon": "//img.alicdn.com/bao/album/sys/icon/discount.gif",
                    "link": "//vip.tmall.com/vip/privilege.htm?spm=3.1000588.0.141.2a0ae8&priv=speed",
                    "rank": -1
                },
                {
                    "description": "卖家为您购买的商品投保退货运费险(保单生效以下单显示为准)",
                    "displayText": "赠运费险",
                    "icon": "//img.alicdn.com/bao/album/sys/icon/discount.gif",
                    "link": "//service.tmall.com/support/tmall/knowledge-1121473.htm?spm=0.0.0.0.asbDA1",
                    "rank": -1
                },
                {
                    "description": "七天无理由退换",
                    "displayText": "七天无理由退换",
                    "icon": "//img.alicdn.com/tps/i3/T1Vyl6FCBlXXaSQP_X-16-16.png",
                    "link": "//pages.tmall.com/wow/seller/act/seven-day",
                    "rank": -1
                }
            ],
            "show": true,
            "success": true,
            "titleInformation": [ ]
        },
        "soldAreaDataDO": {
            "currentAreaEnable": true,
            "success": true,
            "useNewRegionalSales": true
        },
        "tradeResult": {
            "cartEnable": true,
            "cartType": 2,
            "miniTmallCartEnable": true,
            "startTime": 1554812946000,
            "success": true,
            "tradeEnable": true
        },
        "userInfoDO": {
            "activeStatus": 0,
            "companyPurchaseUser": false,
            "loginMember": false,
            "loginUserType": "buyer",
            "success": true,
            "userId": 0
        }
    },
    "isSuccess": true
}

我们发现JSON的内容非常多,我们要是每个都解析,岂不是很累?这里我们只需要获取price的信息,也就是priceInfo,所以我们想寻求一种方法,类似XPath的方式解析,这里我们采用JSONPath。

参考:https://github.com/DarrenChanChenChi/jsonpath

用法和XPath大同小异。

解析出我们想要的代码即可。

整体代码

common.go:

package main

import (
    "github.com/djimenez/iconv-go"
    "time"
    "net"
    "net/http"
    "gopkg.in/xmlpath.v2"
    "strings"
    "fmt"
    "math/rand"
)

type Msg struct{
    AdID int64 `json:"ad_id"`
    SourceID int64 `json:"source_id"`
    Source string `json:"source"`
    ItemID int64 `json:"item_id"`
    URL string `json:"url"`
    UID int64 `json:"uid"`
    DID int64 `json:"did"`
}

func convFromGbk(s string) string {
    gbkConvert, _ := iconv.NewConverter("gbk", "utf-8")
    res, _ := gbkConvert.ConvertString(s)
    return res
}

func newHTTPClient() *http.Client {
    client := &http.Client{
        Transport: &http.Transport{
            Dial: func(netw, addr string) (net.Conn, error) {
                return net.DialTimeout(netw, addr, time.Duration(1500*time.Millisecond))
            },
            MaxIdleConnsPerHost: 200,
        },
        Timeout: time.Duration(1500 * time.Millisecond),
    }
    return client
}

//只获取首元素
func parseNode(node *xmlpath.Node, xpath string) string {
    path, err := xmlpath.Compile(xpath)
    if err != nil {
        fmt.Errorf("%s",err)
        return ""
    }

    it := path.Iter(node)
    for it.Next() {
        s := strings.TrimSpace(it.Node().String())
        if len(s) != 0 {
            //return convFromGbk(s)
            return s
        }
    }
    return ""
}

//获取所有元素
func parseNodeForAll(node *xmlpath.Node, xpath string) []string {
    path, err := xmlpath.Compile(xpath)
    if err != nil {
        fmt.Errorf("%s",err)
        return nil
    }

    it := path.Iter(node)
    elements := []string{}
    for it.Next() {
        s := strings.TrimSpace(it.Node().String())
        if len(s) != 0 {
            //return convFromGbk(s)
            elements = append(elements, s)
        }
    }
    return elements
}

// percent returns the possibility of pct
func percent(pct int) bool {
    if pct < 0 || pct > 100 {
        return false
    }
    return pct > rand.Intn(100)
}

ali_spider.go:

package main

import (
    "code.byted.org/gopkg/logs"
    "encoding/json"
    "fmt"
    "github.com/djimenez/iconv-go"
    "github.com/ngaut/logging"
    "github.com/oliveagle/jsonpath"
    "gopkg.in/xmlpath.v2"
    "io/ioutil"
    "math/rand"
    "net/http"
    "strconv"
    "strings"
)

const itemURLPatternAli = "https://detail.tmall.com/item.htm?id=%d"
const priceURLPatternAli = "https://mdskip.taobao.com/core/initItemDetail.htm?isUseInventoryCenter=false&cartEnable=true&service3C=false&isApparel=true&isSecKill=false&tmallBuySupport=true&isAreaSell=false&tryBeforeBuy=false&offlineShop=false&itemId=%d&showShopProm=false&isPurchaseMallPage=false&itemGmtModified=1555201252000&isRegionLevel=false&household=false&sellerPreview=false&queryMemberRight=true&addressLevel=2&isForbidBuyItem=false&callback=setMdskip&timestamp=1555210888509&isg=bBQF1SmIvk4dQ8UGBOCNIZNDTp7T7IRAguWjmN99i_5Qy1Y_p8_OlZkxNev6Vj5RsG8p46-P7M29-etfw&isg2=BPPzr6M1qyiTZGdgYB4puOBagvEXdGgbstRSkqWQUpJJpBNGLPrUOlF1XpTvBN_i"

var ualist = []string{
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36",
}

type AliSpider struct {
    client *http.Client
}

func NewAliSpider() *AliSpider {
    return &AliSpider{
        client: newHTTPClient(),
    }
}

func (j *AliSpider) loadPage(url string) (*xmlpath.Node, error) {
    req, err := http.NewRequest("GET", url, nil)
    if err != nil {
        return nil, err
    }
    req.Header.Set("User-Agent", ualist[rand.Intn(len(ualist))])
    rsp, err := j.client.Do(req)
    if err != nil {
        return nil, err
    }
    //转码
    utfBody, err := iconv.NewReader(rsp.Body, "gb2312", "utf-8")
    //if body, err := ioutil.ReadAll(utfBody); err == nil {
    //    fmt.Println("HTML content:", string(body))
    //}
    node, err := xmlpath.ParseHTML(utfBody)
    rsp.Body.Close()
    return node, err
}

func (j *AliSpider) parsePrice(itemID int64) (map[string]map[string]float64, error) {
    priceURL := fmt.Sprintf(priceURLPatternAli, itemID)
    req, err := http.NewRequest("GET", priceURL, nil)
    if err != nil {
        return nil, err
    }
    req.Header.Set("User-Agent", ualist[rand.Intn(len(ualist))])
    referer := fmt.Sprintf("https://detail.tmall.com/item.htm?id=%d", itemID)
    req.Header.Set("Referer", referer)
    rsp, err := j.client.Do(req)
    if err != nil {
        return nil, err
    }
    priceInfoRaw, err := ioutil.ReadAll(rsp.Body)
    if err != nil {
        return nil, err
    }
    priceInfo := string(priceInfoRaw)
    jsonStr := convFromGbk(priceInfo)

    leftIndex := strings.Index(jsonStr, "(") + 1
    rightIndex := strings.Index(jsonStr, ")")
    var json_data interface{}
    json.Unmarshal([]byte(jsonStr[leftIndex:rightIndex]), &json_data)

    skuQuantity, err := jsonpath.JsonPathLookup(json_data, "$.defaultModel.inventoryDO.skuQuantity")
    if err != nil {
        logs.Info("json path is err, err is %v", err)
    }
    skuQuantityMap := skuQuantity.(map[string]interface{})
    itemPriceResultMap := map[string]map[string]float64{}
    itemPriceResultDetailMap := map[string]float64{}
    for skuQuantityId, _ := range skuQuantityMap {
        //fmt.Println(key, value)
        jpathPrice := fmt.Sprintf("$.defaultModel.itemPriceResultDO.priceInfo.%s.price", skuQuantityId)
        jpathPromotionPrice := fmt.Sprintf("$.defaultModel.itemPriceResultDO.priceInfo.%s.promotionList[0].price", skuQuantityId)
        price, err := jsonpath.JsonPathLookup(json_data, jpathPrice)
        if err != nil {
            logs.Info("jpathPrice is err, err is %v", err)
        }
        promotionPrice, err := jsonpath.JsonPathLookup(json_data, jpathPromotionPrice)
        if err != nil {
            logs.Info("jpathPromotionPrice is err, err is %v", err)
        }
        priceStr := price.(string)
        promotionPriceStr := promotionPrice.(string)
        itemPriceResultDetailMap["price"], _ = strconv.ParseFloat(priceStr, 64)
        itemPriceResultDetailMap["promotion_price"], _ = strconv.ParseFloat(promotionPriceStr, 64)
        itemPriceResultMap[skuQuantityId] = itemPriceResultDetailMap
    }
    return itemPriceResultMap, err
}

func (j *AliSpider) Parse(msg *Msg) (map[string]interface{}, error) {
    defer func() {
        if r := recover(); r != nil {
            logging.Errorf("parse msg %v, error %v", *msg, r)
            return
        }
    }()
    itemURL := fmt.Sprintf(itemURLPatternAli, msg.ItemID)
    node, err := j.loadPage(itemURL)
    if err != nil {
        fmt.Errorf("%s",err)
        return nil, err
    }
    //metricsClient.EmitCounter("jd_spider", 1, "", map[string]string{"step": "parse"})

    name := parseNode(node, "//h1[@data-spm]")
    //详情描述
    /**
    产品名称:纽曼
    品牌: 纽曼
    型号: EX16
    功能: 睡眠监测 计步 防水
     */
    details := parseNodeForAll(node, "//ul[@id=\"J_AttrUL\"]/li")
    detailsMap := make(map[string]string, len(details))
    for _, detail := range details {
        split := strings.Split(detail, ":")
        if(len(split) > 1){
            detailsMap[split[0]] = strings.TrimSpace(split[1])
        }
    }

    shopname := parseNode(node, "//a[@class=\"slogo-shopname\"]")

    //描述 服务 物流
    shopinfos := parseNodeForAll(node, "//span[@class=\"shopdsr-score-con\"]")
    describe, _ := strconv.ParseFloat(shopinfos[0], 64)
    service, _ := strconv.ParseFloat(shopinfos[1], 64)
    logistics, _ := strconv.ParseFloat(shopinfos[2], 64)

    //价格(多个型号,price是标准价格,promotion_price是促销价格)
    //map[4023134073248:map[price:3299.00 promotion_price:3299.00] 4023134073249:map[price:3299.00 promotion_price:3299.00] 4200326178501:map[promotion_price:3299.00 price:3299.00] 4023134073246:map[price:3299.00 promotion_price:3299.00] 4023134073247:map[price:3299.00 promotion_price:3299.00] 4023134073245:map[price:3299.00 promotion_price:3299.00] 4023134073250:map[price:3299.00 promotion_price:3299.00]]
    itemPriceResultMap, err := j.parsePrice(msg.ItemID)

    res := map[string]interface{}{}
    res["source"] = "Ali"
    res["source_id"] = msg.SourceID
    res["id"] = msg.ItemID
    res["ad_id"] = msg.AdID
    res["url"] = itemURL
    res["name"] = name
    res["details"] = detailsMap
    res["shopname"] = shopname
    res["describe"] = describe
    res["service"] = service
    res["logistics"] = logistics
    res["uid"] = msg.UID
    res["did"] = msg.DID
    res["item_price"] = itemPriceResultMap
    // 选几个必须包含的类别校验
    if res["name"] == "" && res["shopname"] == "" {
        return nil, fmt.Errorf("invalid html page %s", itemURL)
    }
    return res, nil
}

ali_spider_test.go:

package main

import (
    "encoding/json"
    "fmt"
    "strconv"
    "strings"
    "testing"
)

func TestName(t *testing.T) {
    //conf, err := ssconf.LoadSsConfFile(confFile)
    //if err != nil {
    //    panic(err)
    //}
    aliSpider := NewAliSpider()
    //554867117919 585758506034
    var itemId int64 = 7664169349
    itemURL := fmt.Sprintf(itemURLPatternAli, itemId)
    node, err := aliSpider.loadPage(itemURL)
    if err != nil {
        fmt.Errorf("%s",err)
    }
    //fmt.Println(node)
    name := parseNode(node, "//h1[@data-spm]")
    //详情描述
    /**
    产品名称:纽曼
    品牌: 纽曼
    型号: EX16
    功能: 睡眠监测 计步 防水
     */
    details := parseNodeForAll(node, "//ul[@id=\"J_AttrUL\"]/li")
    detailsMap := make(map[string]string, len(details))
    for _, detail := range details {
        split := strings.Split(detail, ":")
        if(len(split) > 1){
            detailsMap[split[0]] = strings.TrimSpace(split[1])
        }
    }

    shopname := parseNode(node, "//a[@class=\"slogo-shopname\"]")

    //描述 服务 物流
    shopinfos := parseNodeForAll(node, "//span[@class=\"shopdsr-score-con\"]")
    describe, _ := strconv.ParseFloat(shopinfos[0], 64)
    service, _ := strconv.ParseFloat(shopinfos[1], 64)
    logistics, _ := strconv.ParseFloat(shopinfos[2], 64)
    //价格(多个型号,price是标准价格,promotion_price是促销价格)
    //map[4023134073248:map[price:3299.00 promotion_price:3299.00] 4023134073249:map[price:3299.00 promotion_price:3299.00] 4200326178501:map[promotion_price:3299.00 price:3299.00] 4023134073246:map[price:3299.00 promotion_price:3299.00] 4023134073247:map[price:3299.00 promotion_price:3299.00] 4023134073245:map[price:3299.00 promotion_price:3299.00] 4023134073250:map[price:3299.00 promotion_price:3299.00]]
    itemPriceResultMap, err := aliSpider.parsePrice(itemId)

    res := map[string]interface{}{}
    res["source"] = "Ali"
    res["url"] = itemURL
    res["name"] = name
    res["details"] = detailsMap
    res["shopname"] = shopname
    res["describe"] = describe
    res["service"] = service
    res["logistics"] = logistics
    res["item_price"] = itemPriceResultMap

    bytes, err := json.Marshal(res)
    if err != nil {
        fmt.Println("error is ", err)
    }
    fmt.Println(string(bytes))
}

运行结果:

{"describe":4.9,"details":{"上市时间":"2014年冬季","乒乓底板材质":"其他","品牌":"Palio/拍里奥","型号":"TNT-1","层数":"9层","拍柄重量":"头沉柄轻","是否商场同款":"是","系列":"拍里奥TNT-1","货号":"TNT-1","颜色分类":"TNT-1直拍(短柄)1只+赠送:1海绵护边【7木+2碳】 TNT-1横拍(长柄)1只+赠送:1海绵护边【7木+2碳】 新TNT直拍(短柄)1只+赠送:1海绵护边【5木+2碳】 新TNT横拍(长柄)1只+赠送:1海绵护边【5木+2碳】"},"item_price":{"3280089025135":{"price":168,"promotion_price":68},"3280089025136":{"price":168,"promotion_price":68},"6310159781":{"price":168,"promotion_price":68},"6310159797":{"price":168,"promotion_price":68}},"logistics":4.8,"name":"正品 拍里奥乒乓球底板新TNT-1碳素快攻弧圈乒乓球拍底板球拍球板","service":4.8,"shopname":"玺源运动专营店","source":"Ali","url":"https://detail.tmall.com/item.htm?id=7664169349"}

原文地址:https://www.cnblogs.com/DarrenChan/p/10706019.html

时间: 2024-10-08 10:27:25

[爬虫]采用Go语言爬取天猫页面的相关文章

零基础掌握百度地图兴趣点获取POI爬虫(python语言爬取)(基础篇)

实现目的:爬取昆明市范围内的全部中学数据,包括名称.坐标. 先进入基础篇,本篇主要讲原理方面,并实现步骤分解,为python代码编写打基础. 因为是0基础开始,所以讲得会比较详细. 如实现目的所讲,爬取昆明市全部中学数据,就是获取百度地图上昆明市范围内所有关键字带中学的地理信息数据(兴趣点). 怎么把百度地图上的数据抓取下来呢? 以下是教程: 本篇目录如下: 1. 百度地图开放平台注册,AK获取 2.关于ak的说明 3.请求URL说明 4.百度地图坐标拾取器 5.以坐标范围获取兴趣点POI 6.

Python爬虫_用Python爬取csdn页面信息目录

1.原理: 这个程序可以实现批量获取到某一个CSDN博客的个人信息.目录与链接的对应,并存到一个本目录的mulu.txt文件中 2.具体代码: # -*- coding: cp936 -*- import urllib.request# import re# import sys# import time# import random import string headers = {# 'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1;

Java爬虫爬取 天猫 淘宝 京东 搜索页和 商品详情

Java爬虫爬取 天猫 淘宝 京东 搜索页和 商品详情 先识别商品url,区分平台提取商品编号,再根据平台带着商品编号爬取数据. 1.导包 <!-- 爬虫相关Jar包依赖 --> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml</artifactId> <version>3.10-FINAL</version> </

python 爬取天猫美的评论数据

笔者最近迷上了数据挖掘和机器学习,要做数据分析首先得有数据才行.对于我等平民来说,最廉价的获取数据的方法,应该是用爬虫在网络上爬取数据了.本文记录一下笔者爬取天猫某商品的全过程,淘宝上面的店铺也是类似的做法,不赘述.主要是分析页面以及用Python实现简单方便的抓取. 笔者使用的工具如下 Python 3--极其方便的编程语言.选择3.x的版本是因为3.x对中文处理更加友好. Pandas--Python的一个附加库,用于数据整理. IE 11--分析页面请求过程(其他类似的流量监控工具亦可).

python --selenium+phantomjs爬取动态页面广告源码

背景:利用爬虫,爬取网站页面广告元素,监控爬取元素的数目,定时发送监控邮件 #!/usr/bin/env python2.7 # -*- coding: utf-8 -*- ''' @xiayun @[email protected] #爬取网站内容,利用phantomjs:IP代理+修改UA+动态页面执行JS ''' from selenium import webdriver from selenium.webdriver.common.desired_capabilities import

Python爬虫实战二之爬取百度贴吧帖子

大家好,上次我们实验了爬取了糗事百科的段子,那么这次我们来尝试一下爬取百度贴吧的帖子.与上一篇不同的是,这次我们需要用到文件的相关操作. 前言 亲爱的们,教程比较旧了,百度贴吧页面可能改版,可能代码不好使,八成是正则表达式那儿匹配不到了,请更改一下正则,当然最主要的还是帮助大家理解思路. 2016/12/2 本篇目标 1.对百度贴吧的任意帖子进行抓取 2.指定是否只抓取楼主发帖内容 3.将抓取到的内容分析并保存到文件 1.URL格式的确定 首先,我们先观察一下百度贴吧的任意一个帖子. 比如:ht

Python 爬虫入门实例(爬取小米应用商店的top应用apk)

一,爬虫是什么? 爬虫就是获取网络上各种资源,数据的一种工具.具体的可以自行百度. 二,如何写简单爬虫 1,获取网页内容 可以通过 Python(3.x) 自带的 urllib,来实现网页内容的下载.实现起来很简单 import urllib.request url="http://www.baidu.com" response=urllib.request.urlopen(url) html_content=response.read() 还可以使用三方库 requests ,实现起

爬虫介绍02:爬取第一个站点

为了搜刮某个站点,第一步我们需要下载该站包含有用信息的页面,也就是我么尝尝提到的爬取过程.爬站的方式多种多样,我们需要根据目标站点的结构选择合适的爬站方案.下面讨论如何安全的爬站,以及常用的三种方法: Crawling a sitemap Iterating the database IDs of each web page Following web page links 1. 下载一个Web页面 爬取网页前,首先需要下载他们.下面的Python脚本,使用了Python的 urllib2 模块

Python爬虫新手教程:爬取了6574篇文章,告诉你产品经理在看什么!

作为互联网界的两个对立的物种,产品汪与程序猿似乎就像一对天生的死对头:但是在产品开发链条上紧密合作的双方,只有通力合作,才能更好地推动项目发展.那么产品经理平日里面都在看那些文章呢?我们程序猿该如何投其所好呢?我爬取了人人都是产品经理栏目下的所有文章,看看产品经理都喜欢看什么. 1. 分析背景 1.1. 为什么选择「人人都是产品经理」 人人都是产品经理是以产品经理.运营为核心的学习.交流.分享平台,集媒体.培训.招聘.社群为一体,全方位服务产品人和运营人,成立8年举办在线讲座500+期,线下分享