最近工作中有一个需求,需要爬取天猫商品的信息,整个需求的过程如下:
修改后端广告交易平台的代码,从阿里上传的素材中解析url,该url格式如下:
https://handycam.alicdn.com/slideshow/26/7ef5aed1e3c39843e8feac816a436ecf.mp4?content=%7B%22items%22%3A%5B%7B%22images%22%3A%5B%22https%3A%2F%2Fasearch.alicdn.com%2Fbao%2Fuploaded%2F%2Fi4%2F22356367%2FTB2PMQinN6I8KJjy0FgXXXXzVXa_%21%210-saturn_solar.jpg%22%5D%2C%22itemid%22%3A%227664169349%22%2C%22shorttitle%22%3A%22%E4%B9%92%E4%B9%93%E7%90%83%E6%8B%8D%20%E6%97%A0%E7%BA%BF%E4%B8%93%E5%B1%9E%22%7D%5D%7D
明显进行编码了,首先我们需要进行解码,解码的在线网站如下:
http://tool.chinaz.com/Tools/urlencode.aspx
经过decode以后,我们得到:
https://handycam.alicdn.com/slideshow/26/7ef5aed1e3c39843e8feac816a436ecf.mp4?content={"items":[{"images":["https://asearch.alicdn.com/bao/uploaded//i4/22356367/TB2PMQinN6I8KJjy0FgXXXXzVXa_!!0-saturn_solar.jpg"],"itemid":"7664169349","shorttitle":"乒乓球拍 无线专属"}]}
我们需要的就是其中的"itemid":"7664169349"。
然后我们通过访问https://detail.tmall.com/item.htm?id=7664169349,打开如下页面:
这就是我们需要抓取的页面信息。广告交易平台将解析的itemid放入到nsq中,爬虫系统通过拼接URL抓取页面的关键信息,然后将关键信息发送到Kafka中,Hive和ES再从Kafka中获取相应的信息,进行查询操作。
第一步
第一步就是解析出ItemId,在广告交易平台我们可以获取需要解析的URL,接下来我们用代码对URL进行decode并且解析出相应的ItemId数值。由于项目采用的是Golang,所以这里以Golang为例,Python写其实更简单,原理一样。
URL解析的方法,可以参考:
https://gobyexample.com/url-parsing
JSON序列化和反序列化,可以参考:
https://www.cnblogs.com/liang1101/p/6741262.html
这里给出我的代码:
package main import ( "encoding/json" "fmt" "net/url" "strconv" ) //结构体的首字母大写 type item struct { Images []string ItemId string ShortTitle string } func main() { var urlstring string = "https://handycam.alicdn.com/slideshow/26/7ef5aed1e3c39843e8feac816a436ecf.mp4?content=%7B%22items%22%3A%5B%7B%22images%22%3A%5B%22https%3A%2F%2Fasearch.alicdn.com%2Fbao%2Fuploaded%2F%2Fi4%2F22356367%2FTB2PMQinN6I8KJjy0FgXXXXzVXa_%21%210-saturn_solar.jpg%22%5D%2C%22itemid%22%3A%227664169349%22%2C%22shorttitle%22%3A%22%E4%B9%92%E4%B9%93%E7%90%83%E6%8B%8D%20%E6%97%A0%E7%BA%BF%E4%B8%93%E5%B1%9E%22%7D%5D%7D" unescape, err := url.QueryUnescape(urlstring) if err != nil { fmt.Println("err is", err) } fmt.Println(unescape) parse, err := url.Parse(unescape) fmt.Println(parse.RawQuery) query, err := url.ParseQuery(parse.RawQuery) fmt.Println(query) fmt.Printf("%T, %v\n", query["content"][0], query["content"][0]) m := make(map[string][]item) json.Unmarshal([]byte(query["content"][0]), &m) fmt.Println("m:", m) itemValue := m["items"][0] fmt.Println(itemValue.ItemId) //转成int64 i, err := strconv.ParseInt(itemValue.ItemId, 10, 64) fmt.Printf("%T, %v", i, i) }
运行结果:
便可以得到我们需要的ItemId数值。
第二步
第二步就是拼接我们的URL进行页面内容的爬取。
如何通过GoLang拉取网页呢?附上一个简单demo。
package main import ( "net/http" "io/ioutil" "fmt" ) func main(){ var website string = "http://www.future.org.cn" if resp,err := http.Get(website); err == nil{ defer resp.Body.Close() if body, err := ioutil.ReadAll(resp.Body); err == nil { fmt.Println("HTML content:", string(body)); }else{ fmt.Println("Cannot read from connected http server:", err); } }else{ fmt.Println("Cannot connect the server:", err); } }
但是爬取页面以后,会发现个问题,就是中文显示乱码。
中文乱码问题解决,参考:
安装 iconv-go
go get github.com/djimenez/iconv-go
可以获取以后再转码,比如:
func convFromGbk(s string) string { gbkConvert, _ := iconv.NewConverter("gbk", "utf-8") res, _ := gbkConvert.ConvertString(s) return res }
也可以用如下方式转换Reader:
req, err := http.NewRequest("GET", url, nil) if err != nil { return nil, err } req.Header.Set("User-Agent", ualist[rand.Intn(len(ualist))]) rsp, err := j.client.Do(req) if err != nil { return nil, err } //转码 utfBody, err := iconv.NewReader(rsp.Body, "gb2312", "utf-8") //if body, err := ioutil.ReadAll(utfBody); err == nil { // fmt.Println("HTML content:", string(body)) //}
爬取以后的页面我们需要进行解析,这里采用的XPath。
关于使用XPath的方式,参考:
http://www.w3school.com.cn/xpath/xpath_axes.asp
非常简单,看完就明白了。
因为爬取之后是html,你只需要获取自己想要的内容即可,说白了就是解析html。
接下来还有一个难点,就是我们抓取的静态页面,很多信息都包含,但是价格信息不包含,因为它是动态加载的。
我们不妨分析一下,
我们将其点开,复制URL在浏览器打开,发现无法访问,403,不要着急,只需要在请求的Header中加上如下的参数即可。
在代码中如下:
referer := fmt.Sprintf("https://detail.tmall.com/item.htm?id=%d", itemID) req.Header.Set("Referer", referer)
我们查看响应发现是一个JSON,
格式化一下:格式化网址:http://tool.oschina.net/codeformat/json
{ "defaultModel": { "bannerDO": { "success": true }, "deliveryDO": { "areaId": 110100, "deliveryAddress": "浙江金华", "deliverySkuMap": { "6310159781": [ { "arrivalNextDay": false, "arrivalThisDay": false, "forceMocked": false, "postage": "快递: 0.00 ", "postageFree": false, "skuDeliveryAddress": "浙江金华", "type": 0 } ], "default": [ { "arrivalNextDay": false, "arrivalThisDay": false, "forceMocked": false, "postage": "快递: 0.00 ", "postageFree": false, "skuDeliveryAddress": "浙江金华", "type": 0 } ], "6310159797": [ { "arrivalNextDay": false, "arrivalThisDay": false, "forceMocked": false, "postage": "快递: 0.00 ", "postageFree": false, "skuDeliveryAddress": "浙江金华", "type": 0 } ], "3280089025135": [ { "arrivalNextDay": false, "arrivalThisDay": false, "forceMocked": false, "postage": "快递: 0.00 ", "postageFree": false, "skuDeliveryAddress": "浙江金华", "type": 0 } ], "3280089025136": [ { "arrivalNextDay": false, "arrivalThisDay": false, "forceMocked": false, "postage": "快递: 0.00 ", "postageFree": false, "skuDeliveryAddress": "浙江金华", "type": 0 } ] }, "destination": "北京市", "success": true }, "detailPageTipsDO": { "crowdType": 0, "hasCoupon": true, "hideIcons": false, "jhs99": false, "minicartSurprise": 0, "onlyShowOnePrice": false, "priceDisplayType": 4, "primaryPicIcons": [ ], "prime": false, "showCuntaoIcon": false, "showDou11Style": false, "showDou11SugPromPrice": false, "showDou12CornerIcon": false, "showDuo11Stage": 0, "showJuIcon": false, "showMaskedDou11SugPrice": false, "success": true, "trueDuo11Prom": false }, "doubleEleven2014": { "doubleElevenItem": false, "halfOffItem": false, "showAtmosphere": false, "showRightRecommendedArea": false, "step": 0, "success": true }, "extendedData": { }, "extras": { }, "gatewayDO": { "changeLocationGateway": { "queryDelivery": true, "queryProm": false }, "success": true, "trade": { "addToBuyNow": { }, "addToCart": { } } }, "inventoryDO": { "hidden": false, "icTotalQuantity": 225, "skuQuantity": { "3280089025136": { "quantity": 71, "totalQuantity": 71, "type": 1 }, "6310159781": { "quantity": 33, "totalQuantity": 33, "type": 1 }, "6310159797": { "quantity": 44, "totalQuantity": 44, "type": 1 }, "3280089025135": { "quantity": 77, "totalQuantity": 77, "type": 1 } }, "success": true, "totalQuantity": 225, "type": 1 }, "itemPriceResultDO": { "areaId": 110100, "duo11Item": false, "duo11Stage": 0, "extraPromShowRealPrice": false, "halfOffItem": false, "hasDPromotion": false, "hasMobileProm": false, "hasTmallappProm": false, "hiddenNonBuyPrice": false, "hideMeal": false, "priceInfo": { "6310159781": { "areaSold": true, "onlyShowOnePrice": false, "price": "178.00", "promotionList": [ { "amountPromLimit": 0, "amountRestriction": "", "basePriceType": "IcPrice", "canBuyCouponNum": 0, "endTime": 1561651200000, "extraPromTextType": 0, "extraPromType": 0, "limitProm": false, "postageFree": false, "price": "75.00", "promType": "normal", "start": false, "startTime": 1546267717000, "status": 2, "tfCartSupport": false, "tmallCartSupport": false, "type": "火爆促销", "unLogBrandMember": false, "unLogShopVip": false, "unLogTbvip": false } ], "sortOrder": 0 }, "6310159797": { "areaSold": true, "onlyShowOnePrice": false, "price": "178.00", "promotionList": [ { "amountPromLimit": 0, "amountRestriction": "", "basePriceType": "IcPrice", "canBuyCouponNum": 0, "endTime": 1561651200000, "extraPromTextType": 0, "extraPromType": 0, "limitProm": false, "postageFree": false, "price": "75.00", "promType": "normal", "start": false, "startTime": 1546267717000, "status": 2, "tfCartSupport": false, "tmallCartSupport": false, "type": "火爆促销", "unLogBrandMember": false, "unLogShopVip": false, "unLogTbvip": false } ], "sortOrder": 0 }, "3280089025135": { "areaSold": true, "onlyShowOnePrice": false, "price": "168.00", "promotionList": [ { "amountPromLimit": 0, "amountRestriction": "", "basePriceType": "IcPrice", "canBuyCouponNum": 0, "endTime": 1561651200000, "extraPromTextType": 0, "extraPromType": 0, "limitProm": false, "postageFree": false, "price": "68.00", "promType": "normal", "start": false, "startTime": 1546267717000, "status": 2, "tfCartSupport": false, "tmallCartSupport": false, "type": "火爆促销", "unLogBrandMember": false, "unLogShopVip": false, "unLogTbvip": false } ], "sortOrder": 0 }, "3280089025136": { "areaSold": true, "onlyShowOnePrice": false, "price": "168.00", "promotionList": [ { "amountPromLimit": 0, "amountRestriction": "", "basePriceType": "IcPrice", "canBuyCouponNum": 0, "endTime": 1561651200000, "extraPromTextType": 0, "extraPromType": 0, "limitProm": false, "postageFree": false, "price": "68.00", "promType": "normal", "start": false, "startTime": 1546267717000, "status": 2, "tfCartSupport": false, "tmallCartSupport": false, "type": "火爆促销", "unLogBrandMember": false, "unLogShopVip": false, "unLogTbvip": false } ], "sortOrder": 0 } }, "queryProm": false, "success": true, "successCall": true, "tmallShopProm": [ ] }, "memberRightDO": { "activityType": 0, "level": 0, "postageFree": false, "shopMember": false, "success": true, "time": 1, "value": 0.5 }, "miscDO": { "bucketId": 15, "city": "北京", "cityId": 110100, "debug": { }, "hasCoupon": false, "region": "东城区", "regionId": 110101, "rn": "fa015e69c6a4ca4bb559805d670557e7", "smartBannerFlag": "top", "success": true, "supportCartRecommend": false, "systemTime": "1555232632711", "town": "东华门街道", "townId": 110101001 }, "regionalizedData": { "success": true }, "sellCountDO": { "sellCount": "5", "success": true }, "servicePromise": { "has3CPromise": false, "servicePromiseList": [ { "description": "商品支持正品保障服务", "displayText": "正品保证", "icon": "无", "link": "//www.tmall.com/wow/portal/act/bzj", "rank": -1 }, { "description": "极速退款是为诚信会员提供的退款退货流程的专享特权,额度是根据每个用户当前的信誉评级情况而定", "displayText": "极速退款", "icon": "//img.alicdn.com/bao/album/sys/icon/discount.gif", "link": "//vip.tmall.com/vip/privilege.htm?spm=3.1000588.0.141.2a0ae8&priv=speed", "rank": -1 }, { "description": "卖家为您购买的商品投保退货运费险(保单生效以下单显示为准)", "displayText": "赠运费险", "icon": "//img.alicdn.com/bao/album/sys/icon/discount.gif", "link": "//service.tmall.com/support/tmall/knowledge-1121473.htm?spm=0.0.0.0.asbDA1", "rank": -1 }, { "description": "七天无理由退换", "displayText": "七天无理由退换", "icon": "//img.alicdn.com/tps/i3/T1Vyl6FCBlXXaSQP_X-16-16.png", "link": "//pages.tmall.com/wow/seller/act/seven-day", "rank": -1 } ], "show": true, "success": true, "titleInformation": [ ] }, "soldAreaDataDO": { "currentAreaEnable": true, "success": true, "useNewRegionalSales": true }, "tradeResult": { "cartEnable": true, "cartType": 2, "miniTmallCartEnable": true, "startTime": 1554812946000, "success": true, "tradeEnable": true }, "userInfoDO": { "activeStatus": 0, "companyPurchaseUser": false, "loginMember": false, "loginUserType": "buyer", "success": true, "userId": 0 } }, "isSuccess": true }
我们发现JSON的内容非常多,我们要是每个都解析,岂不是很累?这里我们只需要获取price的信息,也就是priceInfo,所以我们想寻求一种方法,类似XPath的方式解析,这里我们采用JSONPath。
参考:https://github.com/DarrenChanChenChi/jsonpath
用法和XPath大同小异。
解析出我们想要的代码即可。
整体代码
common.go:
package main import ( "github.com/djimenez/iconv-go" "time" "net" "net/http" "gopkg.in/xmlpath.v2" "strings" "fmt" "math/rand" ) type Msg struct{ AdID int64 `json:"ad_id"` SourceID int64 `json:"source_id"` Source string `json:"source"` ItemID int64 `json:"item_id"` URL string `json:"url"` UID int64 `json:"uid"` DID int64 `json:"did"` } func convFromGbk(s string) string { gbkConvert, _ := iconv.NewConverter("gbk", "utf-8") res, _ := gbkConvert.ConvertString(s) return res } func newHTTPClient() *http.Client { client := &http.Client{ Transport: &http.Transport{ Dial: func(netw, addr string) (net.Conn, error) { return net.DialTimeout(netw, addr, time.Duration(1500*time.Millisecond)) }, MaxIdleConnsPerHost: 200, }, Timeout: time.Duration(1500 * time.Millisecond), } return client } //只获取首元素 func parseNode(node *xmlpath.Node, xpath string) string { path, err := xmlpath.Compile(xpath) if err != nil { fmt.Errorf("%s",err) return "" } it := path.Iter(node) for it.Next() { s := strings.TrimSpace(it.Node().String()) if len(s) != 0 { //return convFromGbk(s) return s } } return "" } //获取所有元素 func parseNodeForAll(node *xmlpath.Node, xpath string) []string { path, err := xmlpath.Compile(xpath) if err != nil { fmt.Errorf("%s",err) return nil } it := path.Iter(node) elements := []string{} for it.Next() { s := strings.TrimSpace(it.Node().String()) if len(s) != 0 { //return convFromGbk(s) elements = append(elements, s) } } return elements } // percent returns the possibility of pct func percent(pct int) bool { if pct < 0 || pct > 100 { return false } return pct > rand.Intn(100) }
ali_spider.go:
package main import ( "code.byted.org/gopkg/logs" "encoding/json" "fmt" "github.com/djimenez/iconv-go" "github.com/ngaut/logging" "github.com/oliveagle/jsonpath" "gopkg.in/xmlpath.v2" "io/ioutil" "math/rand" "net/http" "strconv" "strings" ) const itemURLPatternAli = "https://detail.tmall.com/item.htm?id=%d" const priceURLPatternAli = "https://mdskip.taobao.com/core/initItemDetail.htm?isUseInventoryCenter=false&cartEnable=true&service3C=false&isApparel=true&isSecKill=false&tmallBuySupport=true&isAreaSell=false&tryBeforeBuy=false&offlineShop=false&itemId=%d&showShopProm=false&isPurchaseMallPage=false&itemGmtModified=1555201252000&isRegionLevel=false&household=false&sellerPreview=false&queryMemberRight=true&addressLevel=2&isForbidBuyItem=false&callback=setMdskip×tamp=1555210888509&isg=bBQF1SmIvk4dQ8UGBOCNIZNDTp7T7IRAguWjmN99i_5Qy1Y_p8_OlZkxNev6Vj5RsG8p46-P7M29-etfw&isg2=BPPzr6M1qyiTZGdgYB4puOBagvEXdGgbstRSkqWQUpJJpBNGLPrUOlF1XpTvBN_i" var ualist = []string{ "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)", "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20", "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36", } type AliSpider struct { client *http.Client } func NewAliSpider() *AliSpider { return &AliSpider{ client: newHTTPClient(), } } func (j *AliSpider) loadPage(url string) (*xmlpath.Node, error) { req, err := http.NewRequest("GET", url, nil) if err != nil { return nil, err } req.Header.Set("User-Agent", ualist[rand.Intn(len(ualist))]) rsp, err := j.client.Do(req) if err != nil { return nil, err } //转码 utfBody, err := iconv.NewReader(rsp.Body, "gb2312", "utf-8") //if body, err := ioutil.ReadAll(utfBody); err == nil { // fmt.Println("HTML content:", string(body)) //} node, err := xmlpath.ParseHTML(utfBody) rsp.Body.Close() return node, err } func (j *AliSpider) parsePrice(itemID int64) (map[string]map[string]float64, error) { priceURL := fmt.Sprintf(priceURLPatternAli, itemID) req, err := http.NewRequest("GET", priceURL, nil) if err != nil { return nil, err } req.Header.Set("User-Agent", ualist[rand.Intn(len(ualist))]) referer := fmt.Sprintf("https://detail.tmall.com/item.htm?id=%d", itemID) req.Header.Set("Referer", referer) rsp, err := j.client.Do(req) if err != nil { return nil, err } priceInfoRaw, err := ioutil.ReadAll(rsp.Body) if err != nil { return nil, err } priceInfo := string(priceInfoRaw) jsonStr := convFromGbk(priceInfo) leftIndex := strings.Index(jsonStr, "(") + 1 rightIndex := strings.Index(jsonStr, ")") var json_data interface{} json.Unmarshal([]byte(jsonStr[leftIndex:rightIndex]), &json_data) skuQuantity, err := jsonpath.JsonPathLookup(json_data, "$.defaultModel.inventoryDO.skuQuantity") if err != nil { logs.Info("json path is err, err is %v", err) } skuQuantityMap := skuQuantity.(map[string]interface{}) itemPriceResultMap := map[string]map[string]float64{} itemPriceResultDetailMap := map[string]float64{} for skuQuantityId, _ := range skuQuantityMap { //fmt.Println(key, value) jpathPrice := fmt.Sprintf("$.defaultModel.itemPriceResultDO.priceInfo.%s.price", skuQuantityId) jpathPromotionPrice := fmt.Sprintf("$.defaultModel.itemPriceResultDO.priceInfo.%s.promotionList[0].price", skuQuantityId) price, err := jsonpath.JsonPathLookup(json_data, jpathPrice) if err != nil { logs.Info("jpathPrice is err, err is %v", err) } promotionPrice, err := jsonpath.JsonPathLookup(json_data, jpathPromotionPrice) if err != nil { logs.Info("jpathPromotionPrice is err, err is %v", err) } priceStr := price.(string) promotionPriceStr := promotionPrice.(string) itemPriceResultDetailMap["price"], _ = strconv.ParseFloat(priceStr, 64) itemPriceResultDetailMap["promotion_price"], _ = strconv.ParseFloat(promotionPriceStr, 64) itemPriceResultMap[skuQuantityId] = itemPriceResultDetailMap } return itemPriceResultMap, err } func (j *AliSpider) Parse(msg *Msg) (map[string]interface{}, error) { defer func() { if r := recover(); r != nil { logging.Errorf("parse msg %v, error %v", *msg, r) return } }() itemURL := fmt.Sprintf(itemURLPatternAli, msg.ItemID) node, err := j.loadPage(itemURL) if err != nil { fmt.Errorf("%s",err) return nil, err } //metricsClient.EmitCounter("jd_spider", 1, "", map[string]string{"step": "parse"}) name := parseNode(node, "//h1[@data-spm]") //详情描述 /** 产品名称:纽曼 品牌: 纽曼 型号: EX16 功能: 睡眠监测 计步 防水 */ details := parseNodeForAll(node, "//ul[@id=\"J_AttrUL\"]/li") detailsMap := make(map[string]string, len(details)) for _, detail := range details { split := strings.Split(detail, ":") if(len(split) > 1){ detailsMap[split[0]] = strings.TrimSpace(split[1]) } } shopname := parseNode(node, "//a[@class=\"slogo-shopname\"]") //描述 服务 物流 shopinfos := parseNodeForAll(node, "//span[@class=\"shopdsr-score-con\"]") describe, _ := strconv.ParseFloat(shopinfos[0], 64) service, _ := strconv.ParseFloat(shopinfos[1], 64) logistics, _ := strconv.ParseFloat(shopinfos[2], 64) //价格(多个型号,price是标准价格,promotion_price是促销价格) //map[4023134073248:map[price:3299.00 promotion_price:3299.00] 4023134073249:map[price:3299.00 promotion_price:3299.00] 4200326178501:map[promotion_price:3299.00 price:3299.00] 4023134073246:map[price:3299.00 promotion_price:3299.00] 4023134073247:map[price:3299.00 promotion_price:3299.00] 4023134073245:map[price:3299.00 promotion_price:3299.00] 4023134073250:map[price:3299.00 promotion_price:3299.00]] itemPriceResultMap, err := j.parsePrice(msg.ItemID) res := map[string]interface{}{} res["source"] = "Ali" res["source_id"] = msg.SourceID res["id"] = msg.ItemID res["ad_id"] = msg.AdID res["url"] = itemURL res["name"] = name res["details"] = detailsMap res["shopname"] = shopname res["describe"] = describe res["service"] = service res["logistics"] = logistics res["uid"] = msg.UID res["did"] = msg.DID res["item_price"] = itemPriceResultMap // 选几个必须包含的类别校验 if res["name"] == "" && res["shopname"] == "" { return nil, fmt.Errorf("invalid html page %s", itemURL) } return res, nil }
ali_spider_test.go:
package main import ( "encoding/json" "fmt" "strconv" "strings" "testing" ) func TestName(t *testing.T) { //conf, err := ssconf.LoadSsConfFile(confFile) //if err != nil { // panic(err) //} aliSpider := NewAliSpider() //554867117919 585758506034 var itemId int64 = 7664169349 itemURL := fmt.Sprintf(itemURLPatternAli, itemId) node, err := aliSpider.loadPage(itemURL) if err != nil { fmt.Errorf("%s",err) } //fmt.Println(node) name := parseNode(node, "//h1[@data-spm]") //详情描述 /** 产品名称:纽曼 品牌: 纽曼 型号: EX16 功能: 睡眠监测 计步 防水 */ details := parseNodeForAll(node, "//ul[@id=\"J_AttrUL\"]/li") detailsMap := make(map[string]string, len(details)) for _, detail := range details { split := strings.Split(detail, ":") if(len(split) > 1){ detailsMap[split[0]] = strings.TrimSpace(split[1]) } } shopname := parseNode(node, "//a[@class=\"slogo-shopname\"]") //描述 服务 物流 shopinfos := parseNodeForAll(node, "//span[@class=\"shopdsr-score-con\"]") describe, _ := strconv.ParseFloat(shopinfos[0], 64) service, _ := strconv.ParseFloat(shopinfos[1], 64) logistics, _ := strconv.ParseFloat(shopinfos[2], 64) //价格(多个型号,price是标准价格,promotion_price是促销价格) //map[4023134073248:map[price:3299.00 promotion_price:3299.00] 4023134073249:map[price:3299.00 promotion_price:3299.00] 4200326178501:map[promotion_price:3299.00 price:3299.00] 4023134073246:map[price:3299.00 promotion_price:3299.00] 4023134073247:map[price:3299.00 promotion_price:3299.00] 4023134073245:map[price:3299.00 promotion_price:3299.00] 4023134073250:map[price:3299.00 promotion_price:3299.00]] itemPriceResultMap, err := aliSpider.parsePrice(itemId) res := map[string]interface{}{} res["source"] = "Ali" res["url"] = itemURL res["name"] = name res["details"] = detailsMap res["shopname"] = shopname res["describe"] = describe res["service"] = service res["logistics"] = logistics res["item_price"] = itemPriceResultMap bytes, err := json.Marshal(res) if err != nil { fmt.Println("error is ", err) } fmt.Println(string(bytes)) }
运行结果:
{"describe":4.9,"details":{"上市时间":"2014年冬季","乒乓底板材质":"其他","品牌":"Palio/拍里奥","型号":"TNT-1","层数":"9层","拍柄重量":"头沉柄轻","是否商场同款":"是","系列":"拍里奥TNT-1","货号":"TNT-1","颜色分类":"TNT-1直拍(短柄)1只+赠送:1海绵护边【7木+2碳】 TNT-1横拍(长柄)1只+赠送:1海绵护边【7木+2碳】 新TNT直拍(短柄)1只+赠送:1海绵护边【5木+2碳】 新TNT横拍(长柄)1只+赠送:1海绵护边【5木+2碳】"},"item_price":{"3280089025135":{"price":168,"promotion_price":68},"3280089025136":{"price":168,"promotion_price":68},"6310159781":{"price":168,"promotion_price":68},"6310159797":{"price":168,"promotion_price":68}},"logistics":4.8,"name":"正品 拍里奥乒乓球底板新TNT-1碳素快攻弧圈乒乓球拍底板球拍球板","service":4.8,"shopname":"玺源运动专营店","source":"Ali","url":"https://detail.tmall.com/item.htm?id=7664169349"}
原文地址:https://www.cnblogs.com/DarrenChan/p/10706019.html