爬取搜索出来的电影的下载地址并保存到excel

一、背景

利用Requests模块获取页面,BeautifulSoup来获取需要的内容,最后利用xlsxwriter模块讲内容保存至excel,首先通过讲关键字收拾出来的页面获取到子页面的url,然后再次去抓取获取到子页面的信息保存到excel

二、代码

编写了两个模块,geturldytt和getexceldytt,最后在main内调用

geturldyttd代码如下:

#!/bin/env python
# -*- coding:utf-8 -*-
from urllib import parse
import requests
from bs4 import BeautifulSoup
#http://s.dydytt.net/plus/search.php?keyword=%BF%C6%BB%C3&searchtype=titlekeyword&channeltype=0&orderby=&kwtype=0&pagesize=10&typeid=0&TotalResult=279&PageNo=4
class get_urldic:
    def __init__(self):
        self.first_url = ‘http://s.dydytt.net/plus/search.php?‘
        self.second_url = ‘&searchtype=titlekeyword&channeltype=0&orderby=&kwtype=0&pagesize=10&typeid=0&TotalResult=279&PageNo=‘
        self.info_url = ‘http://s.dydytt.net‘
    #获取搜索关键字
    def get_url(self):
        urlList = []
        # first_url = ‘http://s.dydytt.net/plus/search.php?‘
        # second_url = ‘&searchtype=titlekeyword&channeltype=0&orderby=&kwtype=0&pagesize=10&typeid=0&TotalResult=279&PageNo=‘
        try:
            search = input("Please input search name:")
            dic = {‘keyword‘:search}
            keyword_dic = parse.urlencode(dic,encoding=‘gb2312‘)
            page = int(input("Please input page:"))
        except Exception as e:
            print(‘Input error:‘,e)
            exit()
        for num in range(1,page+1):
            url = self.first_url + str(keyword_dic) + self.second_url + str(num)
            urlList.append(url)
        print("Please wait....")
        print(urlList)
        return urlList,search

    #获取网页文件
    def get_html(self,urlList):
        response_list = []
        for r_num in urlList:
            request = requests.get(r_num)
            response = request.content.decode(‘gbk‘,‘ignore‘).encode(‘utf-8‘)
            response_list.append(response)
        return response_list

    #获取blog_name和blog_url
    def get_soup(self,html_doc):
        result = {}
        for g_num in html_doc:
            soup = BeautifulSoup(g_num,‘html.parser‘)
            context = soup.find_all(‘td‘, width="55%")
            for i in context:
                title=i.get_text()
                result[title.strip()]=self.info_url + i.b.a[‘href‘]
        return result

    def get_info(self,info_dic):
        info_tmp = []
        for k,v in info_dic.items():
            print(v)
            response = requests.get(v)
            new_response = response.content.decode(‘gbk‘).encode(‘utf-8‘)
            soup = BeautifulSoup(new_response, ‘html.parser‘)
            info_dic = soup.find_all(‘div‘, class_="co_content8")
            info_list1= []
            for context in info_dic:
                result = list(context.get_text().split())
                for i in range(0, len(result)):
                    if ‘发布‘ in result[i]:
                        public = result[i]
                        info_list1.append(public)
                    elif "豆瓣" in result[i]:
                        douban = result[i] + result[i+1]
                        info_list1.append(douban)
                    elif "【下载地址】" in result[i]:
                        download = result[i] + result[i+1]
                        info_list1.append(download)
                    else:
                        pass
            info_tmp.append(info_list1)
        return info_tmp

if __name__ == ‘__main__‘:
    blog = get_urldic()
    urllist, search = blog.get_url()
    html_doc = blog.get_html(urllist)
    result = blog.get_soup(html_doc)
    for k,v in result.items():
        print(‘search blog_name is:%s,blog_url is:%s‘ % (k,v))
    info_list = blog.get_info(result)
    for list in info_list:
        print(list)

getexceldytt代码如下:

#!/bin/env python
# -*- coding:utf-8 -*-
# @Author  : kaliarch

import xlsxwriter

class create_excle:
    def __init__(self):
        self.tag_list = ["movie_name", "movie_url"]
        self.info = "information"

    def create_workbook(self,search=" "):
        excle_name = search + ‘.xlsx‘
        #定义excle名称
        workbook = xlsxwriter.Workbook(excle_name)
        worksheet_M = workbook.add_worksheet(search)
        worksheet_info = workbook.add_worksheet(self.info)
        print(‘create %s....‘ % excle_name)
        return workbook,worksheet_M,worksheet_info

    def col_row(self,worksheet):
        worksheet.set_column(‘A:A‘, 12)
        worksheet.set_row(0, 17)
        worksheet.set_column(‘A:A‘,58)
        worksheet.set_column(‘B:B‘, 58)

    def shell_format(self,workbook):
        #表头格式
        merge_format = workbook.add_format({
            ‘bold‘: 1,
            ‘border‘: 1,
            ‘align‘: ‘center‘,
            ‘valign‘: ‘vcenter‘,
            ‘fg_color‘: ‘#FAEBD7‘
        })
        #标题格式
        name_format = workbook.add_format({
            ‘bold‘: 1,
            ‘border‘: 1,
            ‘align‘: ‘center‘,
            ‘valign‘: ‘vcenter‘,
            ‘fg_color‘: ‘#E0FFFF‘
        })
        #正文格式
        normal_format = workbook.add_format({
            ‘align‘: ‘center‘,
        })
        return merge_format,name_format,normal_format

    #写入title和列名
    def write_title(self,worksheet,search,merge_format):
        title = search + "搜索结果"
        worksheet.merge_range(‘A1:B1‘, title, merge_format)
        print(‘write title success‘)

    def write_tag(self,worksheet,name_format):
        tag_row = 1
        tag_col = 0
        for num in self.tag_list:
            worksheet.write(tag_row,tag_col,num,name_format)
            tag_col += 1
        print(‘write tag success‘)

    #写入内容
    def write_context(self,worksheet,con_dic,normal_format):
        row = 2
        for k,v in con_dic.items():
            if row > len(con_dic):
                break
            col = 0
            worksheet.write(row,col,k,normal_format)
            col+=1
            worksheet.write(row,col,v,normal_format)
            row+=1
        print(‘write context success‘)
    #写入子页面详细内容
    def write_info(self,worksheet_info,info_list,normal_format):
        row = 1
        for infomsg in info_list:
            for num in range(0,len(infomsg)):
                worksheet_info.write(row,num,infomsg[num],normal_format)
                num += 1
            row += 1

        print("wirte info success")

    #关闭excel
    def workbook_close(self,workbook):
        workbook.close()

if __name__ == ‘__main__‘:
    print(‘This is create excel mode‘)

main代码如下:

#!/bin/env python
# -*- coding:utf-8 -*-

import geturldytt
import getexceldytt

#获取url字典
def get_dic():
    blog = geturldytt.get_urldic()
    urllist, search = blog.get_url()
    html_doc = blog.get_html(urllist)
    result = blog.get_soup(html_doc)
    info_list= blog.get_info(result)
    return result,search,info_list

#写入excle
def write_excle(urldic,search,info_list):
    excle = getexceldytt.create_excle()
    workbook, worksheet, worksheet_info = excle.create_workbook(search)
    excle.col_row(worksheet)
    merge_format, name_format, normal_format = excle.shell_format(workbook)
    excle.write_title(worksheet,search,merge_format)
    excle.write_tag(worksheet,name_format)
    excle.write_context(worksheet,urldic,normal_format)
    excle.write_info(worksheet_info,info_list,normal_format)
    excle.workbook_close(workbook)

def main():
    url_dic ,search_name, info_list = get_dic()
    write_excle(url_dic,search_name,info_list)

if __name__ == ‘__main__‘:
    main()

三、效果展示

运行代码,填写搜索的关键字,及搜索多少页,生成excel的sheet1内容为首先搜索的页面及url,sheet2为sheet1内url对应的电影下载链接及豆瓣评分,有的子页面存在没有豆瓣评分或下载链接。



原文地址:http://blog.51cto.com/kaliarch/2069544

时间: 2024-08-10 23:28:37

爬取搜索出来的电影的下载地址并保存到excel的相关文章

夺命雷公狗---DEDECMS----27dedecms电影的下载地址的完成

我们现在要完成的是电影的下载地址这里: 我们的下载地址都是放在我们的dede_addonmovie(附加表)里面去才可以的,因为下载地址的个数是不能确定的,所以我们就将所有的下载地址存放到一个字段里面. 我们的下载地址存放的形式可以用  |   号来进行保存,如下所示: 3GP|人狗情未了01|176X144|http://www.showtp.com/01.3gp 3GP|人狗情未了02|176X144|http://www.showtp.com/02.3gp 3GP|人狗情未了03|176X

将爬取的数据保存到Excel表格

第一步.导入模块 import xlwt # 导入写入excel需要的包第二步.定义函数,将爬取好的数据保存到excel文件中,下面以保存python的关键词为例,介绍详细流程. def write_to_excel(filename, lst): # 为防止写入失败,捕获异常 try: # 1 创建一个workbook,相当于创建excel文件 work_book = xlwt.Workbook(encoding='utf-8') # 2 创建一个sheet表单 sheet = work_bo

python3.6+BeautifulSoup4.2 爬取各类app应用信息并下载app包

---------------环境配置--------------- 1.在Windows操作系统下安装python-3.6.4-amd64.exe 2.配置环境变量 Path变量:如C:\Users\Administrator\AppData\Local\Programs\Python\Python36\ 打开cmd命令窗口,输入:python命令,可以进入python 编辑命令行,即可. 3.把beautifulsoup4-4.6.0.tar.gz解压后,放在Python36\目录下,在cm

使用scrapy框架爬取蜂鸟论坛的摄影图片并下载到本地

目标网站:http://bbs.fengniao.com/使用框架:scrapy 因为有很多模块的方法都还不是很熟悉,所有本次爬虫有很多代码都用得比较笨,希望各位读者能给处意见 首先创建好爬虫项目,并使用crawl模板创建爬虫文件 通过观察论坛的规律得出,很多贴子的页数往往大于一页,那么要将贴子里各页的图片下载到同一文件夹内,并且不能重名,就是获取到当前的页码数,已页码数+自然数的方式命令文件.发现scrapy自动爬虫会爬很多重复的页面,度娘后得出两个解决方法,第一个是用布隆过滤器,布隆过滤器相

用Python爬取豆瓣Top250的电影标题

所以我们可以这么写去得到所有页面的链接我们知道标题是在 target="_blank"> 标题的位置</a> 之中 所以可以通过正则表达式找到所有符合条件的标题 将内容写入到表格保存起来 下面贴入完整代码 import requests, bs4, re, openpyxl url = 'https://www.douban.com/doulist/3936288/?start=%s' urls = [] 多少页 pages = 10 for i in range(p

scrapy框架来爬取壁纸网站并将图片下载到本地文件中

首先需要确定要爬取的内容,所以第一步就应该是要确定要爬的字段: 首先去items中确定要爬的内容 class MeizhuoItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() # 图集的标题 title = scrapy.Field() # 图片的url,需要来进行图片的抓取 url = scrapy.Field() pass 在确定完要爬的字段之后,就是分析网站页面的请求

简单爬取《小丑》电影豆瓣短评生成词云

导语  在前段时间看了杰昆菲尼克斯的小丑电影,心里很好奇大部分观众看完这部电影之后对此有什么评价,然后看了看豆瓣短评之后,觉得通过python把短评中出现最多的单词提取出来,做成一张词云,看看这部电影给观众们留下的关键词是什么. 抓取数据  首先刚开始的时候 ,是通过requests去模拟抓取数据,发现短评翻页翻到20页之后就需要登录豆瓣用户才有权限查看,所以打算通过使用selenium模拟浏览器动作自动化将页面中的数据爬取下来,然后存储到特定的txt文件,由于没打算做其他的分析,就不打算存放到

爬取豆瓣的tp250电影名单

# https://movie.douban.com/top250?start=25&filter= 要爬取的网页 import re from urllib.request import urlopen def getPage(url): response=urlopen(url) return response.read().decode('utf-8') def parsePage(s): ret=com.finditer(s) for i in ret: ret={ 'id': i.gr

xpath案例 爬取58出租房源信息&amp;解析下载图片数据&amp;乱码问题

58二手房解析房源名称 from lxml import etree import requests url = 'https://haikou.58.com/chuzu/j2/' headers = { 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Mobile Safari/537.