功能描述:
使用的库
1、time
2、json
3、requests
4、BuautifulSoup
5、RequestException
上机实验室:
""" 作者:李舵 日期:2019-4-27 功能:抓取豆瓣电影top250 版本:V1.0 """ import time import json import requests from bs4 import BeautifulSoup from requests.exceptions import RequestException def get_one_page(url): try: headers = {‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36‘} response = requests.get(url, headers=headers) if response.status_code == 200: return response.text return None except RequestException: return None def parse_one_page(html): soup = BeautifulSoup(html, ‘lxml‘) ol_list = soup.find(‘ol‘, {‘class‘: ‘grid_view‘}) li_list = ol_list.find_all(‘li‘) for i in range(25): move_value = li_list[i] yield { ‘index‘: move_value.find(‘em‘, {‘class‘: ‘‘}).text.strip(), ‘title‘: move_value.find(‘span‘, {‘class‘: ‘title‘}).text.strip(), ‘actor‘: move_value.find(‘p‘, {‘class‘: ‘‘}).text.strip(), ‘score‘: move_value.find(‘span‘, {‘class‘: ‘rating_num‘}).text.strip() } def write_to_file(content): with open(‘result.txt‘, ‘a‘, encoding=‘utf-8‘) as f: print(type(json.dumps(content))) f.write(json.dumps(content, ensure_ascii=False)+‘\n‘) def main(start): url = ‘https://movie.douban.com/top250?start=‘ + str(start) html = get_one_page(url) for item in parse_one_page(html): print(item) write_to_file(item) if __name__ == ‘__main__‘: for i in range(11): main(start=i * 25) time.sleep(1)
补充说明:
1、
原文地址:https://www.cnblogs.com/liduo0413/p/10779802.html
时间: 2024-10-04 17:47:42