1.简介
Python-goose项目是用Python重写的Goose,Goose原来是用Java写的文章提取工具。Python-goose的目标是给定任意资讯文章或者任意文章类的网页,不仅提取出文章的主体,同时提取出所有元信息以及图片等信息,支持中文网页。
Python-goose可提取的信息包括:
- 文章主体内容
- 文章主要图片
- 文章中嵌入的任何Youtube/Vimeo视频
- 元描述
- 元标签
2.安装
virtualenv --no-site-packages goose cd goose #windows下 Scripts\activate #linux下使用/bin/acitvate git clone https://github.com/grangier/python-goose.git cd python-goose pip install -r requirements.txt python setup.py install
3.使用
>>> from goose import Goose >>> url = ‘http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2‘ >>> g = Goose() >>> article = g.extract(url=url) >>> article.title u‘Occupy London loses eviction fight‘ >>> article.meta_description "Occupy London protesters who have been camped outside the landmark St. Paul‘s Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London‘s Court of Appeal." >>> article.cleaned_text[:150] (CNN) -- Occupy London protesters who have been camped outside the landmark St. Paul‘s Cathedral for the past four months lost their court bid to avoi >>> article.top_image.src http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg
对于中文文章,需要
g = Goose({‘browser_user_agent‘: ‘Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.3 6‘,‘stopwords_class‘:StopWordsChinese})
参考:
https://pypi.python.org/pypi/goose-extractor/
时间: 2024-10-06 16:56:22