Python goose-extractor包_程序模块 - PyPI

内容/文章提取器，Web Scrapping

goose-extractor的Python项目详细描述

简介

goose最初是一个用java编写的文章提取器，它拥有最近（2011年8月）被转换为scala project。

这是python中的完全重写。软件的目的是获取任何新闻文章或文章类型的网页，而不仅仅是提取是本文的主体，也是所有元数据和最可能的形象候选人。

goose将尝试提取以下信息：

文章的正文
文章的主要图像
文章中嵌入的任何YouTube/Vimeo电影
元描述
元标记

python版本被重写：

泽维尔格兰杰

许可

如果你觉得鹅有用或有问题，请给我一个电话。我很乐意了解您是如何使用它的，或者应该改进哪些功能

goose是由gravity.com根据apache 2.0许可授权的，请参见有关更多详细信息的许可证文件

设置

mkvirtualenv --no-site-packages goose
git clone https://github.com/grangier/python-goose.git
cd python-goose
pip install -r requirements.txt
python setup.py install

转一圈

>>> from goose import Goose
>>> url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'
>>> g = Goose()
>>> article = g.extract(url=url)
>>> article.title
u'Occupy London loses eviction fight'
>>> article.meta_description
"Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London's Court of Appeal."
>>> article.cleaned_text[:150]
(CNN) -- Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoi
>>> article.top_image.src
http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg

配置

有两种方法可以将配置传递给goose。第一个是向goose传递一个configuration（）对象。第二个是通过配置指令

例如，如果您想更改goose使用的useragent 通过：

>>> g = Goose({'browser_user_agent': 'Mozilla'})

切换解析器：goose现在可以与lxml html解析器或lxml一起使用汤分析器。默认情况下使用html解析器。如果你想使用 soup解析器在配置dict中传递它：

>>> g = Goose({'browser_user_agent': 'Mozilla', 'parser_class':'soup'})

goose现在支持语言

例如，使用正确的元语言取消西班牙语内容页标签

>>> from goose import Goose
>>> url = 'http://sociedad.elpais.com/sociedad/2012/10/27/actualidad/1351332873_157836.html'
>>> g = Goose()
>>> article = g.extract(url=url)
>>> article.title
u'Las listas de espera se agravan'
>>> article.cleaned_text[:150]
u'Los recortes pasan factura a los pacientes. De diciembre de 2010 a junio de 2012 las listas de espera para operarse aumentaron un 125%. Hay m\xe1s ciudad'

有些页面没有正确的元语言标记，您可以使用配置：

>>> from goose import Goose
>>> url = 'http://www.elmundo.es/elmundo/2012/10/28/espana/1351388909.html'
>>> g = Goose({'use_meta_language': False, 'target_language':'es'})
>>> article = g.extract(url=url)
>>> article.cleaned_text[:150]
u'Importante golpe a la banda terrorista ETA en Francia. La Guardia Civil ha detenido en un hotel de Macon, a 70 kil\xf3metros de Lyon, a Izaskun Lesaka y '

传递{use_meta_language'：false，'target_language'：'es'}将 force as configuration将强制使用西班牙语

视频提取

>>> import goose
>>> url = 'http://www.liberation.fr/politiques/2013/08/12/journee-de-jeux-pour-ayrault-dans-les-jardins-de-matignon_924350'
>>> g = goose.Goose({'target_language':'fr'})
>>> article = g.extract(url=url)
>>> article.movies
[<goose.videos.videos.Video object at 0x25f60d0>]
>>> article.movies[0].src
'http://sa.kewego.com/embed/vp/?language_code=fr&playerKey=1764a824c13c&configKey=dcc707ec373f&suffix=&sig=9bc77afb496s&autostart=false'
>>> article.movies[0].embed_code
'<iframe src="http://sa.kewego.com/embed/vp/?language_code=fr&amp;playerKey=1764a824c13c&amp;configKey=dcc707ec373f&amp;suffix=&amp;sig=9bc77afb496s&amp;autostart=false" frameborder="0" scrolling="no" width="476" height="357"/>'
>>> article.movies[0].embed_type
'iframe'
>>> article.movies[0].width
'476'
>>> article.movies[0].height
'357'

中国鹅

有些用户想用goose来制作中文内容。中文单词分割比西方更难处理语言。中文需要一个专门的停止词分析器传递给配置对象

>>> from goose import Goose
>>> from goose.text import StopWordsChinese
>>> url  = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
>>> g = Goose({'stopwords_class': StopWordsChinese})
>>> article = g.extract(url=url)
>>> print article.cleaned_text[:150]
香港行政长官梁振英在各方压力下就其大宅的违章建筑（僭建）问题到立法会接受质询，并向香港民众道歉。

梁振英在星期二（12月10日）的答问大会开始之际在其演说中道歉，但强调他在违章建筑问题上没有隐瞒的意图和动机。

一些亲北京阵营议员欢迎梁振英道歉，且认为应能获得香港民众接受，但这些议员也质问梁振英有

阿拉伯语的goose

为了在阿拉伯语中使用goose，你必须使用stopwordsarabic 上课。

>>> from goose import Goose
>>> from goose.text import StopWordsArabic
>>> url = 'http://arabic.cnn.com/2013/middle_east/8/3/syria.clashes/index.html'
>>> g = Goose({'stopwords_class': StopWordsArabic})
>>> article = g.extract(url=url)
>>> print article.cleaned_text[:150]
دمشق، سوريا (CNN) -- أكدت جهات سورية معارضة أن فصائل مسلحة معارضة لنظام الرئيس بشار الأسد وعلى صلة بـ"الجيش الحر" تمكنت من السيطرة على مستودعات للأسل

韩语鹅肉

为了在韩语中使用goose，你必须使用stopwordskorean 上课。

>>> from goose import Goose
>>> from goose.text import StopWordsKorean
>>> url='http://news.donga.com/3/all/20131023/58406128/1'
>>> g = Goose({'stopwords_class':StopWordsKorean})
>>> article = g.extract(url=url)
>>> print article.cleaned_text[:150]
경기도 용인에 자리 잡은 민간 시험인증 전문기업 ㈜디지털이엠씨(www.digitalemc.com).
14년째 세계 각국의 통신·안전·전파 규격 시험과 인증 한 우물만 파고 있는 이 회사 박채규 대표가 만나기로 한 주인공이다.
그는 전기전자·무선통신·자동차 전장품 분야에

已知问题

Unicode URL存在一些问题。

cookie处理：有些网站需要cookie处理。目前唯一的解决方法是使用原始的HTML提取。例如；

>>> import urllib2
>>> import goose
>>> url = "http://www.nytimes.com/2013/08/18/world/middleeast/pressure-by-us-failed-to-sway-egypts-leaders.html?hp"
>>> opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
>>> response = opener.open(url)
>>> raw_html = response.read()
>>> g = goose.Goose()
>>> a = g.extract(raw_html=raw_html)
>>> a.cleaned_text
u'CAIRO \u2014 For a moment, at least, American and European diplomats trying to defuse the volatile standoff in Egypt thought they had a breakthrough.\n\nAs t'

待办事项

视频HTML5标签提取

欢迎加入QQ群-->： 979659372

goose-extractor 1.0.25

goose-extractor的Python项目详细描述

简介

许可

设置

转一圈

配置

goose现在支持语言

视频提取

中国鹅

阿拉伯语的goose

韩语鹅肉

已知问题

待办事项

推荐PyPI第三方库

interssection

windows_gui_automation

pyPRISM

common-hanzi

gerrit_mq

colvars

hop

mailmerge

toaster

cleanenv

otest-cli

ncmbot

wsgijson

nesterbyankit

business_tools

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

goose-extractor 1.0.25

goose-extractor的Python项目详细描述

简介

许可

设置

转一圈

配置

goose现在支持语言

视频提取

中国鹅

阿拉伯语的goose

韩语鹅肉

已知问题

待办事项

推荐PyPI第三方库

interssection

windows_gui_automation

pyPRISM

common-hanzi

gerrit_mq

colvars

hop

mailmerge

toaster

cleanenv

otest-cli

ncmbot

wsgijson

nesterbyankit

business_tools

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签