Python2.7x32 NLTK Punkt标记器未正确检测句子

2024-10-02 00:36:11 发布

您现在位置:Python中文网/ 问答频道 /正文

本文是http://pastebin.com/gD65sS22,一篇关于病毒性媒体的论文摘要。我使用http://www.nltk.org/api/nltk.tokenize.html中的示例代码,例如,我读取文本文件,加载punkt/英语.泡菜标记器,并打印生成的句子。在

从本质上说,输出是糟糕的。几乎没有一个“e.g.”被正确地忽略,有几篇引文都不好。。。在

这仅仅是NLTK的弱点还是我做错了什么?我应该使用regex来调查吗?在


Tags: 代码orgcomapihttp示例htmlwww
2条回答

首先,你的文本有点嘈杂,如果nltk.sent_tokenize看到一个换行符\r\n,它将打断它并将其用作句子边界。其次,sent_tokenize不太擅长使用句式句号的文本。E、 g

from urllib.request import urlopen, Request
from nltk import sent_tokenize

request = Request("http://pastebin.com/raw.php?i=gD65sS22")
response = urlopen(request)

text = " ".join(response.read().decode('utf8').split('\r\n'))

for sent in sent_tokenize(text):
    print (sent)

[出来]:

^{pr2}$

现在让我们尝试一些技巧:

from urllib.request import urlopen, Request
from nltk import sent_tokenize

def hack(text):
    return text.replace('et al. ', 'et_al._')   

def unhack(text):
    return text.replace('et_al._', 'et al. ')   


request = Request("http://pastebin.com/raw.php?i=gD65sS22")
response = urlopen(request)

text = " ".join(response.read().decode('utf8').split('\r\n'))
text = hack(text)

for sent in sent_tokenize(text):
    print (unhack(sent))

[出来]:

Testing text / paper abstract taken from http://www.aaai.org/ocs/index.php/ICWSM/ICWSM15/paper/view/10505  Viral videos have become a staple of the social Web.
The term refers to videos that are uploaded to video sharing sites such as YouTube, Vimeo, or Blip.tv and more or less quickly gain the attention of millions of people.
Viral videos mainly contain humorous content such as bloopers in television shows (eg.
boom goes the dynamite) or quirky Web productions (eg.
nyan cat).
Others show extraordinary events caught on video (eg.
battle at Kruger) or contain political messages (eg.
kony 2012).
The arguably most prominent example, however, is the music video Gangnam style by PSY which, as of January 2015, has been viewed over 2 billion times on YouTube.
Yet, while the recent surge in viral videos has been attributed to the availability of affordable digital cameras and video sharing sites (Grossman 2006), viral Web videos predate modern social media.
An example is the dancing baby which appeared in 1996 and was mainly shared via email.
The fact that videos became Internet phenomena already before the first video sharing sites appeared suggests that collective attention to viral videos may spread in form of a contact process.
Put differently, it seems reasonable to surmise that attention to viral videos spreads through the Web very much as viruses spread through the world.
Indeed, the times series shown in Fig.
1 support this intuition.
They show exemplary developments of YouTube view counts and Google searches related to recent viral videos and closely resemble the progress of infection counts often observed in epidemic outbreaks.
However, although viral videos attract growing research efforts, the suitability of the viral metaphor was apparently not studied systematically yet.
In this paper, we therefore ask to what extend the dynamics in Fig.
1 can be explained in terms of the dynamics of epidemics?
This question extends existing viral video research which, so far, can be distinguished into two broad categories: On the one hand, researchers especially in the humanities and in marketing, ask for what it is that draws attention to viral videos (Burgess 2008; Southgate, Westoby, and Page 2010).
In a recent study, Shifman (2012) looked at attributes common to viral videos and, based on a corpus of 30 prominent examples, identified six predominant features, namely: focus on ordinary people, flawed masculinity, humor, simplicity, repetitiveness, and whimsical content.
However, while he argues that these attributes mark a video as incomplete or flawed and therefore invoke further attention or creative dialogue, the presence of these key signifiers does not im- ply virality.
After all, there are millions of videos that show these attributes but never attract significant viewership Another popular line of research, especially among data scientists, therefore consists in analyzing viewing patterns of viral videos.
For instance, Figueiredo et al. (2011) found that the temporal dynamics of view counts of YouTube videos seem to depend on whether or not the material is copyrighted.
While copyrighted videos (typically music videos) were observed to reach peak popularity early in their lifetime, other viral videos had been available for quite some time  before  they  experienced  sudden  significant  bursts  in popularity.
In addition, the authors observed that these bursts depended  on  external  factors  such  as  being  listed  on  the YouTube front page.
The importance of external effects for the viral success of a video was also noted by Broxton et al. (2013) who found that viewership patterns of YouTube videos strongly depend on referrals from sites such as Face- book or Twitter.
In particular, they observed that ‘social’ videos with many outside referrals rise to and fall from peak popularity much quicker than ‘less social’ ones.
Sudden bursts in view counts seem to be suitable predictors of a video’s future popularity (Crane and Sornette 2008; Pinto, Almeida, and Goncalves 2013; Jiang et al. 2014).
In fact, it appears that initial view count statistics combined with additional information as to, say, video related sharing activities in other social media, allow for predicting whether or not a video will ’go viral’ soon (Shamma et al. 2011; Jain, Manweiler, and Choudhury 2014).
Yet, Broxton et al. (2013) point out that not all ‘social’ videos go viral and not all viral videos are indeed ‘social’.
Given this interest in video related time series analysis, it is surprising that the viral metaphor has not been scrutinized from this angle.
To the best of our knowledge, the most closely related work is found in a recent report by CintroArias (2014) who attempted to match an intricate infectious disease model to view count data for the video Gangnam Style.
We, too, investigate the attention dynamics of viral videos from the point of view of mathematical epidemiology and present results based on a data set of more than 800 time series.
Our contributions are of theoretical and empirical nature, namely: 1) we introduce a simple yet expressive probabilistic model of the dynamics of epidemics; in contrast to traditional approaches, our model admits a closed form expression for the evolution of infected counts and we show that it amounts to the convolution of two geometric distributions 2) we introduce a time continuous characterization of this result; major advantages of this continuous model are that it is analytically tractable and allows for the use of highly robust maximum likelihood techniques in model fitting as well as for easily interpretable results 3) we fit our model to YouTube view count data and Google Trends time series which reflect collective attention to prominent viral videos and find it to fit well.
Our work therefore constitutes a data scientific approach towards viral video research.
However, it is model- rather than data driven.
This way, we follow arguments brought forth, for instance, by Bauckhage et al. (2013) or Lazer et al. (2014) who criticized the lack of interpretability and the ‘big data hubris’ of purely data driven approaches for their potential of over-fitting and misleading results.
Our presentation proceeds as follows: Next, we review concepts from mathematical epidemiology, briefly discuss approaches based on systems of differential equations, and introduce the probabilistic model that forms the basis for our study; mathematical details behind this model are deferred to the Appendix.
Then, we present the data we analyzed and discuss our empirical results.
We conclude by summarizing our approach, results, and implications of our findings.

现在看起来好多了,但是(e.g. ...仍然存在问题。和Fig. ...。让我们继续黑客攻击:

from urllib.request import urlopen, Request
from nltk import sent_tokenize

def hack(text):
    text = text.replace('et al. ', 'et_al._')
    text = text.replace('eg. ', 'eg._')
    text = text.replace('Fig. ', 'Fig._')
    return text

def unhack(text):
    text = text.replace('et_al._', 'et al. ')
    text = text.replace('eg._', 'eg.')
    text = text.replace('Fig._', 'Fig.')
    return  text


request = Request("http://pastebin.com/raw.php?i=gD65sS22")
response = urlopen(request)

text = " ".join(response.read().decode('utf8').split('\r\n'))
text = hack(text)

for sent in sent_tokenize(text):
    print (unhack(sent))

[出来]:

Testing text / paper abstract taken from http://www.aaai.org/ocs/index.php/ICWSM/ICWSM15/paper/view/10505  Viral videos have become a staple of the social Web.
The term refers to videos that are uploaded to video sharing sites such as YouTube, Vimeo, or Blip.tv and more or less quickly gain the attention of millions of people.
Viral videos mainly contain humorous content such as bloopers in television shows (eg.boom goes the dynamite) or quirky Web productions (eg.nyan cat).
Others show extraordinary events caught on video (eg.battle at Kruger) or contain political messages (eg.kony 2012).
The arguably most prominent example, however, is the music video Gangnam style by PSY which, as of January 2015, has been viewed over 2 billion times on YouTube.
Yet, while the recent surge in viral videos has been attributed to the availability of affordable digital cameras and video sharing sites (Grossman 2006), viral Web videos predate modern social media.
An example is the dancing baby which appeared in 1996 and was mainly shared via email.
The fact that videos became Internet phenomena already before the first video sharing sites appeared suggests that collective attention to viral videos may spread in form of a contact process.
Put differently, it seems reasonable to surmise that attention to viral videos spreads through the Web very much as viruses spread through the world.
Indeed, the times series shown in Fig.1 support this intuition.
They show exemplary developments of YouTube view counts and Google searches related to recent viral videos and closely resemble the progress of infection counts often observed in epidemic outbreaks.
However, although viral videos attract growing research efforts, the suitability of the viral metaphor was apparently not studied systematically yet.
In this paper, we therefore ask to what extend the dynamics in Fig.1 can be explained in terms of the dynamics of epidemics?
This question extends existing viral video research which, so far, can be distinguished into two broad categories: On the one hand, researchers especially in the humanities and in marketing, ask for what it is that draws attention to viral videos (Burgess 2008; Southgate, Westoby, and Page 2010).
In a recent study, Shifman (2012) looked at attributes common to viral videos and, based on a corpus of 30 prominent examples, identified six predominant features, namely: focus on ordinary people, flawed masculinity, humor, simplicity, repetitiveness, and whimsical content.
However, while he argues that these attributes mark a video as incomplete or flawed and therefore invoke further attention or creative dialogue, the presence of these key signifiers does not im- ply virality.
After all, there are millions of videos that show these attributes but never attract significant viewership Another popular line of research, especially among data scientists, therefore consists in analyzing viewing patterns of viral videos.
For instance, Figueiredo et al. (2011) found that the temporal dynamics of view counts of YouTube videos seem to depend on whether or not the material is copyrighted.
While copyrighted videos (typically music videos) were observed to reach peak popularity early in their lifetime, other viral videos had been available for quite some time  before  they  experienced  sudden  significant  bursts  in popularity.
In addition, the authors observed that these bursts depended  on  external  factors  such  as  being  listed  on  the YouTube front page.
The importance of external effects for the viral success of a video was also noted by Broxton et al. (2013) who found that viewership patterns of YouTube videos strongly depend on referrals from sites such as Face- book or Twitter.
In particular, they observed that ‘social’ videos with many outside referrals rise to and fall from peak popularity much quicker than ‘less social’ ones.
Sudden bursts in view counts seem to be suitable predictors of a video’s future popularity (Crane and Sornette 2008; Pinto, Almeida, and Goncalves 2013; Jiang et al. 2014).
In fact, it appears that initial view count statistics combined with additional information as to, say, video related sharing activities in other social media, allow for predicting whether or not a video will ’go viral’ soon (Shamma et al. 2011; Jain, Manweiler, and Choudhury 2014).
Yet, Broxton et al. (2013) point out that not all ‘social’ videos go viral and not all viral videos are indeed ‘social’.
Given this interest in video related time series analysis, it is surprising that the viral metaphor has not been scrutinized from this angle.
To the best of our knowledge, the most closely related work is found in a recent report by CintroArias (2014) who attempted to match an intricate infectious disease model to view count data for the video Gangnam Style.
We, too, investigate the attention dynamics of viral videos from the point of view of mathematical epidemiology and present results based on a data set of more than 800 time series.
Our contributions are of theoretical and empirical nature, namely: 1) we introduce a simple yet expressive probabilistic model of the dynamics of epidemics; in contrast to traditional approaches, our model admits a closed form expression for the evolution of infected counts and we show that it amounts to the convolution of two geometric distributions 2) we introduce a time continuous characterization of this result; major advantages of this continuous model are that it is analytically tractable and allows for the use of highly robust maximum likelihood techniques in model fitting as well as for easily interpretable results 3) we fit our model to YouTube view count data and Google Trends time series which reflect collective attention to prominent viral videos and find it to fit well.
Our work therefore constitutes a data scientific approach towards viral video research.
However, it is model- rather than data driven.
This way, we follow arguments brought forth, for instance, by Bauckhage et al. (2013) or Lazer et al. (2014) who criticized the lack of interpretability and the ‘big data hubris’ of purely data driven approaches for their potential of over-fitting and misleading results.
Our presentation proceeds as follows: Next, we review concepts from mathematical epidemiology, briefly discuss approaches based on systems of differential equations, and introduce the probabilistic model that forms the basis for our study; mathematical details behind this model are deferred to the Appendix.
Then, we present the data we analyzed and discuss our empirical results.
We conclude by summarizing our approach, results, and implications of our findings.

但是,是的,在句子标记器之前清理文本并不是那么困难,只要寻找一些破坏标记器的通用模式就可以了。我希望你能从上面的例子中得到大概的想法。在

所以这些黑客是有效的,但它只对这个数据集有效。那么我该如何概括黑客攻击呢?唯一的解决方案是重新训练punkt标记器以获得特定于学术文本的句子标记器,请参见training data format for nltk punkt

但请注意,您可能需要有一个小的句子标记文本集来训练标记器。玩得高兴!在

为什么当别人回答这些问题时不给我发邮件。。。 不管怎样。我一直在调查这个问题,最终在google上找到了一个答案,即使用NLTK功能并添加更多“已知”的abbrev_类型:

标记器_params.abbrev_类型.update(额外的缩写)

特殊语言和上下文的特殊缩写。但即使使用['e.g'、'al'、'i.e']也能显著提高我的结果,这真的让我想知道为什么“训练有素的”英语泡菜似乎不包含这些。在

相关问题 更多 >

    热门问题