NLTK在同一时间内连续生成二元图和三元图时出错

2024-10-03 06:21:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试将文本传递到下面的脚本中,并让它同时输出bigram和trigram。这就像第六代人在这方面的尝试一样,由于某些原因,它只生成第一个n克,而不是另一个。我尝试过改变秩序,尝试过各种各样的事情

以下是当前脚本:

#!/usr/bin/python3
# -*- coding: utf-8 -*-

import os
import sys
from datetime import datetime, timedelta
import random
import nltk
from nltk.collocations import *
import re
import json
from pprint import pprint


def bigram_generator(important_words, gram_dict):
    finder = BigramCollocationFinder.from_words(important_words, 2)

    for bigram, count in finder.ngram_fd.items():
          gram_dict[' '.join(bigram)] = count

    return gram_dict

def trigram_generator(important_words, gram_dict):
    finder1 = TrigramCollocationFinder.from_words(important_words, 3)

    for trigram, count in finder1.ngram_fd.items():
          gram_dict[' '.join(trigram)] = count

    return gram_dict

def execute_gram_analysis2(important_words):
    bigram_dict = {}
    for x in range(1,10):
        bigram_dict = bigram_generator(important_words, bigram_dict)

    trigram_dict = {}
    for y in range(1,10):
        trigram_dict = trigram_generator(important_words, trigram_dict)

    return bigram_dict, trigram_dict

def convert_gram_dict_to_json(gram_dict):
    json_grams_dict = json.dumps(gram_dict, ensure_ascii=False)
    return json_grams_dict


stopwords = nltk.corpus.stopwords.words('english')

scraped_url_id = 2

s = scraped_urls.select().where(scraped_urls.c.id==scraped_url_id)
results = monitor_bot_conn.execute(s)
for row in results:
    row_id = row[0]
    text = row[6]

    print (text)

    words = re.findall(r'\w+', text.decode('utf-8'))

    words_lowercase = []
    for word in words:
        words_lowercase.append(word.lower())


    important_words = filter(lambda x: x not in stopwords, words_lowercase)

    bigrams_dict, trigrams_dict = execute_gram_analysis2(important_words)
    json_bigrams_dict = convert_gram_dict_to_json(bigrams_dict)
    print ('\n\n---[ BIGRAMS ]---\n\n')
    pprint (json_bigrams_dict)

    json_trigrams_dict = convert_gram_dict_to_json(trigrams_dict)
    print ('\n\n---[ TRIGRAMS ]---\n\n')
    pprint (json_trigrams_dict)

在下面的源文本上使用上述脚本,我得到以下输出:

    ---[ SOURCE TEXT ]---
b'A black hole is a region of spacetime exhibiting such strong gravitational effects that nothing?not even particles and electromagnetic radiation such as light?can escape from inside it.[1] The theory of general relativity predicts that a sufficiently compact mass can deform spacetime to form a black hole.[2][3] The boundary of the region from which no escape is possible is called the event horizon. Although the event horizon has an enormous effect on the fate and circumstances of an object crossing it, no locally detectable features appear to be observed.[4] In many ways a black hole acts like an ideal black body, as it reflects no light.[5][6] Moreover, quantum field theory in curved spacetime predicts that event horizons emit Hawking radiation, with the same spectrum as a black body of a temperature inversely proportional to its mass. This temperature is on the order of billionths of a kelvin for black holes of stellar mass, making it essentially impossible to observe.\n\nObjects whose gravitational fields are too strong for light to escape were first considered in the 18th century by John Michell and Pierre-Simon Laplace.[7] The first modern solution of general relativity that would characterize a black hole was found by Karl Schwarzschild in 1916, although its interpretation as a region of space from which nothing can escape was first published by David Finkelstein in 1958. Black holes were long considered a mathematical curiosity; it was during the 1960s that theoretical work showed they were a generic prediction of general relativity. The discovery of neutron stars in the late 1960s sparked interest in gravitationally collapsed compact objects as a possible astrophysical reality.\n'


---[ BIGRAMS OUTPUT]---

('{"black hole": 4, "hole region": 1, "region spacetime": 1, "spacetime '
 'exhibiting": 1, "exhibiting strong": 1, "strong gravitational": 1, '
 '"gravitational effects": 1, "effects nothing": 1, "nothing even": 1, "even '
 'particles": 1, "particles electromagnetic": 1, "electromagnetic radiation": '
 '1, "radiation light": 1, "light escape": 2, "escape inside": 1, "inside 1": '
 '1, "1 theory": 1, "theory general": 1, "general relativity": 3, "relativity '
 'predicts": 1, "predicts sufficiently": 1, "sufficiently compact": 1, '
 '"compact mass": 1, "mass deform": 1, "deform spacetime": 1, "spacetime '
 'form": 1, "form black": 1, "hole 2": 1, "2 3": 1, "3 boundary": 1, "boundary '
 'region": 1, "region escape": 1, "escape possible": 1, "possible called": 1, '
 '"called event": 1, "event horizon": 2, "horizon although": 1, "although '
 'event": 1, "horizon enormous": 1, "enormous effect": 1, "effect fate": 1, '
 '"fate circumstances": 1, "circumstances object": 1, "object crossing": 1, '
 '"crossing locally": 1, "locally detectable": 1, "detectable features": 1, '
 '"features appear": 1, "appear observed": 1, "observed 4": 1, "4 many": 1, '
 '"many ways": 1, "ways black": 1, "hole acts": 1, "acts like": 1, "like '
 'ideal": 1, "ideal black": 1, "black body": 2, "body reflects": 1, "reflects '
 'light": 1, "light 5": 1, "5 6": 1, "6 moreover": 1, "moreover quantum": 1, '
 '"quantum field": 1, "field theory": 1, "theory curved": 1, "curved '
 'spacetime": 1, "spacetime predicts": 1, "predicts event": 1, "event '
 'horizons": 1, "horizons emit": 1, "emit hawking": 1, "hawking radiation": 1, '
 '"radiation spectrum": 1, "spectrum black": 1, "body temperature": 1, '
 '"temperature inversely": 1, "inversely proportional": 1, "proportional '
 'mass": 1, "mass temperature": 1, "temperature order": 1, "order billionths": '
 '1, "billionths kelvin": 1, "kelvin black": 1, "black holes": 2, "holes '
 'stellar": 1, "stellar mass": 1, "mass making": 1, "making essentially": 1, '
 '"essentially impossible": 1, "impossible observe": 1, "observe objects": 1, '
 '"objects whose": 1, "whose gravitational": 1, "gravitational fields": 1, '
 '"fields strong": 1, "strong light": 1, "escape first": 2, "first '
 'considered": 1, "considered 18th": 1, "18th century": 1, "century john": 1, '
 '"john michell": 1, "michell pierre": 1, "pierre simon": 1, "simon laplace": '
 '1, "laplace 7": 1, "7 first": 1, "first modern": 1, "modern solution": 1, '
 '"solution general": 1, "relativity would": 1, "would characterize": 1, '
 '"characterize black": 1, "hole found": 1, "found karl": 1, "karl '
 'schwarzschild": 1, "schwarzschild 1916": 1, "1916 although": 1, "although '
 'interpretation": 1, "interpretation region": 1, "region space": 1, "space '
 'nothing": 1, "nothing escape": 1, "first published": 1, "published david": '
 '1, "david finkelstein": 1, "finkelstein 1958": 1, "1958 black": 1, "holes '
 'long": 1, "long considered": 1, "considered mathematical": 1, "mathematical '
 'curiosity": 1, "curiosity 1960s": 1, "1960s theoretical": 1, "theoretical '
 'work": 1, "work showed": 1, "showed generic": 1, "generic prediction": 1, '
 '"prediction general": 1, "relativity discovery": 1, "discovery neutron": 1, '
 '"neutron stars": 1, "stars late": 1, "late 1960s": 1, "1960s sparked": 1, '
 '"sparked interest": 1, "interest gravitationally": 1, "gravitationally '
 'collapsed": 1, "collapsed compact": 1, "compact objects": 1, "objects '
 'possible": 1, "possible astrophysical": 1, "astrophysical reality": 1}')

---[ TRIGRAMS OUTPUT ]---

'{}'

我不明白为什么我不能运行这个脚本,所以有bigram和trigram的输出

提前感谢您的帮助


Tags: ofinimportjsontrigramregiondictgram
1条回答
网友
1楼 · 发布于 2024-10-03 06:21:59

filter返回一个迭代器。一旦你遍历它,它就会变成空的。如果要多次使用迭代器,必须将其转换为列表:

important_words = list(filter(lambda x: x not in stopwords, words_lowercase))

相关问题 更多 >