如何为Python BeautifulSoup NLTK分析连接多个列表

2024-09-28 23:23:49 发布

您现在位置:Python中文网/ 问答频道 /正文

这里是Python新手,正在使用beauthoulsoup和NLTK进行第一次web抓取/词频分析。在

我在抓取德克萨斯州司法部罪犯最后陈述的档案。在

我已经到了这样一个地步,我可以从每个违规者的页面中提取出我想分析的文本,并将所有段落的单词标记化,但它返回的是每段标记化单词的列表。我希望合并这些列表并返回一个标记化单词的列表,以便对每个罪犯进行分析。在

我最初以为使用.join可以解决我的问题,但它仍然为每个段落返回一个列表。我也试过itertools。运气不好。在

这里有所有的代码来查找罪犯陈述中最常见的单词,但它是从每个段落返回最常见的单词。 任何帮助将不胜感激!在

from bs4 import BeautifulSoup
import urllib.request
import re
import nltk
from nltk import FreqDist
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

resp = urllib.request.urlopen
("https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html")
soup = BeautifulSoup(resp,"lxml", 
from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=re.compile('last'))[1:2]:
    lastlist = 'https://www.tdcj.state.tx.us/death_row/'+link['href']
    resp2 = urllib.request.urlopen(lastlist)
    soup2 = BeautifulSoup(resp2,"lxml", 
from_encoding=resp2.info().get_param('charset'))
    body = soup2.body

    for paragraph in body.find_all('p')[4:5]:
        name = paragraph.text
        print(name)

    for paragraph in body.find_all('p')[6:]:
        tokens = word_tokenize(paragraph.text)
        addWords = 
        ['I',',','Yes','.','\'m','n\'t','?',':',
        'None','To','would','y\'all',')','Last','\'s']
        stopWords = set(stopwords.words('english')+addWords)
        wordsFiltered = []

        for w in tokens:
            if w not in stopWords:
                wordsFiltered.append(w)

        fdist1 = FreqDist(wordsFiltered)
        common = fdist1.most_common(1)
        print(common)

Tags: infrom标记import列表forbodyall
1条回答
网友
1楼 · 发布于 2024-09-28 23:23:49
from bs4 import BeautifulSoup
import urllib.request
import re
import nltk
from nltk import FreqDist
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

resp = urllib.request.urlopen("https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html")
soup = BeautifulSoup(resp,"lxml", from_encoding=resp.info().get_param('charset'))
wordsFiltered = []
for link in soup.find_all('a', href=re.compile('last'))[1:2]:
    lastlist = 'https://www.tdcj.state.tx.us/death_row/'+link['href']
    resp2 = urllib.request.urlopen(lastlist)
    soup2 = BeautifulSoup(resp2,"lxml", from_encoding=resp2.info().get_param('charset'))    
    body = soup2.body

    for paragraph in body.find_all('p')[4:5]:
        name = paragraph.text
        print(name)


    for paragraph in body.find_all('p')[6:]:
        tokens = word_tokenize(paragraph.text)
        addWords = ['I',',','Yes','.','\'m','n\'t','?',':','None','To','would','y\'all',')','Last','\'s']
        stopWords = set(stopwords.words('english')+addWords)


        for w in tokens:
            if w not in stopWords:
                wordsFiltered.append(w)

fdist1 = FreqDist(wordsFiltered)
common = fdist1.most_common(1)
print(common)

我已经编辑了您的代码,以获取每个语句中最常用的单词。如果你不明白什么,可以随意评论。另外,请记住,如果在每次迭代中都向循环中追加列表,则不要在循环中声明列表。在

相关问题 更多 >