如何对照pandas中的标记化列检查列表

<section name="thisisaxml-file"> <topic> <utterance name="John Doe" id="264"> foo bar? </utterance> <utterance name="Henry Parker" id="265"> foo foo bar. New York, wind. </utterance> </topic> </section>

import pandas as pd import xml.etree.ElementTree as ET import nltk from nltk.tokenize import word_tokenize #xml file data input xml_data = 'sample.xml' #create an ElementTree object etree = ET.parse(xml_data) doc_df = pd.DataFrame(list(iter_docs(etree.getroot()))) hedge = ['foo', 'wind', 'base'] df = pd.DataFrame({'utterance': doc_df['utterance']}) df['id'] = pd.DataFrame({'id': doc_df['id']}) df['name'] = pd.DataFrame({'name': doc_df['name']}) df['tokenized_sents'] = df.apply(lambda row: word_tokenize(row['utterance']), axis=1) df['sents_length'] = df.apply(lambda row: len(row['tokenized_sents']), axis=1) final = df[df.tokenized_sents.apply(lambda x: hedge in x)] final.to_csv('out.csv', sep='\t', encoding='utf-8') #prints to file

1条回答

网友

1楼 · 发布于 2024-09-26 18:18:56

变量df已经是一个包含字典的数据帧。在dataframes中创建dataframes会损坏您的数据，或者至少我看到它会损坏我的一些数据。如果不是这样的话，我想知道怎么做。无论如何，不知道这是否能解决您的问题，但它肯定会清理您的代码

hedge = ['foo', 'wind', 'base']
df = pd.DataFrame({
                 'utterance': doc_df['utterance'],
                  'id':doc_df['id'],
                  'name':doc_df['name']})
df['tokenized_sents'] = df.apply(lambda row:word_tokenize(row['utterance']),axis=1)
df['sents_length'] = df.apply(lambda row: len(row['tokenized_sents']), axis=1)

相关问题更多 >

编程相关推荐

热门问题

热门文章