texts= ['(nytimes) "news on war" (wsj) "news on business" (nytimes) "more news"']
data = pd.DataFrame(texts,columns=['text'])
#use extract all to identify selected texts
new= data['text'].str.extractall(r'((?:\(\w+\)."\w+.\w+ \w+?"))')
#extractall will create a multi index columns, and so you need to adjust it
new_1= new.droplevel(level=1)
val = new_1[0].explode()[0].str.split().str[0]
val2 = new_1[0].explode()[0].str.split().str[1:].apply(lambda x: ' '.join(x))
#Created the new DataFrame to better do some cleaning and grouping earlier results
cleaner = pd.DataFrame()
cleaner['one']= val
cleaner['two']= val2
cleaner =cleaner.groupby('one').aggregate(lambda tdf: tdf.unique().tolist()).reset_index().T
#cleaner.columns= ['NyTimes','WSJ']
cleaner.reset_index(inplace=True)
cleaner.drop(0,axis=0,inplace=True)
cleaner.drop('index',axis=1,inplace=True)
#once the data is is clean to expectation, include it back to main data.
data[['NyTimes','WSJ']]= cleaner.values
data.head()
我相信这段代码可以更好的修改。但它的工作没有循环
Series.str.extractall
从列
Text
中的字符串中提取正则表达式模式的引用,然后按组织分组,并使用list
聚合新闻文章,最后unstack
重新塑造新闻块的正则表达式模式
使用
(foo)
作为分隔符可能更像python,但捕获整个新闻块仍然很好;)新闻块的正则表达式模式分别是
\(nytimes\)\s?".*?"
和\(wsj\)\s?".*?"
简言之
捕获新闻块
使用
pattern
,我们可以将目标块提取为列表整理区块
剩下的是一项常规的数据清理工作
填充数据帧
相关问题 更多 >
编程相关推荐