用带条件语句的括号分隔列

3条回答

网友

1楼 · 编辑于 2024-05-22 09:36:48

我相信这段代码可以更好的修改。但它的工作没有循环

texts= ['(nytimes) "news on war" (wsj) "news on business" (nytimes) "more news"']

data = pd.DataFrame(texts,columns=['text'])
#use extract all to identify selected texts
new= data['text'].str.extractall(r'((?:\(\w+\)."\w+.\w+ \w+?"))')

#extractall will create a multi index columns, and so you need to adjust it

new_1= new.droplevel(level=1)
val = new_1[0].explode()[0].str.split().str[0]
val2 = new_1[0].explode()[0].str.split().str[1:].apply(lambda x: ' '.join(x))

#Created the new DataFrame to better do some cleaning and grouping earlier results
cleaner  = pd.DataFrame()
cleaner['one']= val
cleaner['two']= val2

cleaner =cleaner.groupby('one').aggregate(lambda tdf: tdf.unique().tolist()).reset_index().T
#cleaner.columns= ['NyTimes','WSJ']

cleaner.reset_index(inplace=True)
cleaner.drop(0,axis=0,inplace=True)
cleaner.drop('index',axis=1,inplace=True)

#once the data is is clean to expectation, include it back to main data.
data[['NyTimes','WSJ']]= cleaner.values
data.head()

网友

2楼 · 编辑于 2024-05-22 09:36:48

`Series.str.extractall`

从列Text中的字符串中提取正则表达式模式的引用，然后按组织分组，并使用list聚合新闻文章，最后unstack重新塑造

s = df['Text'].str.extractall(r'\((.*?)\)\s*(.*?)\s*(?=\(|$)')
s = s.set_index(0, append=True).groupby(level=[0, 2])[1].agg(list).unstack()

df.join(s)

^{tb1}$

网友

3楼 · 编辑于 2024-05-22 09:36:48

新闻块的正则表达式模式

使用(foo)作为分隔符可能更像python，但捕获整个新闻块仍然很好；）

新闻块的正则表达式模式分别是$nytimes$\s?".*?"和$wsj$\s?".*?"

简言之

pattern = re.compile(r'\(nytimes\)\s".*?"|\(wsj\)\s".*?"')

捕获新闻块

使用pattern，我们可以将目标块提取为列表

import re
import pandas as pd

my_text = '(nytimes) "news on war" (wsj) "news on business" (nytimes) "more news"'

pattern = re.compile(r'\(nytimes\)\s".*?"|\(wsj\)\s".*?"')
chunks = pattern.findall(my_text)
chunks
>>>['(nytimes) "news on war"', '(wsj) "news on business"', '(nytimes) "more news"']

整理区块

剩下的是一项常规的数据清理工作

list_nyt = []
list_wsj = []
for chunk in chunks:
    if chunk.startswith('(nytimes)'):
        list_nyt.append(chunk.replace('(nytimes) ', '').strip('"'))
    elif chunk.startswith('(wsj)'):
        list_wsj.append(chunk.replace('(wsj) ', '').strip('"'))

填充数据帧

df = pd.DataFrame(columns=['Text', 'NYTimes', 'WSJ'])
df.append(
    {
        'Text': str(my_text),
        'NYTimes': str(list_nyt),
        'WSJ': str(list_wsj)
    }, ignore_index=True
)
df
>>> 🍻

`Series.str.extractall`

相关问题更多 >

编程相关推荐

热门问题

热门文章

用带条件语句的括号分隔列

Series.str.extractall

相关问题 更多 >

编程相关推荐

热门问题

热门文章

`Series.str.extractall`

相关问题更多 >