在Python文本字符串中的特定单词周围插入方括号?

2024-09-30 01:31:05 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个文本字符串,并确定了一组要用[]包装的单词。我已经将这些单词存储在一个数组中,并且还存储了它们在每个对象中的第一个和最后一个字符的索引位置

在Python中,如何将[]附加到这些单词的任一侧

下面是我从中提取单词的文本字符串示例: "The SARs were leaked to the Buzzfeed website and shared with the International Consortium of Investigative Journalists (ICIJ). Panorama led the research for the BBC as part of a global probe. The ICIJ led the reporting of the Panama Papers and Paradise Papers leaks - secret files detailing the offshore activities of the wealthy and the famous. Fergus Shiel, from the consortium, said the FinCEN Files are an insight into what banks know about the vast flows of dirty money across the globe… [The] system that is meant to regulate the flows of tainted money is broken. The leaked SARs had been submitted to the US Financial Crimes Enforcement Network, or FinCEN between 2000 and 2017 and cover transactions worth about $2 trillion. FinCEN said the leak could impact US national security, risk investigations, and threaten the safety of those who file the reports. But last week it announced proposals to overhaul its anti-money laundering programmes. The UK also unveiled plans to reform its register of company information to clamp down on fraud and money laundering.The investment scam that HSBC was warned about was called WCM777. It led to the death of investor Reynaldo Pacheco, who was found under water on a wine estate in Napa, California, in April 2014. Police say he had been bludgeoned with rocks. He signed up to the scheme and was expected to recruit other investors. The promise was everyone would get rich. A woman Mr Pacheco, 44, introduced lost about $3,000. That led to the killing by men hired to kidnap him. He literally was trying to… make people's lives better, and he himself was scammed, and conned, and he unfortunately paid for it with his life,said Sgt Chris Pacheco (no relation), one of the officers who investigated the killing. Reynaldo, he said, was murdered for being a victim in a Ponzi scheme."

下面是一个示例,说明我希望附加方括号的单词数组是什么样子的:

[('Buzzfeed', 28, 36, 'ORG'), ('International Consortium of Investigative Journalists', 61, 118, 'ORG'), ('Panorama', 127, 135, 'ORG'), ('BBC', 161, 164, 'ORG'), ('Panama Papers', 222, 239, 'ORG'), ('Fergus Shiel', 346, 358, 'PERSON'), ('Files', 397, 402, 'PRODUCT'), ('US Financial Crimes Enforcement Network', 608, 651, 'ORG'), ('FinCEN', 733, 739, 'ORG'), ('US', 767, 769, 'GPE'), ('last week', 869, 878, 'DATE'), ('UK', 956, 958, 'GPE'), ('HSBC', 1094, 1098, 'ORG'), ('Reynaldo Pacheco', 1167, 1183, 'PERSON'), ('Napa', 1231, 1235, 'GPE'), ('California', 1237, 1247, 'GPE'), ('April 2014', 1252, 1262, 'DATE'), ('Mr Pacheco', 1431, 1441, 'PERSON'), ('44', 1443, 1445, 'DATE'), ('Sgt Chris Pacheco', 1677, 1694, 'PERSON')]


Tags: andofthetoorgled单词about
3条回答

如果将短语列表(我称之为words)按相反顺序排序,可以在循环中的每个短语周围插入[]。需要向后执行此操作的原因是,插入将更改字符串中后续字符的索引:

for w in sorted(words, key=lambda x:-x[1]):
    text = text[:w[1]] + '[' + text[w[1]:w[2]] + ']' + text[w[2]:]
    
print(text)

输出:

SARs被泄露到[Buzzfeed]网站,并与[国际调查记者联合会](ICIJ)共享。作为全球调查的一部分,[Panorama]领导了[BBC]的研究。ICIJ领导了[巴拿马报]和天堂报泄密的报道,这两份机密文件详述了富人和名人的海外活动。该财团的Fergus Shiel表示,FinCEN[文件]是一个洞察银行对全球范围内大量脏钱流动情况的洞察....旨在监管受污染资金流动的[系统]被打破了。泄漏的SARs已于2000年至2017年间提交给[美国金融犯罪执法网络]或FinCEN,涉及价值约2万亿美元的交易。FinCEN表示,泄漏可能影响[美国]国家安全、风险调查,并威胁提交报告者的安全。但[上周]它宣布了全面改革其反洗钱计划的建议。[英国]还公布了改革其公司信息登记册的计划,以打击欺诈和洗钱。汇丰银行被警告的投资欺诈被称为WCM777。这导致投资者[Reynaldo Pacheco]死亡,他于[2014年4月]在加利福尼亚州[Napa]的一个葡萄酒庄园的水下被发现。警方说他被石头殴打。他签署了该计划,并有望招募其他投资者。承诺是每个人都会变得富有。一名妇女[帕切科先生],[44],损失约3000美元。这导致被雇来绑架他的人杀害。调查这起谋杀案的警官之一[克里斯·帕切科中士](无亲属)说:“他实际上是在试图……改善人们的生活,他自己也被骗上当,不幸的是,他为此付出了生命代价。”。他说,雷纳尔多因为是庞氏骗局的受害者而被谋杀

Demo on ideone

以下应该有效:l是您的原始列表,t是您的文本:

l=[list(i) for i in l]
for i in range(len(l)):
    x1, x2=l[i][1], l[i][2]
    t=t[:x1]+ '[' + t[x1:x2] + ']' +t[x2:]
    for k in range(i+1, len(l)):
        l[k][1]+=2
        l[k][2]+=2

这将提供以下输出:

"The SARs were leaked to the [Buzzfeed] website and shared with [the International Consortium of Investigative Journalists] (ICIJ). [Panorama] led the research for the [BBC] as part of a global probe. The ICIJ led the reporting of [the Panama Papers] and Paradise Papers leaks - secret files detailing the offshore activities of the wealthy and the famous. [Fergus Shiel], from the consortium, said the FinCEN [Files] are an insight into what banks know about the vast flows of dirty money across the globe… [The] system that is meant to regulate the flows of tainted money is broken. The leaked SARs had been submitted to [the US Financial Crimes Enforcement Network], or FinCEN between 2000 and 2017 and cover transactions worth about $2 trillion. [FinCEN] said the leak could impact [US] national security, risk investigations, and threaten the safety of those who file the reports. But [last week] it announced proposals to overhaul its anti-money laundering programmes. The [UK] also unveiled plans to reform its register of company information to clamp down on fraud and money laundering.The investment scam that [HSBC] was warned about was called WCM777. It led to the death of investor [Reynaldo Pacheco], who was found under water on a wine estate in [Napa], [California], in [April 2014]. Police say he had been bludgeoned with rocks. He signed up to the scheme and was expected to recruit other investors. The promise was everyone would get rich. A woman [Mr Pacheco], [44], introduced lost about $3,000. That led to the killing by men hired to kidnap him. He literally was trying to… make people's lives better, and he himself was scammed, and conned, and he unfortunately paid for it with his life,said [Sgt Chris Pacheco] (no relation), one of the officers who investigated the killing. Reynaldo, he said, was murdered for being a victim in a Ponzi scheme."

如果您可以对您的数据做出某些假设,那么下面是一个非常简单的版本,这可能就是我第一次尝试的样子:

text = "The SARs were leaked..."
keywords_indexed = [('Buzzfeed', 28, 36, 'ORG'), ...]

# Construct a set of keywords that we want to bracket
words_to_bracket = set(k[0] for k in keywords_indexed)

# Replace every instance of a word-to-be-bracketed
bracketed_text = text
for word in words_to_bracket:
    bracketed_text = bracketed_text.replace(word, "[{}]".format(word))

print(bracketed_text)

优点:它简单易懂,易于维护

缺点:它效率很低,但这可能无关紧要,除非您处理的是非常大的文本块,并且必须快速处理

只有你才能决定做哪些权衡。只是想给你提供一个好的,干净的版本供你选择


OP样本输入上的上述代码输出:

The SARs were leaked to the [Buzzfeed] website and shared with the [International Consortium of Investigative Journalists] (ICIJ). [Panorama] led the research for the [BBC] as part of a global probe. The ICIJ led the reporting of the [Panama Papers] and Paradise Papers leaks - secret files detailing the offshore activities of the wealthy and the famous. [Fergus Shiel], from the consortium, said the [FinCEN] [Files] are an insight into what banks know about the vast flows of dirty money across the globe… [The] system that is meant to regulate the flows of tainted money is broken. The leaked SARs had been submitted to the [US] Financial Crimes Enforcement Network, or [FinCEN] between 2000 and 2017 and cover transactions worth about $2 trillion. [FinCEN] said the leak could impact [US] national security, risk investigations, and threaten the safety of those who file the reports. But [last week] it announced proposals to overhaul its anti-money laundering programmes. The [UK] also unveiled plans to reform its register of company information to clamp down on fraud and money laundering.The investment scam that [HSBC] was warned about was called WCM777. It led to the death of investor [Reynaldo Pacheco], who was found under water on a wine estate in [Napa], [California], in [April 2014]. Police say he had been bludgeoned with rocks. He signed up to the scheme and was expected to recruit other investors. The promise was everyone would get rich. A woman [Mr Pacheco], [44], introduced lost about $3,000. That led to the killing by men hired to kidnap him. He literally was trying to… make people's lives better, and he himself was scammed, and conned, and he unfortunately paid for it with his life,said [Sgt Chris Pacheco] (no relation), one of the officers who investigated the killing. Reynaldo, he said, was murdered for being a victim in a Ponzi scheme.

相关问题 更多 >

    热门问题