感谢您的到来,我有两个数据框,一个叫做“新闻测试”,它存储了300万条新闻,另一个是“公司名称”,存储了28万个公司名称(带有模糊名称)。以下是一些例子:
+=======+===========================================================================+
| index | content |
+=======+===========================================================================+
| 0 | Apple and Google are two of the strongest companies in the world. |
+-------+---------------------------------------------------------------------------+
| 1 | Working in Facebook and Google is my dream, however, it is still a dream. |
+-------+---------------------------------------------------------------------------+
+=======+========+==============+=======================+
| index | ID | Company_Name | Company_FuzzyName_new |
+=======+========+==============+=======================+
| 0 | 123456 | Apple Inc. | Apple Inc.|Apple |
+-------+--------+--------------+-----------------------+
| 1 | 789111 | Google LLC | Google LLC|Google |
+-------+--------+--------------+-----------------------+
| 2 | 333333 | Facebook | Facebook|FB |
+-------+--------+--------------+-----------------------+
现在,如果“Company_FuzzyName_new”(数据框:Company_fuzzy_name,以|分隔)中的任何一个名称与“content”(数据框:news_test)中的任何单词匹配,我将在news_test中添加一个名为“Com”的新列,并且Company_fuzzy___name中的值是“ID”。因此,根据上述示例,结果将为:
+=======+===========================================================================+==================+
| index | content | Com |
+=======+===========================================================================+==================+
| 0 | Apple and Google are two of the strongest companies in the world. | [123456, 789111] |
+-------+---------------------------------------------------------------------------+------------------+
| 1 | Working in Facebook and Google is my dream, however, it is still a dream. | [789111, 333333] |
+-------+---------------------------------------------------------------------------+------------------+
我已经有了下面的代码,它是有效的 `
list_total = []
for i in range(0, len(news_test)):
list_match = []
for j in range(0, len(company_fuzzy_name)):
if bool(re.search(company_fuzzy_name.iloc[j]['Company_FuzzyName_new'], news_test.iloc[i]['content'].encode('utf-8'))) == True:
list_match.append(company_fuzzy_name.iloc[j]['ID'])
else:
continue
list_total.append(list_match)
news_test['Com'] = list_total
`
但是,这个太慢了(因为3M*280K),我想知道有没有办法加快实现时间,或者重组代码以提高效率?“Com”列中的表单不是固定的,它可以是列表、字符串等。 谢谢你的帮助
我的Python环境是2.7
对不起,有人能帮我吗?我已经在这种情况下呆了很长时间了
相关问题 更多 >
编程相关推荐