在python中通过300万条记录搜索子字符串

2条回答

网友

1楼 · 编辑于 2024-10-06 19:19:42

如果使用pandasisin()函数，速度应该更快

示例：

import pandas as pd
a ='Hello world'
ss = a.split(" ")

df = pd.DataFrame({'col1': ['Hello', 'asd', 'asdasd', 'world']})
df.loc[df['col1'].isin(ss)].index

返回索引列表：

Int64Index([0, 3], dtype='int64')

网友

2楼 · 编辑于 2024-10-06 19:19:42

我找到了另一种方法。我已经为3M数据集的description列创建了一个单词字典，通过拆分每个单词。（我已将描述中的数字替换为零，并用于生成字典）

def tokenize(desc):

   desc=re.sub('\d', '0', desc)
   tokens=re.split('\s+',desc)
   return tokens

def make_inv_index(df):
  inv_index={}
  for i,tokens in df['description_removed_numbers'].iteritems():
     for token in tokens:
         try:
              inv_index[token].append(i)
         except KeyError:
              inv_index[token]=[i]

  return inv_index
df['description_removed_numbers']=df['description'].apply(tokenize)
inv_index_df=make_inv_index(df)

现在，在搜索描述时，必须对搜索字符串应用相同的标记化，并使用字典获取特定单词索引的交集，并且只搜索那些字段。这大大减少了我运行程序的总时间。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章

在python中通过300万条记录搜索子字符串

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >