当列表格式转换为函数时,速度急剧下降

2024-09-28 05:23:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我编写了一个代码,该代码应该对一个文件的语句和两个列表的元素进行一些操作keywordskeywords2。详情如下—

import os
keywords=['a','b']
keywords2=['c','d mvb']

def foo(sentence,k2):

    gs_list=[]                       #####
    for k in keywords:               #####    
        if k in sentence:            #####
            gs_list.append(k)        #####

    for k in gs_list:
        if (k in sentence) and (k2 in sentence):
            print 'a match'
    return 4

for path, dirs, files in os.walk(r'F:\M.Tech\for assigning cl\selected\random 100'):
    for file in files:
        sentences=open(file).readlines();
        for sentence in sentences:
            if sentence.startswith('!series_title'):      
                for k2 in keywords2:
                    foo(sentence,k2)

我已经标记了有问题的代码部分。这篇文章(我们称之为BETA)基本上形成了一个关键字列表,这些关键字在所选句子中。因此,将来的操作只能使用这些关键字来执行。你知道吗

此代码大约需要47秒才能运行100个文件。现在我在想办法加快速度。在keywords2中约有50个元素。所以我想我基本上运行了50次BETA,把它放在函数func中,而我只需要列表keywordssentence。我的主代码中已经有了这两个部分,所以我把这部分转移到了主代码中-

import os
keywords=['a','b']
keywords2=['c','d mvb']

def foo(sentence,k2):

    for k in gs_list:
        if (k in sentence) and (k2 in sentence):
            print 'a match'
    return 4

for path, dirs, files in os.walk(r'F:\M.Tech\for assigning cl\selected\random 100'):
    for file in files:
        sentences=open(file).readlines();
        for sentence in sentences:
            if sentence.startswith('!series_title'):   

                gs_list=[]                       #####
                for k in keywords:               #####    
                    if k in sentence:            #####
                        gs_list.append(k)        #####                        

                for k2 in keywords2:
                    foo(sentence,k2)

我的想法是,这将确保这个列表形成过程只发生一次,每句话,而不是50次像以前一样。这肯定会提高代码的速度。但是这个代码实际上花了89秒来处理同样的100个文件。你知道吗

我不明白为什么这要比前面的代码花费更多的时间。有什么想法吗?你知道吗

完整代码-

import os
import re
import time
start_time = time.time()
a = open('F:\M.Tech\patterns for gmk_down.txt','r').readlines()
a1 = open('F:\M.Tech\patterns for gmk_up.txt','r').readlines()
keywords2=a+a1
ri2 = open(r'F:\M.Tech\for assigning cl\rules occurence\s\induced two.txt', 'w')

keywords = open('F:\M.Tech\mouse_gs_small_simple_reduced.txt','r').readlines()  # this has the new small GS
keystripped = [k.rstrip().lower() for k in keywords]
c=0

def foo(s, gmk):    
    if gmk in s:  # checking if gmk is in the line
        l = re.split('\s|(?<!\d)[,.]|[,.](?!\d)|;|[()]|-', s) # split the line by comma, semicolon and space to check for gmks and gs.
        filter(None, l)       # remove empty elements in the list   
        #gs_list = [k for k in keystripped if k in s]    # <-------- PIECE IN QUESTION --------       
        for gs in gs_list: # gene symbols

            gs1 = re.split('\s|(?<!\d)[,.]|[,.](?!\d)|;|-', gs)
            gs1=filter(None, gs1)
            gmk1 = re.split('\s|(?<!\d)[,.]|[,.](?!\d)|;|-', gmk)
            gmk1=filter(None, gmk1)
            if any(l[i:i+len(gs1)]==gs1 for i in xrange(len(l)-len(gs1)+1)) and (any(l[i:i+len(gmk1)]==gmk1 for i in xrange(len(l)-len(gmk1)+1))): # this ensures that both gs and gmk are in l, as a unit(i.e. and in order) otherwise it was detecting things like 'beta c' from beta cells
                #  UPTO THIS POINT WE HAVE ESTABLISHED THAT THE GMK AND GS ARE INDEED IN THE LINE                    
                k1 = '_MKKEYWORD_1_'
                k2 = '_SKEYWORD_2_'
                #print gmk
                text = re.sub(re.escape(gmk), k1, s, flags=re.I) # because of this replacement, we dont have the problem of counting r from behind etc.

                text = re.sub(r'(\b%s\b)' % (re.escape(gs)), k2, text, flags=re.I)
                lt = text.split()                    
                d_idx = {k1:[], k2:[]}
                for k,v in enumerate(lt):
                    if k1 in v:
                        d_idx[k1].append(k)
                    if k2 in v:
                        d_idx[k2].append(k)
                distance = 8
                data = []
                for idx1 in d_idx[k1]:
                    for idx2 in d_idx[k2]:
                        d = abs(idx1 - idx2)
                        if d<=distance:
                            data.append((d,idx1,idx2))

                data.sort(key=lambda x: x[0])
                for i in range (0, len(data)):  
                    aq = data[i]
                    loq = min(aq[1], aq[2])
                    hiq = max(aq[1], aq[2])
                    brrq = lt[max(0, loq-6):hiq+6]
                    brq = " ".join(brrq)                     

                if data:                     
                    cl(s, gmk, gs, gs_list, data)


def cl(s1, gmk1, gs1, gs_list1, data1): # output will be the confidence level    
        if gmk1 == 'induced':
            if  re.search(r'(%s.*?-induced)' %gs1, br0, re.I|re.S):
                ri2.write('good')

    return 4        

c=0

for path, dirs, files in os.walk(r'F:\M.Tech\for assigning cl\selected\random 100'):
    for file in files:
        sentences = open(os.path.join(path,file),'r').readlines();        
        print("--- %s seconds ---" % (time.time() - start_time))
        for s in sentences:            
            if s.startswith('!series_title'):
                gs_list = [k for k in keystripped if k in s] #<------- PIECE IN QUESTION --------
                for k2 in keywords2:
                    k2 = k2.rstrip().lower()
                    foo(s, k2)
ri2.close()
print("--- %s seconds ---" % (time.time() - start_time))

Tags: 代码inregsforiftimek2
1条回答
网友
1楼 · 发布于 2024-09-28 05:23:44

您没有将gs_list传递给foo。使用全局变量可能会减慢脚本的速度。你知道吗

另外,考虑将BETA列为一个列表。这应该是您需要的:

gs_list = [k for k in keywords if k in sentence]

相关问题 更多 >

    热门问题