python2.7中url到网页标题转换的优化

2024-09-29 23:15:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我在一个名为twfile.txt文件. 示例txt文件可能如下所示:

RT @CriticalReading: How #Islamophobia works. #Germanwings http://t.co/rX6XVxARiD
Family of Australian victims visit the #Germanwings #GermanWingsCrash crash site in #FrenchAlps #A320Crash #A320 http://t.co/ztReJ1tifU
RT @morningshowon7: #Germanwings: Australian relatives have visited the memorial site in the French alps. #TMS7 http://t.co/BmfiLxHPkC
Three generations from the same family were killed in the #Germanwings Alps crash: http://t.co/6F5MgvBSZG http://t.co/HzJZCZKVZe
Alps crash pilot's hidden illness sparks medical privacy debate #Germanwings. http://t.co/Efe89rxwJG
#Germanwings crash: church in #AndreasLubitz's home town stands by his family http://t.co/QkePs5sG4W http://t.co/irdDnHhxF7
Breaking: #Germanwings co-pilot had been treated 4 suicidal tendencies: http://t.co/6qEynKMSEI/s/KJKu http://t.co/TVdqP4EeWu/s/b4vR @Reuters
RT @Reuters: #Germanwings co-pilot had been treated for suicidal tendencies: http://t.co/Qb75hM3shv http://t.co/7twzPvaAQV
Audio last 60 seconds from flight deck http://t.co/T4IYK26NrG     #Germanwings #GermanWingsCrash #GermanyWings #4U9525 #AndreasLubitz
#Germanwings: Australian relatives have visited the memorial site in the French alps. #TMS7 http://t.co/BmfiLxHPkC
RT @surfinwav: American intelligence contractor among those killed in Alps plane crash http://t.co/m4L0EOd9L2 #Germanwings #GermanWingsCrash
Excellent help & resources from our friends @MindframeMedia over responsible reporting re #Germanwings http://t.co/EQG0kxyQgd  #NoStigma
.@Boba71 @Reuters So in Germany any sick psycho can fly a commercial plane hiding behind the so called privacy laws? #germanwings
The World Will Never Forget  https://t.co/Th41xouUiS  #4U9525 #GermanWings #A320Crash #indeepsorrow #AndreasLubitz
RT @Reuters: #Germanwings co-pilot had been treated for suicidal tendencies: http://t.co/Qb75hM3shv http://t.co/7twzPvaAQV
I am uncomfortable using word 'depression' for the #Germanwings pilot, depression does not kill other people.
Google Maps has blurred out the home of #Germanwings crash pilot Andreas Lubitz. http://t.co/VTm5sfmT6e
#Germanwings co-pilot had been treated for suicidal tendencies: http://t.co/YpDB8trKFL http://t.co/uML8h6vwD8
#Lufthansa #Germanwings prepare for negligence charges since copilot was known to be suicidal 7 years ago
ICYMI: @swaindiana's interview w. lawyer who represents 4 families, who lost loved ones in #Germanwings crash. http://t.co/dnUXKkCD46 #CBCNN
An airplane crashes, after a couple of HOURS we get who's guilty, with the perfect solution for everybody. I don't buy it. #Germanwings
#Germanwings Crash Settlements Are Likely to Vary by Passenger Nationality - #aviationlaw #montrealconvention http://t.co/MWM8nSEYwG
RT @Reuters: #Germanwings co-pilot had been treated for suicidal tendencies: http://t.co/Qb75hM3shv http://t.co/7twzPvaAQV
German prosecutors confirm #Germanwings pilot "had continued to see psychiatrists and neurologists until recently" http://t.co/ma1v9zeiIV
RT @Reuters: #Germanwings co-pilot had been treated for suicidal tendencies: http://t.co/Qb75hM3shv http://t.co/7twzPvaAQV
RT @MindframeMedia: MEDIA: tips when including #mentalillness in stories to avoid perpetuating #stigma http://t.co/W7RlJVe9Lq #Germanwings
#Germanwings plane crash in French Alps: First clues - CNN : http://t.co/AbMPbXFfjG
RT @MindframeMedia: MEDIA: Get to know the facts about  #mentalillness & avoiding  stigmatising stories http://t.co/ZDd7AFOAir #Germanwings
RT @michaelhallida4: Am I Mad Enough To Crash A Plane Into A Mountain? https://t.co/M9d5nlf4bM #auspol #Germanwings
It's a sick world! How can this happen? RT @Reuters #Germanwings co-pilot had been treated for suicidal tendencies: http://t.co/ryw6nTmTNF
RT @Reuters: #Germanwings co-pilot Andreas Lubitz had been treated for suicidal tendencies: http://t.co/p7wqBNvoEW http://t.co/KKAGnvXFDd
I suffer #depression too but I would never risk other people's life. #Germanwings

下面的代码用于读取文件。然后它扩展url并用旧的url替换新的url。它还检查url是否指向图像。如果没有,它将用网页标题替换url。否则它会保持原样。代码运行得很好,只是有一个问题,即这个过程需要花费太多时间,这不适合包含数千条tweet的文档。如何使它工作得更快?你知道吗

import codecs
from bs4 import BeautifulSoup
import urllib

output = codecs.open('tw1file.txt','w','utf-8')

with open('twfile.txt','r') as inputf:
    for line in inputf:
        try:
            list1 = line.split(' ')
            for i in range(len(list1)):
                a = list1[i]
                if "http" in list1[i]:
                    ##print list1[i]
                    response = urllib.urlopen(list1[i])
                    a = response.url
                    ##print a
                    if 'photo' in a:
                        ##print a                       
                        list1[i] = a + ' '
                        ##print list1[i]
                    else:

                        html = response.read()
                        soup = BeautifulSoup(html)
                        list1[i] = soup.html.head.title
                        t = str(list1[i])
                        list1[i] = t[8:-9] = ' '


                    list1[i] = ''.join(ch for ch in list1[i])
                else:
                    list1[i] = ''.join(ch for ch in list1[i])
            line = ' '.join(list1)
            print line
            output.write(line)
        except:
            pass


inputf.close()
output.close()

Tags: theinhttpforcrashbeenrtco
1条回答
网友
1楼 · 发布于 2024-09-29 23:15:26

可能通过购买更多的带宽。。

请看这里: accurately measure time python function takes

然后确定你花在什么上的时间,我打赌你正在使用大量的脚本时间,下载网站。。你知道吗

如果您在网络上有很多空闲时间(由于站点的速度比您的带宽慢),您可以尝试将行放入处理队列中,并让一组工作线程来执行实际工作。。你知道吗

请看这里: Threading pool similar to the multiprocessing Pool? (例如,使用worker的代码,请参见dgorissen的答案)

相关问题 更多 >

    热门问题