python中从链接中提取文本

2024-10-02 08:16:28 发布

您现在位置:Python中文网/ 问答频道 /正文

我在python 2.7中有一个脚本,它可以在这个页面中抓取表: http://www.the-numbers.com/movie/budgets/all

我想提取每个列,问题是我的代码无法识别有链接的列(第2列和第3列)。在

budgeturl = "http://www.the-numbers.com/movie/budgets/all"
s = urllib.urlopen(budgeturl).read()
htmlpage = etree.HTML(s)
htmltable = htmlpage.xpath("//td[@class='data']/text()")

在这个代码中,htmltable[0]是排名,htmltable[1]是生产预算,并从那里继续下去。 从我丢失的那些,我需要的是文本而不是链接。在


Tags: the代码脚本comhttp链接www页面
2条回答

您需要修改xpath,因为并不是所有的td元素都有{}。 请尝试以下xpath表达式://td//text()。在

import urllib
from lxml import etree

budgeturl = "http://www.the-numbers.com/movie/budgets/all"
s = urllib.urlopen(budgeturl).read()
htmlpage = etree.HTML(s)
htmltable = htmlpage.xpath("//td//text()")

输出: enter image description here

import urllib

budgeturl = "http://www.the-numbers.com/movie/budgets/all"
s = urllib.urlopen(budgeturl).read()

def find_between( s, first, last ):
    try:
        start = s.index( first ) + len( first )
        end = s.index( last, start )
        return s[start:end]
    except ValueError:
        return ""

s = find_between(s, '<table>', '</table>')

print s[:500]
print '.............................................................'
print s[-250:]

Find string between two substrings

退货:

^{pr2}$

enter image description here

.........................................

enter image description here

I need the text not the link.

通过http://www.convertcsv.com/html-table-to-csv.htm

Release Date,Movie,Production Budget,Domestic Gross,Worldwide Gross
1,12/18/2009,Avatar,"$425,000,000","$760,507,625","$2,783,918,982"
8/5/2005,My Date With Drew,"$1,100","$181,041","$181,041"

您可以使用beautifulsoup执行相同操作,请参见:

beautifulSoup html csv

相关问题 更多 >

    热门问题