在重置之前从Google组下载帖子

#!/usr/bin/perl # groups2csv.pl # Google Groups results exported to CSV suitable for import into Excel. # Usage: perl groups2csv.pl < groups.html > groups.csv # The CSV Header. print qq{"title","url","group","date","author","number of articles"\n}; # The base URL for Google Groups. my $url = "http://groups.google.com"; # Rake in those results. my($results) = (join '', <>); # Perform a regular expression match to glean individual results. while ( $results =~ m!<a href=(/groups[^\>]+?rnum=[0-9]+)>(.+?)</a>.*? <br>(.+?)<br>.*?<a href="?/groups.+?class=a>(.+?)</a> - (.+?) by (.+?)\s+.*?\(([0-9]+) article!mgis ) { my($path, $title, $snippet, $group, $date, $author, $articles) = ($1||'',$2||'',$3||'',$4||'',$5||'',$6||'',$7||''); $title =~ s!"!""!g; # double escape " marks $title =~ s!<.+?>!!g; # drop all HTML tags print qq{"$title","$url$path","$group","$date","$author","$articles"\n\n}; }

1条回答

网友

1楼 · 发布于 2024-10-01 11:28:08

看看这个webapps question和这个forum discussion中提到的HTTrack utility。在

注意，我假设您实际上并不想筛选和处理数据，而只是有一份讨论的副本以供将来参考。在

编辑：如果你真的想抓取屏幕，你也可以这么做，但写一个脚本来做可能是一个重要的时间消耗。屏幕抓取更多的是从html文档中提取特定的数据片段，而不是获取整个html文档。例如，如果你在看jeopardy网站，想要抓取每个问题、他们的分值、谁回答正确、他们在哪个游戏中出现等，你可能需要进行屏幕抓取，以便插入数据库。在

相关问题更多 >

编程相关推荐

热门问题

热门文章