python从没有html标记的文本文件中提取url

http://saiconference.com/ficc2018/submit http://52.21.30.170/sendy/unsubscribe/qhiz2s763l892rkps763chacs52ieqkagf8rbueme9n763jv6da/hs1ph7xt5nvdimnwwfioya/qg0qteh7cllbw8j6amo892ca> https://www.youtube.com/watch?v=gvwyoqnztpy> http://saiconference.com/ficc http://saiconference.com/ficc> http://saiconference.com/ficc2018/submit>

1条回答

网友

1楼 · 发布于 2024-09-30 06:10:52

快速解决方案，假设“>；”是最后出现的唯一字符：url.rstrip('>')

删除单个字符串中最后出现的字符（多次）。因此，您必须遍历列表并删除字符。在

编辑：刚得到一台装有python的PC，所以在测试之后给出了一个regex的答案。在

import re
def extractURLs(fileContent):
    urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', fileContent.lower())
    cleanUrls = []
    for url in urls:
        lastChar = url[-1] # get the last character
        # if the last character is not (^ - not) an alphabet, or a number,
        # or a '/' (some websites may have that. you can add your own ones), then enter IF condition
        if (bool(re.match(r'[^a-zA-Z0-9/]', lastChar))): 
            cleanUrls.append(url[:-1]) # stripping last character, no matter what
        else:
            cleanUrls.append(url) # else, simply append to new list
    print(cleanUrls)
    return cleanUrls

URLs = extractURLs("http://saiconference.com/ficc2018/submit>")

但是，如果只有一个字符，则使用.rstrip（）会更简单。在

相关问题更多 >

编程相关推荐

热门问题

热门文章