如何在标记化之后提取http或https?

2024-09-30 01:33:55 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个文本文件,其中包含这样的文本

>  because she s the worst 
    i am referring to  this   http  iimgurcom5srylmijpg  does it have any deeper meaning or does it signify anything  i just do nt get it why she d do that 
    cheating but zoldycks must have a great time at thanksgiving 
     kurosaki ichigo    http  images5fanpopcomimagephotos29000000ichigowallpaperkurosakiichigo290694271024768jpg  and  kurosaki mea   http  staticzerochannetkurosakimeafull1689483jpg 
    there are a shit ton of koutarous  but the presence of  one   https  smediacacheak0pinimgcomoriginals1219ed1219ed717fc2bfce372759bba2fe1cfegif  is enough to make it the most interesting party.

我通过首先将多个空间转换为单个空间来提取令牌,因为使用以下命令时,空间不一致:

words = re.sub('\s+', ' ', sentence).strip()

现在,我只想得到http或https,因为可以看到文本中没有正确的URL

我试过用(http|https)\s但是没有成功

除此之外还有别的选择吗


Tags: ofthetohttps文本httphave空间

热门问题