如何避免在python中用空格分隔名称

2024-09-29 23:33:19 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图解析来自http://agmarknet.nic.in/的商品定价数据,并试图将其存储在我的数据库中。你知道吗

我以安巴拉坎特的形式得到数据。1.2苦瓜1200 2000 1500然后我通过split()将其拆分并存储在DB。但是有些名称的名称之间有空格,用于split()也将其拆分并将其拆分为:

['Ambala' ,'Cantt.', '1.2', 'Bitter', 'Gourd', '1200', '2000', '1500']

但我希望它是:

['Ambala Cantt.', '1.2', 'Bitter Gourd', '1200', '2000', '1500']

我在for each循环中迭代数据,然后拆分它。到解决这个问题我尝试了正则表达式作为

 ([c.strip() for c in re.match(r"""
        (?P<market>[^0-9]+)
        (?P<arrivals>[^ ]+)
        (?P<variety>[^0-9]+)
        (?P<min>[0-9]+)
        \ (?P<max>[0-9]+)
        \ (?P<modal>[0-9]+)""",
        example,
        re.VERBOSE
    ).groups()])

如果我编写example=“Ambala Cantt”,上面的代码块就可以正常工作。1.2苦瓜1200 2000 1500“但是如果你把它放在for each循环中,例如y:

([c.strip() for c in re.match(r"""
    (?P<market>[^0-9]+)
    (?P<arrivals>[^ ]+)
    (?P<variety>[^0-9]+)
    (?P<min>[0-9]+)
    \ (?P<max>[0-9]+)
    \ (?P<modal>[0-9]+)""",
    example,
    re.VERBOSE
).groups()])

。我得到一个属性错误**重复详细信息 AttributeError:'NoneType'对象没有属性'groups'。我的代码如下所示

   params = urllib.urlencode({'cmm': 'Bitter gourd', 'mkt': '', 'search': ''})
    headers = {'Cookie': 'ASPSESSIONIDCCRBQBBS=KKLPJPKCHLACHBKKJONGLPHE; ASP.NET_SessionId=kvxhkhqmjnauyz55ult4hx55; ASPSESSIONIDAASBRBAS=IEJPJLHDEKFKAMOENFOAPNIM','Origin': 'http://agmarknet.nic.in', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'en-GB,en-US;q=0.8,en;q=0.6','Upgrade-Insecure-Requests': '1','User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.93 Safari/537.36', 'Content-Type': 'application/x-www-form-urlencoded','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Cache-Control': 'max-age=0','Referer': 'http://agmarknet.nic.in/mark2_new.asp','Connection': 'keep-alive'}
    conn = httplib.HTTPConnection("agmarknet.nic.in")
    conn.request("POST", "/SearchCmmMkt.asp", params, headers)
    response = conn.getresponse()
    data = response.read()
    soup = bs(data, "html.parser")
    #print dir(soup)
    z = []
    y = []
    w = []
    x1 = []
    test = []
    trs = soup.findAll("tr")
    for tr in trs:
        c = unicodedata.normalize('NFKD', tr.text)
        y.append(str(c))
    for x in y:
        #data1 = "Ambala 1.2 Onion 1200 2000 1500"
        x1 =    ([c.strip() for c in re.match(r"""
            (?P<market>[^0-9]+)
            (?P<arrivals>[^ ]+)
            (?P<variety>[^0-9]+)
            (?P<min>[0-9]+)
            \ (?P<max>[0-9]+)
            \ (?P<modal>[0-9]+)""",
            x,
            re.VERBOSE
        ).groups()])
    print x1.

有谁能帮助我如何以['Ambala Cantt.','1.2','Bitter gurd','1200','2000','1500']的形式获取数据,而不是以['Ambala','Cantt.','1.2','Bitter','gurd','1200','2000','1500']的形式获取数据。你知道吗


Tags: 数据inrehttpformatchmax形式
1条回答
网友
1楼 · 发布于 2024-09-29 23:33:19
use shlex module

import shlex

l = "Ambala Cantt. 1.2 Bitter Gourd 1200 2000 1500"
# first put quotes around word pairs
l = re.sub(r'([A-Z]\w+\s+\w+)',r'"\1"',l)
# then split with shlex, it will not split inside the quoted strings
l = shlex.split(l)

['Ambala Cantt.', '1.2', 'Bitter Gourd', '1200', '2000', '1500']

您可以将其作为一行运行:

result = shlex.split(re.sub(r'([A-Z]\w+\s+\w+)',r'"\1"',"Ambala Cantt. 1.2 Bitter Gourd 1200 2000 1500"))

相关问题 更多 >

    热门问题