拆分系列数据

2024-10-06 11:26:24 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个用API调用构建的dataframe。我调用API 120次,每次调用API时都会附加1000x31数据集。你知道吗

def load_full2(times):
    dfs = []
    item_count = 0
    while item_count <= times:
        response = requests.post(url_2,data=json.dumps(data_two),headers=headers)
        response_json = response.json()
        result = pd.io.json.json_normalize(response_json['hits']['hits'])
        item_count+=1
        dfs.append(result)


    df = pd.concat(dfs, ignore_index=True)
    df.to_csv("export2.csv", encoding='utf-8', index=False)

我导出的最终数据集如下所示:

120000x31个

id    _index    _score     _source.agent    _source.cookie                                                                                                                                  .source.id    _source.log    _source.keys    _source.name    _source.category    _source.class    _source.companyid    _source.cname    _source.ip    _source.method    _source.process    _source.skid    _source.severity    _source.sysname    _source.template    _source.time    _source.country    _source.event    _source.hostname    _source.ipip    _source.namespace    _source.refer    _source.request_url    _source.type
n/a    n/a      n/a        n/a              __cfduid=d118f225fac35345d9e1d87e533b596ec1574680126; gclid=EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; full_path=google.com       n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a       https://google.com/au/?gclid=CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE

我的主要兴趣是_源.cookie'和'_source.request\u网址'列。我的目标是向我的数据集中添加两个新列,第一个是Gclid from cookie,它将保存gclid=之后的值,该值以结束;。第二列将是Glid_from_url,它将保存来自_service.request_url的值,在gclid=单击id=之后

我期望的输出如下所示:

120000x33个

id    ...    _source.cookie    ...    _source.request_url   gclid_from_cookie      gclid_from_url 
1     ...    c1574680126; gclid=EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; full_pat     ...    pn/?gclid=CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE    EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE       CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE
2     ...    c1574680126; gclid=EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; full_pat     ...    to/?click_type=gclid&click_id=CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE&click_     EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE        CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE
...

我是一个相当新的编程,不知道我将如何前进,我将如何编码它。在构建文件的过程中,我会尝试在每个循环中拆分我感兴趣的两列中的字符串,还是在编译完完整文件后再拆分?你知道吗

第二个问题是在_source.request\u网址列该值在gclid=click_id=下设置。因此,我不确定当值可能存在于其中一个字符串中或根本不存在时,如何拆分该字符串。你知道吗

当我尝试分割字符串时,我得到了一个错误AttributeError: 'Series' object has no attribute 'split'

我真的很感谢你的帮助。你知道吗


Tags: fromidjsonurlsourcecookieresponserequest
2条回答

数据帧仍然有点难以读取,但使用以下示例:

df = pd.DataFrame({'_source.request_url': ['https://google.com/au/?gclid=CjwKCAiAlO7uBRANEiwA_vXQ 5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE', 'https://google.com/au/?click_id=CjwKCAiAlO7uBRANEiwA_vXQ 5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE', 'no match example'], 
                   '_source.cookie': ['__cfduid=d118f225fac35345d9e1d87e533b596ec1574680126; gclid=EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE;', '__cfduid=d118f225fac35345d9e1d87e533b596ec1574680126; gclid=EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE;', None]})

要提取=和;之间的字符串,可以使用regex模式r'=(.+?);'。你知道吗

import re

def get_glid_from_source(pattern, data):

    result = re.search(pattern, str(data))
    if result is not None:
        return result.group(1)
    return None

df['glid_from_url'] = df.apply(lambda x: get_glid_from_source('[gclid|click_id]=(.+?)$', x['_source.request_url']), axis=1)
df['gclid_from_cookie'] = df.apply(lambda x: get_glid_from_source('gclid=(.+?)[;%&]', x['_source.cookie']), axis=1)

如果数据中没有匹配项,regex将返回None,因此您必须用if result is not None捕获它。你知道吗

输出数据帧为:

    _source.request_url _source.cookie                  glid_from_url                                                                                           gclid_from_cookie
0   https://google.com/au/?gclid=CjwKCAiAlO7uBRANE...   __cfduid=d118f225fac35345d9e1d87e533b596ec1574...   CjwKCAiAlO7uBRANEiwA_vXQ 5YOAD-mFNQFuM0dbd7lH...   EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEg...
1   https://google.com/au/?click_id=CjwKCAiAlO7uBR...   __cfduid=d118f225fac35345d9e1d87e533b596ec1574...   CjwKCAiAlO7uBRANEiwA_vXQ 5YOAD-mFNQFuM0dbd7lH...   EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEg...
2   no match example                                    None                                                None                                                None

如果数据中只有一个匹配项,如果有多个匹配项并且您希望捕获该匹配项,则使用re.findall(pattern, data),则此示例有效。你知道吗

  1. 最好在创建dataframe之后再做。你知道吗
  2. 不能直接对pd.Series使用字符串操作,必须将其转换为str,如下所示:

    df['str_col'].str.split(':')
    

例如:
假设您有这样一个数据帧:

data = {'Name':['Tom:bar', 'nick:bar', 'krish:bar', 'jack:bar'], 'Age':[20, 21, 19, 18]} 

# Create DataFrame 
df = pd.DataFrame(data) 
print(df)
[Out]:
        Name   Age
0    Tom:bar   20
1   nick:bar   21
2  krish:bar   19
3   jack:bar   18

可以使用以下操作创建新列:

df['bar_col'] = [x.split(':')[1] for x in df.Name]
print(df)
[Out]:
        Name  Age  bar_col
0    Tom:bar   20  bar
1   nick:bar   21  bar
2  krish:bar   19  bar
3   jack:bar   18  bar

相关问题 更多 >