我想将下面的文本文件转换成一个以列为标题,以行为对应数据的文本文件

2024-09-19 23:34:08 发布

您现在位置:Python中文网/ 问答频道 /正文

下载后,我所有的数据都在文本文件(.txt)中。我有多个这样的文件(至少70000个),我想我会先用cmd中的copy.txt将它们合并成一个mastertext文件。然后,我想分析mastertext文件,以便理想的输出是.csv文件

为了便于分析,需要将数据转换为类似表的结构。每个文件(在我拥有的70000.txt文件中)都以“[”开头,以“]”结尾,每一行新数据都以“{”开头,以“}”结尾,我有多行格式相同的数据。我希望转换后的数据具有诸如“活动ID”之类的标题,以便我可以使用数据透视分析列(就像在excel中所做的那样)。我是编程新手,不确定Python中的哪些库可以帮助我实现我的目标

这是我所拥有的数据的一个例子。(注意——一个文件中只有两行,我有大约70000个这样的文件)

{
    "campaignId": "all",
    "startDate": "2020-06-11",
    "endDate": "2020-06-11",
    "device": "Computers",
    "network": "Display Network",
    "channel": "all",
    "accLevelQS": -1.0,
    "impressions": 389,
    "clicks": 3,
    "ctr": 0.0,
    "avgCPC": 0.0,
    "convValuePerClick": 0.0,
    "convValuePerCost": 0.0,
    "costConv1PerClick": 0.0,
    "convRate1PerClick": 0.0,
    "cost": 0.142884,
    "conv1PerClick": 0.0,
    "totalConvValue": 0.00,
    "allConversions": 0.0,
    "allConversionValue": 0.00,
    "avgPosition": 0.0,
    "intr": 3,
    "searchImprShare": 0.0,
    "contImprShare": 5.0,
    "impressionShare": 5.0
},
{
    "campaignId": "all",
    "startDate": "2020-06-11",
    "endDate": "2020-06-11",
    "device": "Mobile devices with full browsers",
    "network": "Display Network",
    "channel": "all",
    "accLevelQS": -1.0,
    "impressions": 6101,
    "clicks": 90,
    "ctr": 0.0,
    "avgCPC": 0.0,
    "convValuePerClick": 0.0,
    "convValuePerCost": 0.0,
    "costConv1PerClick": 0.0,
    "convRate1PerClick": 0.0,
    "cost": 4.342799,
    "conv1PerClick": 0.0,
    "totalConvValue": 0.00,
    "allConversions": 0.0,
    "allConversionValue": 0.00,
    "avgPosition": 0.0,
    "intr": 90,
    "searchImprShare": 0.0,
    "contImprShare": 5.0077566465021217,
    "impressionShare": 5.0077566465021217
}

Tags: 文件数据txtdevicedisplay结尾channelnetwork
2条回答

{}中包含的每个文本实际上都是dict类型。使用pd.DataFrame可以轻松地将此类型转换为pandas,最后使用to_excel转换为excel

完整的代码可以编写如下:

import pandas as pd

my_dict_1={
  "campaignId": "all", "startDate":"2020-06-11", "endDate":"2020-06-11", "device":"Computers", "network":"Display Network", "channel":"all", "accLevelQS":-1.0, "impressions":389, "clicks":3, "ctr":0.0, "avgCPC":0.0, "convValuePerClick":0.0, "convValuePerCost":0.0, "costConv1PerClick":0.0, "convRate1PerClick":0.0, "cost":0.142884, "conv1PerClick":0.0, "totalConvValue":0.00, "allConversions":0.0, "allConversionValue":0.00, "avgPosition":0.0, "intr":3, "searchImprShare":0.0, "contImprShare":5.0, "impressionShare":5.0
}
my_dict_2={
  "campaignId": "all", "startDate":"2020-06-11", "endDate":"2020-06-11", "device":"Mobile devices with full browsers", "network":"Display Network", "channel":"all", "accLevelQS":-1.0, "impressions":6101, "clicks":90, "ctr":0.0, "avgCPC":0.0, "convValuePerClick":0.0, "convValuePerCost":0.0, "costConv1PerClick":0.0, "convRate1PerClick":0.0, "cost":4.342799, "conv1PerClick":0.0, "totalConvValue":0.00, "allConversions":0.0, "allConversionValue":0.00, "avgPosition":0.0, "intr":90, "searchImprShare":0.0, "contImprShare":5.0077566465021217, "impressionShare":5.0077566465021217
}
all_dict=[my_dict_1,my_dict_2]
df1 = pd.DataFrame(all_dict)
df1.to_excel("output.xlsx",index = False)

I have made some assumptions as the data you have provided seems like a list of JSON data. This is because there is a "," in between your entries.

这是我的input.json文件。请注意,我在顶部和底部添加了[and],因为这将为您的数据提供适当的JSON结构

[
  {
    "campaignId": "all",
    "startDate": "2020-06-11",
    "endDate": "2020-06-11",
    "device": "Computers",
    "network": "Display Network",
    "channel": "all",
    "accLevelQS": -1.0,
    "impressions": 389,
    "clicks": 3,
    "ctr": 0.0,
    "avgCPC": 0.0,
    "convValuePerClick": 0.0,
    "convValuePerCost": 0.0,
    "costConv1PerClick": 0.0,
    "convRate1PerClick": 0.0,
    "cost": 0.142884,
    "conv1PerClick": 0.0,
    "totalConvValue": 0.00,
    "allConversions": 0.0,
    "allConversionValue": 0.00,
    "avgPosition": 0.0,
    "intr": 3,
    "searchImprShare": 0.0,
    "contImprShare": 5.0,
    "impressionShare": 5.0
  },
  {
    "campaignId": "all",
    "startDate": "2020-06-11",
    "endDate": "2020-06-11",
    "device": "Mobile devices with full browsers",
    "network": "Display Network",
    "channel": "all",
    "accLevelQS": -1.0,
    "impressions": 6101,
    "clicks": 90,
    "ctr": 0.0,
    "avgCPC": 0.0,
    "convValuePerClick": 0.0,
    "convValuePerCost": 0.0,
    "costConv1PerClick": 0.0,
    "convRate1PerClick": 0.0,
    "cost": 4.342799,
    "conv1PerClick": 0.0,
    "totalConvValue": 0.00,
    "allConversions": 0.0,
    "allConversionValue": 0.00,
    "avgPosition": 0.0,
    "intr": 90,
    "searchImprShare": 0.0,
    "contImprShare": 5.0077566465021217,
    "impressionShare": 5.0077566465021217
  }
]

下面的代码使用pandas库将数据处理为数据帧,然后将其写入CSV文件

import json  # Available by default, no install required.
import glob  # Available by default, no install required.
import pandas as pd  # Requires installation via pip.

# Initialise a list to store our results.
combined_json = []

# Set a glob pattern to *.txt since your files are txt files.
# You can also write the full path e.g. /home/user/textfiles/*.txt
text_files = glob.glob("*.txt")

# Loop through all the text files and combine them into a single JSON list.
# As for 70,000 files, I am unsure how the performance will turn out.
for json_text in text_files:
    with open(json_text, 'r') as text_file:
        combined_json.extend(json.load(text_file))

# Write all the files to a JSON file. For your future usage.
# You also can read directly from the combed_json variable.
with open('input.json', 'w') as json_file:
    json.dump(combined_json, json_file, indent=2)

# Convert the JSON data into a dataframe, using the combined_json variable.
json_df = pd.json_normalize(combined_json)

# Write the data from the dataframe to the CSV file.
# Mode "w" will always overwrite the CSV file, use mode "a" to append text instead of overwriting.
json_df.to_csv("dataframe.csv", mode="w")

有关pd.json_normalize如何工作的更多信息,您可以参考here

要开始使用pandas库,可以参考here

如果希望从字符串而不是文件加载JSON,可以引用here

要了解有关glob的更多信息,您可以参考here

相关问题 更多 >