Python Pandas如何将输出保存到cs

2024-10-03 09:17:49 发布

您现在位置：Python中文网/ 问答频道 /正文

8756

网友

男 | 程序猿一只，喜欢编程写python代码。

你好，我正在做我的项目。我想通过下面的算法得到文本块的候选。在

我的输入是一个csv文档，其中包含：

HTML列：一行中的HTML代码
TAG列：一行中html代码的标记
Words：aline中标记内的文本
一行的字数
LTC：一行中锚定词的数量
TG：一行中标签的数量
P：一行中P和br标签的数量
CTTD:TC+（0.2*LTC）+TG-P
CTTDs：平滑CTTD

这是我的算法寻找候选文本块。我用pandas把csv文件变成dataframe。我用CTTDs，TC和TG栏来寻找候选人。在

from ListSmoothing import get_filepaths_smoothing
import pandas as pd
import numpy as np
import csv

filenames = get_filepaths_smoothing(r"C:\Users\kimhyesung\PycharmProjects\newsextraction\smoothing")
index = 0
for f in filenames:
    file_html=open(str(f),"r")
    df = pd.read_csv(file_html)
#df = pd.read_csv('smoothing/Smoothing001.csv')

    news = np.array(df['CTTDs'])
    new = np.array(df['TG'])

    minval = np.min(news[np.nonzero(news)])
    maxval = np.max(news[np.nonzero(news)])

    j = 0.2
    thetaCTTD = minval + j * (maxval-minval)
#maxGap = np.max(new[np.nonzero(new)])
#minGap = np.min(new[np.nonzero(new)])
    thetaGap = np.min(new[np.nonzero(new)])
    #print thetaCTTD
    #print maxval
    #print minval
    #print thetaGap
    def create_candidates(df, thetaCTTD, thetaGAP):
        k = 0
        TB = {}
        TC = 0
        for index in range(0, len(df) - 1):
            start = index
            if df.ix[index]['CTTDs'] > thetaCTTD:
                start = index
                gap = 0
                TC = df.ix[index]['TC']
                for index in range(index + 1, len(df) - 1):
                    if df.ix[index]['TG'] == 0:
                        continue
                    elif df.ix[index]['CTTDs'] <= thetaCTTD and gap >= thetaGAP:
                        break
                    elif df.ix[index]['CTTDs'] <= thetaCTTD:
                        gap += 1
                    TC += df.ix[index]['TC']
            if (TC < 1) or (start == index):
                continue
            TB.update({
                k: {
                    'start': start,
                    'end': index - 1
                }
            })
            k += 1
        return TB

    def get_unique_candidate(TB):
        TB = tb.copy()
        for key, value in tb.iteritems():
            if key == len(tb) - 1:
                break
            if value['end'] == tb[key+1]['end']:
                del TB[key+1]
            elif value['start'] < tb[key+1]['start'] < value['end']:
                TB[key]['end'] = tb[key+1]['start'] - 1
            else:
                continue
        return TB

    index += 1
    stored_file = "textcandidate/textcandidate" + '{0:03}'.format(index) + ".csv"
    tb = create_candidates(df, thetaCTTD, thetaGap)
    TB = get_unique_candidate(tb)
    filewrite = open(stored_file, "wb")
    df_list = []
    for (k, d) in TB.iteritems():
        candidate_df = df.loc[d['start']:d['end']]
        candidate_df['candidate'] = k
        df_list.append(candidate_df)
    output_df = pd.concat(df_list)
    output_df.to_csv(stored_file)

    writer = csv.writer(filewrite, lineterminator='\n')
    filewrite.close

acttd是10.36，thagap是1。在

输出是

输出意味着有2个文本块候选。首先，文本块的candiate从215行开始，到225行结束（如下面的pict）。文本块的另一候选者从行号500开始，结束行号501。在

我的问题是如何将输出保存到csv中，不仅是行数，而且文本块和其他列的范围也将显示为输出？在

我期望的输出是像这样的候选文本块的屏幕截图

Tags： csv key 文本 df new index np start

1条回答

网友

1楼 · 发布于 2024-10-03 09:17:49

假设您的输出是字典列表：

pd.concat([df.loc[d['start']:d['end']] for (k, d) in TB.iteritems()])

注意，我们按标签进行切片，因此d['end']将被包括在内。在

编辑：在新列中添加候选编号。在

编写一个循环比执行两个concat操作更简单：

^{pr2}$

在最后一次连接所有数据帧也更快。在

Python Pandas如何将输出保存到cs

相关问题更多 >

编程相关推荐

热门问题

热门文章

Python Pandas如何将输出保存到cs

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >