<p><strong>选项1:<code>.csv</code>,<code>.txt</code>文件</strong></p>
<p>本机Python无法读取<code>.xls</code>文件。如果将文件转换为<code>.csv</code>或<code>.txt</code>,则可以使用标准库中的<code>csv</code>模块:</p>
<pre><code># `csv` module, Standard Library
import csv
filepath = "./test.csv"
with open(filepath, "r") as f:
reader = csv.reader(f, delimiter=',')
header = next(reader) # skip 'A', 'B'
items = set()
for line in reader:
line = [word.replace(" ", "") for word in line if word]
line = filter(str.strip, line)
items.update(line)
print(list(items))
# ['uyete', 'NHYG', 'QHD', 'SGDH', 'AFD', 'DNGS', 'lkd', 'TTT']
</code></pre>
<hr/>
<p><strong>选项2:<code>.xls</code>,<code>.xlsx</code>文件</strong></p>
<p>如果要保留原始的<code>.xls</code>格式,则必须安装<a href="http://www.python-excel.org/" rel="nofollow noreferrer">third-party module</a>到{a2}。在</p>
<p>从命令提示符安装<code>xlrd</code>:</p>
^{pr2}$
<p>在Python中:</p>
<pre><code># `xlrd` module, third-party
import itertools
import xlrd
filepath = "./test.xls"
with xlrd.open_workbook(filepath) as workbook:
worksheet = workbook.sheet_by_index(0) # assumes first sheet
rows = (worksheet.row_values(i) for i in range(1, worksheet.nrows))
cells = itertools.chain.from_iterable(rows)
items = list({val.replace(" ", "") for val in cells if val})
print(list(items))
# ['uyete', 'NHYG', 'QHD', 'SGDH', 'AFD', 'DNGS', 'lkd', 'TTT']
</code></pre>
<hr/>
<p><strong>选项3:数据帧</strong></p>
<p>您可以使用pandas数据帧处理csv和文本文件。<a href="http://pandas.pydata.org/pandas-docs/stable/io.html" rel="nofollow noreferrer">See documentation</a>用于其他格式。在</p>
<pre><code>import pandas as pd
import numpy as np
# Using data from gist.github.com/anonymous/a822647a00087abc12de3053c700b9a8
filepath = "./test2.txt"
# Determines columns from the first line, so add commas in text file, else may throw an error
df = pd.read_csv(filepath, sep=",", header=None, error_bad_lines=False)
df = df.replace(r"[^A-Za-z0-9]+", np.nan, regex=True) # remove special chars
stack = df.stack()
clean_df = pd.Series(stack.unique())
clean_df
</code></pre>
<p>数据帧输出</p>
<pre><code>0 India1
1 India2
2 myIndia
3 Where
4 Here
5 India
6 uyete
7 AFD
8 TTT
dtype: object
</code></pre>
<p>另存为文件</p>
<pre><code># Save as .txt or .csv without index, optional
# target = "./output.csv"
target = "./output.txt"
clean_df.to_csv(target, index=False)
</code></pre>
<p>注意:选项1&2的结果也可以用<code>pd.Series(list(items))</code>转换成无序的pandas列式对象。在</p>
<p><strong>最后:作为脚本</strong></p>
<p>将上面三个选项中的任何一个保存在一个名为<code>restack.py</code>的函数(<code>stack</code>)中。将此脚本保存到一个目录。在</p>
<pre><code># restack.py
import pandas as pd
import numpy as np
def stack(filepath, save=False, target="./output.txt"):
# Using data from gist.github.com/anonymous/a822647a00087abc12de3053c700b9a8
# Determines columns from the first line, so add commas in text file, else may throw an error
df = pd.read_csv(filepath, sep=",", header=None, error_bad_lines=False)
df = df.replace(r"[^A-Za-z0-9]+", np.nan, regex=True) # remove special chars
stack = df.stack()
clean_df = pd.Series(stack.unique())
if save:
clean_df.to_csv(target, index=False)
print("Your results have been saved to '{}'".format(target))
return clean_df
if __name__ == "__main__":
# Set up input prompts
msg1 = "Enter path to input file e.g. ./test.txt: "
msg2 = "Save results to a file? y/[n]: "
try:
# Python 2
fp = raw_input(msg1)
result = raw_input(msg2)
except NameError:
# Python 3
fp = input(msg1)
result = input(msg2)
if result.startswith("y"):
save = True
else:
save = False
print(stack(fp, save=save))
</code></pre>
<p>从其工作目录中,通过命令行运行脚本。回答提示:</p>
<pre><code>> python restack.py
Enter path to input file e.g. ./test.txt: ./@data/test2.txt
Save results to a file? y/[n]: y
Your results have been saved to './output.txt'
</code></pre>
<p>您的结果应该在您的控制台中打印,并且可以选择保存到一个文件<code>output.txt</code>。根据您的兴趣调整任何参数。在</p>