擅长:python、mysql、java
<p>如果您不介意依赖性,我会使用<code>pandas</code>或<code>numpy</code>。使用<a href="http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.html" rel="nofollow">^{<cd3>}</a>可以对其列执行<a href="http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.isin.html#pandas.DataFrame.isin" rel="nofollow">^{<cd4>}</a>检查。否则我建议使用集合,因为regex应该慢得多。像这样:</p>
<pre class="lang-python prettyprint-override"><code>with open(colA_file, "rb") as file_h:
noun_a = set(line.strip() for line in file_h)
with open(colB_file, "rb") as file_h:
noun_b = set(line.strip() for line in file_h)
with open(output_file, "wb") as outfile:
with open(input_file, "rb") as opened_input:
for line in opened_input:
split_line = line.split()
if split_line[0] in noun_a and split_line[1] in noun_b:
outfile.write(line)
</code></pre>