如何检查txt文件的内容

# Python program to # demonstrate reading files # using for loop import re file2 = open('contry.txt', 'w') file3 = open('noncountry.txt', 'w') # Opening file file1 = open('myfile.txt', 'r') count = 0 noncountrycount = 0 countrycounter = 0 # Using for loop print("Using for loop") for line in file1: count += 1 pattern = re.compile(r'^\.\w{2}\s') if pattern.match(line): print(line) countrycounter += 1 else: print("fail", line) noncountrycount += 1 print(noncountrycount) print(countrycounter) file1.close() file2.close() file3.close()

.aaa generic American Automobile Association, Inc. .aarp generic AARP .abarth generic Fiat Chrysler Automobiles N.V. .abb generic ABB Ltd .abbott generic Abbott Laboratories, Inc. .abbvie generic AbbVie Inc. .abc generic Disney Enterprises, Inc. .able generic Able Inc. .abogado generic Minds + Machines Group Limited .abudhabi generic Abu Dhabi Systems and Information Centre .ac country-code Internet Computer Bureau Limited .academy generic Binky Moon, LLC .accenture generic Accenture plc .accountant generic dot Accountant Limited .accountants generic Binky Moon, LLC .aco generic ACO Severin Ahlmann GmbH & Co. KG .active generic Not assigned .actor generic United TLD Holdco Ltd. .ad country-code Andorra Telecom .adac generic Allgemeiner Deutscher Automobil-Club e.V. (ADAC) .ads generic Charleston Road Registry Inc. .adult generic ICM Registry AD LLC .ae country-code Telecommunication Regulatory Authority (TRA) .aeg generic Aktiebolaget Electrolux .aero sponsored Societe Internationale de Telecommunications Aeronautique (SITA INC USA)

3条回答

网友

1楼 · 编辑于 2024-09-30 16:41:40

这是你一直在寻找的东西吗：

with open('lorem.txt') as file:
    data = file.readlines()

for line in data:
    temp = line.split()[0]
    if len(temp) == 3:
        print(temp)

简言之：

file.readlines()在本例中，返回文件中所有行的列表，几乎将文件按\n分割

然后，对于这些行中的每一行，它都会被更多的空格分开，而且由于您需要的代码是行中的第一个，因此它也是列表中的第一个，因此现在检查列表中的第一个项目是否有3个字符长是很重要的，因为由于您的格式似乎非常一致，只有3个长度才是国家代码

网友

2楼 · 编辑于 2024-09-30 16:41:40

您正在three spaces上拆分，但是字符代码后面只跟一个空格，因此您的逻辑是错误的

>>> s = '.ac country-code    Internet Computer Bureau Limited'
>>> s.strip().split('   ')
['.ac country-code', ' Internet Computer Bureau Limited']
>>>

检查第三个字符是否不是空格，第四个字符是否是空格

>>> if s[2] != ' ' and s[3] == ' ':
...     print(f'country code: {s[:3]}')
... else: print('NO')
...
country code: .ac
>>> s = '.abogado    generic Minds + Machines Group Limited'
>>> if s[2] != ' ' and s[3] == ' ':
...     print(f'country code: {s[:3]}')
... else: print('NO')
...
NO
>>>

网友
3楼 · 编辑于 2024-09-30 16:41:40

这通常不仅仅是代码的问题，所以我们需要所有的上下文来重现、调试和解决

编码错误

最后一个提示是您粘贴的控制台输出（错误，stacktrace）

阅读stacktrace&；研究

这就是我的阅读方式&；分析错误输出（Python的stacktrace）：

... C:/Users/tyler/Desktop ...

... findcountrycodes/Test.py", line 15 ...

... Python36\lib\encodings*cp1252*.py ...

... UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 8032:

从这个输出中，我们可以提取重要的上下文信息，以便进行研究和分析；解决问题：

您正在使用Windows
脚本中的第15行Test.py指向读取文件的错误语句：file1 = open('myfile.txt', 'r')
您使用的是python3.6，当前使用的编码是Windows 1252（cp-1252）
根本原因是UnicodeDecodeError，这是在读取文件时经常发生的Python Exception

您现在可以：

研究此异常的Stackoverflow和web:UnicodeDecodeError
通过添加此上下文（作为关键字、标记或转储作为普通输出）改进您的问题

尝试不同的编码

一个答案建议使用当今常见的UTF-8： open(filename, encoding="utf8")

检测文件编码

一种有条理的解决办法是：

在windows记事本或^{上检查文件的编码或charset，例如使用编辑器
用合适的encoding打开Python代码的文件

另见：

过滤国家代码

因此，您只需要带有country-codes的行

预期的筛选

然后期望输入文件的3行被过滤：

.ad country-code    Andorra Telecom
.ac country-code    Internet Computer Bureau Limited
.ae country-code    Telecommunication Regulatory Authority (TRA)

使用正则表达式的解决方案

正如您已经做的那样，测试文件的每一行。测试行是否以这4个字符.xx 开头（其中xx可以是任何ASCII字母）

正则表达式解释

此正则表达式测试有效的两个字母的国家/地区代码：

^\.\w{2}\s

^从字符串的开始（line）
\.（第一个）字母应该是点
\w{2}（后跟）任意两个单词字符(⚠️ 也匹配_0）
\s（后跟）一个空格（空白、制表符等）

Python代码

这是在您的代码中完成的，如下所示（假设line是从读取行填充的）：

import re

line = '.ad '
pattern = re.compile(r'^\.\w{2}\s')
if pattern.match(line):
    print('found country-code')

这是一个可运行的demo on IDEone

进一步阅读

Filter list with regex
Python 3文档：Regular Expression HOWTO
Bharath Sivakumar，关于媒体（2020年）：Extracting Words from a string in Python using the “re” module
koenwoortman的博客（2020）：Remove None values from a list in Python