将列表中的单词与lin中的单词进行匹配的过程

[40.748330000000003, -73.878609999999995] 6 2011-08-28 19:52:47 Sometimes I wish my life was a movie; #unreal I hate the fact I feel lonely surrounded by so many ppl [37.786221300000001, -122.1965002] 6 2011-08-28 19:55:26 I wish I could lay up with the love of my life And watch cartoons all day.

try: KeywordFileName=input('Input keyword file name: ') KeywordFile = open(KeywordFileName, 'r') except FileNotFoundError: print('The file you entered does not exist or is not in the directory') exit() KeyLine = KeywordFile.readline() while KeyLine != '': if list != []: KeyLine = KeywordFile.readline() KeyLine = KeyLine.rstrip() list = KeyLine.split(',') list[1] = int(list[1]) print(list) else: break try: TweetFileName = input('Input Tweet file name: ') TweetFile = open(TweetFileName, 'r') except FileNotFoundError: print('The file you entered does not exist or is not in the directory') exit() TweetLine = TweetFile.readline() while TweetLine != '': TweetLine = TweetFile.readline() TweetLine = TweetLine.rstrip()

2条回答

网友
1楼 · 编辑于 2024-06-01 21:51:31

最简单的方法是在每个tweet的基础上使用nltk库中的word\u标记化
from nltk.tokenize import word_tokenize import collections import re # Sample text from above s = '[40.748330000000003, -73.878609999999995] 6 2011-08-28 19:52:47 Sometimes I wish my life was a movie; #unreal I hate the fact I feel lonely surrounded by so many ppl' num_regex = re.compile(r"[+-]?\d+(?:\.\d+)?") # Removing the numbers from the text s = num_regex.sub('',s) # Tokenization tokens = word_tokenize(s) # Counting the words fdist = collections.Counter(tokens) print fdist
`

网友
2楼 · 编辑于 2024-06-01 21:51:31

如果您的tweet位于.txtlike this file中，并且tweet的行模式与问题中描述的相同，那么您可以尝试以下方法：
import re import json pattern=r'\d{2}:\d{2}:\d{2}\s([a-zA-Z].+)' sentiment_dict={'hate' :1,'hurt':1,'hurting':1,'like':5,'lonely':1,'love':10} final=[] with open('senti.txt','r+') as f: for line in f: data = [] match=re.finditer(pattern,line) for find in match: if find.group(1).split(): final.append(find.group(1).split()) line=[] for item in final: final_dict = {} for sub_item in item: if sub_item in sentiment_dict: if sub_item not in final_dict: final_dict[sub_item]=[sentiment_dict.get(sub_item)] else: final_dict[sub_item].append(sentiment_dict.get(sub_item)) line.append((item,len(item),{key: sum(value) for key,value in final_dict.items()})) result=json.dumps(line,indent=2) print(result)
输出：
[ [ [ "Sometimes", #tweets line or all words "I", "wish", "my", "life", "was", "a", "movie;", "#unreal", "I", "hate", "the", "fact", "I", "feel", "lonely", "surrounded", "by", "so", "many", "ppl" ], 21, #count of words in tweets { "lonely": 1, #sentiment count "hate": 1 } ], [ [ "I", "wish", "I", "could", "lay", "up", "with", "the", "love", "of", "my", "life", "And", "watch", "cartoons", "all", "day." ], 17, { "love": 10 } ], [ [ "I", "hate", "to", "feel", "lonely", "at", "times" ], 7, { "lonely": 1, "hate": 1 } ] ]
Options for regex if one pattern doesn't work for your file:
r'[a-zA-Z].+' #if you use this change find.group(1) to find.group()
r'(?<=\d.\s)[a-zA-Z].+' #if you use this change find.group(1) to find.group()
r'\d{2}:\d{2}:\d{2}\s([a-zA-Z].+)'
r'\b\d{2}:\d{2}:\d{2} (.+)' #group(1)

相关问题更多 >

编程相关推荐

热门问题

热门文章