首先,我使用以下代码从文件中提取了一些文本:
from collections import Counter
def n_gram_opcodes(source, n):
source = open(source).read()
OPCODES = set(["add","call","cmp","mov","jnz","jmp","jz","lea","pop","push",
"retn","sub","test","xor"])
source_words = source.split()
opcodes = [w for w in source_words if w in OPCODES]
return Counter(zip(*[opcodes[i:] for i in range(n)]))
代码还允许计算文件中某些单词出现的频率。以字典格式存储单词,如下所示:
^{pr2}$有了上面的字典,我想取这些值(出现频率)并用在下面的对数似然公式中。我的问题是如何修改代码,以便它可以从任何字典(如上面的字典)中获取值,并将其与下面的代码一起使用。最终结果应该返回数字并使用matplotlib绘制一个图形。在
import math
# The placeholder value for 0 counts
epsilon = 0.0001
def opcode_llr(opcode, freq_table_before, freq_table_after):
'''
Args:
opcode: A single opcode mnemonic, e.g., 'mov'
freq_table_before: The frequency table for opcode trigrams *before*
extraction.
freq_table_after: The frequency table for opcode trigrams *after*
extraction.
The keys for both tables are tuples of string. So, each is of the form
{
('mov', 'mov', 'mov'): 5.0,
('mov', 'jmp', 'mov'): 7.0,
...
}
'''
t_b = len(freq_table_before) or epsilon
t_a = len(freq_table_after) or epsilon
# Compute the opcode counts when occurring in positions 0, 1, 2
opcode_counts = [epsilon, epsilon, epsilon]
for triplet in freq_table_after.keys():
for i, comp in enumerate(triplet):
if comp == opcode:
opcode_counts[i] += 1
f1 = opcode_counts[0]
f2 = opcode_counts[1]
f3 = opcode_counts[2]
return (f1 + f2 + f3) * math.log(float(t_b) / t_a)
这是一种从
Counter
计算llr的通用方法。在相关问题 更多 >
编程相关推荐