将vcdb json数据解析为pandas数据框,并提供摘要函数和基本枚举绘图。

verisp的Python项目详细描述


verispy

Latest Releaselatest release
Licenselicense
verispy logo horizontal bar chart with stylized verispy 这是一个为处理veris数据而构建的python包。此软件包有两个主要用途:
  1. 允许用户将VERISjson对象提取到pandas数据帧结构中。最有可能的veris数据源是veris社区数据库(VCDB)。
  2. 为数据框架提供一些基本的数据分析功能。这包括计算给定枚举的计数和频率,以及绘制一个简单的水平条形图。

安装

要安装此软件包,请git clone此存储库,然后

python -m pip install <path>/verispy/

或者,您可以简单地使用:

pip install verispy

您还需要下载vcdb数据:

git clone https://github.com/vz-risk/VCDB.git

加载数据

安装之后,创建veris对象很简单。我们只需要vcdb json目录的路径:

In[1]:fromverispyimportVERISIn[2]:data_dir='../VCDB/data/json/validated/'In[3]:v=VERIS(json_dir=data_dir)

我们可能希望验证veris模式url是否正确。注意,模式来自github上的VERISrepo。如果每次运行verispy时都无法连接到Internet,还可以下载模式并使用json_to_df函数中的schema_path参数在本地更改路径。

In[4]:v.schema_urlOut[4]:'https://raw.githubusercontent.com/vz-risk/veris/master/verisc-merged.json'

然后,我们可以从json加载veris数据并分配给一个数据帧:

In[5]:veris_df=v.json_to_df(verbose=True)LoadingschemaLoadingJSONfilestoDataFrame.FinishedloadingJSONfilestodataframe.BuildingDataFramewithenumerations.DonebuildingDataFramewithenumerations.Post-ProcessingDataFrame(A4Names,VictimIndustries,Patterns)FinishedbuildingVERISDataFrame

检查数据

然后,我们可能需要检查数据帧:

In[6]:veris_df.shapeOut[6]:(7839,2315)In[7]:veris_df.head()Out[7]:action.Environmentalaction.Error...victim.statevictim.victim_id0FalseFalse...NJC.R.Bard,Inc.1FalseTrue...NaNBritishColumbiaMinistryofFinance2FalseFalse...NaNNaN3FalseFalse...FLCamberwellHighSchool4FalseFalse...NaNLoyalistCertificationServicesExams[5rowsx2315columns]

对其中一个枚举执行快速值计数:

In[8]:veris_df['action.malware.variety.Ransomware'].value_counts()Out[8]:False7716True123Name:action.malware.variety.Ransomware,dtype:int64

大多数枚举都是真/假值。

要查看枚举字典,请查看veris对象中的enumerations属性:

In[9]:len(v.enumerations)Out[9]:68In[10]:importpprintIn[11]:pprint.pprint(v.enumerations){'action.environmental.variety':['Deterioration','Earthquake','EMI','ESD','Fire','Flood','Hazmat','Humidity','Hurricane','Ice','Landslide','Leak','Lightning','Meteorite','Particulates','Pathogen','Power failure','Temperature','Tornado','Tsunami','Vermin','Volcano','Wind','Other','Unknown'],'action.error.variety':['Capacity shortage','Classification error','Data entry error','Disposal error','Gaffe','Loss',...# many more lines

分析

enum_summary(get enumerations with confidence interval)函数是verispy中的主要分析函数。

我们可以查看顶级枚举:

In[12]:v.enum_summary(veris_df,'action')Out[12]:enumxnfreq0Error22687629.00.297291Hacking20797629.00.272512Misuse16047629.00.210253Physical15177629.00.198854Malware6357629.00.083245Social5177629.00.067776Environmental87629.00.001057Unknown210NaNNaN

或更低级别的枚举:

In[13]:v.enum_summary(veris_df,'action.social.variety')Out[13]:enumxnfreq0Phishing350501.00.698601Bribery51501.00.101802Pretexting41501.00.081843Extortion33501.00.065874Forgery16501.00.031945Influence13501.00.025956Other10501.00.019967Baiting2501.00.003998Elicitation2501.00.003999Propaganda2501.00.0039910Scam2501.00.0039911Spam1501.00.0020012Unknown16NaNNaN

我们可以添加第二个变量作为by参数,从而获得由“by”子集的枚举:

In[14]:v.enum_summary(veris_df,'action',by='attribute')Out[14]:byenumxnfreq0attribute.AvailabilityPhysical11532350.00.490641attribute.AvailabilityHacking6642350.00.282552attribute.AvailabilityError4462350.00.189793attribute.AvailabilityMalware1382350.00.058724attribute.AvailabilityMisuse672350.00.028515attribute.AvailabilitySocial592350.00.025116attribute.AvailabilityEnvironmental82350.00.003407attribute.AvailabilityUnknown5NaNNaN8attribute.ConfidentialityError22317057.00.316149attribute.ConfidentialityHacking16847057.00.2386310attribute.ConfidentialityMisuse15527057.00.2199211attribute.ConfidentialityPhysical14927057.00.2114212attribute.ConfidentialityMalware5557057.00.0786513attribute.ConfidentialitySocial4597057.00.0650414attribute.ConfidentialityEnvironmental27057.00.0002815attribute.ConfidentialityUnknown198NaNNaN16attribute.IntegrityHacking9161833.00.4997317attribute.IntegrityMalware6351833.00.3464318attribute.IntegritySocial5171833.00.2820519attribute.IntegrityPhysical3211833.00.1751220attribute.IntegrityMisuse2571833.00.1402121attribute.IntegrityError351833.00.0190922attribute.IntegrityEnvironmental01833.00.0000023attribute.IntegrityUnknown15NaNNaN

我们可以通过指定ci_method(当前支持的方法:wilsonnormal,或agresti_coull,请参见https://www.statsmodels.org/dev/generated/statsmodels.stats.proportion.proportion_confint.html了解更多信息)来添加置信区间:

In[15]:v.enum_summary(veris_df,'action.social.variety',ci_method='wilson')Out[15]:enumxnfreqmethodlowerupper0Phishing350501.00.69860wilson0.657040.737151Bribery51501.00.10180wilson0.078280.131382Pretexting41501.00.08184wilson0.060900.109143Extortion33501.00.06587wilson0.047280.091064Forgery16501.00.03194wilson0.019750.051245Influence13501.00.02595wilson0.015230.043886Other10501.00.01996wilson0.010880.036357Baiting2501.00.00399wilson0.001100.014448Elicitation2501.00.00399wilson0.001100.014449Propaganda2501.00.00399wilson0.001100.0144410Scam2501.00.00399wilson0.001100.0144411Spam1501.00.00200wilson0.000350.0112212Unknown16NaNNaNwilsonNaNNaN

我们可以用ci_level(默认值为0.95)更改置信区间宽度:

In[16]:v.enum_summary(veris_df,'action.social.variety',ci_method='wilson',ci_level=0.5)Out[16]:enumxnfreqmethodlowerupper0Phishing350501.00.69860wilson0.684600.712241Bribery51501.00.10180wilson0.093040.111272Pretexting41501.00.08184wilson0.073950.090483Extortion33501.00.06587wilson0.058780.073744Forgery16501.00.03194wilson0.027050.037675Influence13501.00.02595wilson0.021570.031196Other10501.00.01996wilson0.016160.024637Baiting2501.00.00399wilson0.002490.006398Elicitation2501.00.00399wilson0.002490.006399Propaganda2501.00.00399wilson0.002490.0063910Scam2501.00.00399wilson0.002490.0063911Spam1501.00.00200wilson0.001030.0038712Unknown16NaNNaNwilsonNaNNaN

enum_summary函数返回一个数据帧。有了这个枚举数据框,我们就可以用plot_barchart函数绘制一个简单的水平条形图:

In[17]:actionci_df=v.enum_summary(veris_df,'action')In[18]:action_fig=v.plot_barchart(actionci_df,'Actions')In[19]:action_fig.show()

Action Enumeration Bar Plot

使用模式进行聚类

verispy包的另一个有用功能是df_to_matrix函数,它将veris数据帧转换为选定枚举的布尔值矩阵。此功能的灵感来自jay jacobs的博客文章DBIR Data-Driven Cover。这篇博客文章讨论了dbir“模式”,它最初是在2014 DBIR中描述的。

模式功能

在jay的博客文章中,他将读者引向一个github的gist,其中有一个他编写的函数getpatternlist.r。我们已经将这个r函数转换成python;可以在这里找到它:dbir_patterns.py

使用此函数,我们可以创建具有dbir模式的数据帧:

In[23]:importpandasaspddefget_pattern(df):""" Generates the DBIR "patterns," with liberal inspiration from the getpatternlist.R:     https://gist.github.com/jayjacobs/a145cb87551f551fc719    Parameters    ----------    df: pd DataFrame with most VERIS encodings already built (from verispy package).    Returns    -------    pd DataFrame with the patterns. Does not return as part of original VERIS DF.    """skimmer=df['action.physical.variety.Skimmer']| \
              (df['action.physical.variety.Tampering']&df['attribute.confidentiality.data.variety.Payment'])espionage=df['actor.external.motive.Espionage']|df['actor.external.variety.State-affiliated']....(morelines,seegist)In[23]:patterns=get_pattern(veris_df)In[24]:patterns['pattern'].value_counts()Out[24]:MiscellaneousErrors1814PrivilegeMisuse1597LostandStolenAssets1460EverythingElse1028WebApplications896PaymentCardSkimmers278Crimeware268Cyber-Espionage248DenialofService162PointofSale88Name:pattern,dtype:int64

从这里,我们可以返回到veris_df数据帧并生成布尔veris矩阵:

In[25]:vmat=v.df_to_matrix(veris_df)In[26]:vmatOut[26]:array([[0,0,0,...,0,0,0],[0,0,0,...,0,0,0],[0,1,1,...,0,1,0],...,[0,0,1,...,0,0,0],[0,0,0,...,0,0,0],[0,0,1,...,0,0,0]])In[27]:vmat.shapeOut[27]:(7839,569)

然后,我们可以做一个称为TSNE的降维技术。以下操作可能需要几分钟:

In[28]:fromsklearn.manifoldimportTSNEIn[29]:tsne=TSNE(n_components=2,random_state=42)In[30]:v_tsne=tsne.fit_transform(vmat)

最后,我们可以使用seaborn创建以下由dbir“pattern”着色的绘图:

In[31]:importseabornassnsIn[32]:importpandasaspdIn[33]:importmatplotlib.pyplotaspltIn[34]:tsne_df=pd.DataFrame({'x':v_tsne[:,0],'y':v_tsne[:,1],'pattern':patterns['pattern']})In[35]:tsne_df.head()Out[35]:xypattern00.411892-34.907738PrivilegeMisuse129.37490512.816430MiscellaneousErrors2-63.858070-47.406250Cyber-Espionage3-58.9871067.611073WebApplications4-75.6749277.452817WebApplicationsIn[36]:tsne_centers=tsne_df.groupby(by='pattern').mean()...:tsne_centers['pattern']=tsne_centers.indexIn[37]:p1=sns.lmplot(x='x',y='y',data=tsne_df,fit_reg=False,hue='pattern',...:scatter_kws={'alpha':0.25},size=6)...:In[38]:deflabel_point(df,ax):...:fori,pointindf.iterrows():...:ax.text(point['x']-30,point['y'],point['pattern'])...:In[39]:label_point(tsne_centers,plt.gca())In[40]:plt.show()

TSNE plot with clusters

单元测试

pytest

测试单元
(veris) verispy $ pytest=======================================================test session starts========================================================
platform darwin -- Python 3.6.5, pytest-3.5.1, py-1.5.3, pluggy-0.6.0
rootdir: /Users/tylerbyers/src/verispy, inifile:
plugins: remotedata-0.2.1, openfiles-0.3.0, doctestplus-0.1.3, arraydiff-0.2
collected 8 items                                                                                                                  

verispy/tests/test_veris.py ........                                                                                         [100%]====================================================8 passed in 11.50 seconds=====================================================

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
Android Studio java。伊奥。使用web服务时发生FileNotFoundException   多线程Java更新程序配置   java获取失败的测试用例计数(按参数)   重复lucene查询搜索期间的java内存泄漏?   java JBehave场景是否在故事文件中按顺序运行?   java Gradle:configuration runtime声明了对配置默认值的依赖关系,该依赖关系未在的模块描述符中声明   Java中PHP的crypt函数的等价物   java OnClickListener无法按预期工作   java我试图用一定数量的元素创建一个数组,但是当使用一个变量时,它不起作用   java自定义未来对象   java Apache POI Excel函数提前退出,没有错误   尝试在远程服务器上安装mv时java权限被拒绝   java STAX XMLStreamWriter不使用小文件进行编写   java动态更改文本框JTextArea的大小   java使用键斗篷和pgbouncer   java弃用SequenceHilGenerator sequencebased id生成器;改用SequenceStyleGenerator