通过这个软件包,可以简单地计算出由特定疾病引起的表观遗传结果,并揭示表观遗传效应与特定基因表达之间的关系。对于一个特定疾病的许多样本,你可以从DNA甲基化β值中计算出基因表达水平和总和之间的相关值。此外,您可以简单地知道β值分布密度图。

GeneMethyl的Python项目详细描述


基因甲基1.0.0 作者:韩国首尔国立大学重亨公园,应用生物与化学,clearclouds@snu.ac.kr

a.目标

DNA methylation, adding methyl group to 5' carbon of a cytosine pyrimidine ring, is considered as a important biomarker for some diseases.
And CpG sites, regions of a single DNA strand where a cytosine nucleotide is followed by a guanine nucleotide, are known as easily methylated in human body.
Methylation of CpG sites can affect specific genes' expression levels and in some diseases like cancer, CpG sites' hypermethylation silence tumor suppressor genes' activities and hypomethylation promote retrotransposons' activities like LINE-1 to make chromosome instability.
So, numerous clinical researches about methylation of CpG sites were already completed and still be underway.
To make easily analyze relationship between DNA methylation status and target genes' activities, I made the packaged named GeneMethyl.
I hope this package will contribute to many researches. And I plan to add more functions to solve more complicated problems.

b.输入数据格式

Required Files : DNA methylation data, RNAseq data(raw counts or normalized by RSEM) of samples about a target disease.

1. File name : [TargetDiseaseName].DNA_methylation_450K.tsv (Separated by tab)

	Column : Each sample
	Row : Each CpG site

	Missing values are denoted as nothing. (Just separted by tab)
	We recognize sample length up to 12. (excluding '-', or '_')

Illumina's Infinium HumanMethylation450K Beadchip made us easily get DNA methylation data for selected CpG sites about 480,000.
So, this package use beta-value data table generated from HumanMethylation450K.
And the format of a DNA methylation dataset is the same with a TCGA's pan-cancer atlas methylation dataset.



2. File name : [TargetDiseaseName].PANCANCER.RNAseq.tsv (Separated by tab)

	Column : Each sample
	Row : Each gene

	Missing values are denoted as nothing. (Just separted by tab)
	We recognize sample length up to 12. (excluding '-', or '_')

Your target gene expression levels are automatically calculated from RNAseq raw counts data or normalized data by RSEM package.
And the format of a RNAseq dataset is the same with a TCGA's pan-cancer atlas RNAseq dataset.

c.包函数

1. BetavalueDistribution

	(1) BetavalueDistribution.Draw("TargetDiseaseName", Cutoff, WhetherHistogram) : Drawing beta-value distribution by density plot based on histogram.

		Input parameters : 

			"TargetDiseaseName" : Target disease name
			Cutoff : Cutoff must be in [0, 1] and sections are divided by this cutoff.
				ex) Cutoff : 0.1 	-> 	Your Sections : 0~0.1, 0.1~0.2, 0.2~0.3, 0.3~0.4, 0.4~0.5, 0.5~0.6, 0.6~0.7, 0.7~0.8, 0.8~0.9, 0.9~1.0
			WhetherHistogram : True or False. If you choose True, histogram of beta-values divided by cutoff is also shown.


		Description : 

			DNA methylation data is too large to easily handle, so receiving whole DNA methylation data and drawing density plot at once is not effective from the perspective of memory.
			So, I received DNA methylation data line by line and approximated DNA methylation beta-values into several sections divided by specific cutoff. (You can choose cutoff)
			Then by using histogram, I made density plot of beta-values' distribution.


		Output file :

			/Result/DistributionPlot/[TargetDiseaseName].Betavalue.Distribution.Plot.pdf





2. SimpleCutoff

	(1) SimpleCutoff.TargetGeneActivity("TargetDiseaseName", [TargetGenesList]) : Calculating target genes' activities.

		Input parameters : 

			"TargetDiseaseName" : Target disease name
			[TargetGenesList] : Target genes' list. You can choose multiple genes.


		Description : 

			You can simply calculate target genes' activities.
			I decided representative target genes' activity by using logarithm.
				base : a number of gene
				anti-logarithm : geometric mean of target genes' RNAseq data added to pseudocount(1).
				cf) I added whole RNAseq data to pseudocount(1) to prevent from minus value of representative target genes' activity.


		Output file :

			/Result/SimpleCutoff/[TargetDiseaseName].TargetGeneActivity.txt




	(2) SimpleCutoff.View_Correlation_AND_ScatterPlot("TargetDiseaseName", [TargetGenesList], [Cutoff], Type, WhetherFoldChange) : Calculating sperman's correlation between representative target genes' activities and summations of whole samples. Drawing scatter plots of this correlation.

		Input parameters : 

			"TargetDiseaseName" : Target disease name
			[TargetGenesList] : Target genes' list. You can choose multiple genes.
			Cutoff, Type : Calculating summations by using Cutoff depending on Type.
				Type : "Lower", "Higher", "Both", "All"
					"Lower" : If beta-value is lower than cutoff, sample's beta-value is converted into 1. Or, into 0. Then summate this values to each sample. -> You can get the number of beta-values lower than cutoff to each sample. (in [0, cutoff])
					"Higher" : If beta-value is higher than cutoff, sample's beta-value is converted into 1. Or, into 0. Then summate this values to each sample. -> You can get the number of beta-values higher than cutoff to each sample. (in [cutoff, 1])
					"Both" : If beta-value is lower than cutoff/2 or higher than 1 - cutoff/2, sample's beta-value is converted into 1. Or, into 0. Then summate this values to each sample. -> You can get the number of beta-values in [0, cutoff/2] or [1 - cutoff/2, 1] to each sample.
					"All" : Doing all of theses types respectively.
				Cutoff : Cutoff must be in [0, 1] and it determine the section. You can choose multiple cutoffes.
			WhetherFoldChange : Calculating the fold change
				Fold change = Mean or Median of the representative target genes' activity of samples included in the section by cutoff / Mean or Median of the representative target genes' activity of samples NOT included in the section by cutoff

		Output file :

			/Result/SimpleCutoff/FC_CpGsites/WholeSites.Cutoff.[Cutoff].[TargetDiseaseName].[Type].FC.CpGsites.txt
			/Result/SimpleCutoff/Summation/WholeSites.Cutoff.[Cutoff].[TargetDiseaseName].[Type].Binarization.Summation.txt
			/Result/SimpleCutoff/Correlation/WholeSites.[TargetDiseaseName].[Type].Correlation.Summation.And.TargetGeneActivity.txt -> including whole cutoffes to compare easily
			/Result/SimpleCutoff/Correlation/WholeSites.[TargetDiseaseName].CompareAll.Correlation.Summation.And.TargetGeneActivity.txt -> only emerging if Type is All to compare easily
			/Result/SimpleCutoff/ScatterPlot/WholeSites.Cutoff.[Cutoff].[TargetDiseaseName].[Type].ScatterPlot.pdf





3. TopPercentageCutoff

	(1) TopPercentageCutoff.TargetGeneActivity("TargetDiseaseName", [TargetGenesList]) : Calculating target genes' activities.

		Input parameters : 

			"TargetDiseaseName" : Target disease name
			[TargetGenesList] : Target genes' list. You can choose multiple genes.


		Description : 

			You can simply calculate target genes' activities.
			I decided representative target genes' activity by using logarithm.
				base : a number of gene
				anti-logarithm : geometric mean of target genes' RNAseq data added to pseudocount(1).
				cf) I added whole RNAseq data to pseudocount(1) to prevent from minus value of representative target genes' activity.


		Output file :

			/Result/TopNpercentageCutoff/[TargetDiseaseName].TargetGeneActivity.txt




	(2) TopPercentageCutoff.View_Correlation_AND_ScatterPlot("TargetDiseaseName", [TargetGenesList], [Percentage], Type, WhetherMeanMethod, WhetherFoldChange) : Calculating sperman's correlation between representative target genes' activities and summations of whole samples. Drawing scatter plots of this correlation.

		Background : 

			Before explaining TopNpercentageCutoff, we have to determine what percentage means in this method.

			In some diseases like cancer, beta-value distribution of each CpG site is look like roughly 2-peaked graph.
			So, I roughly classified beta-value distribution of each CpG site as 2 categories, left-skewed and right-skewed.
			To determine skewedness of CpG sites, we compare median of each CpG site with mean or 0.5( = an exact half of [0, 1]).
				If you want to use 'mean method', you need to make WhetherMeanMethod True.
				If you want to use '0.5 method', you need to make WhetherMeanMethod False.

			After determining skewedness, we can classify types as "Positive", "Negative", "Both".
				If type is "Positive" and CpG site's beta-value distribution is right-skewed, top N% of sample beta-value is converted into 1. Or, into 0.
				If type is "Positive" and CpG site's beta-value distribution is left-skewed, bottom N% of sample beta-value is converted into 1. Or into 0.
				Thus, "Positive" type means we count the number of beta-values following the tendency of each CpG site's distribution for each sample.

				If type is "Negative" and CpG site's beta-value distribution is right-skewed, bottom N% of sample beta-value is converted into 1. Or, into 0.
				If type is "Negative" and CpG site's beta-value distribution is left-skewed, top N% of sample beta-value is converted into 1. Or into 0.
				Thus, "Negative" type means we count the number of beta-values not following the tendency of each CpG site's distribution for each sample.

				If type is "Both", regardless of CpG site's skewedness top N/2% and bottom N/2% of sample beta-value is converted into 1. Or, into 0.
				Thus, "Both" type means we count the number of strongly methylated or demethylated CpG sites for each sample.

			By using this method, we can simply count the number of hypermethylated or hypomethylated CpG sites for the specific situation.
			To explain this, let's take an example of cancer.
				Globally, CpG sites are hypomethylated to increase chromosomal instability, by expressing retrotranspons(just one example).
				But, CpG sites are hypermethylated near the promoter regions to silence life-critical genes.
				So, If we use TopNpercentageCutoff Positive type method, we can count the number of these hypermethylated or hypomethylated CpG sites.
				Then, we can correlate this values with the representative target genes' activity.

			In short, we can analyze the epigenetic impact of diseases.


		Input parameters :

			"TargetDiseaseName" : Target disease name
			[TargetGenesList] : Target genes' list. You can choose multiple genes.
			Percentage, Type : Calculating summations by using Percentage depending on Type.
				Type : "Positive", "Negative", "Both", "All"
				Percentage : What percentage do you want?
			WhetherFoldChange : Calculating the fold change
				Fold change = Mean or Median of the representative target genes' activity of samples included in the section by percentage / Mean or Median of the representative target genes' activity of samples NOT included in the section by percentage.
			WhetherMeanMethod : Choosing the method of determining skewedness


		Output file :

			/Result/TopPercentageCutoff/FC_CpGsites/WholeSites.Percentage.[Percentage].[TargetDiseaseName].[Type].FC.CpGsites.txt
			/Result/TopPercentageCutoff/Summation/WholeSites.Percentage.[Percentage].[TargetDiseaseName].[Type].Binarization.Summation.txt
			/Result/TopPercentageCutoff/Skewed/WholeSites.[TargetDiseaseName].Left.Skewed.CpGsites.txt
			/Result/TopPercentageCutoff/Skewed/WholeSites.[TargetDiseaseName].Right.Skewed.CpGsites.txt
			/Result/TopPercentageCutoff/Correlation/WholeSites.[TargetDiseaseName].[Type].Correlation.Summation.And.TargetGeneActivity.txt -> including whole percentages to compare easily
			/Result/TopPercentageCutoff/Correlation/WholeSites.[TargetDiseaseName].CompareAll.Correlation.Summation.And.TargetGeneActivity.txt -> only emerging if Type is All to compare easily
			/Result/TopPercentageCutoff/ScatterPlot/WholeSites.Percentage.[Percentage].[TargetDiseaseName].[Type].ScatterPlot.pdf

d.使用示例

from GeneMethyl import *

BetavalueDistribution.DrawDensityPlot("PANCANCER", 0.001, True)

SimpleCutoff.TargetGeneActivity("PANCANCER", ["GZMA", "PRF1"])
SimpleCutoff.View_Correlation_AND_ScatterPlot("PANCANCER", ["GZMA", "PRF1"], [0.1, 0.2], "All", True)
	cf) Even before not calling TargetGeneActivity, it will automatically executed in this function.

TopPercentageCutoff.TargetGeneActivity("PANCANCER", ["GZMA", "PRF1"])
TopPercentageCutoff.View_Correlation_AND_ScatterPlot("PANCANCER", ["GZMA", "PRF1"], [0.05, 0.1, 0.15, 0.2], "All", False, True)
	cf) Even before not calling TargetGeneActivity, it will automatically executed in this function.

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
使用MongoDB进行java漏斗分析?   java如何重构此代码以仅执行一次方法logViolation(),并在单个字符串中获取字符串变量(speedType)的所有值   java如何将方向向量转换为角度?   Java中json文件中的股票价格   java有没有一种方法可以查看网站中的哪些资源加载了selenium?   java文件定位器可以指向不同文件服务器上的目录吗?   java GWT等待SetVisibleRange和ClearData完成   macos Java Applet无法在Mac OS下接收鼠标输入   java BottomNavigationView不支持单次单击必须单击两次   MS SQL Server 2012中的java:我的数据库名称是J.3.0.0\u DEV我无法在Oracle SQL Developer IDE中连接它?   java对JFileChooser的修改,只显示文件夹内容   java如何将动态对象查询更改为criteria builder或更好的性能   java中的swing新字体类型   java Hibernate。删除未使用的条目   上载CSV文件并将其转换为Java模型对象   java如何将信息添加到库的日志消息中?   JavaEclipseGit:使用egit从另一个分支、标记或引用打开版本   使用Firebase控制台消息的java开放视频