从一个大fi模式的特定出现中提取名称

2024-09-30 05:32:05 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个FASTA文件,它基本上是一个文本文件,用于描述生物序列数据(https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp),其中包含超过10000个FASTA序列(从>;开始)。文件的开头如下所示:

>Gene A
GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAATTCCAGAAGTTGGACAGACATATATAGACAGCACATATATTA
TCTTTATTTTTTTATGTATGATAACATTAAATATAACGTTCAACAATT
>Gene B
GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAATTCCAGAAGTTGGACAGACATATATAGACAGCACATATATTA
TCTTTATTTTTTTATGTATGATAACATTAAATATAACGTTCAACAATTACACCGTTAGCAGTGTGAGCAAAAACGATTAA
AAAGTAAATATTATAAAAGCCCTC
>Gene C
AACAACAAATTGCCATCTACCCGTTTGAATCCTGTAATAATAACTTGCCCAGATTTGCTGCAGCATACTCCTAGAGTTGG
GCTGGGTGGCCCACACAAGCGATAATAACATTTAACAATTGTTTGATATATGTACTTTTTTTTAAGTTTTTTTCTCCTCG
TACTTGCCTTCCAAAAACTCGTTAGCTTTGTACACATACGCCTTTAATTAAAATACTGATAGATGCGTACCACTTACGTC
ATTAGAAAAAGTCACCAAAAGGAAAAATATGGACGACACAAGAACGAGGAGATCTAAGCCACTCGTAGACCACTAAGCAC
AAAATACCCGAAAAATATAACTGATATGATTGCCAACTACCCTGCGACTATGTAAACCCAACCTTCCCCCCTCCTTTACC
CTCTTATTCAAATCGACGCGTGTGTAGAAGATACACTTATTATATTTTTTTTCTGAGATACAATTATAAACACAAAAACG
ACTTTTAACTATATATTAAATAAAAACAAAAGGAAAAACATAATAATTT
>Gene D
AACAACAAATTGCCATCTACCCGTTTGAATCCTGTAATAATAACTTGCCCAGATTTGCTGCAGCATACTCCTAGAGTTGG
GCTGGGTGGCCCACACAAGCGATAATAACATTTAACAATTGTTTGATATATGTACTTTTTTTTAAGTTTTTTTCTCCTCG
TACTTGCCTTCCAAAAACTCGTTAGCTTTGTACACATACGCCTTTAATTAAAATACTGATAGATGCGTACCACTTACGTC
ATTAGAAAAAGTCACCAAAAGGAAAAATATGGACGACACAAGAACGAGGAGATCTAAGCCACTCGTAGACCACTAAGCAC
AAAATACCCGAAAAATATAACTGATATGATTGCCAACTACCCTGCGACTATGTAAACCCAACCTTCCCCCCTCCTTTACC
CTCTTATTCAAATCGACGCGTGTGTAGAAGATACACTTATTATATTTTTTTTCTGAGATACAATTATAAACACAAAAACG
ACTTTTAACTATATATTAAATAAAAACAAAAGGAAAAACATAATAATTT

以此类推,大约有10000个基因。 我想:

  1. 找出哪些基因包含特定的模式(CTTTGTA)
  2. 这种模式在那个基因中出现了多少次?你知道吗
  3. 以模式的频率导出包含模式的基因名列表。你知道吗

欢迎使用Bash或Python(或R)的任何解决方案。你知道吗

另外,到目前为止我已经尝试过但没有成功:将基因及其序列提取到不同的文件中,然后在不同的文件中对模式进行grep。但是,我不能生成这些单独的文件。我曾经

grep '^>' file.txt > new_file.txt

但是我得到的结果是一个文件,只包含所有的基因名。你知道吗


Tags: 文件type基因模式序列fastageneaaaatacccgaaaaatataactgatatgattgccaactaccctgcgactatgtaaacccaaccttcccccctcctttacc
2条回答

这里是一个使用stringi包的R的解决方案。由于没有一个文本文件或类似文件可以作为可复制的示例进行访问,因此我使用cat()readlines()来读取表示您提供的行的副本的临时文件。也请检查时间基准,可能对大文件感兴趣。你知道吗

library(stringi)

cat(">Gene A
GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAATTCCAGAAGTTGGACAGACATATATAGACAGCACATATATTA
    TCTTTATTTTTTTATGTATGATAACATTAAATATAACGTTCAACAATT
    >Gene B
    GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAATTCCAGAAGTTGGACAGACATATATAGACAGCACATATATTA
    TCTTTATTTTTTTATGTATGATAACATTAAATATAACGTTCAACAATTACACCGTTAGCAGTGTGAGCAAAAACGATTAA
    AAAGTAAATATTATAAAAGCCCTC
    >Gene C
    AACAACAAATTGCCATCTACCCGTTTGAATCCTGTAATAATAACTTGCCCAGATTTGCTGCAGCATACTCCTAGAGTTGG
    GCTGGGTGGCCCACACAAGCGATAATAACATTTAACAATTGTTTGATATATGTACTTTTTTTTAAGTTTTTTTCTCCTCG
    TACTTGCCTTCCAAAAACTCGTTAGCTTTGTACACATACGCCTTTAATTAAAATACTGATAGATGCGTACCACTTACGTC
    ATTAGAAAAAGTCACCAAAAGGAAAAATATGGACGACACAAGAACGAGGAGATCTAAGCCACTCGTAGACCACTAAGCAC
    AAAATACCCGAAAAATATAACTGATATGATTGCCAACTACCCTGCGACTATGTAAACCCAACCTTCCCCCCTCCTTTACC
    CTCTTATTCAAATCGACGCGTGTGTAGAAGATACACTTATTATATTTTTTTTCTGAGATACAATTATAAACACAAAAACG
    ACTTTTAACTATATATTAAATAAAAACAAAAGGAAAAACATAATAATTT
    >Gene D
    AACAACAAATTGCCATCTACCCGTTTGAATCCTGTAATAATAACTTGCCCAGATTTGCTGCAGCATACTCCTAGAGTTGG
    GCTGGGTGGCCCACACAAGCGATAATAACATTTAACAATTGTTTGATATATGTACTTTTTTTTAAGTTTTTTTCTCCTCG
    TACTTGCCTTCCAAAAACTCGTTAGCTTTGTACACATACGCCTTTAATTAAAATACTGATAGATGCGTACCACTTACGTC
    ATTAGAAAAAGTCACCAAAAGGAAAAATATGGACGACACAAGAACGAGGAGATCTAAGCCACTCGTAGACCACTAAGCAC
    AAAATACCCGAAAAATATAACTGATATGATTGCCAACTACCCTGCGACTATGTAAACCCAACCTTCCCCCCTCCTTTACC
    CTCTTATTCAAATCGACGCGTGTGTAGAAGATACACTTATTATATTTTTTTTCTGAGATACAATTATAAACACAAAAACG
    ACTTTTAACTATATATTAAATAAAAACAAAAGGAAAAACATAATAATTT
", file = "tempfile.txt")

genes <- readLines("tempfile.txt", n=-1)
unlink("tempfile.txt")

genes <- unlist(stri_split_fixed(paste(genes, collapse = " "), ">"))
genes <- genes[ genes != ""]

genenames <- unlist(stri_extract_all_regex(genes, "Gene \\w+"))
genes <- stri_replace_all_fixed(genes, genenames, "")
names(genes) <- genenames

genes <- gsub("\\s+", "", genes, perl = T) 

gene_pattern_freq <- function(str, patterns) {

  res <- sapply(patterns, function(p) {

    stringi::stri_count_fixed(str, p)

  }, USE.NAMES = T)

  rownames(res) <- names(str)

  return(res)
}

searchpatterns <- c("AA", "GT", "GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAA")
result <- gene_pattern_freq(genes, searchpatterns)
result
#        AA GT GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAA
# Gene A 14  6                                      1
# Gene B 21 10                                      1
# Gene C 52 18                                      0
# Gene D 52 18                                      0

library(microbenchmark)
microbenchmark(gene_pattern_freq(genes, searchpatterns))
# Unit: microseconds
# expr                                      min     lq    mean   median   uq     max   neval
# gene_pattern_freq(genes, searchpatterns) 68.687 77.371 123.438 78.161 79.345 4479.19   100

#export
write.csv(result, file = "../mypath/gene_pattern_freq_result.csv" )
sequences = open('fastafile.txt').read().split('>')  # Creates a list of sequences.

needle = 'CTTTGTA'

occurrences = {}

for sequence in sequences:
   occ = sequence.count(needle)  # Returns the number of times the substring occurs in the string sequence.
   if occ:  # If greater than 0, create an entry in our dictionary. The sequence being the key and the count the value.
      occurrences[sequence] = occ

output = []

sorted_occurrences = sorted(occurrences.items(), key=operator.itemgetter(1))  # Sort the dictionary by length, so sequences with the highest occurrence of the needle appear at the top.

for seq, occ_count in sorted_occurrences.iteritems():
    gene_name, sequence = seq.split('\n')
    formatted_line = '{gene_name} - {occ_count}'.format(gene_name=gene_name, occ_count=str(occ_count))  # Format the lines the way you want.
    output.append(formatted_line)  

with open('occurences.txt') as o_f:
    o_f.write('\n'.join(output))

相关问题 更多 >

    热门问题