<p>这里是一个使用<code>stringi</code>包的R的解决方案。由于没有一个文本文件或类似文件可以作为可复制的示例进行访问,因此我使用<code>cat()</code>和<code>readlines()</code>来读取表示您提供的行的副本的临时文件。也请检查时间基准,可能对大文件感兴趣。你知道吗</p>
<pre><code>library(stringi)
cat(">Gene A
GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAATTCCAGAAGTTGGACAGACATATATAGACAGCACATATATTA
TCTTTATTTTTTTATGTATGATAACATTAAATATAACGTTCAACAATT
>Gene B
GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAATTCCAGAAGTTGGACAGACATATATAGACAGCACATATATTA
TCTTTATTTTTTTATGTATGATAACATTAAATATAACGTTCAACAATTACACCGTTAGCAGTGTGAGCAAAAACGATTAA
AAAGTAAATATTATAAAAGCCCTC
>Gene C
AACAACAAATTGCCATCTACCCGTTTGAATCCTGTAATAATAACTTGCCCAGATTTGCTGCAGCATACTCCTAGAGTTGG
GCTGGGTGGCCCACACAAGCGATAATAACATTTAACAATTGTTTGATATATGTACTTTTTTTTAAGTTTTTTTCTCCTCG
TACTTGCCTTCCAAAAACTCGTTAGCTTTGTACACATACGCCTTTAATTAAAATACTGATAGATGCGTACCACTTACGTC
ATTAGAAAAAGTCACCAAAAGGAAAAATATGGACGACACAAGAACGAGGAGATCTAAGCCACTCGTAGACCACTAAGCAC
AAAATACCCGAAAAATATAACTGATATGATTGCCAACTACCCTGCGACTATGTAAACCCAACCTTCCCCCCTCCTTTACC
CTCTTATTCAAATCGACGCGTGTGTAGAAGATACACTTATTATATTTTTTTTCTGAGATACAATTATAAACACAAAAACG
ACTTTTAACTATATATTAAATAAAAACAAAAGGAAAAACATAATAATTT
>Gene D
AACAACAAATTGCCATCTACCCGTTTGAATCCTGTAATAATAACTTGCCCAGATTTGCTGCAGCATACTCCTAGAGTTGG
GCTGGGTGGCCCACACAAGCGATAATAACATTTAACAATTGTTTGATATATGTACTTTTTTTTAAGTTTTTTTCTCCTCG
TACTTGCCTTCCAAAAACTCGTTAGCTTTGTACACATACGCCTTTAATTAAAATACTGATAGATGCGTACCACTTACGTC
ATTAGAAAAAGTCACCAAAAGGAAAAATATGGACGACACAAGAACGAGGAGATCTAAGCCACTCGTAGACCACTAAGCAC
AAAATACCCGAAAAATATAACTGATATGATTGCCAACTACCCTGCGACTATGTAAACCCAACCTTCCCCCCTCCTTTACC
CTCTTATTCAAATCGACGCGTGTGTAGAAGATACACTTATTATATTTTTTTTCTGAGATACAATTATAAACACAAAAACG
ACTTTTAACTATATATTAAATAAAAACAAAAGGAAAAACATAATAATTT
", file = "tempfile.txt")
genes <- readLines("tempfile.txt", n=-1)
unlink("tempfile.txt")
genes <- unlist(stri_split_fixed(paste(genes, collapse = " "), ">"))
genes <- genes[ genes != ""]
genenames <- unlist(stri_extract_all_regex(genes, "Gene \\w+"))
genes <- stri_replace_all_fixed(genes, genenames, "")
names(genes) <- genenames
genes <- gsub("\\s+", "", genes, perl = T)
gene_pattern_freq <- function(str, patterns) {
res <- sapply(patterns, function(p) {
stringi::stri_count_fixed(str, p)
}, USE.NAMES = T)
rownames(res) <- names(str)
return(res)
}
searchpatterns <- c("AA", "GT", "GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAA")
result <- gene_pattern_freq(genes, searchpatterns)
result
# AA GT GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAA
# Gene A 14 6 1
# Gene B 21 10 1
# Gene C 52 18 0
# Gene D 52 18 0
library(microbenchmark)
microbenchmark(gene_pattern_freq(genes, searchpatterns))
# Unit: microseconds
# expr min lq mean median uq max neval
# gene_pattern_freq(genes, searchpatterns) 68.687 77.371 123.438 78.161 79.345 4479.19 100
#export
write.csv(result, file = "../mypath/gene_pattern_freq_result.csv" )
</code></pre>