**文書の過去の版を表示しています。**
fastaファイルから欲しい配列を抜き出す
fastaファイルの中には
>ENSDART00000189226.1 cdna chromosome:GRCz11:2:36088047:36088056:1 gene:ENSDARG00000116470.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:BX681417.24 TCTGGACTAC >ENSDART00000189431.1 cdna chromosome:GRCz11:2:36087769:36087779:1 gene:ENSDARG00000116509.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:BX681417.25 GATTGGGGTAC >ENSDART00000169803.2 cdna chromosome:GRCz11:7:17261616:17264309:-1 gene:ENSDARG00000103653.2 gene_biotype:polymorphic_pseudogene transcript_biotype:polymorphic_pseudogene gene_symbol:CU179757.1 GCCACGCCCCCTGATATAAATAAAGCAGATTGGGTTCCTCATCTTAATCTCCTCGTTGGA TCTAGACTGATATACAGCAATGGTGTGTTTATTAATTTCAGCAATGTTGATGGGTAAGTG TTACTGCACTAAAAATACCTTAATCATGCACAATTATGATCTTTAAAAACAGATTTTGTG GGGCAATGTATTGTATGATTCTAGGATTTTAAATAAAACATGTTATTTCTCTGGATTTTA GGTTTGCTTTCACTTGTCAAATCTTCTCAGATAGTCAATATATCAGCTCAACCAGGAAAA
のように配列中に改行が含まれるものがある。この場合欲しい配列がgrep -A 1
とかで抜いてくることができないので、seqkitというツールが有用。
seqkit grep -p "ENSDART00000140557.4" ~/work/ZebraVsCarp/zebrafish/dna_fasta/Danio_rerio.GRCz11.cdna.all.fa
これで
>ENSDART00000140557.4 cdna chromosome:GRCz11:23:10469955:10587257:1 gene:ENSDARG00000076292.6 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:tns2a description:tensin 2a [Source:ZFIN;Acc:ZDB-GENE-090312-163] GAGTTGTGTACGTTTGGCTGACCGTACGCAGGCTAAGAGTGCTAATAGACTTTTGATTTA ATAGGCATATTCTCACCTTTGTTTACATACGTGTGCTCACGAGAGACAGATGAGTGTGAA CTGGCGAACATGTGTGTGTGTTTGAATTAAAGCCAGGAAAAGGGGAAGGGCTTCGAGTGA GCGAACAAAAACAGCTGCCTCTTGTGTGCTCGGGAAACCTCATCCAGAGAGAGGGAGAGA GAAGAGACGGACCAGCAGGGTCCAGGCCTCTATATGGGTTGCGCCTTGAGTGGGGATAGC ...
のようにほしい配列をキーワード指定で抜いてこれた。まあ少し考えればawkでできそうだけど。 ちなみにseqkitはいろいろな機能があり、こんな感じ。文字列処理に慣れてないうちは楽かも。
[kijima.yusuke@m48 OnlyDNAHit_CI01000021]$ seqkit -h SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation Version: 0.7.3-dev Author: Wei Shen <shenwei356@gmail.com> Documents : http://bioinf.shenwei.me/seqkit Source code: https://github.com/shenwei356/seqkit Please cite: https://doi.org/10.1371/journal.pone.0163962 Suggestion : Install pigz to gain better parsing performance for gzipped data Usage: seqkit [command] Available Commands: common find common sequences of multiple files by id/name/sequence concat concatenate sequences with same ID from multiple files convert convert FASTQ quality encoding between Sanger, Solexa and Illumina duplicate duplicate sequences N times faidx create FASTA index file and extract subsequence fq2fa convert FASTQ to FASTA fx2tab convert FASTA/Q to tabular format (with length/GC content/GC skew) genautocomplete generate shell autocompletion script grep search sequences by pattern(s) of name or sequence motifs head print first N FASTA/Q records help Help about any command locate locate subsequences/motifs range print FASTA/Q records in a range (start:end) rename rename duplicated IDs replace replace name/sequence by regular expression restart reset start position for circular genome rmdup remove duplicated sequences by id/name/sequence sample sample sequences by number or proportion seq transform sequences (revserse, complement, extract ID...) shuffle shuffle sequences sliding sliding sequences, circular genome supported sort sort sequences by id/name/sequence/length split split sequences into files by id/seq region/size/parts stats simple statistics of FASTA/Q files subseq get subsequences by region/gtf/bed, including flanking sequences tab2fx convert tabular format to FASTA/Q format version print version information and check for update Flags: --alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000) -h, --help help for seqkit --id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud... --id-regexp string regular expression for parsing ID (default "^([^\\s]+)\\s?") -w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60) -o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-") --quiet be quiet and do not show extra information -t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto") -j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2) Use "seqkit [command] --help" for more information about a command.