fastaファイルから欲しい配列のみ抜き出す

fastaファイルから欲しい配列を抜き出す

fastaファイルの中には

>ENSDART00000189226.1 cdna chromosome:GRCz11:2:36088047:36088056:1 gene:ENSDARG00000116470.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:BX681417.24
TCTGGACTAC
>ENSDART00000189431.1 cdna chromosome:GRCz11:2:36087769:36087779:1 gene:ENSDARG00000116509.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:BX681417.25
GATTGGGGTAC
>ENSDART00000169803.2 cdna chromosome:GRCz11:7:17261616:17264309:-1 gene:ENSDARG00000103653.2 gene_biotype:polymorphic_pseudogene transcript_biotype:polymorphic_pseudogene gene_symbol:CU179757.1
GCCACGCCCCCTGATATAAATAAAGCAGATTGGGTTCCTCATCTTAATCTCCTCGTTGGA
TCTAGACTGATATACAGCAATGGTGTGTTTATTAATTTCAGCAATGTTGATGGGTAAGTG
TTACTGCACTAAAAATACCTTAATCATGCACAATTATGATCTTTAAAAACAGATTTTGTG
GGGCAATGTATTGTATGATTCTAGGATTTTAAATAAAACATGTTATTTCTCTGGATTTTA
GGTTTGCTTTCACTTGTCAAATCTTCTCAGATAGTCAATATATCAGCTCAACCAGGAAAA

のように配列中に改行が含まれるものがある。この場合欲しい配列がgrep -A 1とかで抜いてくることができないので、seqkitというツールが有用。

seqkit grep -nrp "ENSDART00000140557.4" ~/work/ZebraVsCarp/zebrafish/dna_fasta/Danio_rerio.GRCz11.cdna.all.fa

これで

>ENSDART00000140557.4 cdna chromosome:GRCz11:23:10469955:10587257:1 gene:ENSDARG00000076292.6 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:tns2a description:tensin 2a [Source:ZFIN;Acc:ZDB-GENE-090312-163]
GAGTTGTGTACGTTTGGCTGACCGTACGCAGGCTAAGAGTGCTAATAGACTTTTGATTTA
ATAGGCATATTCTCACCTTTGTTTACATACGTGTGCTCACGAGAGACAGATGAGTGTGAA
CTGGCGAACATGTGTGTGTGTTTGAATTAAAGCCAGGAAAAGGGGAAGGGCTTCGAGTGA
GCGAACAAAAACAGCTGCCTCTTGTGTGCTCGGGAAACCTCATCCAGAGAGAGGGAGAGA
GAAGAGACGGACCAGCAGGGTCCAGGCCTCTATATGGGTTGCGCCTTGAGTGGGGATAGC
...

のようにほしい配列をキーワード指定で抜いてこれた。まあ少し考えればawkでできそうだけど。 ちなみにseqkitはいろいろな機能があり、こんな感じ。文字列処理に慣れてないうちは楽かも。

[kijima.yusuke@m48 OnlyDNAHit_CI01000021]$ seqkit -h
SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Version: 0.7.3-dev

Author: Wei Shen <shenwei356@gmail.com>

Documents  : http://bioinf.shenwei.me/seqkit
Source code: https://github.com/shenwei356/seqkit
Please cite: https://doi.org/10.1371/journal.pone.0163962

Suggestion : Install pigz to gain better parsing performance for gzipped data

Usage:
  seqkit [command]

Available Commands:
  common          find common sequences of multiple files by id/name/sequence
  concat          concatenate sequences with same ID from multiple files
  convert         convert FASTQ quality encoding between Sanger, Solexa and Illumina
  duplicate       duplicate sequences N times
  faidx           create FASTA index file and extract subsequence
  fq2fa           convert FASTQ to FASTA
  fx2tab          convert FASTA/Q to tabular format (with length/GC content/GC skew)
  genautocomplete generate shell autocompletion script
  grep            search sequences by pattern(s) of name or sequence motifs
  head            print first N FASTA/Q records
  help            Help about any command
  locate          locate subsequences/motifs
  range           print FASTA/Q records in a range (start:end)
  rename          rename duplicated IDs
  replace         replace name/sequence by regular expression
  restart         reset start position for circular genome
  rmdup           remove duplicated sequences by id/name/sequence
  sample          sample sequences by number or proportion
  seq             transform sequences (revserse, complement, extract ID...)
  shuffle         shuffle sequences
  sliding         sliding sequences, circular genome supported
  sort            sort sequences by id/name/sequence/length
  split           split sequences into files by id/seq region/size/parts
  stats           simple statistics of FASTA/Q files
  subseq          get subsequences by region/gtf/bed, including flanking sequences
  tab2fx          convert tabular format to FASTA/Q format
  version         print version information and check for update

Flags:
      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
  -h, --help                            help for seqkit
      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
      --id-regexp string                regular expression for parsing ID (default "^([^\\s]+)\\s?")
  -w, --line-width int                  line width when outputing FASTA format (0 for no wrap) (default 60)
  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
      --quiet                           be quiet and do not show extra information
  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
  -j, --threads int                     number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

Use "seqkit [command] --help" for more information about a command.

fastaファイルから欲しい配列を抜き出す samtools編

samtools faidx file.fasta chr1:100-1000

などとすると、file.fasta中の染色体chr1の100-1000を抜き出してくれる。

染色体chr1全体を抜き出したい場合は、

samtools faidx file.fasta chr1

で良い

fastaファイルから欲しい配列を抜き出す 高速編

正規表現を使わない限定で、複数のクエリーにマッチする配列を抜き出したい場合は、

 LC_ALL=C fgrep -A3 -f query.file <(zcat DB.fastq.gz) | grep -vE "^\-\-$"

クエリー398591, レコード3990531でも

real	0m10.212s
user	0m9.384s
sys	0m1.409s

ちなみにseqkitや通常のgrepだとかなりの時間がかかる。

  • fastaファイルから欲しい配列のみ抜き出す.1551888139.txt.gz
  • 最終更新: 2019/03/06 16:02
  • by 210.191.58.97