**文書の過去の版を表示しています。**
データチェック
mitofishは魚類のミトコンドリア配列データベース。現在2500種が登録済み。登録されたミトコンドリア配列は遺伝子領域の予測がされているので、その領域の抽出を目指す。まずはデータを覗いてみる。
NC_000860_Salvelinus_fontinalis TOPOLOGY circular source 1..16624 ff_definition NC_000860_Salvelinus_fontinalis mitochondrial DNA, ??? organelle mitochondrion organism NC_000860_Salvelinus_fontinalis specimen_voucher ??? mol_type genomic DNA CDS 2843..3817 codon_start 1 gene ND1 product NADH dehydrogenase subunit 1 transl_table 2 CDS 4032..5080 codon_start 1 gene ND2 product NADH dehydrogenase subunit 2 note Incomplete stop codon transl_except (pos:5079..5080,aa:TERM) transl_table 2 ... rRNA 69..1015 product 12S rRNA rRNA 1088..2767 product 16S rRNA tRNA 1..68 product tRNA-Phe tRNA 1016..1087 product tRNA-Val tRNA 2768..2842 product tRNA-Leu note Anticodon: UAA tRNA 3824..3895 product tRNA-Ile tRNA complement(3893..3963) product tRNA-Gln ... NC_000861_Salvelinus_alpinus TOPOLOGY circular source 1..16659 ff_definition NC_000861_Salvelinus_alpinus mitochondrial DNA, ??? organelle mitochondrion organism NC_000861_Salvelinus_alpinus specimen_voucher ??? mol_type genomic DNA CDS 2843..3817 codon_start 1 gene ND1 product NADH dehydrogenase subunit 1 transl_table 2 CDS 4032..5081 codon_start 1 gene ND2 product NADH dehydrogenase subunit 2 transl_table 2 ...
どうも配列名の宣言のあとタブでずらしてCDS情報が列挙されてるっぽい。
フォーマットチェック
戦略を練るためもう少しフォーマットを確認。
[kijima.yusuke@m48 annotation]$ cat * | grep CDS | head -n 30 CDS 2843..3817 codon_start 1 CDS 4032..5080 codon_start 1 CDS 5472..7022 codon_start 1 CDS 7186..7876 codon_start 1 CDS 7952..8119 codon_start 1 CDS 8110..8792 codon_start 1 CDS 8793..9577 codon_start 1 CDS 9648..9996 codon_start 1 CDS 10067..10363 codon_start 1 CDS 10357..11737 codon_start 1 CDS 11950..13788 codon_start 1 CDS complement(13785..14306) codon_start 1 CDS 14379..15519 codon_start 1 CDS 2843..3817 codon_start 1 CDS 4032..5081 codon_start 1 CDS 5473..7023 codon_start 1 CDS 7187..7877 codon_start 1 CDS 7953..8120 codon_start 1 CDS 8111..8793 codon_start 1 CDS 8794..9578 codon_start 1 CDS 9649..9997 codon_start 1 CDS 10068..10364 codon_start 1 CDS 10358..11738 codon_start 1 CDS 11951..13789 codon_start 1 CDS complement(13786..14307) codon_start 1 CDS 14380..15520 codon_start 1 CDS 2840..3814 codon_start 1 CDS 4026..5070 codon_start 1 CDS 5461..7017 codon_start 1 CDS 7170..7860 codon_start 1
complementみたいな変な文字がたまに混入するっぽい。アミノ酸配列に変えるときコドン開始位置は気になるので一応チェックしてみる。
[kijima.yusuke@m48 annotation]$ cat * | grep CDS | cut -f 5 | uniq 1
全部1スタートで大丈夫ですね。
本番
イメージとしては配列名と転写開始終了位置がタブ区切りで並ぶ感じ。
cat * | awk 'BEGIN{tmp=0} /^[^\t]/{tmp=$1} /^\tCDS/{ORS="";print tmp"\t"$2"\n"}' | sed -e 's/\.\./\t/g' | sed -e 's/complement(\|)//g' > mitoCDSStartEnd
それでは結果を見てみます
[kijima.yusuke@m48 annotation]$ head -n 20 mitoCDSStartEnd NC_000860_Salvelinus_fontinalis 2843 3817 NC_000860_Salvelinus_fontinalis 4032 5080 NC_000860_Salvelinus_fontinalis 5472 7022 NC_000860_Salvelinus_fontinalis 7186 7876 NC_000860_Salvelinus_fontinalis 7952 8119 NC_000860_Salvelinus_fontinalis 8110 8792 NC_000860_Salvelinus_fontinalis 8793 9577 NC_000860_Salvelinus_fontinalis 9648 9996 NC_000860_Salvelinus_fontinalis 10067 10363 NC_000860_Salvelinus_fontinalis 10357 11737 NC_000860_Salvelinus_fontinalis 11950 13788 NC_000860_Salvelinus_fontinalis 13785 14306 NC_000860_Salvelinus_fontinalis 14379 15519 NC_000861_Salvelinus_alpinus 2843 3817 NC_000861_Salvelinus_alpinus 4032 5081 NC_000861_Salvelinus_alpinus 5473 7023 NC_000861_Salvelinus_alpinus 7187 7877 NC_000861_Salvelinus_alpinus 7953 8120 NC_000861_Salvelinus_alpinus 8111 8793 NC_000861_Salvelinus_alpinus 8794 9578
まあいいんじゃないでしょうか