test

ファイルの中身を見てみる

format-version: 1.2
data-version: releases/2018-05-07
subsetdef: goantislim_grouping "Grouping classes that can be excluded"
subsetdef: gocheck_do_not_annotate "Term not to be used for direct annotation"
subsetdef: gocheck_do_not_manually_annotate "Term not to be used for direct manual annotation"
subsetdef: goslim_agr "AGR slim"
...

[Term]
id: GO:0000001
name: mitochondrion inheritance
namespace: biological_process
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]
synonym: "mitochondrial inheritance" EXACT []
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution

[Term]
id: GO:0000002
name: mitochondrial genome maintenance
namespace: biological_process
def: "The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome." [GOC:ai, GOC:vw]
is_a: GO:0007005 ! mitochondrion organization

[Term]
id: GO:0000003
name: reproduction
namespace: biological_process
alt_id: GO:0019952
alt_id: GO:0050876
...

各Termのブロック内のid,name,namespaceを抜き出してタブ区切りのリストにしたい。

Termブロックが最後まで法則性を保っているかテスト

Termの数がid,name,namespaceの数と一致していれば(おおざっぱだけど)まあブロック内にそれぞれ含まれてると言っていいかな

[kijima.yusuke@m48 Uniprot_swiss]$ grep -E "^\[Term\]$" go.obo | wc -l
47179
[kijima.yusuke@m48 Uniprot_swiss]$ grep -E "^id:" go.obo | wc -l
47189
[kijima.yusuke@m48 Uniprot_swiss]$ grep -E "^name:" go.obo | wc -l
47189
[kijima.yusuke@m48 Uniprot_swiss]$ grep -E "^namespace:" go.obo | wc -l
47189

数にずれがある。ファイルの末尾を見てみると

[kijima.yusuke@m48 Uniprot_swiss]$ tail -n 20 go.obo
namespace: external
xref: RO:0002213
holds_over_chain: negatively_regulates negatively_regulates
is_a: regulates ! regulates
transitive_over: part_of ! part of

[Typedef]
id: regulates
name: regulates
namespace: external
xref: RO:0002211
is_transitive: true
transitive_over: part_of ! part of

[Typedef]
id: starts_during
name: starts_during
namespace: external
xref: RO:0002091

[kijima.yusuke@m48 Uniprot_swiss]$ grep "\[Typedef\]" go.obo | wc -l
10

変なのが含まれており、その数がちょうど10個なので一応解決。

抜き出して並べてみる

awkを使う。

[kijima.yusuke@m48 Uniprot_swiss]$ cat go.obo | awk 'BEGIN{flag1=0} {if(match($0,/^\[Term\]$/))flag1=1} flag1==1&&/^id: GO/{ORS="";print $0"\t"} flag1==1&&/^name: /{ORS="";print $0"\t"} flag1==1&&/^namespace:/{ORS="";print $0"\n";flag1=0}' | sed -e 's/id: \|name: \|namespace: //g' | head
GO:0000001      mitochondrion inheritance       biological_process
GO:0000002      mitochondrial genome maintenance        biological_process
GO:0000003      reproduction    biological_process
GO:0000005      obsolete ribosomal chaperone activity   molecular_function
GO:0000006      high-affinity zinc transmembrane transporter activity   molecular_function
GO:0000007      low-affinity zinc ion transmembrane transporter activity        molecular_function
GO:0000008      obsolete thioredoxin    molecular_function
GO:0000009      alpha-1,6-mannosyltransferase activity  molecular_function
GO:0000010      trans-hexaprenyltranstransferase activity       molecular_function
GO:0000011      vacuole inheritance     biological_process

いい感じ。行数と末尾をチェックしてみても

[kijima.yusuke@m48 Uniprot_swiss]$ cat go.obo | awk 'BEGIN{flag1=0} {if(match($0,/^\[Term\]$/))flag1=1} flag1==1&&/^id: GO/{ORS="";print $0"\t"} flag1==1&&/^name: /{ORS="";print $0"\t"} flag1==1&&/^namespace:/{ORS="";print $0"\n";flag1=0}' | sed -e 's/id: \|name: \|namespace: //g' | wc -l
47179
[kijima.yusuke@m48 Uniprot_swiss]$ cat go.obo | awk 'BEGIN{flag1=0} {if(match($0,/^\[Term\]$/))flag1=1} flag1==1&&/^id: GO/{ORS="";print $0"\t"} flag1==1&&/^name: /{ORS="";print $0"\t"} flag1==1&&/^namespace:/{ORS="";print $0"\n";flag1=0}' | sed -e 's/id: \|name: \|namespace: //g' | tail
GO:2001308      gliotoxin metabolic process     biological_process
GO:2001309      gliotoxin catabolic process     biological_process
GO:2001310      gliotoxin biosynthetic process  biological_process
GO:2001311      lysobisphosphatidic acid metabolic process      biological_process
GO:2001312      lysobisphosphatidic acid biosynthetic process   biological_process
GO:2001313      UDP-4-deoxy-4-formamido-beta-L-arabinopyranose metabolic process        biological_process
GO:2001314      UDP-4-deoxy-4-formamido-beta-L-arabinopyranose catabolic process        biological_process
GO:2001315      UDP-4-deoxy-4-formamido-beta-L-arabinopyranose biosynthetic process     biological_process
GO:2001316      kojic acid metabolic process    biological_process
GO:2001317      kojic acid biosynthetic process biological_process

はい。

  • test.1526974062.txt.gz
  • 最終更新: 2018/05/22 07:27
  • by 133.11.222.89