**文書の過去の版を表示しています。**
ファイルの中身を見てみる
format-version: 1.2 data-version: releases/2018-05-07 subsetdef: goantislim_grouping "Grouping classes that can be excluded" subsetdef: gocheck_do_not_annotate "Term not to be used for direct annotation" subsetdef: gocheck_do_not_manually_annotate "Term not to be used for direct manual annotation" subsetdef: goslim_agr "AGR slim" ... [Term] id: GO:0000001 name: mitochondrion inheritance namespace: biological_process def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764] synonym: "mitochondrial inheritance" EXACT [] is_a: GO:0048308 ! organelle inheritance is_a: GO:0048311 ! mitochondrion distribution [Term] id: GO:0000002 name: mitochondrial genome maintenance namespace: biological_process def: "The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome." [GOC:ai, GOC:vw] is_a: GO:0007005 ! mitochondrion organization [Term] id: GO:0000003 name: reproduction namespace: biological_process alt_id: GO:0019952 alt_id: GO:0050876 ...
各Termのブロック内のid,name,namespaceを抜き出してタブ区切りのリストにしたい。
Termブロックが最後まで法則性を保っているかテスト
Termの数がid,name,namespaceの数と一致していれば(おおざっぱだけど)まあブロック内にそれぞれ含まれてると言っていいかな
[kijima.yusuke@m48 Uniprot_swiss]$ grep -E "^\[Term\]$" go.obo | wc -l 47179 [kijima.yusuke@m48 Uniprot_swiss]$ grep -E "^id:" go.obo | wc -l 47189 [kijima.yusuke@m48 Uniprot_swiss]$ grep -E "^name:" go.obo | wc -l 47189 [kijima.yusuke@m48 Uniprot_swiss]$ grep -E "^namespace:" go.obo | wc -l 47189
数にずれがある。ファイルの末尾を見てみると
[kijima.yusuke@m48 Uniprot_swiss]$ tail -n 20 go.obo namespace: external xref: RO:0002213 holds_over_chain: negatively_regulates negatively_regulates is_a: regulates ! regulates transitive_over: part_of ! part of [Typedef] id: regulates name: regulates namespace: external xref: RO:0002211 is_transitive: true transitive_over: part_of ! part of [Typedef] id: starts_during name: starts_during namespace: external xref: RO:0002091 [kijima.yusuke@m48 Uniprot_swiss]$ grep "\[Typedef\]" go.obo | wc -l 10
変なのが含まれており、その数がちょうど10個なので一応解決。
抜き出して並べてみる
awkを使う。
[kijima.yusuke@m48 Uniprot_swiss]$ cat go.obo | awk 'BEGIN{flag1=0} {if(match($0,/^\[Term\]$/))flag1=1} flag1==1&&/^id: GO/{ORS="";print $0"\t"} flag1==1&&/^name: /{ORS="";print $0"\t"} flag1==1&&/^namespace:/{ORS="";print $0"\n";flag1=0}' | sed -e 's/id: \|name: \|namespace: //g' | head GO:0000001 mitochondrion inheritance biological_process GO:0000002 mitochondrial genome maintenance biological_process GO:0000003 reproduction biological_process GO:0000005 obsolete ribosomal chaperone activity molecular_function GO:0000006 high-affinity zinc transmembrane transporter activity molecular_function GO:0000007 low-affinity zinc ion transmembrane transporter activity molecular_function GO:0000008 obsolete thioredoxin molecular_function GO:0000009 alpha-1,6-mannosyltransferase activity molecular_function GO:0000010 trans-hexaprenyltranstransferase activity molecular_function GO:0000011 vacuole inheritance biological_process
いい感じ。行数と末尾をチェックしてみても
[kijima.yusuke@m48 Uniprot_swiss]$ cat go.obo | awk 'BEGIN{flag1=0} {if(match($0,/^\[Term\]$/))flag1=1} flag1==1&&/^id: GO/{ORS="";print $0"\t"} flag1==1&&/^name: /{ORS="";print $0"\t"} flag1==1&&/^namespace:/{ORS="";print $0"\n";flag1=0}' | sed -e 's/id: \|name: \|namespace: //g' | wc -l 47179 [kijima.yusuke@m48 Uniprot_swiss]$ cat go.obo | awk 'BEGIN{flag1=0} {if(match($0,/^\[Term\]$/))flag1=1} flag1==1&&/^id: GO/{ORS="";print $0"\t"} flag1==1&&/^name: /{ORS="";print $0"\t"} flag1==1&&/^namespace:/{ORS="";print $0"\n";flag1=0}' | sed -e 's/id: \|name: \|namespace: //g' | tail GO:2001308 gliotoxin metabolic process biological_process GO:2001309 gliotoxin catabolic process biological_process GO:2001310 gliotoxin biosynthetic process biological_process GO:2001311 lysobisphosphatidic acid metabolic process biological_process GO:2001312 lysobisphosphatidic acid biosynthetic process biological_process GO:2001313 UDP-4-deoxy-4-formamido-beta-L-arabinopyranose metabolic process biological_process GO:2001314 UDP-4-deoxy-4-formamido-beta-L-arabinopyranose catabolic process biological_process GO:2001315 UDP-4-deoxy-4-formamido-beta-L-arabinopyranose biosynthetic process biological_process GO:2001316 kojic acid metabolic process biological_process GO:2001317 kojic acid biosynthetic process biological_process
はい。