NOTE: We are working on migrating this site away from MediaWiki, so editing pages will be disabled for now.
Load GenBank into Chado
Contents
Abstract
This HOWTO describes how to load GenBank format files into Chado.
Authors
Copyright
This document is copyright Don Gilbert , 2007. For reproduction other than personal use please contact <cain@cshl.edu>
Revision History
Revision 1.0 2007-04-16 BIO | First version |
Creating GFF3 from GenBank Files
GFF3 can also be generated using a script provided by Bioperl, scripts/Bio-DB-GFF/genbank2gff3.pl
(this script is currently preferred over the script of the same name found in the GMOD package). If your working directory contains a Genbank file you could use it like this:
>bp_genbank2gff3.pl --dir . --outdir .
A recent update (April 2007) to bp_genbank2gff3.pl and gmod_bulk_load_gff3.pl should solve the first two problems below. Another addition to bp_genbank2gff3.pl is the option --noCDS that produces GFF gene models suited to loading to Chado.
>bp_genbank2gff3.pl --noCDS --in mygenome.gbk >gmod_bulk_load_gff3.pl --database mygenome --gff mygenome.gbk.gff
Possible Errors
This method for generating GFF3 files is not completely satisfactory and development is ongoing to provide better translation. However, by proceeding carefully you should be able to get it to produce GFF3 that can be loaded. Possible errors from running this script, and their fixes, are described below.
couldn't open /var/lib/gmod/conf directory for reading:No such file or directory
Make sure the environmental variable GMOD_ROOT is set to where gmod was installed, for example:
setenv GMOD_ROOT /usr/local/gmod/ # tcsh
or
set GMOD_ROOT=/usr/local/gmod/ # bash
Unable to find srcfeature <some feature> in the database
Solution: Edit the '##sequence-region' 2nd line of the GFF3 output. Change it to '# sequence-region' is enough, or remove the line.
Your GFF3 file uses a tag called <term>, but this term is not already in the cvterm and dbxref tables so that its value can be inserted into the featureprop table
Solution: This error message will be followed by SQL statements that insert the term in the correct way - execute them. By the way, one explanation for this error is that the source sequence was curated but not with terms from the Sequence Ontology.
DBD::Pg::db pg_endcopy failed: ERROR: duplicate key violates unique constraint "featureprop_c1"
CONTEXT: COPY featureprop, line ...
Solution: The CONTEXT line above is telling you what the offending data is. This error probably means that there are 2 features sharing the same name or ID and feature type in the GFF file. Correct these errors by hand and reload.