NOTE: We are working on migrating this site away from MediaWiki, so editing pages will be disabled for now.
Difference between revisions of "InterMine Presentation"
From GMOD
m (Text replace - "</xml>" to "</syntaxhighlight>") |
|||
(37 intermediate revisions by 4 users not shown) | |||
Line 3: | Line 3: | ||
====Background==== | ====Background==== | ||
− | InterMine was developed as the generic underpinnings of the [http://www.flymine.org FlyMine Project] | + | [http://www.intermine.org InterMine] was developed as the generic underpinnings of the [http://www.flymine.org FlyMine Project] |
* Team of 7 FTE | * Team of 7 FTE | ||
Line 13: | Line 13: | ||
* In use by others in Cambridge, Edinburgh, Vienna… + modENCODE DCC if funded | * In use by others in Cambridge, Edinburgh, Vienna… + modENCODE DCC if funded | ||
* modENCODE/ Chado | * modENCODE/ Chado | ||
− | |||
− | |||
====Technical Overview==== | ====Technical Overview==== | ||
+ | * Data model --> Java classes, relational schema, mappings through automatic code generation | ||
* Custom Java object/relational system | * Custom Java object/relational system | ||
** When we started, couldn’t select from multiple classes at one time using hibernate. | ** When we started, couldn’t select from multiple classes at one time using hibernate. | ||
* Optimised for read-only performance | * Optimised for read-only performance | ||
− | * Designed for big, complex queries | + | * Designed for big, complex queries, bulk data |
* Performance optimisation | * Performance optimisation | ||
**Transparent query re-writing | **Transparent query re-writing | ||
* Web application - Struts/JSP/Ajax | * Web application - Struts/JSP/Ajax | ||
− | ====Example InterMine XML==== | + | ====Loading Data==== |
+ | |||
+ | * Read-only in production environment (therefore Problems 3 and 5 skipped) | ||
+ | * Load data from InterMine XML | ||
+ | * Parsers from standard formats | ||
+ | ** e.g. UniProt, GFF3, PSI, FASTA | ||
+ | * Powerful integration system: coarse/fine grained data source priorities give load-order independence | ||
+ | |||
+ | =====Test problems===== | ||
+ | * Used SOFA as core data model - similar to Chado. | ||
+ | * Added Gene.description (absent from model), compiled, loaded data (here XML + FASTA), released webapp. | ||
+ | |||
+ | =====Example InterMine XML for Problem 1: load genes + annotation===== | ||
− | <xml> | + | <syntaxhighlight lang="xml"> |
<items> | <items> | ||
<item id="0_3" class=”” implements="http://www.flymine.org/model/genomic#Gene"> | <item id="0_3" class=”” implements="http://www.flymine.org/model/genomic#Gene"> | ||
Line 42: | Line 53: | ||
</item> | </item> | ||
... | ... | ||
− | </ | + | </syntaxhighlight> |
+ | =====Resulting webapp object page===== | ||
+ | [[Image:xfile_gene.png|xfile Gene details page]] | ||
+ | ====Code for Problem 2: Print gene annotation report==== | ||
− | + | <syntaxhighlight lang="java"> | |
− | + | ||
− | + | ||
− | < | + | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
public class BakeOff { | public class BakeOff { | ||
public static void main(String[] args) throws Exception { | public static void main(String[] args) throws Exception { | ||
Line 135: | Line 118: | ||
} | } | ||
− | </java> | + | </syntaxhighlight> |
+ | |||
+ | ====Quicksearch - Problem 4: find genes starting with x==== | ||
+ | =====Java API===== | ||
+ | |||
+ | <syntaxhighlight lang="java"> | ||
+ | Query q = new Query(); | ||
+ | QueryClass qcObj = new QueryClass(Gene.class); | ||
+ | q.addFrom(qcObj); | ||
+ | q.addToSelect(qcObj); | ||
+ | |||
+ | QueryField qf = new QueryField(qcObj, "identifier"); | ||
+ | |||
+ | SimpleConstraint sc = new SimpleConstraint(qf, ConstraintOp.MATCHES, new QueryValue("x-%")); | ||
+ | q.setConstraint(sc); | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | =====IQL===== | ||
+ | |||
+ | <syntaxhighlight lang="sql"> | ||
+ | SELECT DISTINCT a1_.identifier AS a2_ FROM org.flymine.model.genomic.Gene AS a1_ WHERE a1_.identifier LIKE 'x-%' | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | =====Perl API===== | ||
+ | |||
+ | <syntaxhighlight lang="perl"> | ||
+ | my $genes = InterMine::Gene::Manager->get_genes(query => [ | ||
+ | identifier => { like => 'x-%' },],); | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | ====Larger Query==== | ||
+ | |||
+ | Within FlyMine: | ||
+ | For one or more genes report: | ||
+ | * Gene, Transcripts, Exons, Chromosomal Locations, Lengths | ||
+ | |||
+ | * Query joins 7 classes | ||
+ | ** all are on select list of query | ||
+ | ** many more tables than classes are joined | ||
+ | |||
+ | * Performance: | ||
+ | ** One gene: | ||
+ | *** 2 rows in ~2 seconds | ||
+ | ** All genes, all organisms | ||
+ | ***~300,000 rows in 36 seconds (without using pre-computation to enhance performance) | ||
+ | ***~300,000 rows in ~1 second (using pre-computation) | ||
+ | |||
+ | ====Implications of Query Optimisation==== | ||
+ | |||
+ | * Performance optimisation not tied to schema design | ||
+ | * Can adapt performance optimisation to usage of live database | ||
+ | * Template queries pre-computed | ||
+ | ** ~40 template queries run per gene details page - renders in seconds | ||
+ | |||
+ | ====Acknowlegements==== | ||
+ | |||
+ | * Richard Smith | ||
+ | * Kim Rutherford | ||
+ | * Matthew Wakeling | ||
+ | * Xavier Watkins | ||
+ | * Julie Sullivan | ||
+ | * Rachel Lyne | ||
+ | * Hilde Janssens | ||
+ | * François Guillier | ||
+ | * Philip North | ||
+ | * Tom Riley | ||
+ | * Peter Mclaren | ||
+ | * Mark Woodbridge | ||
+ | * Debashis Rana | ||
+ | * Wenyan Ji | ||
+ | * Markus Brosch | ||
+ | * Florian Reising | ||
+ | * Andrew Varley | ||
+ | * Gos Micklem | ||
+ | |||
+ | InterMine/FlyMine are funded by the Wellcome Trust (grant no. 067205), | ||
+ | awarded to M. Ashburner, G. Micklem, S. Russell, K. Lilley | ||
+ | and K. Mizuguchi. | ||
+ | |||
+ | [[Category:InterMine]] |
Latest revision as of 21:16, 9 October 2012
This Wiki page is an edited version of Gos's presentation
Contents
Background
InterMine was developed as the generic underpinnings of the FlyMine Project
- Team of 7 FTE
- 5 developers, one sys admin,
- 1 biologist/ bioinformatician
- Java/ postgreSQL
- SVN repository: 125,000 lines of code + 57,000 lines of tests
- Under development since 2002
- In use by others in Cambridge, Edinburgh, Vienna… + modENCODE DCC if funded
- modENCODE/ Chado
Technical Overview
- Data model --> Java classes, relational schema, mappings through automatic code generation
- Custom Java object/relational system
- When we started, couldn’t select from multiple classes at one time using hibernate.
- Optimised for read-only performance
- Designed for big, complex queries, bulk data
- Performance optimisation
- Transparent query re-writing
- Web application - Struts/JSP/Ajax
Loading Data
- Read-only in production environment (therefore Problems 3 and 5 skipped)
- Load data from InterMine XML
- Parsers from standard formats
- e.g. UniProt, GFF3, PSI, FASTA
- Powerful integration system: coarse/fine grained data source priorities give load-order independence
Test problems
- Used SOFA as core data model - similar to Chado.
- Added Gene.description (absent from model), compiled, loaded data (here XML + FASTA), released webapp.
Example InterMine XML for Problem 1: load genes + annotation
<items> <item id="0_3" class=”” implements="http://www.flymine.org/model/genomic#Gene"> <attribute name="identifier" value="xfile" /> <attribute name="description" value="A test gene for GMOD meeting" /> <reference name="organism" ref_id="0_1" /> <collection name="transcripts"> <reference ref_id="0_9" /> </collection> </item> <item id="0_1" class="" implements="http://www.flymine.org/model/genomic#Organism"> <attribute name="taxonId" value="7227" /> </item> ...
Resulting webapp object page
Code for Problem 2: Print gene annotation report
public class BakeOff { public static void main(String[] args) throws Exception { // code to get the "xfile" gene ObjectStore os = ObjectStoreFactory.getObjectStore("os.production"); Query q = new Query(); QueryClass qcObj = new QueryClass(Gene.class); q.addFrom(qcObj); QueryField qf = new QueryField(qcObj, "identifier"); q.addToSelect(qf); SimpleConstraint sc = new SimpleConstraint(qf, ConstraintOp.EQUALS, new QueryValue("xfile")); q.setConstraint(sc); System.err.println("query: " + q); Results res = os.execute(q); // a Results object is a List of Lists List rr = (List) res.get(0); Gene gene = (Gene) rr.get(0); System.err.println ("symbol: " + gene.getIdentifier()); // a BioEntity in FlyMine has a collection of Synonym objects - // we need Synonym.value for each Synonym System.err.print ("synonyms: "); Iterator synIter = gene.getSynonyms().iterator(); while (synIter.hasNext()) { Synonym syn = (Synonym) synIter.next(); System.err.print (syn.getValue() + ' '); } System.err.println ("description: " + gene.getDescription()); // get the class name, but we already know that the gene is a Gene System.err.println ("type: " + gene.getClass().getName()); // make a List from a the Set of exons for this Gene List exons = new ArrayList(gene.getExons()); Exon exon1 = (Exon) exons.get(0); Exon exon2 = (Exon) exons.get(1); // get the start and end via the Location object System.err.println ("exon1 start: " + exon1.getChromosomeLocation().getStart()); System.err.println ("exon1 end: " + exon1.getChromosomeLocation().getEnd()); System.err.println ("exon2 start: " + exon2.getChromosomeLocation().getStart()); System.err.println ("exon2 end: " + exon2.getChromosomeLocation().getEnd()); // write out the first cds List cdss = new ArrayList(gene.getCDSs()); FlyMineSequence flymineSequence = FlyMineSequenceFactory.make((CDS) cdss.get(0)); // use BioJava to output the sequence Annotation annotation = flymineSequence.getAnnotation(); annotation.setProperty(FastaFormat.PROPERTY_DESCRIPTIONLINE, gene.getIdentifier() + " cds"); SeqIOTools.writeFasta(System.err, flymineSequence); } }
Quicksearch - Problem 4: find genes starting with x
Java API
Query q = new Query(); QueryClass qcObj = new QueryClass(Gene.class); q.addFrom(qcObj); q.addToSelect(qcObj); QueryField qf = new QueryField(qcObj, "identifier"); SimpleConstraint sc = new SimpleConstraint(qf, ConstraintOp.MATCHES, new QueryValue("x-%")); q.setConstraint(sc);
IQL
SELECT DISTINCT a1_.identifier AS a2_ FROM org.flymine.model.genomic.Gene AS a1_ WHERE a1_.identifier LIKE 'x-%'
Perl API
my $genes = InterMine::Gene::Manager->get_genes(query => [ identifier => { like => 'x-%' },],);
Larger Query
Within FlyMine: For one or more genes report:
- Gene, Transcripts, Exons, Chromosomal Locations, Lengths
- Query joins 7 classes
- all are on select list of query
- many more tables than classes are joined
- Performance:
- One gene:
- 2 rows in ~2 seconds
- All genes, all organisms
- ~300,000 rows in 36 seconds (without using pre-computation to enhance performance)
- ~300,000 rows in ~1 second (using pre-computation)
- One gene:
Implications of Query Optimisation
- Performance optimisation not tied to schema design
- Can adapt performance optimisation to usage of live database
- Template queries pre-computed
- ~40 template queries run per gene details page - renders in seconds
Acknowlegements
- Richard Smith
- Kim Rutherford
- Matthew Wakeling
- Xavier Watkins
- Julie Sullivan
- Rachel Lyne
- Hilde Janssens
- François Guillier
- Philip North
- Tom Riley
- Peter Mclaren
- Mark Woodbridge
- Debashis Rana
- Wenyan Ji
- Markus Brosch
- Florian Reising
- Andrew Varley
- Gos Micklem
InterMine/FlyMine are funded by the Wellcome Trust (grant no. 067205), awarded to M. Ashburner, G. Micklem, S. Russell, K. Lilley and K. Mizuguchi.