MITOMAP Data Curation
This project has been archived. Data curation is still an important concern, but this was an attempt to create an Open Office client tool for the MITOMAP curator. Because our informatics efforts have become so web-based, a curation tool implemented using AJAX seems a better choice. The work of this project is now consolidated back into the MITOMAP project.
Improve the harvesting of mtDNA data for MITOMAP.
- mitocurator - OO used to open the extracted variation list and a BASIC macro makes the appropriate spreadsheet entries
- mitoextractor - Perl program using wxCocoaDialog for data extraction (can this use FileIO.pm?, RefSeq.pm? . . ., other Mitomaster integration)
- crowd sourcing - Not sure exactly. Some ideas:
- expand the submission of data
- allow comments on data
- create a trigger in the mito database to generate edit dates and remove these tables from Mitomap.xls
Get OO setup on Marie's computer w/ database connection
Test the use of Strawberry perl with wxCocoaDialog on my computer
Get Strawberry Perl setup on Marie's computer
- mitocurator (mitomap_prototype) - focus on regular polymorphism curation, the references, then clinical variants
create a prototype GUI that is passing values
create OO BASIC data structure for the rCRS and use to automatically give the ref allele to the curator
add some dummy data
create a trigger function that will populate the Variant ID field for variants that have been previously seen
must search not only the polymorphism sheet, but also the clinical spreadsheets
give a descriptor value that indicates how the previously reported variant was classified
create function that will populate the 'Previously Reported' field (11/19)
get all references from the appropriate linker table
use a format that identifies the type of citation and the authors
- implement the predicted 'Coding Change' (11/28)
- create OO BASIC data structure for coding changes
- implement this in RefSeq.pm and then generate the declarations in OO BASIC, alternatively implement the declarations in psql and begin to make greater use of database function calls
- also make use of this to generate the complete set of mtDNA alleles (complete set of allele topic pages)
- create a trigger function that uses the data structure to give feedback to the curator
- implement the actual spreadsheet entries on submitting a variant (12/5)
- use the max ID value of the spredsheet
- add automatic updating of the edit date values
- implement managing the 'Note' field (12/8)
- implement data syncing - mitosync (12/12)
- adjust mito database to authenticate a mitoadmin connection
- test DML in OO
- sync macro based on max. value of primary key
- rewrite rebuild script for OO format (12/23)
- how best to parse OO Calc format? perl modules??
- refactor FileIO.pm again!!
- add support for clinical field entries (12/29)
- auto-increment reference ID values when Marie pastes a new reference from biglist (1/5)
- can this be done using a cell definition?
- trigger on the spreadsheet?
The mitocurator interface might be just as easily implemented (maybe more) using AJAX instead of OpenOffice. Doesn't look too hard:
- Create a perl script on the server which receives the request, queries the database, and returns the data to JSON
- These could probably be facilitated with one of the Perl frameworks (Catalyst, Gantry, Mojolicious, or Titanium). Probably better to use a light-weight framework with Foswiki.
- Though PostgreSQL can be induced to return XML, there seems to be a reasonable argument for using JSON, which has a very simple structure.
- There are also numerous Perl modules for working with JSON.
- AJAX Perl article - good simple example
How can to utilize community contribution to increase the quantity and accuracy of our data?
- Variant Pages - Create a wiki page for each mtDNA variant. Link to MITOMAP information, output from MITOMASTER, and include a section curated by the community.
Did some work getting AJAX functionality working with Foswiki's JQuery plugin. Very nice! This certainly seems like the way to go for a curation tool. No need to learn OOBasic, just do all the updates on a "hot" database within the system. Curator has nothing to install, maybe this could lead to crowd-sourcing. Also easier to implement. Plan: implement a prototype of what I've done in OpenOffice on a page accessible only to Marie. Get some feedback to make a final decision.
Killed Open Office framework project and incorporated that work into this project. Seems that OpenOffice serves a good niche for data management, but I'll stick to web-based or command-line apps for most of my work.
Ideas for improving data curation:
- curation tool - create standalone perl utilities and macros in OpenOffice that relieve Marie from some of the tedium of adding new data
- crowd sourcing - can we use socialnetworking to have the community provide assist in data curation (e.g. mirroring MITOMAP tables in Mitowiki)
- mining - create programs to extract data from current sources and to derive new information
- standardization - any possibility of implementing a data standard that would allow automated extraction of data from published works?
Updating the 'Previously Reported' value working for my test data. Some concern over the number of operations being performed by checking a variant. May require optimization for full data set. Maybe begin by sorting the rows of the sheets. Maybe call the number format conversion function at the beginning.
Searching for new variants seems to be working. It was desirable to check these spreadsheets: polymorph, mMut, rtMut, somat, unpub, ins, del. However, due to the non-standardized way in which the ins and del entries are represented, these sheets are not searched.
Another GUI window called mitocurator was added. This GUI window will serve as the gateway to all curation tools. The GUI interface and underlying software will be called "mitocurator", while the spreadsheet document has been renamed to mitomap (although with a .ods file extension). "Openoffice mitomap" will represent the spreadsheet-managed version of MITOMAP data, and will replace "mitomap Excel".
Directly editing the database doesn't seem like a good idea. Instead, move the rebuilding activity into mitocurator as a "syncing" function. How best to implement syncing? Maybe do a basic syncing based on the maximum value of the primary key, and re-implement the rebuild script for open office.
Skype call with Marie:
- reference identifiers for "Previously Reported" variants should indicate the variant category
- if not found in the database, than "Previously Reported" should give feedback of "Not currently in MITOMAP"
- Submit button should give feedback about what was done
- Variants need a "notes" field that is for internal use only
- aachange column currently has a misleading representation of 'noncoding', generate values which are descriptive of the transcript product
- Clinical Fields:
- Disease: dropdown of diseases currently in the database with option to add
- Conservation: dropdown of H, M, L, NR, +, -
- Controls: textbox
- Homoplasmic: dropdown of +, -, NR
- Heteroplasmic: dropdown of +, -, NR
- Status: dropdown of everything currently in the database with option to add
Basic layout of the GUI for mitocurator looks okay. Have decided it would be best if we could ditch the spreadsheets and edit the database directly, but unsure of the technical difficulties. Get some feedback from Marie about the new mitocurator layout.
Explored database interaction with OO. Bain and Pitonyak are both great sources of information. New book about database development from a guy at Stanford looks like a must have. Have OO making database queries, but haven't tried DML. How do database forms differ from dialogs?
Skype call with Marie:
- mitomap_prototype and mitoextractor work on her computer
- add two generated fields to the right side of the variation_curation GUI: Variant ID# and Ref Info. Ref Info should be a listing of all the references that have been associated with the reference and should begin with the Ref ID#, but also have a short descriptor, similar to a citation entry.
- Variants being added are either new or previously seen. If previously seen, an entry may exist in the polymorphism sheet or any of the clinical sheets. All of these sheets should be checked, and any entries are reported back to the curator.
- Curating reference is mostly a cut n' paste operation from "biglist", but a macro should handle the generation of new ref ID#s. Can this be imbedded in the cell of the spreadsheet?
- Curating clinical variants is a very similar process, only more information is added. Ideally, the same tool could be used by having a "clinical button" that would act as a toggle for showing additional fields in the data entry.
- Priorities: 1) curating regular polymorphisms (Marie is waiting for this), 2) curating references, 3) curating clinical variants, 4) extracting data from phylotree and other sources.
- Direct editing of the database seems worth exploring. At the very least mitocurator should be enabled to make database calls. Marie would like the ability to submit a Genbank number and have that sequence automatically retrieved and entered into the database. Ultimately, we need to associate variants with all of the sequences in which they are found. These associations should be generated within the database and presented within MITOMAP and other places in CMEM Web.
- When entering a new variant: if the variant is not found, then populate the "Reference ID" field with the most recent entry of the Reference spreadsheet. The value populated should be composed of a prefix that is the reference ID# followed by a short human readable citation descriptor (e.g. "2300 - Wallace and Mishmar 2005"). This value can either be accepted by the curator or an alternative reference ID# can be entered. The mitocurator program will use the ID# prefix for associating the variant with a reference and ignore the rest of the value submitted.
Added amino acid names to tool tip, but they all appear on one line. Doesn't seem to be any way of adding a line break in OO tool tips.
Added functions to Spreadsheet library. 'getSheet' works well. More functions needed. Some snippets that will be useful for mitocurator:
- oCell.Value = now ' use this for updating modification dates
- oSheet.getRows.removeByIndex( 6, 3 ) ' use this to delete rows
- oRow = oSheet.getRows.getByIndex( 0 ) ' get a row
- oColumn = oSheet.getColumns.getByIndex( 1 ) ' get a column
Basic GUI for mitocurator is in mitomap_prototype. Values are being passed back. mitocurator should be aimed at managing the most tedious parts of MITOMAP curation, presumably polymorphisms. Other functionality might be added in the future. Aim at doing a good job with well-defined functionality, learn some more OO, and build my foundation libraries. Get input from Marie regarding the interface.
Got a prototype mitoprocessor dialog object in OO. How to change the drop down box list?? Polish this a bit more and get Marie's input.
pp works well for packaging mitoextractor into an excutable. Attached protoype mitoextractor file selection executable.
Next steps: Begin designing a GUI Dialog in OO for inputting new polymorphisms. Begin adding file validation code to mitoextractor.
Strawberrry Perl with wxCocoaDialog works on my XP parallels, though the wxCocoaDialog does not seem to be a full port. Remember to install wxCocoaDialog into the C:\wxCocoaDialog directory on Marie's computer to keep the paths the same. Unfortunately, Platypus only bundles programs for distribution on a Mac and there does not seem to be a Windows counterpart. Use pp instead. Would be nice to be using ActiveState's developer tools in a situation like this. Next steps: Get a basic file selection script implemented in perl with wxCocoaDialog and packaged with pp.