- Main project page - This link was given in the Bioinformatics publication.
- Archive page - has attachments of data files for the initial implementation
- Adjusted compression factor values for HapMap data for compressing a full chromosome 1 sequence and posted a link to the software. Still need to test round trip usage of mtDNA, finish implementing custom data sets, and provide a little more instruction.
- Checked new compression numbers with old. There at first seemed to be discrepancies. For the rCRS mtDNA data this was due to two values being incorrectly switched in a table included in the draft of the paper. Consensus data differs slightly due to a minor difference in generating the consensus sequence which changes the number of runlengths and their frequencies. Overall, there is good agreement.
- Compression factor is actually quite large when considering a full sequence (about 1,400). Numbers for raw file sizes of HapMap data and the space required to store a position-variant representation in ASCII have been checked again and seem accurate.
- double-check numbers for each encoding
- implement mtDNA encodings
- do round-trip testing and checking of numbers for mtDNA data
- Still searching for error in calculations, but now suspect they are accurate. Possible reasons for compression factor being small in HapMap data relative to mtDNA data:
- chromosome 1 is roughly 15,000 times (16.5kbp vs. 247mbp) the size of mtDNA giving greater freedom for the variety of runlengths that have to be encoded and therefore larger symbol sizes
- mtDNA is very gene-dense which tends to restrict the variety of variants see, making them more predictable, and so less informative
- mtDNA data analyzed is heavily biased towards European haplogroups, while the HapMap data set has been specifically designed to cover a wide range of ethnic diversity (again mtDNA is more predictable)
- Added file size computations to the code that extracts the individual donor data. Renamed 'raw' files to 'pv' for position-variant. Added an easily parseable line at the end of each analysis containing values for generating the compression factor. Still not sure if there is a bug in computing sizes. Adding code for computing the compression factor.
- HapMap data format is not as wasteful as it first seems.
- S = # of sequences
- N = # of variable positions
- P = avg. # of character per position value
- V = avg. # of characters per variant
- M = amount of meta data (this is the repetitive stuff in the HapMap file that describes things like the lab protocol)
- HapMap Size = N * ( P + M + ( S * V ) ) = ( N * P ) + ( N * M ) + ( N * S * V ) )
- Pos-Variant Size = N * S * ( P + V ) = ( N * S * P) + ( N * S * V)
- Cancel the ( N * S * V ) terms: ( N * P ) + ( N * M ) ~ ( N * S * P)
- Meta-data is essentially a fixed size, so when the number of sequences is large, the HapMap format becomes a more efficient representation. This makes sense because when the number of sequences is large, the position-variant notation repeats the same set of position numbers for each sequence. Consensus adjustments are not considered in this, but the relationship should still hold.
- Code for generating variants relative to the mtDNA consensus sequence is very nasty. Rewrite this later as a plpgsql function so that variation, relative to any sequence in the database, can easily be extracted for a sequence. Using %u with printf in perl will silently rewrite any number over 2^32 (about 4 billion) as 2^32. Writing the number as a float will preserve the magnitude of the value by results in a loss of precision beyond the first 16 positions. The most accurate is to use either Math::BigInt or Math::Float. A few of my numbers exceed the 4 billion mark by one or two places, but the error is quite trivial. If results were desirable for a larger dataset, such as the whole genome, then this code should be rewritten.
- Incorporating mtDNA code from previous modules.
- Generated donor statistics (about 5 hrs. to compute). Though the avg. number of variants per donor, relative to the consensus, is about 87,000 for chromosome 1, there is a wide range (lowest = 39,766 and highest = 170,050). Also encoded data by submitting the donor ID value.
- Got results for chromosome 1. Huffman code has slight bug. Approximate runtimes per chromosome:
- compiling donor statistics: 3.5 hrs.
- extracting raw genotype data and computing the consensus sequence: 1.5 hrs
- generating runlength and variant frequencies: 1.5 hrs
- generating a Huffman encoding (all other encodings are very fast): 1.1 hrs.
- Out of memory error in computing runlength frequencies for chromosome 1. Changed the frequency generating code. Works well now.
- All encodings working correctly on the test data. Adding a summary analysis.
- Imported code from OO and my perl library for doing the encodings.
- Added population information for the donor and altered code to create separate databases for each chromosome. Though we are only demonstrating for one chromosome, the code should work okay for all of them, provided you have enough time (about an 5 hours per chromosome - 4 hours are spent writing information to the database) to extract the data and enough disk space (about 11 gigs per chromosome) and enough memory (about 1.5 gigs of free memory) to do the job. Still need to determine if the same reference sequence was used for each population. This should not matter since the consensus sequence is being computed, but should probably get this information. Not overtly mentioned on the HapMap site. Under the current strategy, each chromosome receives its own encoding strategy, but without regard to any other information, i.e. the nothing about the population of the donor is assumed.
- Added generation of the consensus sequence.
- Had problem matching the position data to the donor, while keeping the runtime and memory under control. Created one big grid with a hash indexed with position values that points to a CSV string of all the genotypes for all the donors. Using arrays instead of strings uses all the memory. Creating such a grid for chromosome 1 consumes about 1.5 gigs of memory during runtime. Consensus adjustments are made on the grid in memory and then entries for the donors and their associated variant genotypes are written out to a sqlite database. The resulting sqlite file for chromosome 1 is 11 gigs.
- Database was a bit slow. Optimized by adding indexes on the tables, caching the variant ID values, and using transactions. Much quicker. Added SQL to generate the reference haplotype for each position and generate a fasta file of consensus variants. Worked well on test data.
- Creating the combined fasta files for each chromosome is working well on the test data, but uses too much memory for the full data set. Will use a simple sqlite database to store the data for each chromosome, and use that to compute the consensus haplotypes and remove variants that match the consensus. Schema:
- The HapMap SNPs are scored using an arbitrary reference sequence, analogous to the use of the rCRS for mitochondrial variation. The reference sequence used does not seem to be available for download, but regardless, a consensus list of SNP variants must be generated to optimize the compression. These consensus lists will have to be generated for each chromosome since encodings are being generated for each chromosome.
HapMap data also reports the SNP for both copies of DNA. All that is known is whether the donor is homozygous or heterozygous at a particular position, so it is impossible to reconstruct the exact sequence of either one of the copies. Therefore, the position-variant entries in the frequency files will have two symbols for the variant (representing both copies of DNA). Donors whose allelic makeup matches the consensus at a particular position will not have a position-variant entry.
- The Compression folder that bundles all of the files for the project will have these sub-directories and files:
- . . .
- . . .
- . . .
- . . .
- . . .
- . . .
- . . .
- . . .
- . . .
The sqlite directory holds databases needed for the HapMap data, organized by chromosome. Other datasets are presumed to be small enough that they can be manipulated in memory, so do not require writing values to disk. Each of the chromosome databases have the following schema:
The data directory will contain the raw data files. All of the mtDNA data can be stored in a single fasta file(one for consensus data and one for rCRS data). These files contain the variation list for each donor in position-variant format according to the specific reference sequence. Hapmap genotype files are in a completely different file format organized in groups of files that are associated with a particular chromosome.
For each of the chromosomes, there are ten raw data files, each representing a particular population (denoted by a three letter symbol in the filename - ASW, CEU, CHB, etc.). These data files are processed to extract the genotypes associated with each donor. About 100 donors are recorded in each data file. A large grid is created in memory, which used to compute the consensus sequence files and adjust the genotype data of each donor to be relative to this consensus. Due to the large volume of data, the aggregate consensus donor variants for all populations associated with a chromosome are stored in sqlite databases.
Lists of genotypes for each individual are then extracted from the sqlite database and used to compute the runlength and variant frequencies for each chromosome. Analogous information is represented in the count directory - each time the frequencies for either a set of runlengths or variants is computed, the corresponding set of counts is also stored.
Encoding files provide a mapping between either runlengths and symbols or variants and symbols. They are named according to runlength or variant, algorithm used, and by chromosome for HapMap data.
Predictions of compression efficiency are stored in the analysis directory. These predictions will combine the frequency data with symbol size (taken from the appropriate encoding file) to calculate the expected bit requirements for representing a sequence. Other calculations? Unlike the other files, these will attempt to integrate and summarize, so they will only be divided according to hapmap, mtdna_rcrs, and mtdna_consensus. Analysis of the donor data itself is also in this directory. The donor analysis files include such things as the total number of donors, average number of variants, and the expected space requirements in ASCII for donor files using the position-variant notation in a fasta formatted file.
The reference directory will hold lists of position-variant files used as references. Each of these files will contain position-variant entries for all variable positions (non-varying positions are always assumed to be the same as the reference and ignored). The symbol in the position-variant entries will match that of the reference sequence represented by the file.
- There are three data sets from HapMap that look potentially useful: allele frequencies, geneotype frequencies, and genotypes. The allele and genotype frequency data are very close to what is needed, unfortunately they are not sufficient for computing runlength frequencies. It seems necessary to parse out the individual list of variants for each person reported in each file of the genotype data. This will put the data in the standard 'position-variant' format, so that runlength frequencies can be computed across all of the individuals.
- Have a simple interface to a framework of programs that will be needed and directories for the various data files. This will be bundled and made available for download.
- Each chromosome of the HapMap data is treated as an individual data set with its own set of DNA positions. To create runlength frequencies it is necessary to either stitch the chromosomes together and establish one comprehensive set of runlengths, or alternatively treat each chromosome as its own separate data set and create separate runlength encodings for each chromosome. Pierre did thought either way acceptable. Treating each chromosome individually seems to be a bit easier, though it makes the assumption that data is submitted for compression in the context of a particular chromosome (perhaps not really a limitation since that seems to be the current convention).
- Emailed Pierre and Doug plan to re-implement the software in a standalone perl program and to do the analysis for the Hapmap data set. Suggested to Pierre that we ask for an extension, but he'd like to see if we can finish in a couple of weeks and if needed ask for an extension.
- Paper rejected by Bioinformatics without major revisions. Spoke with Pierre:
- Software needs to be generally available. Perhaps it would work on a standard format file, such as a list of position and change.
- Look at compressing the hapmap data which follows the format above.
- Look again at the paper "Human Genome as an Email Attachment"
- Uploaded new sequence_stats file to the supplementary data. Removed the variant_stats file. Statistics on the variants is contained within the pages of the spreadsheets. Added link for downloading OpenOffice.
- Posted supplementary data here
. OpenOffice data files moved to this topic.
- Finished Golomb implementation in OpenOffice for both rCRS and consensus data sets. Sent summary to Pierre. Need to prepare supplementary data that would allow one to reproduce the results: list of Genbank #s, fasta files, notes about how the algorithms were implemented (runlength counts start at 1, M-values). Also implement Elias-Gamma encoding.