Structural Genomics

Preliminary Sesame Genome Assemblies
The current draft assembly of Yuzhi 11 is 293.7 Mb in length, with a GC content of 34.65%. The N50 and N90 sizes of the scaffolds are 22.6 kb and 4.3kb, respectively . Genome size was estimated to be 354 Mb using the well-established 17-mer method , in line with flow cytometry data which suggest it is 369 Mb. The 17-mer distribution frequency in 16.77 Gb of trimmedSolexaPE reads was calculated using Jellyfish (v1.1.4) . We identified a total of13,931,658,332 unique k-mers, and 87,207,553 k-mers which had a frequency<10.The frequency of peak k-mers was 39.

In order to determine the frequency and complexity of repetitive elements in the draft assembly, we compared the assembly information with the Arabidopsis repetitive elements database from the RepeatMasker library (version 20120418) and the sesame de novo database constructed for the Yuzhi 11 draft assembly (RepeatModeler,version1.0.5) using RepeatMasker (version open-3.2.9) Thirty-eight percent of the draft assembly was identified as repetitive elements, only ~5.7% of which shared homology with the Arabidopsis database.

We validated the coding region coverage of the draft assembly using two differentgene footprint coverage methods. Using the Core Eukaryotic Genes MappingApproach (CEGMA) , 444 (96.9%) of the 458 Core Eukaryotic Genes (CEG) mappedagainst the draft assembly were identified. An RNA sequence based methodemploying Velvet [80] and OASES allowed us to assemble 3.5 Gb of RNA-Seq reads(NCBI Accession: SRX061117) into 99,589 putative transcripts. Putativetranscripts were then translated into 82,549 peptides using ESTScan (Version2.1) . These peptides were aligned against the SWISS-PROT [83] database using BLAST(E-value: 1e-5) to obtain high-confidence peptides. Redundant peptides (such asalternate-splicing transcripts) were filtered according to BLAST scores and the names of the hits. More than 99.5% of the 3,584 peptides obtained could be aligned to the draft assembly using GMAP . The above results indicate that the draft assembly has a high coverage of the coding region.

Gene prediction for the draft assembly was performed using InchWorm : 3.5 Gb ofRNA-Seq reads [GenBank: SRX061117] were assembled into 472,257 contigs and mapped to the draft genome using GMAP. The GMAP mapping results were used as atraining set for ab initio prediction using AUGUSTUS. As a result, 23,713 gene models were obtainedwith a total length of 28 Mb. Average gene length was 1.2 kb and average GCcontent was 45%. We obtained functional annotations of all genes using InterProScan , which also determines motifs and domains. 10,656 genes weregiven Gene Ontology (GO) annotations using corresponding InterPro entries andthe Pfam database . Visualization of the functional categories of these 10,656genes was performed using WEGO.