[ad_1]
Sample collection and DNA sequencing
The C. barosmum sample was collected from Lijiang City, Yunnan Province, China, at coordinates 27°26′00″N, 100°10′43″E. Genomic DNA was successfully extracted from C. barosmum leaves using the cetyltrimethylammonium bromide (CTAB) method. The Agencourt AMPure XP-Medium kit was utilized to screen for DNA fragments of 200–400 bp. Subsequently, end repair, 3′ adenylation, adapter ligation, and PCR amplification were performed on these DNA fragments. The products were then recovered using the AxyPrep Mag PCR cleanup kit. Following heat denaturation of the double-stranded PCR products, they were looped using spliced oligonucleotide sequences to form single-stranded circular DNA (ssCir DNA), which was formatted into a final library for quality control. Qualified libraries were sequenced on the MGISEQ. 2000 platform. To ensure data quality, Fastp11 software was employed to filter raw reads, removing low-quality reads and adapter contamination. BluePippin (Sage Science, USA) was used for size selection of DNA samples to obtain long DNA fragments of a specific length range. The ends of the DNA fragments were repaired, and A-tail ligation was performed. Subsequent ligation reactions were conducted using adapters from the LSK109 kit to construct the library. The size of the library fragments was quantified using a Qubit® 3.0 fluorometer (Invitrogen, USA). Finally, sequencing using the PromethION sequencer generated 67.46 GB of long-read data. Sequencing using the MGISEQ-2000 sequencer generated 33.77 GB of short-read data.
Hi-C sequencing
For chromosome conformation capture (Hi-C) sequencing, genomic DNA extracted from C. barosmum samples served as the foundation for Hi-C library construction. Initially, the sample cells underwent formaldehyde treatment to achieve effective cross-linking of DNA to proteins and between proteins. Following cell lysis and sample quality assessment, the preparation of Hi-C fragments commenced. This process encompassed chromatin digestion using restriction endonuclease, biotin labeling, flat-end ligation, and DNA purification. After reconfirming DNA quality, the qualified Hi-C samples underwent end biotin removal, sonication interruption, end repair, A-tail addition, and sequencing junction addition to form the plus-junction product. Subsequently, PCR facilitated the screening and amplification of the library product. Upon passing the quality control assay of Hi-C fragment junctions, the library preparation process concluded. Finally, sequencing of the library utilized the MGI-2000 sequencer with PE150 sequencing strategy, generating 53.58 GB of Hi-C data.
Transcriptome sequencing
We sampled the roots, stems, and leaves of C. barosmum separately during its nutritive growth period, set up three sets of replicates, and extracted RNA from these samples. Following RNA extraction, we enriched eukaryotic mRNAs from total RNAs using the QiaQuick PCR purification kit. Subsequently, we fragmented these mRNAs into short segments and synthesized double-stranded cDNAs using these fragments as templates. After a series of purification steps, we sequenced these cDNA fragments by adding poly(A) tails. Following additional purification steps, end repair, poly(A) tail addition, ligation of sequencing junctions, and fragment size selection, we performed PCR amplification of these cDNA fragments to construct sequencing libraries. Finally, we conducted high-throughput sequencing of these meticulously prepared libraries using the MGISEQ 2000 sequencer. The entire experiment produced 93.13 GB of valuable data, of which 28.37 GB, 31.06 GB and 33.7 GB were for roots, stems and leaves, respectively.
Genome survey
Prior to genome assembly, we conducted a k-mer analysis on the second-generation sequencing data of C. barosmum. Utilizing the KMC12 tool, we performed k-mer counting and statistical analysis. With the Arabidopsis thaliana genome as a reference, we simulated short-fragment data matching the sequencing depth of C. barosmum. Employing the SciPy package in Python, we estimated the genome size and heterozygosity. Following k-mer curve fitting across various heterozygosity gradient combinations, we ultimately assessed the genome size of C. barosmum to be 551.56 Mb with a heterozygosity of 1.10% (Fig. 1a).
Genome assembly
We used NextDenovo v2.4.0 (https://github.com/Nextomics/NextDenovo) software to perform genome assembly on the raw reads. First, the raw reads were error corrected using the NextCorrect module in NextDenovo. Subsequently, the genome was reassembled from the corrected reads using the NextGraph module (default parameters) to generate a Preliminary Assembly version of the genome, with a genome size of 517.52 Mb and an N50 length of 10.93 Mb. Then, the optimized genome was polished through multiple iterations using NextPolish13 software. The process is as follows: firstly, three rounds of correction were performed using Oxford Nanopore Technologies (ONT) three-generation data, and then four rounds of correction were performed using two-generation data, in order to further improve the single-base accuracy of the genome. To further improve the assembly precision and haplotype resolution, we used the Purge-Dups (version 1.2.5)14 tool to remove redundant sequences caused by heterozygous or repetitive sequences. After this series of steps, we obtained a sketched genome of 521.93 Mb with an N50 of 11.02 Mb. Furthermore, we validated the Hi-C read sequences using Hicpro15 and organized and anchored the contigs into sketched chromosomes with the assistance of 3D-DNA316. Using the Juicebox17 assembly tool, we conducted manual verification and fine-tuning, including clustering optimization, order adjustment, and orientation correction. The post-Hi-C-assisted assembly genome size was 518.59 Mb, with a scaffold N50 of 21.12 Mb and a contig N50 of 10.67 Mb, comprising 24 chromosomes (Table 1). We visualized the gene density, GC content, Gypsy density, and Copia density of these 24 chromosomes using TBtools18 (Fig. 2). Heatmaps of the genome-wide Hi-C data for the C. barosmum genome chromosomes were then plotted using hicexplorer19 (Fig. 1b). Lastly, we evaluated the genome quality using Benchmarking Universal Single-Copy Orthologs (BUSCO v5.4.5)20, the results showed that 98.95% of the conserved orthologs were complete (including 33.52% single-copy and 65.43% duplicated genes), while only 0.31% were fragmented and 0.74% were missing (Table 2).
Repeat sequence annotation
We conducted a comprehensive analysis of transposable elements (TEs) in the assembled C. barosmum genome. Initially, we utilized Extensive de-novo TE Annotator (EDTA) v2.1.23521 to accurately identify and annotate transposable elements in the genome. Subsequently, to further classify these transposable elements, we employed three specialized tools: TEsorter (v1.3), DeepTE22, and LTR_FINDER23, for detailed analysis. By integrating the results from these tools, we determined that approximately 268,702,321 bp of sequences in the genome are repetitive elements, constituting 51.81% of the total genome length. Notably, long terminal repeat sequences (LTRs) represented a substantial proportion of all repetitive elements, accounting for 34.36% (Table 3).
Structural annotation of genes
Our gene structure prediction process incorporates three methods: ab initio prediction, homology prediction, and transcriptome prediction. Initially, we employed PASA software to compare transcriptome data with the genome, predicting UTR regions and alternative splicing events for the initial gene set. Subsequently, we integrated the PASA predictions with GeMoMa integration and ab initio prediction results. Evidence Modeler (EVM) software was then utilized for weighted analysis, prioritizing PASA predictions, followed by GeMoMa predictions, and lastly, ab initio predictions, to construct the initial genome gene set. Finally, we applied TransposonPSI comparison to screen and exclude genes containing transposable elements and coding errors from the initial set, yielding the final gene set. This refined prediction and screening process resulted in 41,864 predicted genes with an average gene length of 3,226.14 bp, an average coding sequence (CDS) length of 1,242.39 bp, and an average of 5.73 exons per gene. The average exon length was 216.81 bp, while the average intron length was 419.38 bp.
Functional annotation of genes
We functionally annotated the protein sequences using Blastp software, comparing them with the non-redundant database (NR) (Non-Redundant Protein Sequence Database), Swiss-Prot24, KEGG25 (Kyoto Encyclopedia of Genes and Genomes), and KOG26 (Eukaryotic Protein Orthologous Groups) databases. The analysis parameters were set with an e-value threshold of 1e-5 and a maximum target sequence of 1. Subsequently, we compiled the annotation information of these protein sequences from the NR, Swiss-Prot, KEGG, and Gene Ontology (GO)27 databases, carefully categorizing the annotated species information. By integrating the results from these databases, we obtained a comprehensive overview of gene function annotations. Statistical analysis revealed that 38,471 genes, representing 91.90% of the total, were annotated in at least one functional database (Table 4).
[ad_2]
Source link