Characterization of the chloroplast genome sequence of Calophyllum inophyllum, a bioenergy tree species, using Oxford Nanopore Technologies

Calophyllum inophyllum is a forest tree species that offers significant renewable energy benefits. However, the genetic information of these bioenergy tree species for the study of tree improvement is still limited. This study aimed to assemble the draft chloroplast genome using MinION Oxford Nanopore Technologies (ONT) and analyze the genetic relationship based on molecular markers for C. inophyllum. The study involved several steps: DNA extraction and isolation using the CTAB method, DNA library preparation, DNA sequencing using MinION Oxford Nanopore Technologies (ONT), and phylogenetic analysis based on selected DNA markers. The results showed that 115.1 Mb of high-quality DNA was successfully generated using ONT long-read sequencing, and 114.708 bp was annotated into partial genome chloroplast. The genetic relationship analysis using two selected DNA barcodes, namely rbcL and accD showed that the accD gene is more recommended for constructing a phylogenetic tree for the genus Calophyllum than the rbcL gene as it has a high bootstrap value (93%). Furthermore, the accD gene can also be considered a potential marker for further genetic analysis of C. inophyllum.


Introduction
The energy demand is increasing yearly in line with the increasing economic growth rate, population growth, and industrial development.Final energy consumption in Indonesia from 2000 to 2014 was still dominated by petroleum oil, petro solar oil, kerosene, and aviation fuel [1].Thus, the national final energy demand by KS scenario will reach 433 MTOE in 2050 or increase with an average annual growth rate of 3.7% especially fossil energy demand [2].In addition, Indonesia's oil reserves decreased by 3,530.43MMSTB during 2015-2019, which shows a scarcity of fossil energy [3].These data show that fuel consumption tends to increase and is continually needed, but resources are minimal, so it is important to have alternative fuel, such as biofuels.
Biodiesel can be produced as one of the alternative renewable energy sources to achieve national energy security and reduce greenhouse gas emissions.Calophyllum inophyllum known as tamanu or nyamplung in Indonesia is a non-edible oil seed producing tree that is naturally distributed in Southeast IOP Publishing doi:10.1088/1755-1315/1315/1/012077 2 Asia, Australia, the South Pacific, Southern coastal India, and eastern Africa [4].C. inophyllum grows along the coast and in lowland forests throughout Indonesia.The seed kernel of C. inophyllum can be used as raw material for biodiesel production, where the oil content is relatively high compared to other plants (Schleichera oleosa 55-70%, Jatropha curcas 40-70%, Elaeis guineensis 46-54%, and Calophyllum inophyllum 40-73%) [5].C. inophyllum seed oil is a potentially environmentally friendly energy resource with a higher selling value than kerosene and castor oil without competing with food needs [6].
Genomic-wide selection (GS) is a new strategy utilizing high-density molecular markers developed for breeding selection [7].The genomic estimated breeding value (GEBV) can be obtained from chromosome fragments or all marker effect values in the whole genome accumulated by GS.The breeding value of an individual can be estimated by combining marker and phenotypic information.Early prediction and selection of individuals can be carried out based on the breeding values to shorten generation intervals, improve selection accuracy, and save costs [8].For GS to be applied to the genetic improvement of target species, reference genome sequencing is needed at an early stage.Long-read sequencing approaches with Oxford Nanopore Technologies (ONT) can assemble the chloroplast genome that contains a 10-30 kb pair of long inverted repeats, which can confound sequencing and assembling efforts when using short-read technologies [9].
The research of genomic sequencing is needed to study the breeding selection and establishment of new forests, genetic variation, and gene expression, and to construct the evolutionary development of organisms through phylogenetic analysis.Therefore, this study aimed to assemble the chloroplast genome draft using the MiniON sequencer Oxford Nanopore Technologies (ONT) and analyze the molecular marker-based genetic relationship for C. inophyllum.The data produced from this study is expected to be a reference for compiling the whole chloroplast genome through the genome resequencing process and reveal the molecular marker or gene potentially selected as the DNA barcode for C. inophyllum.

Plant material collection
Fresh leaves collection was carried out in November 2021 from one C. inophyllum individual mature tree growing in the Faculty of Forestry and Environment area of IPB University Dramaga Campus, Bogor, West Java, Indonesia (latitude -6.560452, longitude 106.728187).Leaves were preserved in zip plastic containing silica gel and then stored at -20°C in the freezer.

DNA extraction and quality test
The genomic DNA of C. inophyllum silica gel-dried leaves was extracted by using the modified cetyl trimethyl ammonium bromide (CTAB) method [10].100 mg of silica gel-dried leaves was ground manually into a fine powder using mortar, pestle, and liquid nitrogen.The fine leaf powder was then placed into a 2 mL microtube with an additional 1000 μL of preheated CTAB buffer.Samples were then homogenized with the help of a vortex, followed by incubation at 65°C for 30 min using a water bath.The microtubes containing the sample was inverted every 10 min to mix evenly.When finished, 500 μL of chloroform: isoamyl alcohol (CIA) 24:1 and 10 μL of phenol were added into the microtube.The mixture was then centrifuged for 10 min at 10,000 rpm.
To separate and form supernatant layers, organic matter, and chloroform, the supernatant layer in the uppermost was transferred to a new 1.5 mL microtube using a micropipette.This process was repeated twice without phenol addition for the second repetition.Cold isopropanol was added to the supernatant (1:1) and 5 M NaCl with a supernatant ratio of 1:4.The mixture was stored in a freezer at -20°C overnight.The mixture was then centrifuged for 10 min at 10,000 rpm until precipitate formed.The mixture (aqueous solution) was discarded leaving only the DNA pellet and 500 μL of 70% ethanol was subsequently added before centrifuging for 5 min at 10,000 rpm.The 70% ethanol was then discarded.This process was carried out twice.The DNA pellet was dried in a desiccator for 10 min so that the tube was completely clean of ethanol.In the final step, the DNA pellet was preserved by 50 μL of TE buffer in the microtube and stored in the freezer at -20°C.
DNA quality test was carried out by 1% (w/v) agarose gel electrophoresis method at 100 V for 20 min.Moreover, the DNA concentration was also measured using a Qubit 1.0 Fluorometer.

Data analysis 2.4.1.
Sequence raw data analysis.The output of MinION sequencing was raw Fast5 data, which was then converted into FastQ files using base-calling.This conversion involved translating the ionic signal into nucleotide sequences using Guppy Basecaller v4.2.3+8aca2af8 [11].The FastQ data were further analyzed using NanoStat v1.2.1 to evaluate the quality and statistics of the reads.Additionally, NanoPlot v1.31.0 was employed to construct a distribution plot illustrating the read length and quality score [12] [13].All reads' quality was filtered using the NanoFilt v2.7.1 tool to obtain data with a Q score more than 7 and a base length greater than 500 bp [14].The filtered reads were continued in the assembly process to compile reads into contigs using the Rebaler program combination (v0.2.0).The resulting contigs were polished using the Medaka program to obtain contigs with a higher degree of accuracy [15].The Quast program was used to assess the corrected contig statistics, by referring to the chloroplast genome.Afterward, the polished contigs were annotated using the cloud server ChloroBox with the GeSeq tool [16].

Chloroplast marker analysis.
The GenBank annotation produced by GeSeq was visualized using SnapGene v6.2.1 to identify the potential molecular marker [17].Two coding gene sequences (rbcL and accD) were chosen for further analysis.These sequences were subjected to a Blastn search on the NCBI website, specifically targeting the Calophyllum genus.50 homologous sequences of the rbcL marker and 12 homologous sequences of the accD marker were obtained and saved in a Fasta file for phylogenetic tree construction.

Phylogenetic tree construction.
The MEGA-X program was utilized to create phylogenetic trees [18] [19].The process of aligning homolous sequences and markers was performed using ClustalW.The phylogenetic trees were generated by applying the Maximum-Likelihood algorithm to the aligned sequences.A bootstrap value of 1000 repetitions was used to assess the reliability of the phylogenetic tree.The completed phylogenetic tree was displayed in a circular format using iToL and modified using Inkscape.

Genome sequencing and assembly
The initial stage of long-read sequencing involved conducting a quality check (QC) on the FastQ file using the Galaxy Cloud software.This was done after the base-calling procedure to assess the read length and quality of reads in raw data [20].Based on the distribution (Figure 1), the read length values for C. inophyllum were generally spread from 0 bp to 10000 bp.The average quality of C. inophyllum sequences is in the range of Q8-Q15.The histogram also showed the most extended read length value of C. inophyllum around 103,794 bp.The highest quality reads (Q scores) were also obtained for both species, namely > Q25, with relatively few numbers.The data required for subsequent analysis requires a minimum read length of 500 bp and read quality > Q7 [21] [22].Following the filtering process, over 99.9% of reads successfully met the quality control criteria (30,794 reads) with a read length N50 of 12,503 bp and a total base of 115.1 Mb.Hence, the C. inophyllum sequence produced in this study is suitable for long-read sequencing.

Chloroplast genome annotation
The purpose of genome annotation was to seek out functional genes across the genome sequence [23].The length of Calophyllum inophyllum partial chloroplast genome was 114,708 bp or 114 kb, which was constructed from contig 57 (Figure 2).A total of 93 genes were identified throughout the genome, with 4 of these genes being duplicated in the invert repeat regions, resulting in a total number of 97 genes.The genetic material consisted of 1 rRNA gene, 25 tRNA genes, and 71 protein-coding genes.Several genes potentially used as DNA barcoding in cpDNA are rbcL and accD [24].The requirement for a gene to be used as a suitable DNA barcoding is to have low intra-specific and high inter-specific variations [25].Contig 57 C. inophyllum annotation results showed the presence of an rbcL gene of 1,428 bp in length and 43% GC, and an accD gene of 1,488 bp in length and 35% GC.This is following the statement of CBoL [26] that the length of the rbcL gene sequence is around 1,400 bp.

Phylogenetic Analysis
In this study, maximum likelihood phylogenetic approaches were used to construct the phylogenetic tree because this method typically includes trade-offs between accuracy and computing demand, with more accurate tree reconstruction necessitating deeper and time-consuming tree space exploration [27].The maximum likelihood technique was adopted for multiple sequence alignment and cladogram tree visualization.The statistical probability of phylogram topology was established using bootstrap analysis, with the value computed based on 1000 bootstrap replications.
Two molecular markers namely rbcL and accD genes were selected for constructing phylogenetic tree of C. inophyllum.The Consortium for the Barcode of Life recommends the use of the rbcL gene as a reliable and efficient method for evaluating the genetic condition of a species with thoroughness, precision, and promptness [28].The rbcL gene in cpDNA encodes RuBisCo and exhibits a high degree of species similarity due to its low mutation rate [29].The length of the rbcL gene is around 1400 kb [30].The accD gene is found across plants, including the shortened chloroplast genomes of parasitic and non-photosynthetic plants.The expression of accD was thought to be crucial throughout the embryo development stage in Arabidopsis and increased the total ACCase levels in the plastid.Acetyl-CoA carboxylase (ACCase) facilitates the conversion of acetyl-CoA into malonyl-CoA and is thought to regulate de novo fatty acid production [31] [32].
A phylogenetic tree was constructed using the rbcL gene sequences from the chloroplast genome of C. inophyllum and 50 additional members of the Calophyllum genus, as reported in NCBI.The comparison of these sequences allowed for the construction of a phylogenetic tree (figure 3).In this study, the results of genetic relationship analysis using the rbcL gene with the maximum likelihood method of 1000 bootstraps on C. inophyllum showed very weak topological validity (39% in bootstrap value).The highest bootstrap value obtained by the rbcL gene for constructing the genus Calophyllum was only 59%.Typically, the bootstrap value was classified into four categories: very weak (<50%), weak (50-69%), moderate (70-85%), and high (>85%) [33].The rbcL gene exhibits a high efficacy in amplifying gene fragments, but it also possesses significant limitations, such as limited accuracy in distinguishing between multiple closely related species [34].Consequently, the rbcL gene proved to be inadequate in differentiating across genera.The rbcL gene was not good enough in classifying variations between species in the Calophyllum genera.Due to the lack of study of chloroplast genome sequencing on the Calophyllum genus, the homology level of the nucleotide sequences obtained is not optimal.Therefore, other markers were needed that could be used as reference sequences and were available in the NCBI database.The accD marker was then chosen to construct the C. inophyllum phylogenetic tree because it was available in NCBI and had a high value of homology proportions.
The accD gene is one of the genes in the chloroplast region that codes for the β-carboxyl transferase subunit of acetyl co-enzyme A carboxylase [35].This gene plays a vital role in photosynthesis and affects plant leaf growth [36].The accD gene belongs to a more conservative coding region than the non-coding region.Other studies showed that accD markers have regions that vary greatly in sequence and gene length [37][38].[39].Several other studies have shown that Calophyllum is a genus of 190 species that is widely dispersed, primarily along coastlines in the Indo-Malesian region, Micronesia, Melanesia, northern Australia, Central and South America [40].The presence of C. inophyllum in both Java and Tahiti (South Pacific) [24] suggest that the distribution of C. inophyllum in Java is likely due to a shared ancestor with similar chemicalgenetic characteristics or structures, as indicated by its natural occurrence in both regions.

Conclusion
The genome of Calophyllum inophyllum was successfully sequenced using long-read DNA sequencing performed by MinION Oxford Nanopore Technologies (ONT), with a total base of 115.1 Mb and generated a 114,708 bp of partial chloroplast genome.In this study, the rbcL gene was not recommended for phylogenetic tree construction in the genus of Calophyllum considering its low bootstrap value.However, the phylogenetic tree based on the accD gene revealed that C. inophyllum in Java was closely related to C. inophyllum in the South Pacific with a high number of bootstrap values (93%).Furthermore, the accD gene can be considered a potential marker for further genetic analysis of C. inophyllum.

Figure 1 .
Figure 1.Histogram of read length distribution data and average read quality for Calophyllum inophyllum sequences.

Figure 4 .
Figure 4. Phylogenetic tree of Calophyllum inophyllum on accD gene sequences.The phylogenetic tree of C. inophyllum based on accD gene sequences showed that the studied C. inophyllum was in the same clade as the other Calophyllum inophyllum (Figure 4).The C. inophyllum sample used in this study was C. inophyllum native from Java Island.C. inophyllum from Java forms a monophyletic clade with sister group C. inophyllum from Tahiti Avera (KT369681.1 and KT369680.1)with a bootstrap value of 93%.C. inophyllum from Java also forms a monophyletic clade with C. inophyllum from Tahiti Patio (KT384377.1).The formation of a monophyletic clade in a cladogram indicates the presence of genetic material and the same characters from a common ancestor[39].Several other studies have shown that Calophyllum is a genus of 190 species that is widely dispersed, primarily along coastlines in the Indo-Malesian region, Micronesia, Melanesia, northern Australia, Central and South America[40].The presence of C. inophyllum in both Java and Tahiti (South Pacific)[24] suggest that the distribution of C. inophyllum in Java is likely due to a shared ancestor with similar chemicalgenetic characteristics or structures, as indicated by its natural occurrence in both regions.