In 2018, Lewin et al proposed the ambitious goal to sequence a reference genome of each eukaryotic species on Earth within 10 years called the moonshot of biology. This proposal let to the establishment of the Earth Biogenome Project (EBP) and gained a lot of traction and momentum in the following. This lead to the established of several genome projects across the world and in Europe to which our group also contributed, such as BGE, ERGA or InvertOmics. In 2022, the clock was officially set to 2020 and the progress should happen within three phases. In specific:
“Phase I: An annotated reference genome for one representative of each taxonomic family of eukaryotes (~9,400 species) in 3 y.
Phase II: Reference genomes for one representative of each genus (~180,000 species) in years 4 to 7.
Phase III: Reference genomes for remaining ~1.65 million known eukaryotic species in the final 3 y of the project.” (Lewin et al, 2022)
So this year, we are at half-time with the EBP or if we take 2022 as the true starting point, we are at the end of phase 1. Hence, what is the state of the art and where are the challenges?

In a recent update article, Blaxter et al. stated “By the end of 2024, EBP-affiliated projects had publicly released 2,000 high-quality genome assemblies, representing more than 500 eukaryotic families. In this article, we present a revised set of goals for Phases I and II of the EBP. For Phase II, we propose generating reference genomes for 150,000 species over 4 years, including representative genomes for at least 50% of all accepted genera and for additional species of biological and economic importance.” Hence, we see that the goals were not met and we are even far of the goals for some of the taxa. For example, as of 03.12.2025 within Lophotrochozoa chromosome-level genomes missing in NCBI and on GoaT for the phyla Phoronida, Dicyemida, Orthonectida, Gastrotricha, Gnathostomulida, Micrognathozoa, Entoprocta and Cycliophora (see the figure above showing presence of genomes within Lophotrochozoa).
Howard et al. reviewed the progress and challenges of the DToL and showed that a re-occurring problem was available amount of tissue in relation to genome size across all eukaryotic kingdoms. Progress has so far mostly accomplished by sequencing relative easy-to-handle species such as larger individuals in vertebrates, plants and arthropods. In a recent news article, AI has been presented as the tool to solve all the problems associated with the lack of progress in the genome projects. This is maybe not so surprising nowadays, but is AI really the solution to the problems?

In our group, we predominantly target such challenging taxa ourselves as the larger consortia predominantly as off now do not prioritize such taxa to be able accomplish the goals agreed upon with the funders. Despite some progresses, we were able to achieve by applying protocols that amplify whole genomes, we also experienced some strong setbacks. In the meantime, we have tried to obtain genomes of high-quality (but not chromosome-level) for 33 small-sized species of 11 phyla. With different success across the different phyla (see figure above), while we had a 100% success rate so far for the few large-bodied species we tried.
Given this larger dataset, we dug now deeper into possible parameters determining the success of genome sequencing in such taxa. To cut a long story short, there are two major factors that came forward. The first one was contamination. A high degree of contamination correlated with low BUSCO score and hence low recovery of the target genome. This is most likely not due to the fact that the samples had large amount of contaminating bacteria and such. All samples had only a single specimens, that was carefully cleaned and contained an absolute minimum of surrounding water (less than 1 µl). Hence, the host tissue should outweigh by far the contamination. However, prokaryotic (bacterial) DNA is naked, that means it is not covered by proteins like eukaryotic DNA. Hence, it is easier to amplify when the DNA extraction of the host was not good enough. This then let relatively quickly to an amplification bias towards the contamination. The second one was large genomes given the small amount of tissues. This situation resulted in low contamination, but only medium-ranged BUSCO scores and very fragmented genome assemblies.
We are now working on both factors by improving DNA extraction methods to obtain cleaner (and especially naked) DNA of the eukaryotic target tissue and sequencing the genomes deeper if necessary. First results show promising progress, but still the optimization of the DNA extraction can be tricky and require adjustments for different phyla or animal groups. Also sequencing deeper can mean substantially deeper than the usually advised 20-30x for PacBio HiFi sequencing. We are exploring now sequencing as deep as 60x. Hence, while AI can definitely be of great assistance in all steps after sequencing, especially closing the annotation gap, it is not the primary solution to the first challenges associated with improved laboratory protocols. AI can assist in experimental design here, but in the end the work needs to be done in the laboratory by highly trained scientists with the proper skills for this work.
![]()