Member-only story
Bioinformatics
K-mers for genomic analyses
K-mers is simply a sequence of string with k characters. For example this string:
AGCTTGACGTACT
If k-mers with k = 3, we have a list of string like
AGC,GCT,CTT,TTG,TGA,GAC,ACG,CGT,GTA,TAC,ACT (from left to right)
In bioinformatics, estimation of k-mer abundance histograms or just enumerat-ing the number of unique k-mers and the number of singletons are desirable in many genome sequence analysis applications. The applications include predicting genome sizes, data pre-processing for de Bruijn graph assembly methods (tune runtime parameters for analysis tools), repeat detection, sequenc-ing coverage estimation, measuring sequencing error rates, etc. Different methods for cardinality estima-tion in sequencing data have been developed in recent years.
An example use K-mers for genome comparison and analysis:
1.0
1.0
0.3333333333333333
0.0
0.5