Bioinformatics

K-mers for genomic analyses

Why we need to know about k-mers.

Photo by Chris Stenger on Unsplash

K-mers is simply a sequence of string with k characters. For example this string:

AGCTTGACGTACT

If k-mers with k = 3, we have a list of string like

AGC,GCT,CTT,TTG,TGA,GAC,ACG,CGT,GTA,TAC,ACT (from left to right)

In bioinformatics, estimation of k-mer abundance histograms or just enumerat-ing the number of unique k-mers and the number of singletons are desirable in many genome sequence analysis applications. The applications include predicting genome sizes, data pre-processing for de Bruijn graph assembly methods (tune runtime parameters for analysis tools), repeat detection, sequenc-ing coverage estimation, measuring sequencing error rates, etc. Different methods for cardinality estima-tion in sequencing data have been developed in recent years.

An example use K-mers for genome comparison and analysis:

1.0
1.0
0.3333333333333333
0.0
0.5

Hope it helps~~~

Thanks for reading my post.

PEACE!!!

Reference

https://hub.gke2.mybinder.org/user/dib-lab-sourmash-hbh66r0t/notebooks/doc/kmers-and-minhash.ipynb

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store