Get started with Bioinformatics

Recently I joined the project about analyzing and visual big data from hospitals or healthcare system. There’s a lot of things to learn in Bioinformatics. In this post, we will go over some fundamental knowledge in this field. Let’s dive in.

Photo by National Cancer Institute on Unsplash

In simplest form of explanation, Bioinformatics is how we use computer to analyze Biology data. It is a mixture of 3 things : Biology, Statistics, and Data Science. The tasks involved in Bioinformatics usually are : Data Analysis, Software Development, and Modeling. Bioinformatics has become an important part of many areas of biology. In experimental molecular biology, bioinformatics techniques such as image and signal processing allow extraction of useful results from large amounts of raw data. In the field of genetics, it aids in sequencing and annotating genomes and their observed mutations. It plays a role in the text mining of biological literature and the development of biological and gene ontologies to organize and query biological data. It also plays a role in the analysis of gene and protein expression and regulation. Bioinformatics tools aid in comparing, analyzing and interpreting of genetic and genomic data and more generally in the understanding of evolutionary aspects of molecular biology. At a more integrative level, it helps analyze and catalogue the biological pathways and networks that are an important part of systems biology. In structural biology, it aids in the simulation and modeling of DNA, RNA,proteins as well as biomolecular interactions.

Bioinformatics, as a new emerging discipline, combines mathematics, information science, and biology and helps answer biological questions. The word ‘bioinformatics’ was first used in 1968 and its definition was first given in 1978. Bioinformatics has also been referred to as ‘computational biology’. However, strictly speaking, computational biology deals mainly with modeling of biological systems. The main components of bioinformatics are (1) the development of software tools and algorithms and (2) the analysis and interpretation of biological data by using a variety of software tools and particular algorithms.

General Knowledge and Terms

  1. On-premise database vs Cloud-based database : is a group of servers that you privately own and control. Traditional cloud computing (as opposed to hybrid or private cloud computing models) involves leasing data center resources from a third-party service provider.

2. A data dictionary : is a centralized repository of metadata. Metadata is data about data.

A data dictionary, or metadata repository, as defined in the IBM Dictionary of Computing, is a “centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format”. Oracle defines it as a collection of tables with metadata. The term can have one of several closely related meanings pertaining to databases and database management systems (DBMS):

  • A document describing a database or collection of databases
  • An integral component of a DBMS that is required to determine its structure
  • A piece of middleware that extends or supplants the native data dictionary of a DBMS

3. A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video). A data lake can be established “on premises” (within an organization’s data centers) or “in the cloud” (using cloud services from vendors such as Amazon, Google and Microsoft).

From All Things Distributed

4. A data swamp is a deteriorated and unmanaged data lake that is either inaccessible to its intended users or is providing little value

5. Blob : is a collection of binary data stored as a single entity in a database management system. Blobs are typically images, audio or other multimedia objects, though sometimes binary executable code is stored as a blob. Database support for blobs is not universal.

6. Ontology-based concept:

  • Ontology : philosophy term related to being, existence, becoming
  • Conceptual model ontology: is a meta model for representing conceptual models and their inter-relationships to logical models and vocabularies. Core to the CMO are 3 classes : type, quality, relations

7. SNV: single nucleotide variants

8. Genome: is the genetic material of an organism. It consists of DNA. The genome includes both the genes and the noncoding DNA, as well as mitochondrial DNA and chloroplast DNA. The study of the genome is called genomics.

9. Chromosome : is a DNA (deoxyribonucleic acid) molecule with part or all of the genetic material (genome) of an organism. Most eukaryotic chromosomes include packaging proteins which, aided by chaperone proteins, bind to and condense the DNA molecule to prevent it from becoming an unmanageable tangle.

10. Somatic cell : a diploid which means they contain 2 copies of each chromosome

11. Gametic cell : reproducible cells like sperm

12. Phenotype : refers to the observable physical properties of an organism; these include the organism’s appearance, development, and behavior. An organism’s phenotype is determined by its genotype, which is the set of genes the organism carries, as well as by environmental influences upon these genes.

Bioinformatics Tools

  1. Bedtools : a powerful toolset for genome arithmetic
  2. ELK stack : “ELK” is the acronym for three open source projects: Elasticsearch, Logstash, and Kibana. Elasticsearch is a search and analytics engine. Logstash is a server‑side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a “stash” like Elasticsearch. Kibana lets users visualize data with charts and graphs in Elasticsearch.
  3. Python programming language(cuz there is ton of libraries support for data science / machine learning in Python)
  4. R language( for data analysis)
  5. MongoDB : a popular NoSQL database for handling big data


Learning Bioinformatics is quite a lot of works involved. You need to know about Biology, Statistics and Data Science at the same time. Keep in mind that anything great cannot be done after one night so take your time. Just learn one thing at a time then you’ll be good.

This is it.

Thanks for reading my post.

Peace as always.~~~

A passionate automation engineer who strongly believes in “A man can do anything he wants if he puts in the work”.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store