Member-only story
Bioinformatics
Parse GFF file with
To install gffutils
pip install gffutils
Gffutils allow us to create sqlite db from gff file.
import gffutils
gffutils.create_db(filename, database_filename)
Then we can use the db for easily query data.
db = gffutils.FeatureDB(dbfn=database_filename)
For example, let’s say we need to work with gencode gff3 file look like:
##gff-version 3
#description: evidence-based annotation of the human genome (GRCh38), version 35 (Ensembl 101)
#provider: GENCODE
#contact: gencode-help@ebi.ac.uk
#format: gff3
#date: 2020-06-03
##sequence-region chr1 1 248956422
chr1 HAVANA gene 11869 14409 . + . ID=ENSG00000223972.5;gene_id=ENSG00000223972.5;gene_type=transcribed_unprocessed_pseudogene;gene_name=DDX11L1;level=2;hgnc_id=HGNC:37102;havana_gene=OTTHUMG00000000961.2
chr1 HAVANA transcript 11869 14409 . + . ID=ENST00000456328.2;Parent=ENSG00000223972.5;gene_id=ENSG00000223972.5;transcript_id=ENST00000456328.2;gene_type=transcribed_unprocessed_pseudogene;gene_name=DDX11L1;transcript_type=processed_transcript;transcript_name=DDX11L1-202;level=2;transcript_support_level=1;hgnc_id=HGNC:37102;tag=basic;havana_gene=OTTHUMG00000000961.2;havana_transcript=OTTHUMT00000362751.1
And we want to get “seqid”, “start”, “end”, “attributes” from the features type. Sample code for this is below.
Running this will show us the result
84127
LINC02455
chr12
752579
911452
WNK1
chr12
...
Thanks for reading my post.
~~PEACE~~