Data Crawler

Crawl data from sites in Go

for your specific needs.

Image for post
Image for post
Photo by Robert Anasch on Unsplash

To get along, what we will need to prepare is :

The example site we will interact with is https://www.ncbi.nlm.nih.gov. This site is a platform for bioinformatics. We will crawl some gene data.

Image for post
Image for post

Because the data is presented in static html so we don’t actually need to trigger a browser to get javascript calling. We only need to send http request to the endpoint using native http in Go.

Image for post
Image for post
resp, _ := client.Get("https://www.ncbi.nlm.nih.gov/gene/" + geneId)

The goquery lib support for finding element information by selector.

doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
log.Fatal(err)
}
geneOfficialSymbol := doc.Find("#summaryDl > dd.noline").Contents().Text()

If you do not know what the selector of the element is, we can use Chrome Developer Tool to get the element information.

Image for post
Image for post

Complete code to get the data .

Running this will show us the json data for the gene.

{"MAFIP":{"ncbi":{"symbol":"MAFIP","symbol_source":"HGNC","id":"727764","gene_name":"MAFF interacting protein (pseudogene)","gene_synonyms":["MIP","pp5644","TEKT4P4"],"biotype":"pseudo","contig":"14 Unlocalized Scaffold","start":53589,"end":115073,"reference_genome":"GRCh38","strand":"","description":"This gene was originally thought to be a protein coding gene. However, the encoded protein sequence is highly similar to the C-terminal sequence of the tektin-4 protein, and the transcript sequences of this gene are highly similar to the TEKT4 pseudogenes, which are found on chromosomes 4, 21 and Y, respectively. Therefore, this gene is thought to be another pseudogene of the TEKT4 gene (GeneID:150483). Multiple alternatively spliced transcript variants have been found for this gene. [provided by RefSeq, Mar 2012]"},"ensemble":{"symbol":"","symbol_source":"","id":"","gene_name":"","gene_synonyms":null,"biotype":"","contig":"","start":0,"end":0,"reference_genome":"","strand":"","description":""},"gene_cards":{"symbol":"","id":"","gene_name":"","gene_synonyms":null,"biotype":"","contig":"","start":0,"end":0,"reference_genome":"","strand":"","description":null}}}
{"TP53":{"ncbi":{"symbol":"TP53","symbol_source":"HGNC","id":"7157","gene_name":"tumor protein p53","gene_synonyms":["P53","BCC7","LFS1","BMFS5","TRP53"],"biotype":"protein coding","contig":"17","start":7668402,"end":7687550,"reference_genome":"GRCh38","strand":"","description":"This gene encodes a tumor suppressor protein containing transcriptional activation, DNA binding, and oligomerization domains. The encoded protein responds to diverse cellular stresses to regulate expression of target genes, thereby inducing cell cycle arrest, apoptosis, senescence, DNA repair, or changes in metabolism. Mutations in this gene are associated with a variety of human cancers, including hereditary cancers such as Li-Fraumeni syndrome. Alternative splicing of this gene and the use of alternate promoters result in multiple transcript variants and isoforms. Additional isoforms have also been shown to result from the use of alternate translation initiation codons from identical transcript variants (PMIDs: 12032546, 20937277). [provided by RefSeq, Dec 2016]"},"ensemble":{"symbol":"","symbol_source":"","id":"","gene_name":"","gene_synonyms":null,"biotype":"","contig":"","start":0,"end":0,"reference_genome":"","strand":"","description":""},"gene_cards":{"symbol":"","id":"","gene_name":"","gene_synonyms":null,"biotype":"","contig":"","start":0,"end":0,"reference_genome":"","strand":"","description":null}}}

Hope it helps.

Happy coding guys ~~

Written by

A passionate automation engineer who strongly believes in “A man can do anything he wants if he puts in the work”.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store