Creating a Catalog in Seconds

Many users have data in an arbitrary format (e.g. an Excel file from a paper) or another source of annotation such as BioMART. BioR allows users to integrate additional sources of information into the system as catalogs extremely rapidly. Unlike most other tools, BioR does not require a specific bioinformatics format, all you need to be able to do is convert the files into JSON, and BioR has many utilities to do that for you!

As an example, lets integrate dbSNFP2.1 into BioR (availible from here: https://sites.google.com/site/jpopgen/dbNSFP).

First Genes - we’ll take the header line (first line in the file), convert parentheses to underscores, then pipe it into bior_create_config_for_tab_to_tjson to construct the config file we need to build the catalog:

$ head -n 1 dbNSFP2.1_gene | tr “(” “_” | tr “)” “_” | bior_create_config_for_tab_to_tjson > gene.config

$ vim gene.config (to identify the _landmark golden identifier)

$ cat dbNSFP2.1_gene | bior_tab_to_tjson -c gene.config > dbNSFP2.1_gene.tjson

$ bior_create_catalog -c -1 -i dbNSFP2.1_gene.tjson -o dbNSFP2.1_gene

Then Variants:

$ cat dbNSFP2.1_variant* | grep -v “^#” > dbNSFP2.1_variant

$ head -n 1 dbNSFP2.1_variant.chr1|bior_create_config_for_tab_to_tjson > variant.config

$ vim variant.config (to columns for _landmark, _minBP, _maxBP, _refAllele, and _altAllele)

$ cat dbNSFP2.1_gene | bior_tab_to_tjson -c gene.config > dbNSFP2.1_gene.tjson

$ bior_create_catalog -c -1 variant.tjson -o dbNSFP2.1_variant

It is really that simple, now dbNSFP is integrated into BioR! To use it, make sure to index as needed using bior_index_catalog command.

BioR Catalog Shortcut

BioR commands commonly use long paths to files. One of the first things you will want to do when using BioR is to make an alias to the location of the BioR catalogs. For example if the BioR catalogs are located in $bior

Then, on bash, execute the following command at the command line:

$ export bior=/data/path/

You may want to put this command in your .bashrc or .bash_profile so that the $bior environment variable shows up next time you log in.

Finding out what is in a Catalog

Each data source is ‘published’ into a BioR catalog file for use by the BioR scripts.  A Catalog is a collection of files (both data and indexes) that is understood by the BioR Pipes infrastructure. BioR’s reference data consists of the raw files downloaded/updated and made available to BioR users. These files ARE NOT catalogs. Catalogs are transformed into the BioR standard catalog structure so that pipes can work on the content. BioR catalogs are bgziped files [1]_ that contain 4 columns (_landmark, _minBP, _maxBP, and JSON). A more comprehensive description of the BioR catalog format is in Chapter 3.

To see what is in a catalog, use the zcat command (gzcat on a mac) followed by the catalog filename, followed by less:

$ zcat $bior/NCBIGene/GRCh37_p10/genes.tsv.bgz | less

1 10954 11507 {“_type”:”gene”,”_landmark”:”1”,”_strand”:”+”,”_minBP”:10954,”_maxBP” :11507,”gene”:”LOC100506145”,”note”:”Derived by automated computational analysis using gene prediction method: GNOMON. Supporting evidence includes similarity to: 1 Protein”,”pseudo”:”“,”GeneID”:”100506145”}

Unix less is a good-low-memory command to look at data. Type q <enter> to quit less. Type man less at the command line to see how to use the less command. You can use up and down arrows to scroll through the data a line at a time or ‘f’ and ‘b’ to scroll a page at a time.

Several of the above functions use ‘Golden Identifiers’ to match records across catalogs. Table 2 shows the current golden identifiers used in the codebase and what function(s) use them. The bold ones are the primary attributes used by several commands

‘Golden Identifier’ Functions Definition
__landmark bior_overlap, bior_same_variant Chromosome, or sequence ID that the interval is located on
__minBP bior_overlap, bior_same_variant Minimum 1-based position (e.g. NCBI coordinates) on the landmark sequence
_maxBP bior_overlap, bior_same_variant Maximum 1-based position on the landmark sequence
_refAllele bior_same_variant REF as in VCF standard
_altAlleles bior_same_variant ALT as in VCF standard
_type (none yet) The type of object each line refers to (Examples: “variant”, “gene”, “drug”, “pathway”, etc)
_strand (none yet) The strand direction (“+” or “-”)
_id (none yet) The id for the variant or object. For example, this may refer to the rsId for variants

BioR Commands

Showin the Commands in BioR Toolkit

All BioR commands start with bior_ so once BioRTools is installed and on your path you can type bior_ followed by the tab key (twice) and it will show you all of the current commands in the toolkit:

$ bior_

bior_annotate bior_concat bior_lookup bior_snpeff

bior_annotate_blaster bior_count_catalog_misses bior_merge bior_tab_to_tjson

bior_annotate.sh bior_create_catalog bior_modify_tjson bior_tjson_to_vcf

bior_bed_to_tjson bior_create_catalog_props bior_overlap bior_trim_spaces

bior_build_catalog bior_create_config_for_tab_to_tjson bior_pretty_print

bior_variant_to_tjson bior_catalog_remove_duplicates bior_drill

bior_ref_allele bior_vcf_to_tjson bior_catalog_stats bior_gbk_to_tjson

bior_replace_lines bior_vep bior_chunk bior_gff3_to_tjson

bior_rsid_to_tjson bior_verify_catalog bior_compress bior_index_catalog

bior_same_variant bior_create_catalog_props bior_ref_allele

 

Or listing them out one command per line in sorted order:

$ ls -1 $BIOR_LITE_HOME/bin | grep ^bior_

bior_annotate

bior_annotate_blaster

bior_bed_to_tjson

bior_build_catalog

bior_catalog_remove_duplicates

bior_catalog_stats

bior_chunk

bior_compress

bior_concat

bior_count_catalog_misses

bior_create_catalog

bior_create_catalog_props

bior_create_config_for_tab_to_tjson

bior_drill

bior_gbk_to_tjson

bior_gff3_to_tjson

bior_index_catalog

bior_lookup

bior_merge

bior_modify_tjson

bior_overlap

bior_pretty_print

bior_ref_allele

bior_replace_lines

bior_rsid_to_tjson

bior_same_variant

bior_snpeff

bior_tab_to_tjson

bior_tjson_to_vcf

bior_trim_spaces

bior_variant_to_tjson

bior_vcf_to_tjson

bior_vep

bior_verify_catalog

To find out which version each command was added to BioR:

# Path to cmds is similar to:

# /usr/local/biotools/bior_scripts/4.3.0/bior_pipeline-4.3.0/bin/bior_d rill

# Sort by cmd

$ for cmd in `ls -1 $BIOR_LITE_HOME/bin | grep ^bior`; do earliestVersion=`find $BIOR_LITE_HOME/../../ -name $cmd | sed ‘s#^.*bior_pipeline-##’ | sed ‘s#^.##’ | sed ‘s#/.*##’ | sort | head -1`; echo -e “$cmdt$earliestVersion”; done

bior_annotate 0.0.3-SNAPSHOT

bior_annotate_blaster 2.3.0

bior_bed_to_tjson 2.1.0

bior_build_catalog 4.1.2

bior_catalog_remove_duplicates 3.0.0

bior_catalog_stats 4.3.0

bior_chunk 2.3.0

bior_compress 0.0.3-SNAPSHOT

bior_concat 2.3.0

bior_count_catalog_misses 4.1.2

bior_create_catalog 2.1.0

bior_create_catalog_props 2.1.0

bior_create_config_for_tab_to_tjson 2.1.0

bior_drill 0.0.3-SNAPSHOT

bior_gbk_to_tjson 2.4.0

bior_gff3_to_tjson 2.4.0

bior_index_catalog 2.1.0

bior_lookup 0.0.3-SNAPSHOT

bior_merge 2.3.0

bior_modify_tjson 4.3.0

bior_overlap 0.0.3-SNAPSHOT

bior_pretty_print 0.0.3-SNAPSHOT

bior_ref_allele 2.3.0

bior_replace_lines 4.3.0

bior_rsid_to_tjson 2.4.1

bior_same_variant 0.0.3-SNAPSHOT

bior_snpeff 0.0.3-SNAPSHOT

bior_tab_to_tjson 2.1.0

bior_tjson_to_vcf 2.1.0

bior_trim_spaces 2.2.1

bior_variant_to_tjson 3.0.0

bior_vcf_to_tjson 2.1.0

bior_vep 0.0.3-SNAPSHOT

bior_verify_catalog 4.1.2

# Sort by release where each command was introduced

$ for cmd in `ls -1 $BIOR_LITE_HOME/bin | grep ^bior`; do earliestVersion=`find $BIOR_LITE_HOME/../../ -name $cmd | sed ‘s#^.*bior_pipeline-##’ | sed ‘s#^.##’ | sed ‘s#/.*##’ | sort | head -1`; echo -e “$earliestVersiont$cmd”; done | sort -k1,1

0.0.3-SNAPSHOT bior_annotate

0.0.3-SNAPSHOT bior_compress

0.0.3-SNAPSHOT bior_drill

0.0.3-SNAPSHOT bior_lookup

0.0.3-SNAPSHOT bior_overlap

0.0.3-SNAPSHOT bior_pretty_print

0.0.3-SNAPSHOT bior_same_variant

0.0.3-SNAPSHOT bior_snpeff

0.0.3-SNAPSHOT bior_vep

2.1.0 bior_bed_to_tjson

2.1.0 bior_create_catalog

2.1.0 bior_create_catalog_props

2.1.0 bior_create_config_for_tab_to_tjson

2.1.0 bior_index_catalog

2.1.0 bior_tab_to_tjson

2.1.0 bior_tjson_to_vcf

2.1.0 bior_vcf_to_tjson

2.2.1 bior_trim_spaces

2.3.0 bior_annotate_blaster

2.3.0 bior_chunk

2.3.0 bior_concat

2.3.0 bior_merge

2.3.0 bior_ref_allele

2.4.0 bior_gbk_to_tjson

2.4.0 bior_gff3_to_tjson

2.4.1 bior_rsid_to_tjson

3.0.0 bior_catalog_remove_duplicates

3.0.0 bior_variant_to_tjson

4.1.2 bior_build_catalog

4.1.2 bior_count_catalog_misses

4.1.2 bior_verify_catalog

4.3.0 bior_catalog_stats

4.3.0 bior_modify_tjson

4.3.0 bior_replace_lines

Table 1 has a more complete description of these commands.

Commands in the toolkit operate on tab delimited data with a VCF style header (starting with “#”). Commands in the toolkit insert additional annotation to the right. Raw annotation is obtained by comparing JSON objects in columns to JSON objects in catalogs. Table 1.0 shows the format of columns <in,out> of each BioR function. For example bior_vcf_to_tjson takes as an input VCF columns (and the header) and outputs VCF + JSON in the last column.

Command Input, Output Description
bior_annotate VCF, TJSON Append to the VCF ‘info’ field a set of commonly used annotations.
bior_annotate_blaster VCF, TJSON Similar to bior_annotate, but it uses the grid engine to split the input VCF into multiple smaller chunks and annotate those chunks concurrently before re-assembling them back into a single file
bior_bed_to_tjson BED, TJSON Load a BED file and convert to TJSON format.
bior_build_catalog (various), Catalog Creates a catalog bgz file from some data source, along with the accompanying columns.tsv and datasource.properties files. Also verifies the catalog for conformity with the catalog spec and consistency with reference assemblies
bior_catalog_remove_d uplicates TJSON, TJSON Keeps the first of several duplicate lines depending on keys specified by the user
bior_catalog_stats TJSON, (stats files) Show statistics about a catalog - from frequency of characters occurring on what percentage of lines, to a list of 1000 possible values for each field
bior_chunk VCF, VCF Breaks up a VCF into chunks based on start and end lines
bior_compress TJSON, TJSON Compress entries from provided set of identifiers into a single entry with each value separated by a delimiter.
bior_concat VCF, VCF Concatenate multiple VCF files together to form one large one
bior_count_catalog_mi sses catalog, report Report the number of misses that would occur in a catalog due to the Tabix Reader bug that was found in Broad code
bior_create_catalog TJSON, catalog Convert a text tabulated file into a catalog. Chromosome ID, Start and End genomics position fields have to be explicitly named.
bior_ create_catalog_props catalog, property Create property files from the metadata extracted from a catalog. Property files are needs for proper metadata handling.
bior_create_config_fo r_tab_to_tjson TSV,config Create a configuration file that describes column description. This file is needed when uploading a tab delimited file.
bior_drill TJSON, TJSON Extract an element from nested JSON string.
bior_gbk_to_tjson Genbank, TJSON Takes one or more genbank (gbk) files from input and outputs them as TJSON in STDOUT
bior_gff3_to_tjson gff3, TJSON Takes variant data in GFF3 format from STDIN and converts it into JSON as an additional column that is output to STDOUT
bior_index_catalog identifier, index Index the specified identifier in a catalog. Indices a stored in a separate index file.
bior_lookup TJSON, TJSON Extract annotations from a catalog based on matching values of an identifier.
bior_merge VCF, VCF Merges multiple VCFs together into one large one. This is done by looking at the next line in each file to determine which should be inserted into the large VCF (vs bior-concat which simply outputs one file after another without looking at content of the lines)
bior_modify_tjson TJSON, TJSON Given a config file that specifies how to transform data types and values, modify the TJSON on the fly (during streaming)
bior_overlap TJSON, TJSON Extract annotations from a catalog based on genomic location overlap. The overlap is computed from the Start and End genomics position of a variant.
bior_pretty_print TJSON, STDOUT Convert TJSON in a readable format for screen or file output.
bior_ref_allele TJSON, TJSON Retrieves the reference allele from the NCBI Genome database that matches a chromosome, start, end position
bior_replace_lines TJSON, TJSON Given two input files (1 with lines to find, and 1 with lines replace), replace whole lines in the TJSON
bior_rsid_to_tjson text, TJSON converts rsIDs into JSON as an additional column
bior_same_variant TJSON, TJSON Extract annotations from a catalog based on variant position, reference and alternate allele definition.
| | definition. |
| | to annotate variants. |
| | Chromosome ID, Start |
| | and Stop genomics |
| | position, reference |
| | and alternate allele |
| | of the variant is |
| | required . |
bior_tab_to_tjson TSV, TJSON Load a tab-delimited file and convert to TJSON format.
bior_tjson_to_vcf TJSON, VCF Convert TJSON to VCF format for file output.
bior_trim_spaces TJSON,TJSON Trims spaces from around tab-separated columns. Use this if you find spaces before or after your vcf columns that crash the tools or cause VEP to take a lot of memory.
bior_variant_to_tjson tsv, TJSON Converts rsID or position data into JSON as an additional column
bior_vcf_to_tjson VCF, TJSON Load a VCF file and convert to TJSON format.
bior_vep TJSON, TJSON Use VEP2 to annotate variants. Chromosome ID, Start and Stop genomics position, reference and alternate allele of the variant is required.
bior_verify_catalog TJSON, Report Verifies the catalog structure, reference base pairs, as well as columns.tsv and datasource.properties files against the values expected from crawling the catalog

Table 1: List of commands available in the BioR Toolkit. Detailed description and example is displayed when executing the command with the –h flag.

1Cingolani, P. et al. (2012) A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin). 6(2) :p. 80-92.

2McLaren W et al. (2010) Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. BMC Bioinformatics 26(16):2069-70

Most every one of these commands supports the –h (help) flag to get information about how to use the command. To get help on bior_vcf_to_tjson type:

$ bior_vcf_to_tjson -h

NAME

bior_vcf_to_tjson – converts VCF data into JSON as an additional column

SYNOPSIS

bior_vcf_to_tjson [–log] [–help]