bior_drill¶
Drill allows you to extract one or more values from one JSON column. As a simple example, say we have one JSON column with one value. By default, it will remove the JSON column and insert a new column that has just the value you want to drill. Note the ##BIOR metadata header that is created for the column: it added the prefix “bior.” to the column name to distinguish it as a BioR column
$ cat my.tsv #myJson {“aString”:”hi”} $ cat my.tsv | bior_drill -p aString ##BIOR=<ID=”bior.myJson.aString”,Operation=”bior_drill”,DataType=”Str ing”,ShortUniqueName=”“,Path=”“> #bior.myJson.aString hi |
You can also keep the JSON column (and it will be re-appended to the end of the line so you can continue to perform operations on the JSON column):
$ cat my.tsv | bior_drill -p aString -k ##BIOR=<ID=”bior.myJson.aString”,Operation=”bior_drill”,DataType=”Str ing”,ShortUniqueName=”“,Path=”“> #bior.myJson.aString myJson hi {“aString”:”hi”} |
If bior_overlap or bior_same_variant or bior_lookup were run first before drilling any of the fields, there is a lot more metadata that is added about the catalog and about each field that is drilled
ctg=/data5/bsi/catalogs/bior/v1/1000_genomes/20110521/ALL.wgs.phase 1_release_v3.20101123.snps_indels_sv.sites_GRCh37.tsv.bgz $ echo “{_landmark:1,_minBP:10583,_maxBP:10583}” | bior_overlap -d $ctg | bior_drill -p ID ##BIOR=<ID=”bior.1kG_3”,Operation=”bior_overlap”,DataType=”JSON”,Shor tUniqueName=”1kG_3”,Source=”1000 Genomes”,Description=”1000 Genomes Project goal is to find most genetic variants that have frequencies of at least 1% in the populations studied.”,Version=”3”,Build=”GRCh37”,Path=”/data5_mike/bsi/catalogs/b ior/v1/1000_genomes/20110521/ALL.wgs.phase1_release_v3.20101123.snps_ indels_sv.sites_GRCh37.tsv.bgz”> ##BIOR=<ID=”bior.1kG_3.ID”,Operation=”bior_drill”,Field=”ID”,DataType =”String”,Number=”.”,FieldDescription=”Semi-colon separated list of unique identifiers. If this is a dbSNP variant, the rs number(s) should be used. (VCF field)”,ShortUniqueName=”1kG_3”,Source=”1000 Genomes”,Description=”1000 Genomes Project goal is to find most genetic variants that have frequencies of at least 1% in the populations studied.”,Version=”3”,Build=”GRCh37”,Path=”/data5_mike/bsi/catalogs/b ior/v1/1000_genomes/20110521/ALL.wgs.phase1_release_v3.20101123.snps_ indels_sv.sites_GRCh37.tsv.bgz”,Delimiter=”|”> #UNKNOWN_1 bior.1kG_3.ID {_landmark:1,_minBP:10583,_maxBP:10583} rs58108140 |
You can use the “echo” command instead of “cat” to inject JSON into the STDIN stream. Here we will also drill out multiple fields at once (##BIOR headers removed from many of the following cmds for brevity, and shortened to “##BIOR….”)
$ echo “{‘aString’:’hi’,’aFloat’:0.034,’aBool’:false,’anInt’:34,’aDot’:’.’}” | bior_drill -p aString -p aFloat -p aBool -p anInt -p aDot ##BIOR…. #bior.#UNKNOWN_1.aString bior.#UNKNOWN_1.aFloat bior.#UNKNOWN_1.aBool bior.#UNKNOWN_1.anInt bior.#UNKNOWN_1.aDot hi 0.034 false 34 . |
Drill from a middle column (not the last one)
NOTE: This shows how to use tabs and newlines with “echo” command to add columns and header rows (or multiple data rows)
NOTE: The drilled column is removed by default and all following columns shifted left one position
$ echo -e “#IdtMyJsontReftAltnRs1t{‘A’:1}tAtC” #Id MyJson Ref Alt Rs1 {‘A’:1} A C $ echo -e “#IdtMyJsontReftAltnRs1t{‘A’:1}tAtC” | bior_drill -c 2 -p A ##BIOR…. #Id Ref Alt bior.MyJson.A Rs1 A C 1 |
Arrays and nulls are handled specially.
- Nulls and empty arrays when drilled are denoted as dots to maintain placeholders in an otherwise empty column
- If a field does not appear in the line, its drilled value will be a dot (ex: the “nonexistent” field specified below)
- Single-value arrays have the brackets (“[“, “]”) removed and only the value inserted
- Multi-value arrays have the brackets removed, and values are
separated by pipes by default (ex: “anArray”:[“A”,”B”,”C”] drilled
becomes “A|B|C”).
- If the separator occurs in one of the values, an error occurs
- Change the default separator by using the -d flag
- Skip any nulls or dots in an array using the -s flag. Warning: some arrays keep nulls or dots to maintain order to correlate other values from another array within the same data source, so order and position may be important.
- NOTE: BioR v4.3.0 changed the way arrays are drilled. Prior to that version, “{‘key’:[1,2,3]}” would drill to “[1,2,3]”, instead of the “1|2|3” as is now done. This change allows easier parsing of drilled values
- NOTE: BioR v4.2.0 fixed a bug in the way null values were handled when drilling. Previously, instead of inserting a dot as a placeholder for that column, the column was missing entirely, which shifted all columns left one position.
$ echo “{‘aNull’:null,’anArray’:[‘A’,’B’,’C’],’anArraySingleVal’:[‘D’],’anAr rayEmpty’:[],’anArrayNull’:null}” | bior_drill -p aNull -p anArray -p anArraySingleVal -p anArrayEmpty -p anArrayNull -p nonexistent ##BIOR…. #bior.#UNKNOWN_1.aNull bior.#UNKNOWN_1.anArray bior.#UNKNOWN_1.anArraySingleVal bior.#UNKNOWN_1.anArrayEmpty bior.#UNKNOWN_1.anArrayNull bior.#UNKNOWN_1.nonexistent . A|B|C D … # Error if separator occurs in value $ echo “{‘anArray’:[‘A|B’,’B’,’C’]}” | bior_drill -p anArray Application error bior_drill: : Error: the delimiter ‘|’ was found within one of the array values that was drilled: ‘A|B’ Execute bior_drill with logging enabled using the -l or –log option Command executed bior_drill -p anArray # Override the default separator $ echo “{‘anArray’:[‘A|B’,’B’,’C’]}” | bior_drill -p anArray -d “##” ##BIOR=…. #bior.#UNKNOWN_1.anArray A|B##B##C # Skip any nulls and dots within an array. NOTE: Empty strings are still added $ echo “{‘anArray’:[‘A’,null,’.’,’B’,null,’C’,null,’D’,’‘]}” | bior_drill -p anArray -s ##BIOR=…. #bior.#UNKNOWN_1.anArray A|B|C|D| |