bior_catalog_stats

Generates the following catalog statistics:

  • Character stats - Tracks frequency of characters in a column. These statistics are stored in a file for each catalog column with a file named <COLUMN_NAME>_char_stats.txt
  • Value stats - Tracks column values and frequency (up to the given sampling-size). These statistics are stored in a file for each catalog column with a file named <COLUMN_NAME>_value_stats.txt

From this we can more easily determine if fields should be Strings (they have non-numeric characters in them (or dots)), Floats (only numbers and dots), Integers (only numbers, no dots), Booleans (only the letters that make up “true” and “false”), etc.

ctg=/data5/bsi/catalogs/bior/v1/ClinVar/20160515_GRCh37/variants_no dups.v1/macarthur-lab_xml_txt.tsv.bgz

# Generate a report for the ClinVar catalog with 100 sample values

$ bior_catalog_stats -d $ctg -o . -n 100

# The reports generated: (cutting out some unnecessary columns to the left)

$ ls -l | cut -f 7- -d” “

6698 Apr 24 17:04 all_pmids_stats.txt

15209 Apr 24 17:04 all_submitters_stats.txt

18825 Apr 24 17:04 all_traits_stats.txt

5328 Apr 24 17:04 alt_stats.txt

2641 Apr 24 17:04 chrom_stats.txt

8878 Apr 24 17:04 clinical_significance_stats.txt

844 Apr 24 17:04 conflicted_stats.txt

5670 Apr 24 17:04 endPos_stats.txt

8460 Apr 24 17:04 hgvs_c_stats.txt

10325 Apr 24 17:04 hgvs_p_stats.txt

5668 Apr 24 17:04 measureset_id_stats.txt

1194 Apr 24 17:04 mut_stats.txt

836 Apr 24 17:04 pathogenic_stats.txt

5683 Apr 24 17:04 pos_stats.txt

5264 Apr 24 17:04 ref_stats.txt

2986 Apr 24 17:04 review_status_stats.txt

8642 Apr 24 17:04 symbol_stats.txt

# Looking at the first row in the catalog, we see the fields that are reported on above (minus the golden attributes like “_landmark”, “_id”, etc)

$ zcat $ctg | head -1 | bior_pretty_print

# COLUMN NAME COLUMN VALUE

  • ———– ————

1 #UNKNOWN_1 1

2 #UNKNOWN_2 949523

3 #UNKNOWN_3 949523

4 #UNKNOWN_4 {

“chrom”: “1”,

“pos”: 949523,

“ref”: “C”,

“alt”: “T”,

“mut”: “ALT”,

“measureset_id”: 183381,

“symbol”: “ISG15”,

“clinical_significance”: “Pathogenic”,

“review_status”: “no assertion criteria provided”,

“hgvs_c”: “NM_005101.3:c.163Cu003eT”,

“hgvs_p”: “NP_005092.1:p.Gln55Ter”,

“all_submitters”: “OMIM”,

“all_traits”: “Immunodeficiency 38;IMMUNODEFICIENCY 38 WITH BASAL GANGLIA CALCIFICATION”,

“all_pmids”: “25307056”,

“pathogenic”: 1,

“conflicted”: 0,

“endPos”: 949523,

“_altAlleles”: [

“T”

],

“_id”: “.”,

“_landmark”: “1”,

“_minBP”: 949523,

“_refAllele”: “C”,

“_maxBP”: 949523

}

# Here, the “COLUMN METADATA” is info pulled from the catalog’s columns.tsv file

# Characters stats are obtained by crawling each line in the catalog and determining what characters appear with what frequency

$ cat chrom_stats.txt

# COLUMN METADATA

COLUMN chrom

TYPE String

COUNT 1

DESCRIPTION Chromosome

# CHARACTER STATS

Total Lines in file: 117329

Lines that had column: 117329 (100.0%)

Total column value chars: 173129

NON-Alphanumeric

CHARACTER LINE_FREQ LINE_% CHAR_FREQ CHAR_%

Alphanumeric

CHARACTER LINE_FREQ LINE_% CHAR_FREQ CHAR_%

0 5802 4.9% 5802 3.4%

1 60443 51.5% 68477 39.6%

2 23790 20.3% 25435 14.7%

3 11383 9.7% 11383 6.6%

4 5572 4.7% 5572 3.2%

5 9576 8.2% 9576 5.5%

6 9870 8.4% 9870 5.7%

7 15207 13.0% 15207 8.8%

8 4372 3.7% 4372 2.5%

9 8959 7.6% 8959 5.2%

M 81 0.1% 81 0.0%

X 8365 7.1% 8365 4.8%

Y 30 0.0% 30 0.0%

# VALUE STATS

VALUE_FREQ VALUE_% VALUE

13750 11.7% 2

9227 7.9% 17

8365 7.1% X

8034 6.8% 11

7987 6.8% 1

6351 5.4% 16

6229 5.3% 5

5980 5.1% 7

5829 5.0% 13

5554 4.7% 3

5469 4.7% 12

4608 3.9% 9

4351 3.7% 19

4103 3.5% 10

3519 3.0% 6

3347 2.9% 15

3007 2.6% 14

2861 2.4% 8

2565 2.2% 4

1699 1.4% 20

1645 1.4% 22

1511 1.3% 18

1227 1.0% 21

81 0.1% M

30 0.0% Y

# If we look at the “VALUE STATS” from “clinical_significance_stats.txt for example, we can see what the top 10 most frequently occurring values are:

$ grep -A 12 “VALUE STATS” clinical_significance_stats.txt

# VALUE STATS

VALUE_FREQ VALUE_% VALUE

31677 27.0% Pathogenic

30630 26.1% Uncertain significance

12700 10.8% Benign

12597 10.7% not provided

10660 9.1% Likely benign

6197 5.3% Likely pathogenic

2293 2.0% Benign;Likely benign

1772 1.5% other

1188 1.0% Likely benign;Uncertain significance

1077 0.9% Likely pathogenic;Pathogenic