Package 'QCGWAS'

Title:	Quality Control of Genome Wide Association Study Results
Description:	Tools for (automated and manual) quality control of the results of Genome Wide Association Studies.
Authors:	Peter J. van der Most and Ilja M. Nolte
Maintainer:	Peter J. van der Most <[email protected]>
License:	GPL (>= 3)
Version:	1.0-9
Built:	2025-02-14 03:09:29 UTC
Source:	https://github.com/cran/QCGWAS

Help Index

Quality Control of Genome Wide Association Study Results
Skewness and Kurtosis
Checking GWAS p-values
Convert imputation-status values to the QCGWAS standard
Create an allele-reference file from HapMap data
Automated filtering and reformatting of GWAS results files
Sample dataset for the QCGWAS package
Translation table for GWAS dataset headers
Select high-quality data in GWAS datasets
Identify non-standard column names
Generates an intensity plot from x, y datasets
Easy loading of GWAS results files
Check and correct alleles in GWAS result files
GWAS effect-Size distribution plot
GWAS Precision Plot
Regional Association Plot
GWAS Skewness vs. Kurtosis Plot
Automated Quality Control of GWAS Results files
Histogram(s) of expected and observed data distribution
QQ and Manhattan plots
QQ plot(s) of expected vs. reported p-values
Log entries for QC_GWAS
Convert alleles to the opposing DNA strand
Translate column names into standard names

Quality Control of Genome Wide Association Study Results

Description

Functions for automated and manual quality control of Genome Wide Association Study results.

Details

Package:	QCGWAS
Type:	Package
Version:	1.0-9
Date:	2022-05-30
License:	GPL (>= 3)

The core of the package is the function QC_GWAS. This function carries out an automated quality-control (QC) of a Genome Wide Association Study (GWAS) results file, reporting on the data-distribution, checking the SNPs for missing and invalid data, comparing the alleles and allele-frequency to a reference, and creating QQ and Manhattan plots.

Although the number of arguments in QC_GWAS may seem overwhelming, only three of them are required to run a basic QC. The name of the file to be QC'ed should be passed to the filename argument; the directory of said file to the dir_data argument; and a header-translation table to the header_translations argument. The results will be saved in a number of files and graphs in the data directory. For a quick introduction to QCGWAS, read the quick reference guide that can be found in "R\library\QCGWAS\doc".

Author(s)

Peter J. van der Most and Ilja M. Nolte

Maintainer: Peter J. van der Most <[email protected]>

References

Van der Most, P.J., Vaez, A., Prins, B.P., Loretto Munoz, M., Snieder, H., Alizadeh, B.Z. & Nolte, I.M. (2014). QCGWAS: A flexible R package for automated quality control of Genome-Wide Association results. Bioinformatics 30(8):1185-1186.

Skewness and Kurtosis

Description

Functions for calculating skewness and kurtosis

Usage

calc_kurtosis(input,
              FRQ_val = NULL, HWE_val = NULL,
              cal_val = NULL, imp_val = NULL, ...)
calc_skewness(input,
              FRQ_val = NULL, HWE_val = NULL,
              cal_val = NULL, imp_val = NULL, ...)
calc_kurtosis(input,
              FRQ_val = NULL, HWE_val = NULL,
              cal_val = NULL, imp_val = NULL, ...)
calc_skewness(input,
              FRQ_val = NULL, HWE_val = NULL,
              cal_val = NULL, imp_val = NULL, ...)

Arguments

`input`	either a vector of effect sizes or a data frame using the standard column names.
`FRQ_val`, `HWE_val`, `cal_val`, `imp_val`, `...`	arguments passed to `HQ_filter`.

Details

Kurtosis is a measure of how well a distribution matches a Gaussian distribution. A Gaussian distribution has a kurtosis of 0. Negative kurtosis indicates a flatter distribution curve, while positive kurtosis indicates a sharper, thinner curve.

Skewness is a measure of distribution asymmetry. A symmetrical distribution has skewness 0. A positive skewness indicates a long tail towards higher values, while a negative skewness indicates a long tail towards lower values.

Kurtosis is calculated as:

sum( (ES - mean(ES))^4) / ((length(ES)-1) * sd(ES)^4 )

Skewness is calculated as:

sum( (ES - mean(ES))^3) / ((length(ES)-1) * sd(ES)^3 )

Value

Respectively the kurtosis and skewness of the input effect-size distribution.

Note

Both functions accept vectors as input. If input is a data frame, the column names must match the standard names used by QC_GWAS ("EFFECT" for effect sizes, "EFF_ALL_FREQ" for allele frequency, etc.)

Examples

  data("gwa_sample")  
  
  calc_kurtosis(gwa_sample$EFFECT)
  calc_kurtosis(gwa_sample)
  calc_kurtosis(gwa_sample$EFF_ALL_FREQ)
  calc_kurtosis(gwa_sample,
                FRQ_val = 0.05, cal_val = 0.95,
                filter_NA = FALSE)
data("gwa_sample")  
  
  calc_kurtosis(gwa_sample$EFFECT)
  calc_kurtosis(gwa_sample)
  calc_kurtosis(gwa_sample$EFF_ALL_FREQ)
  calc_kurtosis(gwa_sample,
                FRQ_val = 0.05, cal_val = 0.95,
                filter_NA = FALSE)

Checking GWAS p-values

Description

A simple test to check if the reported p-values in a GWAS results file match the other statistics. This function calculates an expected p-value (from the effect size and standard error) and then correlates it with the actual, reported p-value.

Usage

check_P(dataset, HQ_subset,
        plot_correlation = FALSE, plot_if_threshold = FALSE,
        threshold_r = 0.99,
        save_name = "dataset", save_dir = getwd(),
        header_translations,
        use_log = FALSE, dataN = nrow(dataset), ...)
check_P(dataset, HQ_subset,
        plot_correlation = FALSE, plot_if_threshold = FALSE,
        threshold_r = 0.99,
        save_name = "dataset", save_dir = getwd(),
        header_translations,
        use_log = FALSE, dataN = nrow(dataset), ...)

Arguments

`dataset`	table with at least three columns: p-value, effect size and standard error.
`HQ_subset`	an optional logical or numeric vector indicating the rows in `dataset` that contain high quality SNPs.
`plot_correlation`	logical; should a scatterplot of the reported vs. calculated p-values be made? If `TRUE`, the plot is saved as a .png file.
`plot_if_threshold`	logical; if `TRUE`, the scatterplot is only saved when the correlation between reported and calculated p-values is lower than `threshold_r`.
`threshold_r`	numeric; the correlation threshold for the scatterplot.
`save_name`	character string; the filename, without extension, for the scatterplot.
`save_dir`	character string; the directory where the output files are saved. Note that R uses forward slash (/) where Windows uses backslash (\).
`header_translations`	translation table for column names See `translate_header` for more information. If the argument is left empty, `dataset` is assumed to use the standard column names used by `QC_GWAS`.
`use_log`, `dataN`	arguments used by `QC_GWAS`; redundant when `check_P` is used separately.
`...`	arguments passed to `plot`.

Details

check_P calculates the expected p-value by taking the chi-square (1 degree of freedom) of the effect size divided by the standard error squared.

In a typical GWAS dataset, the expected and observed p-values should correlate perfectly. If this isn't the case, the problem either lies in a misidentified column, or the wrong values were used when generating the dataset.

Value

The correlation between expected and reported p-values.

Examples

  data("gwa_sample")

  selected_SNPs <- HQ_filter(data = gwa_sample,
                             FRQ_val = 0.05,
                             cal_val = 0.95,
                             filter_NA = FALSE)
  # To calculate a correlation between predicted and actual p-values:
  check_P(gwa_sample, HQ_subset = selected_SNPs,
          plot_correlation = FALSE)
  
  # To plot the correlation:
  ## Not run: 
    check_P(gwa_sample, HQ_subset = selected_SNPs,
            plot_correlation = TRUE, plot_if_threshold = FALSE,
            save_name = "sample")
  
## End(Not run)
data("gwa_sample")

  selected_SNPs <- HQ_filter(data = gwa_sample,
                             FRQ_val = 0.05,
                             cal_val = 0.95,
                             filter_NA = FALSE)
  # To calculate a correlation between predicted and actual p-values:
  check_P(gwa_sample, HQ_subset = selected_SNPs,
          plot_correlation = FALSE)
  
  # To plot the correlation:
  ## Not run: 
    check_P(gwa_sample, HQ_subset = selected_SNPs,
            plot_correlation = TRUE, plot_if_threshold = FALSE,
            save_name = "sample")
  
## End(Not run)

Convert imputation-status values to the QCGWAS standard

Description

Converts imputation-status data to the standard values used by QC_GWAS

Usage

convert_impstatus(impstatus,
                  T_strings = c("1", "TRUE", "T"),
                  F_strings = c("0", "FALSE", "F"),
                  NA_strings = c(NA, "NA", ".", "-"),
                  use_log = FALSE, ...)
convert_impstatus(impstatus,
                  T_strings = c("1", "TRUE", "T"),
                  F_strings = c("0", "FALSE", "F"),
                  NA_strings = c(NA, "NA", ".", "-"),
                  use_log = FALSE, ...)

Arguments

`impstatus`	vector of imputation-status data.
`T_strings`	character-string(s) indicating imputed data.
`F_strings`	character-string(s) indicating genotyped data.
`NA_strings`	character-string(s) indicating missing data.
`use_log`, `...`	arguments used by `QC_GWAS`; redundant when `convert_impstatus` is used separately.

Details

This function is used to convert the imputation-status column into the standard format, where 0 is genotyped and 1 is imputed. Untranslated values (i.e. strings that do not appear in any of the string arguments) will trigger a warning message and are set to NA.

Value

Numeric vector with values 0 for genotyped, 1 for imputed and NA for unknown data.

Note

The implementation of this function has changed. Previously, only character data was translated using the string arguments. Since version 1.0-4, the string arguments are used for all data types, so the user can determine the conversion of logical and numeric values as well.

However, this does mean that values 1 and 0 are no longer converted automatically. They must be specified in the string arguments, or else the function will report them as untranslated and converts them to NA. The same applies to TRUE, FALSE and NA. Note the difference between the character string "NA" and the value NA.

Finally, if the imputation-status column contains character strings, the main QC function QC_GWAS requires that all values are translated. If not, QC_GWAS will abort the QC.

Examples

  status1 <- c(0,0,0,1,1,1,2,NA)
  status2 <- c(TRUE, FALSE, TRUE, FALSE, NA)
  status3 <- c("imputed", "genotyped", "NA", NA)
  status4 <- c(status1, status2, status3)
  
  ( outcome1 <- convert_impstatus(status1,
                                  T_strings = 1,
                                  F_strings = 0,
                                  NA_strings = NA) )
  # status 1 contains an untranslated value "2", which is
  # converted to NA. To avoid the warning message:
  
  ( outcome1 <- convert_impstatus(status1,
                                  T_strings = 1,
                                  F_strings = 0,
                                  NA_strings = c(NA, 2)) )
  
  
  ( outcome2 <- convert_impstatus(status2,
                                  T_strings = TRUE,
                                  F_strings = FALSE,
                                  NA_strings = NA) )
  
  ( outcome3 <- convert_impstatus(status3,
                                  T_strings = "imputed",
                                  F_strings = "genotyped",
                                  NA_strings = c("NA", NA)) )
  # Note that NA_strings includes both the character-string "NA"
  # and the value NA. Otherwise, one of the two would go 
  # "untranslated" and trigger a warning message.
  
  
  # And to check them all together
  ( outcome4 <- convert_impstatus(status4,
                          T_strings = c(1, TRUE, "imputed"),
                          F_strings = c(0, FALSE, "genotyped"),
                          NA_strings = c("NA", NA, 2)) )
status1 <- c(0,0,0,1,1,1,2,NA)
  status2 <- c(TRUE, FALSE, TRUE, FALSE, NA)
  status3 <- c("imputed", "genotyped", "NA", NA)
  status4 <- c(status1, status2, status3)
  
  ( outcome1 <- convert_impstatus(status1,
                                  T_strings = 1,
                                  F_strings = 0,
                                  NA_strings = NA) )
  # status 1 contains an untranslated value "2", which is
  # converted to NA. To avoid the warning message:
  
  ( outcome1 <- convert_impstatus(status1,
                                  T_strings = 1,
                                  F_strings = 0,
                                  NA_strings = c(NA, 2)) )
  
  
  ( outcome2 <- convert_impstatus(status2,
                                  T_strings = TRUE,
                                  F_strings = FALSE,
                                  NA_strings = NA) )
  
  ( outcome3 <- convert_impstatus(status3,
                                  T_strings = "imputed",
                                  F_strings = "genotyped",
                                  NA_strings = c("NA", NA)) )
  # Note that NA_strings includes both the character-string "NA"
  # and the value NA. Otherwise, one of the two would go 
  # "untranslated" and trigger a warning message.
  
  
  # And to check them all together
  ( outcome4 <- convert_impstatus(status4,
                          T_strings = c(1, TRUE, "imputed"),
                          F_strings = c(0, FALSE, "genotyped"),
                          NA_strings = c("NA", NA, 2)) )

Create an allele-reference file from HapMap data

Description

This function creates the standard allele reference file, as used by QC_GWAS and match_alleles, from data publicly available at the website of the international HapMap project (see 'References').

Usage

create_hapmap_reference(dir = getwd(),
   download_hapmap = FALSE, download_subset,
   hapmap_files = list.files(path = dir, pattern = "freqs_chr"),
   filename = "allele_reference_HapMap",
   save_txt = TRUE, save_rdata = !save_txt,
   return_reference = FALSE)
create_hapmap_reference(dir = getwd(),
   download_hapmap = FALSE, download_subset,
   hapmap_files = list.files(path = dir, pattern = "freqs_chr"),
   filename = "allele_reference_HapMap",
   save_txt = TRUE, save_rdata = !save_txt,
   return_reference = FALSE)

Arguments

`dir`	character string; the directory of the input and output files. Note that R uses forward slash (/) where Windows uses the backslash (\).
`download_hapmap`	logical; if `TRUE`, the required allele-frequency files are downloaded from the HapMap website into `dir`, and then turned into a reference. If `FALSE`, the files specified in `hapmap_files` are used.
`download_subset`	character-string; indicates the population to download for creating the reference. Options are: ASW, CEU, CHB, CHD, GIH, JPT, LWK, MEX, MKK, TSI, YRI.
`hapmap_files`	character vector of the filenames of HapMap frequency-files to be included in the reference. The default option includes all files with the string "freqs_chr" in the filename. (This argument is only used when `download_hapmap` is `FALSE`.)
`filename`	character string; the name of the output file, without file-extension.
`save_txt`, `save_rdata`	logical; should the reference be saved as a tab-delimitated text file and/or an RData file? If saved as RData, the object name `allele_ref_std` is used for the reference table.
`return_reference`	logical; should the function return the reference as it output value?

Details

The function removes SNPs with invalid alleles and with allele frequencies that do not add up to 1. It also removes all instances of duplicate SNPids. If such entries are encountered, a warning is printed in the R console and the entries are saved in a .txt file in the output directory.

Like the QC_GWAS, create_hapmap_reference codes the X chromosome as 23, Y as 24, XY (not available on HapMap website) as 25 and M as 26.

Both the .RData export and the function return store the alleles as factors rather than character strings.

Value

If return_reference is TRUE, the function returns the generated reference table. If FALSE, it returns an invisible NULL.

References

The required data is available at the Website of the International HapMap project, under bulk data downloads > bulk data > frequencies

http://hapmap.ncbi.nlm.nih.gov

The HapMap files downloaded by this function are subject to the HapMap terms and policies. See: http://hapmap.ncbi.nlm.nih.gov/datareleasepolicy.html

Examples

  # This command will download the CEU HapMap dataset and use
  # it to generate an allele-reference. Create a folder
  # "new_hapmap" to store the data and make sure there is
  # sufficient disk space and a reasonably fast internet
  # connection.

  ## Not run: 
    new_hapmap <- create_hapmap_reference(dir = "C:/new_hapmap",
                                download_hapmap = TRUE, download_subset = "CEU",
                                filename = "new_hapmap", save_txt = TRUE,
                                return_reference = TRUE)
  
## End(Not run)
# This command will download the CEU HapMap dataset and use
  # it to generate an allele-reference. Create a folder
  # "new_hapmap" to store the data and make sure there is
  # sufficient disk space and a reasonably fast internet
  # connection.

  ## Not run: 
    new_hapmap <- create_hapmap_reference(dir = "C:/new_hapmap",
                                download_hapmap = TRUE, download_subset = "CEU",
                                filename = "new_hapmap", save_txt = TRUE,
                                return_reference = TRUE)
  
## End(Not run)

Automated filtering and reformatting of GWAS results files

Description

This function was created as a convenient way to automate the removal of low-quality and non-autosomal SNPs. It includes the same formatting options as QC_GWAS.

Usage

filter_GWAS(ini_file,
            GWAS_files, output_names,
            gzip_output = TRUE,
            dir_GWAS = getwd(), dir_output = dir_GWAS,
            FRQ_HQ = NULL, HWE_HQ = NULL,
            cal_HQ = NULL, imp_HQ = NULL,
            FRQ_NA = TRUE, HWE_NA = TRUE,
            cal_NA = TRUE, imp_NA = TRUE,
            ignore_impstatus = FALSE,
            remove_X = FALSE, remove_Y = FALSE,
            remove_XY = FALSE, remove_M = FALSE,
            header_translations,
            check_impstatus = FALSE,
            imputed_T = c("1", "TRUE", "yes", "YES", "y", "Y"),
            imputed_F = c("0", "FALSE", "no", "NO", "n", "N"),
            imputed_NA = NULL,
            column_separators = c("\t", " ", "", ",", ";"),
            header = TRUE, nrows = -1, nrows_test = 1000,
            comment.char = "", na.strings = c("NA", "."),
            out_header = "original", out_quote = FALSE,
            out_sep = "\t", out_eol = "\n", out_na = "NA",
            out_dec = ".", out_qmethod = "escape",
            out_rownames = FALSE, out_colnames = TRUE, ...)
filter_GWAS(ini_file,
            GWAS_files, output_names,
            gzip_output = TRUE,
            dir_GWAS = getwd(), dir_output = dir_GWAS,
            FRQ_HQ = NULL, HWE_HQ = NULL,
            cal_HQ = NULL, imp_HQ = NULL,
            FRQ_NA = TRUE, HWE_NA = TRUE,
            cal_NA = TRUE, imp_NA = TRUE,
            ignore_impstatus = FALSE,
            remove_X = FALSE, remove_Y = FALSE,
            remove_XY = FALSE, remove_M = FALSE,
            header_translations,
            check_impstatus = FALSE,
            imputed_T = c("1", "TRUE", "yes", "YES", "y", "Y"),
            imputed_F = c("0", "FALSE", "no", "NO", "n", "N"),
            imputed_NA = NULL,
            column_separators = c("\t", " ", "", ",", ";"),
            header = TRUE, nrows = -1, nrows_test = 1000,
            comment.char = "", na.strings = c("NA", "."),
            out_header = "original", out_quote = FALSE,
            out_sep = "\t", out_eol = "\n", out_na = "NA",
            out_dec = ".", out_qmethod = "escape",
            out_rownames = FALSE, out_colnames = TRUE, ...)

Arguments

`ini_file`	(the filename of) a table listing the files to be processed and the filters to be applied. See 'Details'.
`GWAS_files`	character vector: when no `ini_file` is provided, this identifies the files to be processed. See 'Details'.
`output_names`	character vector: the filenames for the output files. The default option is to use the input filenames. Note that, unlike with other `QCGWAS` functions, the file extensions should be included (However, the function will automatically add `".gz"` when the files are compressed.
`gzip_output`	logical; should the output files be compressed?
`dir_GWAS`, `dir_output`	character-strings specifying the directory address of the folders for the input files and the output, respectively. Note that R uses forward slash (/) where Windows uses backslash (\).
`FRQ_HQ`, `HWE_HQ`, `cal_HQ`, `imp_HQ`	Numeric vectors. When no `ini_file` is provided, these arguments specify the filter threshold-values for allele frequency, HWE p-value, callrate and imputation quality, respectively. Passed to `HQ_filter`.
`FRQ_NA`, `HWE_NA`, `cal_NA`, `imp_NA`	Logical vectors. When no `ini_file` is provided, these arguments specify whether missing values (of allele frequency, HWE p-value, callrates and imputation quality, respectively) are excluded (`TRUE`) or ignored (`FALSE`). Passed to `HQ_filter`.
`ignore_impstatus`	Logical vector. When no `ini_file` is provided, this argument specifies whether imputation status is taken into account when applying the filters. If `FALSE`, HWE p-value and callrate filters are applied only to genotyped SNPs, and imputation quality filters only to imputed SNPs. If `TRUE`, the filters are applied to all SNPs regardless of the imputation status.
`remove_X`, `remove_Y`, `remove_XY`, `remove_M`	logical; respectively whether X-chromosome, Y-chromosome, pseudo-autosomal and mitochondrial SNPs are removed. Note: these arguments accept only a single `TRUE` or `FALSE` value. Unlike the above settings, it's not possible to specify them independently for every dataset.
`header_translations`	translation table for column names. See `translate_header` for more information. If the argument is left empty, `dataset` is assumed to use the standard column-names used by `QC_GWAS`.
`check_impstatus`	logical; should `convert_impstatus` be called to convert the imputation-status column into standard values?
`imputed_T`, `imputed_F`, `imputed_NA`	arguments passed to `convert_impstatus`.
`column_separators`	character string or vector; specifies the values used as column delimitator in the GWAS file(s). The argument is passed to `load_test`; see the description of that function for more information.
`nrows_test`	integer; the number of rows used for "trial-loading". Before loading the entire dataset, the function `load_test` is called to determine the dataset's file-format by reading the top `x` lines, where `x` is `nrows_test`. Setting `nrows_test` to a low number (e.g. `150`) means quick testing, but runs the risk of missing problems in lower rows. To test the entire dataset, set it to `-1`.
`header`, `nrows`, `comment.char`, `na.strings`, `...`	arguments passed to `read.table` when importing the dataset.
`out_header`	Translation table for the column names of the output file. This argument is the opposite of `header_translations`: it translates the standard column-names of `QC_GWAS` to user-defined ones. `output_header` can be one of three things: A user specified table similar to the one used by `translate_header`. However, as this translates standard names into non-standard ones, the standard names should be in the right column, and the desired ones in the left. There is also no requirement for the names in the left column to be uppercase. The name of a file in `dir_GWAS` containing such a table. Character string specifying a standard form. See `QC_GWAS`, section 'Output header' for the options.
`out_quote`, `out_sep`, `out_eol`, `out_na`, `out_dec`, `out_qmethod`, `out_rownames`, `out_colnames`	arguments passed to `write.table` when saving the final dataset.

Details

The easiest way to use filter_GWAS is by passing an ini file to the ini_file argument. The ini file can be generated by running QC_series with the save_filtersettings argument set to TRUE. The output will include a file 'Check_filtersettings.txt', describing the (high-quality) filter settings used for each file (taking into account whether there was enough data, i.e. whether the use_threshold was met, to apply the filters).

The ini_file argument accepts both a table or the name of a file in dir_GWAS or the current R working directory.

If no ini_file is specified, the function will use the GWAS_files, x_HQ, x_NA and ignore_impstatus arguments to construct such a table. GWAS_files can either be a character vector or a single value. If a single string, all filenames containing the string will be processed. The other arguments can also be a vector or a single value; if the latter, they will be recycled to create a vector of the correct length.

If neither ini_file nor GWAS_files are specified, the function will look for a file Check_filtersettings.txt in dir_GWAS and the current R working directory.

Note that ini_file overrules the other filter settings, i.e. one cannot adjust ini_file through the other arguments.

Value

An invisible logical vector, indicating which files were successfully filtered.

Note

R is not the optimal platform for filtering GWAS files. This function was added at the request of a user, but an UNIX script is likely to be faster.

Sample dataset for the QCGWAS package

Description

A fake GWAS results dataset for use in the examples.

Usage

data(gwa_sample)data(gwa_sample)

Format

A data frame with 10000 observations on the following 15 variables.

MARKER: a character vector; the SNP IDs.
STRAND: a character vector; the DNA strand of the listed alleles.
CHR: a character vector; the chromosome of the listed SNP.
POSITION: a numeric vector; the basepair position of the listed SNP.
EFFECT_ALL: a character vector; the effect allele.
OTHER_ALL: a character vector; the non-effect allele.
N_TOTAL: a numeric vector; the sample size.
EFF_ALL_FREQ: a numeric vector; frequency of the effect-allele in the dataset.
HWE_PVAL: a numeric vector; Hardy-Weinberg (HWE) p-value of the listed alleles.
CALLRATE: a numeric vector; SNP callrate.
EFFECT: a numeric vector; effect size of the listed effect allele.
STDERR: a numeric vector; standard error of the effect.
PVALUE: a numeric vector; p-value of the effect.
IMPUTED: a numeric vector; imputation status - i.e. whether the SNP is genotyped (0) or imputed (1).
IMP_QUALITY: a numeric vector; imputation quality.

Translation table for GWAS dataset headers

Description

This is a sample translation table, as used by translate_header to translate dataset column names to the QCGWAS standard. An editable .txt version can be found in "R\library\QCGWAS\doc".

Usage

data(header_translations)data(header_translations)

Format

A data frame with 104 observations on the following 2 variables.

STANDARD: a character vector; standard column names as used by QCGWAS.
ALTERNATIVE: a character vector; alternative column names, as used in the datasets.

Select high-quality data in GWAS datasets

Description

This function accepts a QC_GWAS dataset and returns a vector of logical values indicating which entries meet the quality criteria.

Usage

HQ_filter(data,
          ignore_impstatus = FALSE,
          FRQ_val = NULL, HWE_val = NULL,
          cal_val = NULL, imp_val = NULL,
          filter_NA = TRUE,
          FRQ_NA = filter_NA, HWE_NA = filter_NA,
          cal_NA = filter_NA, imp_NA = filter_NA)
HQ_filter(data,
          ignore_impstatus = FALSE,
          FRQ_val = NULL, HWE_val = NULL,
          cal_val = NULL, imp_val = NULL,
          filter_NA = TRUE,
          FRQ_NA = filter_NA, HWE_NA = filter_NA,
          cal_NA = filter_NA, imp_NA = filter_NA)

Arguments

`data`	table to be filtered. `HQ_filter` assumes the dataset uses the standard `QC_GWAS` column names.
`ignore_impstatus`	logical; if `FALSE`, HWE p-value and callrate filters are applied only to genotyped SNPs, and imputation quality filters only to imputed SNPs. If `TRUE`, the filters are applied to all SNPs regardless of the imputation status.
`FRQ_val`, `HWE_val`, `cal_val`, `imp_val`	numeric; the minimal required value for allele frequency, HWE p-value, callrate and imputation quality respectively. Note that the allele-frequency filter is two-sided: for a filter-value of `x`, it will exclude entries with `freq < x` and `freq > 1 - x`.
`filter_NA`	logical; if `TRUE`, then missing filter variables will be excluded; if `FALSE`, they will be ignored. `filter_NA` is the default setting for all variables. Variable-specific settings can be specified with the following arguments.
`FRQ_NA`, `HWE_NA`, `cal_NA`, `imp_NA`	logical; variable-specific settings for `filter_NA`.

Details

A SNP is considered high-quality if it meets all quality criteria. The thresholds are inclusive; i.e. SNPs that have a value equal or higher than the threshold will be considered high-quality.

To filter missing values only, set the filter argument to NA, and the corresponding NA-filter to TRUE.

To disable filtering entirely, set to NULL. This disables the filtering of missing values as well.

When imputation status is missing or invalid (and ignore_impstatus is FALSE), only the allele-frequency filter will be applied.

Value

A vector of logical values, indicating which values in data meet (TRUE) or fail (FALSE) the quality criteria.

Note

The table entered in the data argument must use the standard column names of QC_GWAS. Functions using HQ_filter usually allow the user to specify a translation table. If not, translate_header can be used to translate the header manually.

Examples

  data("gwa_sample")

  selected_SNPs <- HQ_filter(data = gwa_sample,
                             FRQ_val = 0.01,
                             cal_val = 0.95,
                             filter_NA = FALSE)
  summary(gwa_sample[selected_SNPs, ])
  
  selected_SNPs <- HQ_filter(data = gwa_sample,
                             FRQ_val = 0.01,
                             cal_val = 0.95,
                             filter_NA = FALSE,
                             ignore_impstatus = TRUE)
  summary(gwa_sample[selected_SNPs, ])
data("gwa_sample")

  selected_SNPs <- HQ_filter(data = gwa_sample,
                             FRQ_val = 0.01,
                             cal_val = 0.95,
                             filter_NA = FALSE)
  summary(gwa_sample[selected_SNPs, ])
  
  selected_SNPs <- HQ_filter(data = gwa_sample,
                             FRQ_val = 0.01,
                             cal_val = 0.95,
                             filter_NA = FALSE,
                             ignore_impstatus = TRUE)
  summary(gwa_sample[selected_SNPs, ])

Identify non-standard column names

Description

This function is a subroutine of translate_header and several other functions. It is used to translate non-standard column names into standard ones.

Usage

identify_column(std_name, alt_names, header)
identify_column(std_name, alt_names, header)

Arguments

`std_name`	character string; the standard name for the data-column.
`alt_names`	translation table; with in the left column the standard name(s) and in the right column possible alternatives. See `translate_header` for more details.
`header`	the column names of the dataset. The names should be entirely in uppercase.

Details

The purpose of identify_column is essentially to look up in the translation table (alt_names) which of the names in header can be translated into std_name.

Value

An integer vector of the entry(s) in header (i.e. the column-numbers) that can be translated into std_name.

Examples

  sample_data <-
    data.frame(SNP = paste("rs", 1:10, sep = ""),
               chrom = 2,
               effect = 1:10/10,
               misc = NA)
  sample_header <- toupper(names(sample_data))


  alt_headers <-
    data.frame(
      standard = c("MARKER", "MARKER", "CHR", "CHR"),
      alternative = c("MARKER", "SNP", "CHR", "CHROM"),
      stringsAsFactors = FALSE)
  
  identify_column(std_name = "EFFECT", alt_names = alt_headers,
                  header = sample_header)
  identify_column(std_name = "MARKER", alt_names = alt_headers,
                  header = sample_header)
  identify_column(std_name = "CHR", alt_names = alt_headers,
                  header = sample_header)
  identify_column(std_name = "MISC", alt_names = alt_headers,
                  header = sample_header)
sample_data <-
    data.frame(SNP = paste("rs", 1:10, sep = ""),
               chrom = 2,
               effect = 1:10/10,
               misc = NA)
  sample_header <- toupper(names(sample_data))


  alt_headers <-
    data.frame(
      standard = c("MARKER", "MARKER", "CHR", "CHR"),
      alternative = c("MARKER", "SNP", "CHR", "CHROM"),
      stringsAsFactors = FALSE)
  
  identify_column(std_name = "EFFECT", alt_names = alt_headers,
                  header = sample_header)
  identify_column(std_name = "MARKER", alt_names = alt_headers,
                  header = sample_header)
  identify_column(std_name = "CHR", alt_names = alt_headers,
                  header = sample_header)
  identify_column(std_name = "MISC", alt_names = alt_headers,
                  header = sample_header)

Generates an intensity plot from x, y datasets

Description

This function is used by match_alleles to generate an intensity plot. This function is currently only partially implemented, so we recommend that users do not bother with it.

Usage

intensity_plot(x, y, strata, nbin = 20,
               xmax = max(x), xmin = min(x),
               ymax = max(y), ymin = min(y),
               strata_colours = c("black", "red", "turquoise3"),
               verbose = TRUE, xlab = "x", ylab = "y", ...)
intensity_plot(x, y, strata, nbin = 20,
               xmax = max(x), xmin = min(x),
               ymax = max(y), ymin = min(y),
               strata_colours = c("black", "red", "turquoise3"),
               verbose = TRUE, xlab = "x", ylab = "y", ...)

Arguments

`x`, `y`	numerical vectors; the x and y coordiantes of the datapoints.
`strata`	logical vector; indicates whether the datapoint belongs to strata 1 or 2. If missing, all datapoints are assumed to belong to strata 1.
`nbin`	integer: the number of bins (categories) on the x and y axis.
`xmax`, `xmin`, `ymax`, `ymin`	numeric; the range of x and y values shown in the plot.
`strata_colours`	character vector of length `3`, indicating the colours to be used for entries of strata 1, 2 or mixed, respectively.
`verbose`	logical; determines whether a warning is printed in the console when datapoints are removed.
`xlab`, `ylab`, `...`	arguments passed to `plot`.

Details

This function is intended as an alternative to the standard scatter plot for the allele-frequency correlation graph.

Value

An invisible object of type list with the following two components:

`NA_removed`	The number of entries removed due to missing values
`outliers_removed`	The number of entries removed because they exceeded the thresholds specified by the min/max arguments.

Easy loading of GWAS results files

Description

load_GWAS is wrapper-function of read.table that makes loading large GWAS results files less of a hassle. It automatically unpacks .zip and .gz files and uses load_test to determine which column separator the file uses.

Usage

load_GWAS(filename, dir = getwd(),
          column_separators = c("\t", " ", "", ",", ";"),
          test_nrows = 1000,
          header = TRUE, nrows = -1,
          comment.char = "", na.strings = c("NA", "."),
          stringsAsFactors = FALSE, ...)

load_test(filename, dir = getwd(),
          column_separators = c("\t", " ", "", ",", ";"),
          test_nrows = 1000, ...)
load_GWAS(filename, dir = getwd(),
          column_separators = c("\t", " ", "", ",", ";"),
          test_nrows = 1000,
          header = TRUE, nrows = -1,
          comment.char = "", na.strings = c("NA", "."),
          stringsAsFactors = FALSE, ...)

load_test(filename, dir = getwd(),
          column_separators = c("\t", " ", "", ",", ";"),
          test_nrows = 1000, ...)

Arguments

`filename`	character string; the complete filename of the file to be loaded. Note that compressed files (.gz or .zip files) can only be unpacked if the filename of the archive contains the extension of the archived file. For example, if the archived file is named `"data1.csv"`, the archive should be `"data1.csv.zip"`.
`dir`	character string; the directory containing the file. Note that R uses forward slash (/) where Windows uses backslash (\).
`column_separators`	character string or vector of the column-separators to be tried by `load_test`. White-space can be specified by `""`, but it is recommended you try tab (`"\t"`) and space (`" "`) first.
`test_nrows`	integer; the number of lines that `load_test` checks in the trail-load. A smaller number means faster loading, but also makes it more likely that errors slip through. To check the entire dataset, set to `-1`.
`header`, `nrows`, `comment.char`, `na.strings`, `stringsAsFactors`, `...`	Arguments passed to `read.table`.

Details

load_test determines the correct column separator simply by trying them individually until it finds one that works (that is: one that results in a dataset with an equal number of cells in every row AND at least five or more columns). If none work, it reports the error-message generated by the last column separator tried.

The column separators are tried in the order specified by the column_separators argument.

By default, load_test only checks the first 1000 lines (adjustable by the test_nrows argument); if the problem lies further down in the dataset, it will not catch it. In such a case, load_GWAS and QC_GWAS will crash when attempting to load the dataset.

A common problem is employing white-space ("") as column separator for a file that uses empty fields to indicate missing values. The separators surrounding an empty field are adjacent, so R parses them as a single column separator. In this particular example, specifying a single space (" ") or tab ("\t") as column separator solves the problem (this is why the default setting of column_separators puts these values before white-space).

Value

load_GWAS returns the table imported from the specified file.

load_test returns a list with 4 components:

`success`	logical; whether `load_test` was able to load a dataset with five or more columns.
`error`	character string; if unable to load the file, this returns the error-message of the last column separator to be tried.
`file_type`	character string; the last three characters of `filename`.
`sep`	the first column-separator that succeeded in loading a dataset with five or more columns.

Note

load_GWAS uses the same default loading-settings as QC_GWAS. load_test, on the other hand, has no default values for header, comment.char, na.strings and stringsAsFactors, and uses the read.table defaults instead.

Examples

  ## As the function requires a GWAS file to work,
  ## the following code should be adjusted before execution.
  ## Because this is a demonstration, the nrows argument is used
  ## to read only the first 100 rows.
  
  ## Not run: 
   data_GWAS <-
      load_GWAS("GWA_results1.txt.zip",
                dir = "C:/GWAS_results",
                nrows = 100)
  
## End(Not run)
## As the function requires a GWAS file to work,
  ## the following code should be adjusted before execution.
  ## Because this is a demonstration, the nrows argument is used
  ## to read only the first 100 rows.
  
  ## Not run: 
   data_GWAS <-
      load_GWAS("GWA_results1.txt.zip",
                dir = "C:/GWAS_results",
                nrows = 100)
  
## End(Not run)

Check and correct alleles in GWAS result files

Description

This function checks the reported alleles and allele frequencies in GWAS results data by comparing them to a reference table. It will also uniformize the dataset by switching all SNPs to the positive strand and flipping the alleles so that the effect allele matches the reference minor allele.

Usage

match_alleles(dataset, ref_set, HQ_subset,
              dataname = "dataset", ref_name = "reference",
       unmatched_data = !all(dataset$MARKER %in% ref_set$SNP),
              check_strand = FALSE,
              save_mismatches = TRUE, delete_mismatches = FALSE,
              delete_diffEAF = FALSE, threshold_diffEAF = 0.15,
              check_FRQ = TRUE, check_ambiguous = FALSE,
              plot_FRQ = FALSE, plot_intensity = FALSE, 
              plot_if_threshold = FALSE,
              threshold_r = 0.95,
              return_SNPs = FALSE, return_ref_values = FALSE,
              header_translations, header_reference,
              save_name = dataname, save_dir = getwd(),
              use_log = FALSE, log_SNPall = nrow(dataset))
match_alleles(dataset, ref_set, HQ_subset,
              dataname = "dataset", ref_name = "reference",
       unmatched_data = !all(dataset$MARKER %in% ref_set$SNP),
              check_strand = FALSE,
              save_mismatches = TRUE, delete_mismatches = FALSE,
              delete_diffEAF = FALSE, threshold_diffEAF = 0.15,
              check_FRQ = TRUE, check_ambiguous = FALSE,
              plot_FRQ = FALSE, plot_intensity = FALSE, 
              plot_if_threshold = FALSE,
              threshold_r = 0.95,
              return_SNPs = FALSE, return_ref_values = FALSE,
              header_translations, header_reference,
              save_name = dataname, save_dir = getwd(),
              use_log = FALSE, log_SNPall = nrow(dataset))

Arguments

`dataset`	table containing the allele data. `dataset` should always contain columns for the SNPID, the effect allele and the other allele. Strand and allele frequency may be required, depending upon the settings, while effect size is optional. `match_alleles` accepts non-standard column names, provided a translation table is specified in `header_translations`. The order of columns does not matter; nor does the presence of other columns.
`ref_set`	table containing the reference data. `ref_set` should always contain columns for the SNPID, the minor allele and the major allele. Minor allele frequency is only required if `check_FRQ` or `check_ambiguous` are `TRUE`. The standard column-names are `"SNP"`, `"MINOR"`, `"MAJOR"` and `"MAF"`. Non-standard column names are accepted if a translation table is specified in `header_reference`. All SNPs must be aligned to the positive strand.
`HQ_subset`	an optional logical or numeric vector indicating the rows in `dataset` that contain high quality SNPs.
`dataname`, `ref_name`	character strings; the names of the dataset and reference, respectively. Used as identifiers in the output files.
`unmatched_data`	logical; are there SNPs in the dataset that do not appear in the reference? This argument is currently redundant: the function will determine it automatically.
`check_strand`	logical; should the function check for negative-strand SNPs? If `FALSE`, all SNPs are assumed to be on the positive strand.
`save_mismatches`	logical; should mismatching entries be exported to a .txt file before they are corrected?
`delete_mismatches`	logical; should mismatching SNPs (that could not be corrected by strand-switching) have their effect allele set to missing?
`delete_diffEAF`	logical; should SNPs that exceed the `threshold_diffEAF` have their effect allele set to missing?
`threshold_diffEAF`	numeric; the max. allowed difference between reported and reference allele frequency.
`check_FRQ`	logical; should the function correlate the reported allele-frequency with that of the reference?
`check_ambiguous`	logical; should the function do separate frequency correlations and create separate plots for SNPs with a strand-independent allele-pair (i.e. an A/T or C/G configuration)?
`plot_FRQ`	logical; should a scatterplot of reported vs. reference allele-frequency be made?
`plot_intensity`	logical; if `TRUE`, instead of a scatterplot an intensity plot is generated. This option is currently only partially implemented. Leave to `FALSE` for now.
`plot_if_threshold`	logical; if `TRUE`, the scatterplot is only made when frequency correlation is below the threshold specified by `threshold_r`.
`threshold_r`	numeric; the correlation threshold value.
`return_SNPs`	logical; should the return value include the relevant columns of `dataset`?
`return_ref_values`	logical; should the return-value include the matching entries in `ref_set`?
`header_translations`, `header_reference`	translation tables for converting the column names of `dataset` and `ref_set` to standard names, respectively. See `translate_header` for more information.
`save_name`	character string; the filename, without extension, for the various output files.
`save_dir`	character string; the directory where the output files are saved. Note that R uses forward slash (/) where Windows uses backslash (\).
`use_log`, `log_SNPall`	arguments used by `QC_GWAS`; redundant when `match_alleles` is used separately.

Details

match_alleles is one of the more complicated functions of QCGWAS. However, what it does is quite simple:

Check for incorrect allele pairs
Uniformize the output so that all SNPs are on the positive strand, and identical SNPs will have the same effect allele and other allele in all datasets
Check the allele frequency

The complexity stems from the fact that these three tasks have to be carried out together and often overlap. So the actual function schematic looks like this:

Switch negative-strand SNPs (i.e. SNPs with "-" in the strand-column) to the positive strand. This step can be disabled by setting check-strand to FALSE.
Correct (if possible) mismatching alleles. A mismatch is when the reported allele-pair does not match that in the reference. match_alleles will attempt to fix the mismatch by "strand-switching" the alleles. The assumption is that dataset merely reported the wrong strand, so converting them to the opposing strand should solve the mismatch. If it does not, the SNPs are truly mismatches. If delete_mismatches is TRUE, the effect alleles are set to NA; if FALSE, they are restored to their original configuration and excluded from the allele-frequency test. If save_mismatches is TRUE, the entries are exported in a .txt before being changed. Note that save_mismatches only exports true mismatches (i.e. not those that were fixed after strand-switching).
Align the allele-pairs with the reference. In order to have the same effect allele with the same SNP in every dataset, SNPs are "flipped" so that the effect allele matches the reference minor allele. Flipped alleles will also have their allele frequency and effect size inverted.
Check for undetected strand-mismatch by counting the number of ambiguous and (if check_FRQ is TRUE) suspect SNPs. Ambiguous SNPs are SNPs with an allele-pair that is identical on both strands (i.e. A/T or C/G). Suspect SNPs are ambiguous SNPs whose allele-frequency differs strongly from that of the reference.
Check allele-frequencies by correlating and/or plotting them against the reference. If check_ambiguous is TRUE, additional scatterplots will be made for the subsets of ambiguous and non-ambiguous SNPs. If delete_diffEAF is TRUE, SNPs whose allele-frequency differs from the reference by more than threshold_diffEAF have their effect alleles set to NA as well. This entire step can be disabled by setting check_FRQ to FALSE.

Value

An object of class 'list' with the following components:

`FRQ_cor`, `FRQ_cor_ambiguous`, `FRQ_cor_nonambi`	Allele-frequency correlations for all, ambiguous, and non-ambiguous SNPs respectively.
`n_SNPs`	Total number of SNPs in `dataset`
`n_missing`, `n_missing_data`, `n_missing_ref`	Number of SNPs with missing allele-data in either `dataset` or `reference`, dataset only, and reference only, respectively.
`n_negative_strand`, `n_negative_switch`, `n_negative_mismatch`	Number of negative-strand SNPs, the subset of negative-strand SNPs that were strand-switched twice because they did not match the reference, and the subset of double-switched SNPs that were still mismatching after the second strand-switch.
`n_strandswitch`, `n_mismatch`	Number of SNPs that was strand-switched because they did not match the reference, and the subset of those that still did not match after the strand-switch.
`n_flipped`	Number of SNPs whose alleles were flipped to align them with the reference.
`n_ambiguous`, `n_suspect`	Number of ambiguous SNPs, and the subset of those that had a large allele-frequency aberration.
`n_diffEAF`	Number of SNPs whose allele-frequency differs from the reference by more than `threshold_diffEAF`.
`MARKER`	When `return_SNPs` and/or `return_ref_values` is `TRUE`, this returns the column of `dataset` containing the SNP IDs. If not, this returns `NULL`.
`EFFECT_ALL`, `OTHER_ALL`, `STRAND`, `EFFECT`, `EFF_ALL_FREQ`	If `return_SNPs` is `TRUE`, these elements return the corrected columns of data-set. If `FALSE`, these return `NULL`. Note: `match_alleles` only returns those columns that were checked; if `check_FRQ` is `FALSE`, `EFF_ALL_FREQ` return `NULL`. The same goes for `check_strand` and `STAND`. `EFFECT` is only returned if present in `dataset`.
`ref_MINOR`, `ref_MAJOR`, `ref_MAF`	If `return_ref_values` is `TRUE`, these elements return the reference minor and major alleles and allele frequency column for the SNPs in `MARKER`.If `FALSE`, these return `NULL`. `ref_MAF` is only returned when `check_FRQ` is `TRUE`.

Interpreting the output

The output of match_alleles may seem a bit overwhelming at first, so here is a short explanation of what it means and what you should pay attention to.

The columns included in the return value when return_SNPs is TRUE are the post-matching dataset. This is only relevant if you want to continue working with the corrected dataset. Similarly, the output of return_ref_values is only used for comparing the post-matching dataset to the reference.

n_missing, n_missing_data and n_missing_ref report the prevalence of missing allele data, but are otherwise irrelevant.

The majority of return values serve to check whether strand-switching was performed correctly. n_strandswitch indicates how many SNPs were converted to the other strand because of a mismatch with the reference. In our experience, many cohorts do not include stand data, or simply set all SNPs to "+", so the presence of strand-switched SNPs isn't an indicator of problems by itself. However, if the strand-switching did not fix the mismatch, there may a problem. The subset of strand-switched SNPs that could not be fixed is reported as n_mismatch, and indicates incorrect allele data or, possibly, triallelic SNPs.

Depending on the argument save_mismatches, mismatching entries are exported as a .txt file, together with the reference data. This allows the user to see which SNPs are affected.

Another sign of trouble is when negative-strand SNPs (n_negative_strand) are present (i.e. the cohort included real strand data, rather just setting it to "+"), but strand-switching still occurred. Negative-strand SNPs are converted to the positive strand before their alleles are compared to the reference, so they should not appear here. If they do, it means that either the strand-column data is incorrect, or it is an ordinary mismatch (see above).

Negative-strand SNPs that are "strand-switched" will revert to their original allele configuration (but the strand column now reports them as being on positive strand). The output of QC_GWAS calls them double strand-switches, but here they are reported as n_negative_switch. The subset of those that could not be fixed is n_negative_mismatch.

Just to be clear: the relevant output is still n_strandswitch and n_negative_strand, not n_negative_switch and n_negative_mismatch. Whether it was the negative-strand SNPs that were switched or not is not important: the important thing is that there were negative-strand SNPs (i.e. the cohort included real strand data rather setting everything to "+"); yet strand-switches were still necessary and cannot be attributed to mismatch (i.e. the strand data is incorrect).

n_flipped counts how many SNPs had their alleles reversed to match the effect allele with the reference minor-allele. This is merely recorded for "administrative" purposes, and shouldn't concern the user.

n_ambiguous and n_suspect are another test of the strand information. Ambiguous SNPs are SNPs that have the same allele pair on the positive and negative strands (i.e. A/T or C/G). Matching them with the allele-reference therefor won't detected incorrect strand-information. In a normal-sized GWAS results file, about 15% of SNPs will be ambiguous.

Suspect SNPs are the subset of ambiguous SNPs whose allele frequency is significantly different from that in the reference ( < 0.35. vs > 0.65 or visa versa). In our experience, a GWAS results file with 2.5M SNPs will have only a few dozen suspect SNPs. However, if it's a sizable proportion of all ambiguous SNPs, it indicates that the ambiguous SNPs are listed for the wrong strand. This will also have resulted in the wrong SNPs being flipped in the previous step, so it should be visible in the allele-frequency correlation test as well.

n_diffEAF counts SNPs with significantly different allele-frequencies. A large number here indicates either that the allele-frequencies are incorrect or listed for the wrong allele (see below), or that the population used in the dataset does not match that of the reference.

The FRQ_cor value is the correlation between the reported and reference allele-frequencies. If allele frequency is correct, the correlation should be near 1. If it's close to -1, the listed frequency is that of the other (i.e. non-effect) allele.

the FRQ_cor_ambiguous and FRQ_cor_nonambi values are the same test for the subsets of ambiguous and non-ambiguous SNPs. If ambiguous SNPs are listed on the wrong strand, then they will have been flipped incorrectly, so allele-frequency correlation should also move towards -1.

Note

The function does not delete SNPs, regardless of the delete_mismatches delete_diffEAF arguments. Setting these to TRUE means that any such SNPs are marked by having their effect allele set to NA. The actual deletion takes place inside QC_GWAS.

Examples

# In order to keep the QCGWAS package small, no allele reference
# is included. Use the create_hapmap_reference function (see
# above) to create one.

## Not run: 
  data("gwa_sample")
  hapmap_ref <- read.table("C:/new_hapmap/new_hapmap.txt",
                  header = TRUE, stringsAsFactors = FALSE)

  match_alleles(gwa_sample, hapmap_ref,
    dataname = "sample data", ref_name = "HapMap",
    save_name = "test_allele1", save_dir = "C:/new_hapmap",
    check_strand = TRUE, plot_FRQ = TRUE)

 HQ_SNPs <- HQ_filter(data = gwa_sample, filter_NA = TRUE,
                          filter_FRQ = 0.01, filter_cal = 0.95)
  match_alleles(gwa_sample, hapmap_ref,
    HQ_subset = HQ_SNPs,
    dataname = "sample data", ref_name = "HapMap",
    save_name = "test_allele2", save_dir = "C:/new_hapmap",
    check_strand = TRUE, plot_FRQ = TRUE)

  match_output <-
    match_alleles(gwa_sample, hapmap_ref,
      HQ_subset = HQ_SNPs,
      delete_mismatches = TRUE, return_SNPs = TRUE,
      delete_diffEAF = TRUE, threshold_diffEAF = 0.15,
      dataname = "sample data", ref_name = "HapMap",
      save_name = "test_allele3", save_dir = "C:/new_hapmap",
      check_strand = TRUE, plot_FRQ = TRUE)
  
  if(sum(match_output$n_negative_strand,
         match_output$n_strandswitch, match_output$n_mismatch,
         match_output$n_flipped, match_output$n_diffEAF) > 0){
    gwa_sample$EFFECT_ALL   <- match_output$EFFECT_ALL
    gwa_sample$OTHER_ALL    <- match_output$OTHER_ALL
    gwa_sample$STRAND       <- match_output$STRAND
    gwa_sample$EFFECT       <- match_output$EFFECT
    gwa_sample$EFF_ALL_FREQ <- match_output$EFF_ALL_FREQ
  }
  
## End(Not run)
# In order to keep the QCGWAS package small, no allele reference
# is included. Use the create_hapmap_reference function (see
# above) to create one.

## Not run: 
  data("gwa_sample")
  hapmap_ref <- read.table("C:/new_hapmap/new_hapmap.txt",
                  header = TRUE, stringsAsFactors = FALSE)

  match_alleles(gwa_sample, hapmap_ref,
    dataname = "sample data", ref_name = "HapMap",
    save_name = "test_allele1", save_dir = "C:/new_hapmap",
    check_strand = TRUE, plot_FRQ = TRUE)

 HQ_SNPs <- HQ_filter(data = gwa_sample, filter_NA = TRUE,
                          filter_FRQ = 0.01, filter_cal = 0.95)
  match_alleles(gwa_sample, hapmap_ref,
    HQ_subset = HQ_SNPs,
    dataname = "sample data", ref_name = "HapMap",
    save_name = "test_allele2", save_dir = "C:/new_hapmap",
    check_strand = TRUE, plot_FRQ = TRUE)

  match_output <-
    match_alleles(gwa_sample, hapmap_ref,
      HQ_subset = HQ_SNPs,
      delete_mismatches = TRUE, return_SNPs = TRUE,
      delete_diffEAF = TRUE, threshold_diffEAF = 0.15,
      dataname = "sample data", ref_name = "HapMap",
      save_name = "test_allele3", save_dir = "C:/new_hapmap",
      check_strand = TRUE, plot_FRQ = TRUE)
  
  if(sum(match_output$n_negative_strand,
         match_output$n_strandswitch, match_output$n_mismatch,
         match_output$n_flipped, match_output$n_diffEAF) > 0){
    gwa_sample$EFFECT_ALL   <- match_output$EFFECT_ALL
    gwa_sample$OTHER_ALL    <- match_output$OTHER_ALL
    gwa_sample$STRAND       <- match_output$STRAND
    gwa_sample$EFFECT       <- match_output$EFFECT
    gwa_sample$EFF_ALL_FREQ <- match_output$EFF_ALL_FREQ
  }
  
## End(Not run)

GWAS effect-Size distribution plot

Description

This function generates the effect-size distribution boxplot created by QC_series.

Usage

plot_distribution(data_table,
                  names = 1:ncol(data_table),
                  include = TRUE,
                  plot_order = 1:ncol(data_table),
                  quantile_lines = FALSE,
                  save_name = "Graph_distribution",
                  save_dir = getwd(), ...)
plot_distribution(data_table,
                  names = 1:ncol(data_table),
                  include = TRUE,
                  plot_order = 1:ncol(data_table),
                  quantile_lines = FALSE,
                  save_name = "Graph_distribution",
                  save_dir = getwd(), ...)

Arguments

`data_table`	table with a column of effect sizes for every dataset.
`names`	vector; the names for the datasets, for use in the graph. Note: it's best to keep these very short, as long labels won't be plotted. The default is the column numbers of `data_table`.
`include`	logical vector indicating which columns of `data_table` are included in the plot. The default setting is to include all.
`plot_order`	numeric vector determining the left-to-right order of plotting the datasets (columns). `QC_series` uses the sample size for this.
`quantile_lines`	logical; should lines representing the median and quartile values be included?
`save_name`	character string; the filename, without extension, for the graph file.
`save_dir`	character string; the directory where the graph is saved. Note that R uses forward slash (/) where Windows uses backslash (\).
`...`	arguments passed to `boxplot`.

Details

When running a QC over multiple files, QC_series collects the values of the effectsize_HQ output of QC_GWAS in a table, which is then passed to this function. If there are significant differences in the distribution of effect sizes, it usually indicates that the datasets did not use the same model or unit.

Value

An invisible NULL.

Note

There is a known bug with this function when called by QC_series. As input for names, QC_series pastes together a shortened filename and a "N = x" string giving the dataset's sample size.

The filenames are truncated to the first unique element; e.g. files "cohortX_male_HB.txt" and "cohortX_female_HB.txt" become "cohortX_male; N = 608" and "cohortX_female; N = 643", respectively. However, if the unique element is longer than approx. 15 characters, the label is too long to be plotted. The only solution is to change the filenames prior to passing the files to QC_series.

Examples

## Not run: 
  data("gwa_sample")

  chunk1 <- gwa_sample$EFFECT[1:1000]
  chunk2 <- gwa_sample$EFFECT[1001:2000]
  chunk3 <- gwa_sample$EFFECT[2001:3000]

  plot_distribution(
    data_table = data.frame(chunk1, chunk2, chunk3),
    names = c("chunk 1", "chunk 2", "chunk 3"),
    quantile_lines = TRUE,
    save_name = "sample_distribution")
    
## End(Not run)
## Not run: 
  data("gwa_sample")

  chunk1 <- gwa_sample$EFFECT[1:1000]
  chunk2 <- gwa_sample$EFFECT[1001:2000]
  chunk3 <- gwa_sample$EFFECT[2001:3000]

  plot_distribution(
    data_table = data.frame(chunk1, chunk2, chunk3),
    names = c("chunk 1", "chunk 2", "chunk 3"),
    quantile_lines = TRUE,
    save_name = "sample_distribution")
    
## End(Not run)

GWAS Precision Plot

Description

This function generates the precision plot created by QC_series. Precision is defined as:

1 / median standard-error.

Usage

plot_precision(SE, N,
               labels = NULL,
               save_name = "Graph_precision",
               save_dir = getwd(), ...)
plot_precision(SE, N,
               labels = NULL,
               save_name = "Graph_precision",
               save_dir = getwd(), ...)

Arguments

`SE`, `N`	numeric vectors containing the median standard-error and sample size of the datasets, respectively.
`labels`	vector containing names or other identifiers for the datapoints, to be plotted in the graph. Note: it's best to keep these very short. To disable labeling, set to `NULL` (default).
`save_name`	character string; the filename, without extension, for the graph file.
`save_dir`	character string; the directory where the graph is saved. Note that R uses forward slash (/) where Windows uses backslash (\).
`...`	arguments passed to `plot`.

Details

When running a QC over multiple files, QC_series collects the values of the SE_HQ_median and sample_size_HQ outputs of QC_GWAS in a table, which is then passed to this function to convert it to a plot.

The plot is to provide a visual estimate whether the standard errors are within the expected range. As sample size increases, the median standard error is expected to decrease, so the plot should show a linear relation.

Value

An invisible NULL.

Examples

## Not run: 
  value_SE <- c(0.078, 0.189, 0.077, 0.040, 0.021, 0.072)
  value_N <- c(870, 830, 970, 690, 2200, 870)
  value_labels <- paste("cohort", 1:6)
  
  plot_precision(SE = value_SE, N = value_N,
                 labels = value_labels,
                 save_name = "sample_precision")
                 
## End(Not run)
## Not run: 
  value_SE <- c(0.078, 0.189, 0.077, 0.040, 0.021, 0.072)
  value_N <- c(870, 830, 970, 690, 2200, 870)
  value_labels <- paste("cohort", 1:6)
  
  plot_precision(SE = value_SE, N = value_N,
                 labels = value_labels,
                 save_name = "sample_precision")
                 
## End(Not run)

Regional Association Plot

Description

A regional association plot is essentially a zoomed-in Manhattan plot, allowing the researcher to look at associations in a small, pre-defined area of the genome.

Usage

plot_regional(dataset,
              chr, start_pos, end_pos,
              plot_cutoff_p = 1,
              name_cutoff_p,
              data_name = NULL,
              save_name = "regional_association_plot",
              save_dir = getwd(),
              header_translations,
              main = "Regional association plot", ...)
plot_regional(dataset,
              chr, start_pos, end_pos,
              plot_cutoff_p = 1,
              name_cutoff_p,
              data_name = NULL,
              save_name = "regional_association_plot",
              save_dir = getwd(),
              header_translations,
              main = "Regional association plot", ...)

Arguments

`dataset`	data frame containing the SNPs chromosome number, position, p-value and (if a `name_cutoff_p` is specified) their SNP IDs.
`chr`	character or numeric value indicating the chromosome of interest.
`start_pos`, `end_pos`	Numeric; the position values of the beginning and end of the region of interest.
`plot_cutoff_p`	numeric - the threshold of p-values to be shown in the QQ & Manhattan plots. Higher (less significant) p-values are not included in the plot. Note that, unlike in `QC_GWAS` or `QC_plots`, the default value is `1` (i.e. includes everything), rather than `0.05`. Setting it to `0.05` excludes 95% of data-points, which significantly reduces running time and memory usage in larger datasets.
`name_cutoff_p`	numeric; SNPs with p-values lower than or equal to this value will have their names plotted in the graph. If set to `NULL` (default), no names will be plotted.
`data_name`	character string; the subtitle for the plot.
`save_name`	character string; the filename, without extension, for the graph file.
`save_dir`	character string; the directory where the graph is saved. Note that R uses forward slash (/) where Windows uses backslash (\).
`header_translations`	translation table for column names. See `translate_header` for more information. If the argument is left empty, `dataset` is assumed to use the standard column-names of `QC_GWAS`.
`main`, `...`	arguments passed to `plot`.

Details

By default, plot_regional expects dataset to use the standard column-names used by QC_GWAS. A translation table can be specified in header_translations to allow non-standard names. See translate_header for more information.

Value

An invisible NULL.

Examples

## Not run: 
  data("gwa_sample")

  plot_regional(dataset = gwa_sample,
          chr = 2, start_pos = 55000000, end_pos = 75000000,
          data_name = "QC GWAS sample data",
          save_name = "sample_regional_association")
          
## End(Not run)
## Not run: 
  data("gwa_sample")

  plot_regional(dataset = gwa_sample,
          chr = 2, start_pos = 55000000, end_pos = 75000000,
          data_name = "QC GWAS sample data",
          save_name = "sample_regional_association")
          
## End(Not run)

GWAS Skewness vs. Kurtosis Plot

Description

This function generates the skewness vs. kurtosis plot created by QC_series.

Usage

plot_skewness(skewness,
              kurtosis,
              labels = paste("Study", 1:length(skewness)),
              plot_labels = "outliers",
              save_name = "Graph_skewness_kurtosis",
              save_dir = getwd(), ...)
plot_skewness(skewness,
              kurtosis,
              labels = paste("Study", 1:length(skewness)),
              plot_labels = "outliers",
              save_name = "Graph_skewness_kurtosis",
              save_dir = getwd(), ...)

Arguments

`skewness`, `kurtosis`	Vectors containing the skewness and kurtosis values of the datasets
`labels`	vector containing names or other identifiers for the datapoints, to be plotted in the graph. Note: it's best to keep these very short.
`plot_labels`	character string or logical determining whether the values in `labels` are plotted next to the data points. The possible settings are: `"none"` (or `FALSE`); `"all"` (or `TRUE`); and `"outliers"` for outliers only.
`save_name`	character string; the filename, without extension, for the graph file.
`save_dir`	character string; the directory where the graph is saved. Note that R uses forward slash (/) where Windows uses backslash (\).
`...`	arguments passed to `plot`.

Details

When running a QC over multiple files, QC_series collects the values of the skewness_HQ and kurtosis_HQ output of QC_GWAS in a table, which is then passed to this function to convert it into a plot. Note that this values are calculated over high-quality SNPs only.

Ideally, one expects both the skewness and kurtosis of effect sizes to be close to 0. In practice, these statistics can be hugely variable. QC_series uses only high-quality effect sizes to calculate these values in order to reduce some of the more extreme values. Still, it is recommended that you compare the values to those of other GWAS with the same phenotype, rather than relying on on the label outliers command to identify problems.

Value

An invisible NULL.

Examples

  value_S <- c(0.05, -0.27, 0.10, 0.11)
  value_K <- c( 6.7,  10.0, 10.1,  6.6)
  value_labels <- paste("cohort", 1:4)
  
  ## Not run: 
  plot_skewness(skewness = value_S,
                kurtosis = value_K,
                labels = value_labels,
                plot_labels = "outliers",
                save_name = "sample_skewness_kurtosis")
## End(Not run)
value_S <- c(0.05, -0.27, 0.10, 0.11)
  value_K <- c( 6.7,  10.0, 10.1,  6.6)
  value_labels <- paste("cohort", 1:4)
  
  ## Not run: 
  plot_skewness(skewness = value_S,
                kurtosis = value_K,
                labels = value_labels,
                plot_labels = "outliers",
                save_name = "sample_skewness_kurtosis")
## End(Not run)

Automated Quality Control of GWAS Results files

Description

QC_GWAS runs a full quality control (QC) over a single GWAS results file. It removes missing and invalid data, checks the alleles and allele frequency with a reference, tests the reported p-value against both calculated and expected values, creates QQ and Manhattan plots and reports the distribution of the quality-parameters within the dataset, as well as various QC statistics.

QC_series does the same thing for multiple GWAS results files. It's mainly a wrapper that passes individual files to QC_GWAS, but it has a few extra features, such as making a checklist of important QC stats and creating several graphs to compare the QC'ed files.

Although the number of arguments in QC_GWAS may seem overwhelming, only three of them are required to run a basic QC. The name of the file to be QC'ed should be passed to the filename argument; the directory of said file to the dir_data argument; and a header-translation table to the header_translations argument. For a quick introduction to QCGWAS, read the quick reference guide that can be found in "R\library\QCGWAS\doc".

Usage

QC_GWAS(filename,
        filename_output = paste0("QC_", filename),
        dir_data = getwd(),
        dir_output = paste(dir_data, "QCGWASed", sep = "/"),
        dir_references = dir_data,
        header_translations,
        column_separators = c("\t", " ", "", ",", ";"),
        nrows = -1, nrows_test = 1000,
        header = TRUE, comment.char = "",
        na.strings = c("NA", "nan", "NaN", "."),
        imputed_T = c("1", "TRUE", "T"),
        imputed_F = c("0", "FALSE", "F"),
        imputed_NA = c(NA, "-"),
        save_final_dataset = TRUE,
        gzip_final_dataset = TRUE, order_columns = FALSE,
        spreadsheet_friendly_log = FALSE,
        out_header = "standard",
        out_quote = FALSE, out_sep = "\t", out_eol = "\n",
        out_na = "NA", out_dec = ".", out_qmethod = "escape",
        out_rownames = FALSE, out_colnames = TRUE,
        return_HQ_effectsizes = FALSE,
        remove_X = FALSE, remove_Y = FALSE,
        remove_XY = remove_Y, remove_M = FALSE,
        calculate_missing_p = FALSE,
        make_plots = TRUE, only_plot_if_threshold = TRUE,
        threshold_allele_freq_correlation = 0.95,
        threshold_p_correlation = 0.99,
        plot_intensity = FALSE,
        plot_histograms = make_plots, plot_QQ = make_plots,
        plot_QQ_bands = TRUE, plot_Manhattan = make_plots,
        plot_cutoff_p = 0.05,
        allele_ref_std, allele_name_std,
        allele_ref_alt, allele_name_alt,
        update_alt = FALSE, update_savename,
        update_as_rdata = FALSE, backup_alt = FALSE,
        remove_mismatches = TRUE,
        remove_mismatches_std = remove_mismatches,
        remove_mismatches_alt = remove_mismatches,
        threshold_diffEAF = 0.15, remove_diffEAF = FALSE,
        remove_diffEAF_std = remove_diffEAF,
        remove_diffEAF_alt = remove_diffEAF,
        check_ambiguous_alleles = FALSE,
        use_threshold = 0.1,
        useFRQ_threshold = use_threshold,
        useHWE_threshold = use_threshold,
        useCal_threshold = use_threshold,
        useImp_threshold = use_threshold,
        useMan_threshold = use_threshold,
        HQfilter_FRQ = 0.01, HQfilter_HWE = 10^-6,
        HQfilter_cal = 0.95, HQfilter_imp = 0.3,
        QQfilter_FRQ = c(NA, 0.01, 0.05),
        QQfilter_HWE = c(NA, 10^-6, 10^-4),
        QQfilter_cal = c(NA, 0.95, 0.99),
        QQfilter_imp = c(NA, 0.3, 0.5, 0.8),
        NAfilter = TRUE,
        NAfilter_FRQ = NAfilter, NAfilter_HWE = NAfilter,
        NAfilter_cal = NAfilter, NAfilter_imp = NAfilter,
        ignore_impstatus = FALSE,
        minimal_impQ_value = -0.5, maximal_impQ_value = 1.5,
        logI = 1L, logN = 1L, ...)
QC_series(data_files, datafile_tag, output_filenames,
          dir_data = getwd(),
          dir_output = paste(dir_data, "QCGWASed", sep = "/"),
          dir_references = dir_data,
          header_translations, out_header = "standard",
          allele_ref_std, allele_name_std,
          allele_ref_alt, allele_name_alt,
          update_alt = FALSE, update_savename,
          update_as_rdata = FALSE, backup_alt = FALSE,
          plot_effectsizes = TRUE, lim_effectsizes = NULL,
          plot_SE = TRUE, label_SE = TRUE,
          plot_SK = TRUE, label_SK = "outliers",
          save_filtersettings = FALSE, ...)
QC_GWAS(filename,
        filename_output = paste0("QC_", filename),
        dir_data = getwd(),
        dir_output = paste(dir_data, "QCGWASed", sep = "/"),
        dir_references = dir_data,
        header_translations,
        column_separators = c("\t", " ", "", ",", ";"),
        nrows = -1, nrows_test = 1000,
        header = TRUE, comment.char = "",
        na.strings = c("NA", "nan", "NaN", "."),
        imputed_T = c("1", "TRUE", "T"),
        imputed_F = c("0", "FALSE", "F"),
        imputed_NA = c(NA, "-"),
        save_final_dataset = TRUE,
        gzip_final_dataset = TRUE, order_columns = FALSE,
        spreadsheet_friendly_log = FALSE,
        out_header = "standard",
        out_quote = FALSE, out_sep = "\t", out_eol = "\n",
        out_na = "NA", out_dec = ".", out_qmethod = "escape",
        out_rownames = FALSE, out_colnames = TRUE,
        return_HQ_effectsizes = FALSE,
        remove_X = FALSE, remove_Y = FALSE,
        remove_XY = remove_Y, remove_M = FALSE,
        calculate_missing_p = FALSE,
        make_plots = TRUE, only_plot_if_threshold = TRUE,
        threshold_allele_freq_correlation = 0.95,
        threshold_p_correlation = 0.99,
        plot_intensity = FALSE,
        plot_histograms = make_plots, plot_QQ = make_plots,
        plot_QQ_bands = TRUE, plot_Manhattan = make_plots,
        plot_cutoff_p = 0.05,
        allele_ref_std, allele_name_std,
        allele_ref_alt, allele_name_alt,
        update_alt = FALSE, update_savename,
        update_as_rdata = FALSE, backup_alt = FALSE,
        remove_mismatches = TRUE,
        remove_mismatches_std = remove_mismatches,
        remove_mismatches_alt = remove_mismatches,
        threshold_diffEAF = 0.15, remove_diffEAF = FALSE,
        remove_diffEAF_std = remove_diffEAF,
        remove_diffEAF_alt = remove_diffEAF,
        check_ambiguous_alleles = FALSE,
        use_threshold = 0.1,
        useFRQ_threshold = use_threshold,
        useHWE_threshold = use_threshold,
        useCal_threshold = use_threshold,
        useImp_threshold = use_threshold,
        useMan_threshold = use_threshold,
        HQfilter_FRQ = 0.01, HQfilter_HWE = 10^-6,
        HQfilter_cal = 0.95, HQfilter_imp = 0.3,
        QQfilter_FRQ = c(NA, 0.01, 0.05),
        QQfilter_HWE = c(NA, 10^-6, 10^-4),
        QQfilter_cal = c(NA, 0.95, 0.99),
        QQfilter_imp = c(NA, 0.3, 0.5, 0.8),
        NAfilter = TRUE,
        NAfilter_FRQ = NAfilter, NAfilter_HWE = NAfilter,
        NAfilter_cal = NAfilter, NAfilter_imp = NAfilter,
        ignore_impstatus = FALSE,
        minimal_impQ_value = -0.5, maximal_impQ_value = 1.5,
        logI = 1L, logN = 1L, ...)
QC_series(data_files, datafile_tag, output_filenames,
          dir_data = getwd(),
          dir_output = paste(dir_data, "QCGWASed", sep = "/"),
          dir_references = dir_data,
          header_translations, out_header = "standard",
          allele_ref_std, allele_name_std,
          allele_ref_alt, allele_name_alt,
          update_alt = FALSE, update_savename,
          update_as_rdata = FALSE, backup_alt = FALSE,
          plot_effectsizes = TRUE, lim_effectsizes = NULL,
          plot_SE = TRUE, label_SE = TRUE,
          plot_SK = TRUE, label_SK = "outliers",
          save_filtersettings = FALSE, ...)

Arguments

`filename`, `data_files`, `datafile_tag`	`filename` and `data_files` are, respectively, the name and names of the GWAS results file(s) to be QC'ed. If no `data_files` are specified, `QC_series` will process all filed in `dir_data` containing the string passed to `datafiles_tag` in their filename. See below for more information on the input requirements.
`filename_output`, `output_filenames`	respectively the filename or names of the output of the QC. This should not include an extension, since the QC will automatically add one. The default is to use the input filename with `"QC_"` prefixed.
`dir_data`, `dir_output`, `dir_references`	character strings specifying the directory dress of the folders for, respectively, the input file(s), the output file(s) and the auxillary files (header-translation tables and allele references). Note that R uses forward slash (/) where Windows uses backslash (\). If `dir_output` does not exist, it will be created. If no `dir_output` is specified, a folder named `"QCGWASed"` will be created in `dir_data`.
`header_translations`	Translation table for the column names of the input file. Alternatively, the name of a file in `dir_references` containing such a table. See `translate_header` for details.
`column_separators`	character string or vector; specifies the values used as column delimitator in the GWAS file. The argument is passed to `load_test`; see the description of that function for more information.
`nrows_test`	integer; the number of rows used for "trial-loading". Before loading the entire dataset, the function `load_test` is called to determine the dataset's file-format by reading the top `x` lines, where `x` is `nrows_test`. Setting `nrows_test` to a low number (e.g. `150`) means quick testing, but runs the risk of missing problems in lower rows. To test the entire dataset, set it to `-1`.
`nrows`, `header`, `comment.char`	arguments passed to `read.table` when importing the dataset.
`na.strings`	character vector describing the values that represent missing data in the dataset. Passed to `read.table`.
`imputed_T`, `imputed_F`, `imputed_NA`	character vectors; passed to `convert_impstatus` (as `T_strings`, `F_strings` and `NA_strings`, respectively) to translate the imputation-status column. Note that the current version of `QC_GWAS` always translates the imputation status. Even when the dataset already has the correct format, the user still needs to specify `1`, `0` and `NA` for these arguments, respectively. Also note that R distinguishes between the value `NA` and the character string `"NA"`.
`save_final_dataset`	logical; should the post-QC dataset be saved?
`gzip_final_dataset`	logical; should the post-QC dataset be compressed?
`order_columns`	logical; should the post-QC dataset use the default column order?
`spreadsheet_friendly_log`	logical; if `TRUE`, the final log file will be tab-separated, for easy viewing in a spreadsheet program. If `FALSE` (default), it will be formatted for pretty viewing in a text-processing program.
`out_header`	Translation table for the column names of the output file. This argument is the opposite of `header_translations`: it translates the standard column-names of `QC_GWAS` to user-defined ones. `output_header` can be one of three things: A user specified table similar to the one used by `translate_header`. However, as this translates standard names into non-standard ones, the standard names should be in the right column, and the desired ones in the left. There is also no requirement for the names in the left column to be uppercase. The name of a file in `dir_references` containing such a column. Character string specifying a standard form. See the section 'Output header' below for the options.
`out_quote`, `out_sep`, `out_eol`, `out_na`, `out_dec`, `out_qmethod`, `out_rownames`, `out_colnames`	arguments passed to `write.table` when saving the final dataset.
`return_HQ_effectsizes`	logical; return a vector of (max. 1000) high-quality effect-sizes? (In `QC_series`, this is set by the `plot_effectsizes` argument.)
`remove_X`, `remove_Y`, `remove_XY`, `remove_M`	logical; respectively whether X-chromosome, Y-chromosome, pseudo-autosomal and mitochondrial SNPs are removed from the dataset.
`calculate_missing_p`	logical; should the QC calculate missing/invalid p-values in the dataset?
`make_plots`	logical; should the QC generate and save QQ plots, a Manhattan plot, histograms of data distribution and scatter plots of correlation in `dir_reference`?
`only_plot_if_threshold`	logical; should the scatterplots only be made if the correlation is below a threshold value?
`threshold_allele_freq_correlation`, `threshold_p_correlation`	numeric; thresholds for reporting and plotting the correlation between respectively the allele frequency of the dataset and the reference, and the calculated and reported p-values.
`plot_intensity`	logical; if `TRUE`, instead of a scatterplot of allele correlations, an intensity plot is generated. This option is currently only partially implemented. Leave to `FALSE` for now.
`plot_histograms`	logical; should histograms of the effect sizes, standard errors, allele frequencies, HWE p-values, callrates and imputation quality be made?
`plot_QQ`, `plot_Manhattan`	logical; should QQ and Manhattan plots be made?
`plot_QQ_bands`	logical; include probability bands in the QQ plot?
`plot_cutoff_p`	numeric; significance threshold for inclusion in the QQ and Manhattan plots. The default value (`0.05`) excludes 95% of SNPs, significantly reducing running-time and memory usage. For this reason it is not recommend to set a higher value when QC'ing a normal-sized GWAS dataset.
`allele_ref_std`, `allele_ref_alt`	the standard and alternative allele-reference tables. Alternatively, the name of a file in `dir_references` containing said table. Files in `.RData` format are accepted, but the table's object name must match the argument name. See `match_alleles` for more information on the input requirements.
`allele_name_std`, `allele_name_alt`	character strings; these name the standard and alternative allele reference in the output. If no values are given, the function will use the reference's filename (if specified) or a default name.
`update_alt`	logical; if the function encounters SNPs not included in either the standard or alternative reference, should these be added to the alternative reference file? If no alternative reference was specified, this creates one.
`update_savename`	character string; the filename for saving the updated alternative reference, without extension. If `allele_ref_alt` is a filename, it is not necessary to specify this argument.
`update_as_rdata`	logical; should the updated alternative allele reference be saved as `.RData` (`TRUE`) or a tab-delimitated .txt file (`FALSE`).
`backup_alt`	logical; if the alternative allele reference is updated, should a back-up be made of the original reference file?
`remove_mismatches`, `remove_mismatches_std`, `remove_mismatches_alt`	logical; should SNPs with mismatching alleles be removed from the dataset? `remove_mismatches` serves as the default value; the other two arguments determine this setting for the standard and alternative references, respectively.
`threshold_diffEAF`	Numeric; the threshold for the difference between reported and reference allele-frequency. SNPs for which the difference exceeds the threshold are counted and (optionally) removed.
`remove_diffEAF`, `remove_diffEAF_std`, `remove_diffEAF_alt`	Logical; should SNPs that exceed the `threshold_diffEAF` be removed from the dataset? `remove_diffEAF` serves as the default value; the other two arguments determine this setting for the standard and alternative references, respectively.
`check_ambiguous_alleles`	logical; check for SNPs with strand-independent allele-configurations (i.e. A/T and C/G SNPs)?
`use_threshold`	numeric; threshold value. The relative or absolute number of usable values required for a variable to be used in the QC. These arguments prevent the QC from applying filters to variables with no data. If a variable has less non-missing, non-invalid values than specified in the threshold, it will be ignored; i.e. no filter will be applied to it and no plots will be made. Values `> 1` specify the absolute threshold, while values of `1` or lower specify the fraction of SNPs remaining in the dataset. This argument is the default threshold for all variables; variable-specific thresholds can be set with the following arguments.
`useFRQ_threshold`, `useHWE_threshold`, `useCal_threshold`, `useImp_threshold`, `useMan_threshold`	numeric; variable-specific thresholds for allele frequency, HWE p-value, callrate, imputation quality and Manhattan plot (i.e. chromosome & position values) respectively.
`HQfilter_FRQ`, `HQfilter_HWE`, `HQfilter_cal`, `HQfilter_imp`	numeric; threshold values for the high-quality SNP filter. SNPs that do not meet or exceed all four thresholds will be excluded from several QC tests. The filters are for allele frequency, HWE p-value, callrate & imputation quality, respectively, and are processed by `HQ_filter`. See 'Filter arguments' for more information. Note: the high-quality filter does not remove SNPs; it merely excludes them from several QC tests.
`QQfilter_FRQ`, `QQfilter_HWE`, `QQfilter_cal`, `QQfilter_imp`	numeric vector; threshold values for the QQ plot filters. SNPs that do not meet or exceed the value will be excluded from the QQ plot. Up to five values can be specified per filter. The filters are for allele-frequency, HWE p-value, callrate & imputation quality respectively, and are processed by `QC_plots`. See 'Filter arguments' for more information.
`NAfilter`, `NAfilter_FRQ`, `NAfilter_HWE`, `NAfilter_cal`, `NAfilter_imp`	logical; should the high-quality and QQ filters exclude (`TRUE`) or ignore (`FALSE`) missing values? `NAfilter` is the default setting; the others allow allow variable specific settings.
`ignore_impstatus`	logical; if `FALSE`, HWE p-value and callrate filters are applied only to genotyped SNPs, and imputation quality filters only to imputed SNPs. If `TRUE`, the filters are applied to all SNPs regardless of the imputation status.
`minimal_impQ_value`, `maximal_impQ_value`	numeric; the minimal and maximal possible (i.e. non-invalid) imputation quality values.
`logI`, `logN`	progress indicators used by `QC_series`: irrelevant for users.
`plot_effectsizes`, `plot_SE`, `plot_SK`	logical; additional plot options for `QC_series`. The arguments toggle the creation of plots comparing the effect-size distribution, precision and skewness vs. kurtosis of all successfully QC'ed datasets, respectively. See `plot_distribution`, `plot_precision` and `plot_skewness` for more information.
`lim_effectsizes`	specifies the y-axis range of the effect-size distribution plot.
`label_SE`	logical; should the datapoints in the precision plot be labeled?
`label_SK`	character string; determines whether the datapoints in the skewness vs. kurtosis plot are labeled. Options are `"none"`, `"all"`, or `"outliers"` (outliers only).
`save_filtersettings`	logical; saves the filtersettings used by the high-quality filter to a file 'Check_filtersettings.txt' in the output directory. If a file of that name already exists, the settings are added to the end (i.e. it updates rather than overwrites existing files). The file can be used as ini file by `filter_GWAS`.
`...`	in `QC_series`: arguments passed to `QC_GWAS`; in `QC_GWAS`: arguments passed to `read.table` when importing the dataset.

Details

The full quality-control carried out by QC_GWAS consists of 5 phases. The function takes a single dataset (or, rather, the location and filename of a single dataset) and runs it through the following phases:

1: Importing the dataset
2: Checking data integrity
3: Checking alleles
4: Generating QC statistics and graphs
5: Saving the output

Phase 1: importing the dataset

GWAS results files come in a variety of formats, so QC_GWAS is flexible about loading data. It uses an autoloader function (load_GWAS) to unpack .zip or .gz files and determine the column-separator used in the file. See the section 'Requirement for the input dataset' for more information.

Next, the function attempts to translate the dataset's column names (the header) to standard names, so that it knows what type of data a column contains. This is done by comparing the column names to a translation table (specified in the header_translations argument). See translate_header for more information.

Note that only the SNP ID, alleles, effect-size and standard error columns are required. The absence of other standard columns (chromosome, position, strand, allele frequency, HWE p-value, callrate, imputation quality, imputation status and used for imputation) will not cause the QC to abort. Instead, a warning is printed on screen and in the log file, and a dummy column filled with NA values is added to the dataset.

It is therefor important to check the log file: if a standard column is present but not identified (because it is missing or misspelled in the translation table) the QC will continue, but is unable to check/use the data inside. The unidentified column will be reported in the columns_unidentified value of the QC_GWAS return or in the "QC_checklist.txt" file generated by QC_series.

Phase 2: checking data integrity

The purpose of phase 2 is to ensure that the dataset can be QC'ed: that that all SNPs have the required data and that all columns contain only valid (or missing) values.

The first step is to remove SNPs that won't be used: monomorphic SNPs and (if specified by the arguments) allosomal, pseudo-autosomal and mitochondrial SNPs. The function considers SNPs monomorphic if they have a missing or invalid other (non-effect) allele, identical alleles or an allele-frequency of 1 or 0.

The second step is to check the imputation status column with the function convert_impstatus. See the section 'Requirement for the input dataset' for more information. Imputation status is one of the most important variables in the dataset: if unknown, the HWE p-value, callrate and imputation quality won't be used (unless ignore_impstatus is TRUE), as the function cannot determine which SNPs are imputed and which are not. For this reason, if convert_impstatus is unable to translate any character string in the column, the QC will abort.

The third step carries out three tests for all other standard variables:

Does the column contain the correct type of data (integer, numeric or character)?
How many values are missing (NA)?
How many values are invalid (= impossible)?

The exact nature of the three tests differs per variable: see the documentation file in "R\library\QCGWAS\doc" for more detail.

The presence of the wrong data-type will cause the QC to abort. Wrong data-type indicates either a problem in the file itself, or with the way it was imported (in which case it is most likely due to a mistranslated header).

The final step is the removal of the invalid values and of unusable SNPs. The variables MARKER, EFFECT_ALLELE, OTHER_ALLELE, EFFECT and STDERR are considered crucial. SNPs with missing or invalid values in any of these variables are removed the dataset. Missing values in the other variables are ignored, while invalid values are set to NA.

Phase 3: checking alleles

This phase has three functions:

To check if the correct alleles are reported for each SNP
To check if the allele-frequency is reported for the correct (effect) allele
To ensure that SNPs are aligned to the positive strand and use the same effect-allele in all post-QC datasets

This is achieved by comparing the data to a reference, using the function match_alleles. First, all SNPs are switched to the positive strand (the alleles are converted to their opposing base and the strand-value is set to "+"). If there are SNPs whose allele pair doesn't match the reference, match_alleles assumes the information in the strand column is absent or incorrect, and will also switch those SNPs to the other (presumably positive) strand. This step is referred to as strand-switching in QC output, and is independent from the negative-strand SNP conversion. It is therefor possible that a SNP is switched twice: once because the strand-column indicates it is on the negative strand, and twice because of a mismatch. This is referred to as double strand-switch in the output, and indicates either the wrong value in the strand column, or a mismatch with the reference. In the latter case, it will most likely be picked up in the next step.

If the strand-switch does not fix the mismatch, the SNPs are counted in the QC output as mismatches. Depending on the remove_mismatches arguments, the SNPs will either be removed from the dataset, or left in but excluded from the further tests of the allele-matching.

Next, match_alleles "flips" SNPs so that their effect allele matches the reference minor allele. This ensures that a SNP will have the same effect allele in all post-QC datasets.

match_alleles also counts the number of SNPs with a strand-independent allele configuration (A/T or C/G; these are designated as "ambiguous SNPs"), and the subset of those with an allele-frequency that is substantially different from the reference ("suspect SNPs"). If a substantial proportion of ambiguous SNPs is suspect, it indicates that the strand information is incorrect. In our experience, a regular, 2.5M SNP dataset usually consists of 15% ambiguous SNPs, of which a few dozen will be suspect.

The function also counts the number of SNPs whose allele frequency differs from the reference by more than a set amount (threshold_difEAF). If the relevant remove_diffEAF argument is TRUE, these SNP will be excluded from the dataset after the allele-matching is finished.

The final step is to correlate the reported allele-frequency against the reference. If allele-frequency is reported for the correct (effect) allele, the correlation should be close to 1. If the outcome is close to -1, the reported frequency is that of the other allele. Depending on the plot settings, a scatter plot of reported vs. reference frequency is made for all SNPs, and for the subsets of ambiguous and non-ambiguous SNPs.

The standard and alternative allele reference

The above steps describe what happens when the dataset is compared to a single reference. However, we found that many GWAS datasets contain SNPs not present in our standard HapMap reference, so we added a second, flexible reference that can be updated with any unknown SNPs the QC encounters.

SNPs that are not found in either reference are converted to the positive strand, and "flipped" if their allele frequency is > 0.50. If update_alt is TRUE, these SNPs are then added to the alternative reference and saved under the name update_savename. There are a few caveats to this system: see the section 'Updating the alternative reference' for details.

Phase 4: generating QC statistics and graphs

At this stage, no further changes will be made to the dataset (except, optionally, to recalculate missing p-values). The function will now start to calculate the QC statistics and generate the important graphs. These are:

Create histograms of variable distribution (optional)
Check p-values by correlating them to a p calculated from the effect-size and standard-error (via the check_P function).
Recalculate missing/invalid p-values (optional)
Calculate QC statistics
Create QQ and Manhattan plots (optional, see QC_plots function for more information).

Phase 5: saving the output

A series of tables is added to the bottom of the log file, reporting the QC statistics and the data distribution. If save_final_dataset is TRUE, the post-QC data will be exported as a .txt file. The column names and format of that file can be specified by the out arguments.

Value

The most important output of the QC is the log file. See the section 'QC output files' for more details. This section only describes the function return within R.

QC_series returns a single, invisible, logical value, indicating whether the alternative allele-reference has been updated.

QC_GWAS returns an object of class 'list'. If the QC was not successful, this list contains only five of the following components (QC_successful, filename_input, filename_output, all_ref_changed, effectsize_return). If it was, it will contain all of them:

`QC_successful`	logical; indicates whether the QC was completed. If `FALSE`, the function was either unable to load the dataset, encountered an unexpected datatype, or removed all SNPs during the QC. The log file and screen output will indicate what triggered the abort.
`filename_input`, `filename_output`	the filenames of the dataset pre- and post-QC respectively.
`sample_size`, `sample_size_HQ`	the highest reported sample size for all SNPs and high-quality SNPs only, respectively.
`lambda`, `lambda_geno`, `lambda_imp`	the lambda values of all, genotyped and imputed SNPs, respectively.
`SNP_N_input`	the number of SNPs in the original dataset.
`SNP_N_input_monomorphic`	the number of SNPs removed because they are monomorphic.
`SNP_N_input_monomorphic_identic_alleles`	the subset of above that had identical alleles, but allele-frequencies that were not `0` or `1`.
`SNP_N_input_chr`	the number of SNPs removed because they were X-chromosomal, Y-chromosomal, pseudo-autosomal or mitochondrial (depends on the remove-arguments). If all remove arguments were set to `FALSE`, this returns `NA`.
`SNP_N_preQC`	the number of SNPs that entered phase 2b (i.e. after removal of the monomorphic and excluded-chromosome SNPs).
`SNP_N_preQC_unusable`	the number of SNPs removed in phase 2d, due to missing or invalid crucial variables.
`SNP_N_preQC_invalid`	the number of SNPs with invalid, non-crucial values in phase 2d.
`SNP_N_preQC_min`	the number of negative-strand SNPs in phase 2d.
`SNP_N_midQC`	the number of SNPs in the dataset during allele-matching (phase 3).
`SNP_N_midQC_min`	the number of negative strand SNPs in phase 3.
`SNP_N_midQC_min_std`, `SNP_N_midQC_min_alt`, `SNP_N_midQC_min_new`	the number of negative strand SNPs matched against, respectively, the standard allele reference, the alternative allele reference or neither.
`SNP_N_midQC_strandswitch_std`, `SNP_N_midQC_strandswitch_alt`	the number of SNPs that were strand-switched because of a mismatch with the standard and alternative allele reference, respectively.
`SNP_N_midQC_strandswitch_std_min`, `SNP_N_midQC_strandswitch_alt_min`	the subset of previous that were negative-strand SNPs. NOTE: at this point in the QC, negative-strand SNPs have already been converted to the positive strand, i.e. they should not appear in this category. If they do, there is a problem with the reported strand, or with the reference table.
`SNP_N_midQC_mismatch`	the number of SNPs that were still mismatching after the strand-switch.
`SNP_N_midQC_mismatch_std`, `SNP_N_midQC_mismatch_alt`	subset of previous that were matched with the standard and alternative allele reference, respectively.
`SNP_N_midQC_mismatch_std_min`, `SNP_N_midQC_mismatch_alt_min`	subset of previous that are negative-strand SNPs.
`SNP_N_midQC_flip_std`, `SNP_N_midQC_flip_alt`, `SNP_N_midQC_flip_new`	Number of SNPs that were flipped (had their alleles reversed) when matched against, respectively, the standard allele reference, the alternative allele reference or neither.
`SNP_N_midQC_ambiguous`	the number of ambiguous SNPs
`SNP_N_midQC_ambiguous_std`, `SNP_N_midQC_ambiguous_alt`, `SNP_N_midQC_ambiguous_new`	subset of ambiguous SNPs matched against, respectively, the standard allele reference, the alternative allele reference or neither.
`SNP_N_midQC_suspect`	the subset of ambiguous SNPs whose allele frequencies differ strongly from those in the reference.
`SNP_N_midQC_suspect_std`, `SNP_N_midQC_suspect_alt`	the subsets of previous matched against the standard and alternative allele reference, respectively.
`SNP_N_midQC_diffEAF`	the number of SNPs whose allele frequency differs strongly from the reference.
`SNP_N_midQC_diffEAF_std`, `SNP_N_midQC_diffEAF_alt`	subset of previous that were matched with the standard and alternative allele reference, respectively.
`SNP_N_postQC`	the number of SNPs in the final dataset.
`SNP_N_postQC_geno`, `SNP_N_postQC_imp`	the number of genotyped and imputed SNPs in the final dataset.
`SNP_N_postQC_invalid`	the number of SNPs with invalid values remaining in the final dataset. Note: any invalid values have already been changed to `NA` at this point. This merely counts how many of those SNPs are still in the dataset.
`SNP_N_postQC_min`	the number of negative-strand SNPs in the final dataset. Note: all SNPs have been switched to the positive strand at this point. This merely counts how many of those SNPs are still in the dataset.
`SNP_N_postQC_HQ`	the number of high-quality SNPs in the final dataset.
`fixed_HWE`, `fixed_callrate`, `fixed_sampleN`, `fixed_impQ`	logical or character string; are HWE p-values, callrates, sample-size and imputation quality values identical for all relevant SNPs? If `TRUE`, it indicates that the parameters have not been calculated and are dummy values. If a parameter fails the threshold test, this returns `"insuf. data"` (`"no data"` for sample size).
`effect_25`, `effect_median`, `effect_75`	the quartile values of the effect-size distribution.
`effect_mean`	the mean of the effect-size distribution.
`SE_median`, `SE_median_HQ`	the median standard error of all SNPs and high-quality SNPs only, respectively.
`skewness`, `kurtosis`	the skewness and kurtosis value of the dataset.
`skewness_HQ`, `kurtosis_HQ`	the skewness and kurtosis value for high-quality SNPs only.
`all_ref_std_name`, `all_ref_alt_name`	the names used for the standard and alternative allele-references in the output.
`all_MAF_std_r`, `all_MAF_alt_r`	allele-frequency correlation with the standard and alternative allele references.
`all_ambiguous_MAF_std_r`, `all_ambiguous_MAF_alt_r`	allele-frequency correlations for the subset of ambiguous SNPs in the standard and alternative allele references, respectively.
`all_non_ambig_MAF_std_r`, `all_non_ambig_MAF_alt_r`	allele-frequency correlations for the subset of non-ambiguous SNPs in the standard and alternative allele references, respectively.
`all_ref_changed`	logical; has an updated alternative allele reference been saved?
`effectsize_return`	logical; is a vector of high-quality effect-sizes returned in `effectsizes_HQ`?
`effectsizes_HQ`	if `effectsize_return` is `TRUE`, a vector of 1000 high-quality effect-sizes; if not, `NULL`. If a dataset contains less than 1000 high-quality SNPs, the vector is padded with NA's to bring it to 1000 values.
`pvalue_r`	the correlation between reported and calculated p-values.
`visschers_stat`, `visschers_stat_HQ`	the Visscher's statistic for all SNPs and high-quality SNPs only, respectively.
`columns_std_missing`	the names of any missing, standard columns: if no columns are missing, this returns `0`.
`columns_std_empty`	the names of any empty, standard columns: if no columns are empty, this returns `0`.
`columns_unidentified`	the names of any unidentified columns in the input dataset. If none, this returns `0`.
`outcome_useFRQ`, `outcome_useHWE`, `outcome_useCal`, `outcome_useImp`, `outcome_useMan`	logical; indicates whether the variable passed the threshold test.
`...`	the remaining 'setting' components return the the actual filter settings used in the QC (i.e. taking into account whether the variables passed the threshold test).

QC output files

The results of the QC are reported in a variety of .txt and .png files saved in dir_output. The files use the same output name as the dataset, with an extension to indicate their contents (i.e. '_log.txt', '_graph_QQ.png'). The .txt files are tab-delimited and are best viewed in a spreadsheet program. The most important one is the log file. This file summarizes the results of the QC and the data inside the file.

The log file

The top of the file is table of log entries, reporting any (potential) problems encountered during the QC. Some of these are just routine updates; the removal of SNPs with missing data, for example. However, do check the other entries. These report important but non-fatal problems, relating to crucial missing data or invalid data. In such a case, and provided the QC did not abort, the affected SNPs will have been exported to a .txt file before being excluded, so the user can inspect them without having to reload the entire dataset. The .txt's are:

[filename_output]_SNPs_invalid_allele2.txt

[filename_output]_SNPs_duplicates.txt

[filename_output]_SNPs_removed.txt

[filename_output]_SNPs_improbable_values.txt

(Note: the names of the files are slightly confusing: the "SNPs_removed" file contains all SNPs removed in phase 2d. This does not include monomorphic SNPs, or SNPs from excluded chromosomes, as these are removed in phase 2a. Also, the "SNPs_improbable_values" file does not include SNPs with invalid values for crucial variables, as these are already in the "SNPs_removed" file.)

Another important but non-fatal problem is missing columns. QC_GWAS uses a translation table to determine the contents of a column. If the translation table is incomplete, or contains a typo, the function will be unable to translate (and therefor use) a column. If this involves, say, callrate, it merely means the function cannot apply the callrate filters, but the absence of p-values or imputation status will disable many features of the QC. If you know that a data column is present, yet the log reports it missing, check the translation table. The QC_series checklist output and the columns_unidentified value of the QC_GWAS return report the names of any unidentified columns in the dataset.

If the QC aborts, the log file should give some indication why this occurred. However, if it was successful, there will be several other tables in the log file.

The second table reports the number of SNPs in the dataset at various stages of the QC; as well as how many (and for what reason) SNPs were removed.

The third table gives an overview of the data itself. It reports how many values were missing and invalid per variable in both the pre- and post-QC datasets, as well as the quartile and mean values of the post-QC data. A few notes on the nomenclature: invalid values will have been removed (for crucial variables) or set to missing (for non-crucial variables) in stage 2d. The post-QC 'invalid' column merely records how many of these SNPs remain in the dataset. 'Unusable' values are the missing and invalid values combined (shown as percentage of the total data). Finally, pre-QC refers to the dataset during stage 2b-c, but after monomorphic SNPs and SNPs from excluded chromosomes (stage 2a) have been removed.

The fourth table reports on the allele-matching in phase 3. The concepts are explained in 'details' and the match_alleles function; here we just mention what the user needs to pay attention to. Strand-switching counts SNPs whose strand was switched because it mismatched with the reference. As many cohorts do not add strand data (or set every SNP to "+"), the presence of such SNPs is not a red flag by itself. However, if there are mismatching SNPs (the subset of strand-switched SNPs that could not be fixed), there is a problem with the allele data (or, possibly, a triallelic SNP). Check the [filename_output]_SNPs_mismatches-[ref-name].txt file to see the affected SNPs.

Another red flag is if there are strand-switched SNPs in a dataset that also contains negative-strand SNPs (i.e. the cohort included real strand data, rather just setting it to "+" for all SNPs). Negative-strand SNPs are converted to the positive strand beforehand, so they should not appear in this step (if they do, they are counted in the "double strand-switch" entry, but that is of minor importance). The real problem is that if a cohort includes negative-strand SNPs (i.e. real strand data), and there are still strand-switches, the strand data must be incorrect. Whether the strand-switches and the negative-strand SNPs overlap is unimportant.

The third possible problem is when a large proportion of ambiguous SNPs is suspect: it indicates that they are on the wrong strand.

Finally, a large number of SNPs with a deviating allele frequency indicates either that the frequency is reported for the wrong allele (see below) or that the dataset population does not match that of the reference.

The fifth table reports the QC outcome statistics. The p-value and allele-frequency correlations should be close to 1. An allele-frequency correlation of -1 means that the frequency was reported for the wrong (non-effect) allele. As for the p-value correlaction: in a typical GWAS dataset, the expected and observed p-values should correlate perfectly. If this isn't the case, it means either that a column was misidentified when loading the dataset or that the wrong values were used when generating the data.

The sixth table reports how many SNPs were removed by the various QQ plot filters.

The seventh table reports the chromosomes and alleles present in the final dataset.

The eighth table counts invalid values in the pre- and post-QC files for several variables. 'Extreme p' is a value that is only used when calculate_missing_p is TRUE. Any newly-calculated p-values that are < 1E-300 will be set to 1E-300, in order to exclude any values of 0 (1E-300 is close to the smallest numeric value that R can handle safely).

The final four tables contain the settings of the QC.

The QC_series output

QC_series saves a checklist, showing the most important QC stats of the various files side by side, and (depending on the plot-arguments) several graphs comparing effect-size distribution, precision and skewness vs. kurtosis of the QC'ed files.

Output header

The standard-format values used by out_header are:

"standard" retains the default column names used by QC_GWAS.
"original" restores the column names used in the input file.
"old" uses the default column names of pre-v1.0b versions of QCGWAS.
"GWAMA", "PLINK", "META" and "GenABEL" set the column names to those use by the respective programs. Note that META's alleleB corresponds to EFFECT_ALL in QC_GWAS.

Requirement for the input dataset

QC_GWAS will automatically unpack .gz and .zip files, provided the filename includes the extension of the packed file. For example, if the data file is named "data1.csv", the zip file should be "data1.csv.zip". If it is named "data1.zip", QC_GWAS won't be able to "call" the file inside.

QC_GWAS is flexible when it comes to file-format. By default, it can open datasets with a variety of column separators and NA values (the user can specify these via the column_separators and na.strings arguments). Read the documentation of the auto-loader function load_GWAS for more information.

Chromosome values can be coded as numeric or character: values of "X", "Y", "XY" and "M" will automatically be converted to 23, 24, 25 and 26, respectively.

By default, imputation status is coded as integers 0 (genotyped) and 1 (imputed). As of version 1.0-4, imputation status will always be translated using the imputed_T, imputed_F and imputed_NA arguments. This means that the user must specify values for these arguments, even when the dataset already uses the standard format. Because of the importance of imputation status, if the function is unable to translate character values, the QC will abort.

The minimal and maximal valid imputation quality values are determined by the minimal_impQ_value and maximal_impQ_value arguments.

Standard errors and p-values of 0 are considered invalid and removed in phase 2d, while values of -1 will be set to NA. Effect sizes of -1 are accepted, unless the standard error and/or the p value are also -1, in which case it is also set to NA.

Filter arguments

QC_GWAS has three sets arguments relating to filters: the arguments for the HQ (high-quality), QQ (plot) and NA (missing values) filters. The HQ and QQ arguments work mostly in the same way, but there are a few key differences.

The high-quality arguments accept single, numeric values that determine the minimal values of allele-frequency (FRQ), HWE p-value (HWE), callrate (cal) and imputation quality (imp) for a SNP to be of high-quality. The high-quality filter is used for the effect-size boxplot and the Manhattan plot, as well as several QC stats.

The QQ arguments accept a vector of max. 5 numeric values that are applied sequentially as filters in the QQ plots.

Both of these use the NA filter argument(s) to determine whether to exclude or ignore missing values.

Neither filter is used to remove SNPs; they merely exclude them from several QC tests. Both HQ and QQ filter criteria are only applied if the variable passed the threshold test, i.e. if there are sufficient non-missing, non-invalid values for the filter to be applied (see the use_threshold argument for details). It's pointless to filter an empty column.

If ignore_impstatus is FALSE (default), the imputation-quality criterion is only applied to imputed SNPs, and the HWE p-value and callrate criteria only to genotyped SNPs. If TRUE, the filters are applied to all SNPs, regardless of the imputation status.

The allele-frequency filters are two-sided: when set to value x, SNPs with frequency < x or > 1 - x are excluded.

To filter missing values only, set the filter to NA and the corresponding NA filter argument to TRUE. To disable entirely, set to NULL (this means the NA-filter setting is ignored as well).

The differences between the HQ and QQ filters are:

The HQ filter arguments accept a single value, the QQ filters can accept up to 5.

The HQ filter is a single filter: a SNP needs to meet all relevant criteria to be considered high-quality. The QQ filter values are applied separately.

The QQ filter has an additional feature: if passed a value x > 1, it will calculate a filter value of x / sample size. This is to allow size-based filtering of allele frequencies. Note that this filter uses the sample size listed for that specific SNP. If the sample size is missing, the relevant NA-filter setting is used to determine whether it should be excluded.

Updating the alternative reference

There are two drawbacks to the way the function updates the alternative reference file. One is a technical issue, but the other can affect the QC of subsequent files.

Firstly, the argument update_alt has a slightly misleading name: the alternative-reference file is updated, but the reference inside the R memory is not. If the user wants to do further QCs with the updated reference, (s)he will have to manually reload the updated file into R.

This is caused by the way R handles data-alterations that occur inside a function. Any changes made to data last only for the duration of the function. Once the function terminates, the memory reverts to its original state. In other words: the allele reference is updated inside QC_GWAS, but goes back to the pre-QC state once the QC is finished. QC_series deals with this by automatically reloading the reference file whenever it is updated, but, again, once the function terminates it will revert to its original state.

The second drawback is that the content of the alternative reference is arbitrary, depending on which file an unknown SNP is encountered in first. For example, suppose that SNP rs31 has alleles A and G, an allele frequency that varies around 0.5, and does not appear in the standard reference. When it is added to the alternative reference, the allele listed as the minor one depends entirely on the allele frequency in the first file it is encountered in.

More seriously, if the information in the first file is incorrect, the SNP may be strand-switched or excluded in subsequent files because it does not match the (incorrect) reference. This is another reason why it is important to check the log files: if there is a problem with a datafile's strand, alleles or allele-frequency, and the alternative reference was updated, the incorrect data may have been added to the reference. If so, one should go back to a previous reference file. The argument backup_alt is useful for this, though note that QC_series only does this the first time the reference is updated.

Also, if one wants to QC a large number of files for a meta-GWAS, one should use the same alternative allele reference file (and let QC_GWAS update it) for every file, otherwise it is possible that rs13 may have a different effect alleles in some post-QC files.

Examples

## For instructions on how to run QC_GWAS and QC_series
## check the quick start guide in /R/library/QCGWAS/doc
## For instructions on how to run QC_GWAS and QC_series
## check the quick start guide in /R/library/QCGWAS/doc

Histogram(s) of expected and observed data distribution

Description

QC_histogram creates two histograms: one showing the observed data distribution of a numeric variable, and one showing the expected distribution. It includes the option to filter the data with the high-quality filter. histogram_series generates a series of such histograms for multiple filter settings.

Usage

QC_histogram(dataset, data_col = 1,
             save_name = "dataset", save_dir = getwd(),
             export_outliers = FALSE,
             filter_FRQ = NULL, filter_cal = NULL,
             filter_HWE = NULL, filter_imp = NULL,
             filter_NA = TRUE,
             filter_NA_FRQ = filter_NA, filter_NA_cal = filter_NA,
             filter_NA_HWE = filter_NA, filter_NA_imp = filter_NA,
             breaks = "Sturges",
             graph_name = colnames(dataset)[data_col],
             header_translations, check_impstatus = FALSE,
             ignore_impstatus = FALSE,
             T_strings = c("1", "TRUE", "yes", "YES", "y", "Y"),
             F_strings = c("0", "FALSE", "no", "NO", "n", "N"),
             NA_strings = c(NA, "NA", ".", "-"), ...)
histogram_series(dataset, data_col = 1,
   save_name = paste0("dataset_F", 1:nrow(plot_table)),
   save_dir = getwd(), export_outliers = FALSE,
   filter_FRQ = NULL, filter_cal = NULL,
   filter_HWE = NULL, filter_imp = NULL,
   filter_NA = TRUE,
   filter_NA_FRQ = filter_NA, filter_NA_cal = filter_NA,
   filter_NA_HWE = filter_NA, filter_NA_imp = filter_NA,
   breaks = "Sturges",
   header_translations, ignore_impstatus = FALSE,
   check_impstatus = FALSE,
   T_strings = c("1", "TRUE", "yes", "YES", "y", "Y"),
   F_strings = c("0", "FALSE", "no", "NO", "n", "N"),
   NA_strings = c(NA, "NA", ".", "-"),
   ...)
QC_histogram(dataset, data_col = 1,
             save_name = "dataset", save_dir = getwd(),
             export_outliers = FALSE,
             filter_FRQ = NULL, filter_cal = NULL,
             filter_HWE = NULL, filter_imp = NULL,
             filter_NA = TRUE,
             filter_NA_FRQ = filter_NA, filter_NA_cal = filter_NA,
             filter_NA_HWE = filter_NA, filter_NA_imp = filter_NA,
             breaks = "Sturges",
             graph_name = colnames(dataset)[data_col],
             header_translations, check_impstatus = FALSE,
             ignore_impstatus = FALSE,
             T_strings = c("1", "TRUE", "yes", "YES", "y", "Y"),
             F_strings = c("0", "FALSE", "no", "NO", "n", "N"),
             NA_strings = c(NA, "NA", ".", "-"), ...)
histogram_series(dataset, data_col = 1,
   save_name = paste0("dataset_F", 1:nrow(plot_table)),
   save_dir = getwd(), export_outliers = FALSE,
   filter_FRQ = NULL, filter_cal = NULL,
   filter_HWE = NULL, filter_imp = NULL,
   filter_NA = TRUE,
   filter_NA_FRQ = filter_NA, filter_NA_cal = filter_NA,
   filter_NA_HWE = filter_NA, filter_NA_imp = filter_NA,
   breaks = "Sturges",
   header_translations, ignore_impstatus = FALSE,
   check_impstatus = FALSE,
   T_strings = c("1", "TRUE", "yes", "YES", "y", "Y"),
   F_strings = c("0", "FALSE", "no", "NO", "n", "N"),
   NA_strings = c(NA, "NA", ".", "-"),
   ...)

Arguments

`dataset`	vector or table containing the variable of interest.
`data_col`	name or number of the column of `dataset` containing the variable of interest.
`save_name`	for `QC_histogram`, a character string; for `histogram_series`, a vector of character strings; specifying the filename(s) of the graph, without extension.
`save_dir`	character string; the directory where the output files are saved. Note that R uses forward slash (/) where Windows uses the backslash (\).
`export_outliers`	logical or numeric value; should outlying entries (which are excluded from the plot) be exported to an output file? If numeric, the number specifies the max. number of entries that is exported.
`filter_FRQ`, `filter_cal`, `filter_HWE`, `filter_imp`	Filter threshold-values for allele-frequency, callrate, HWE p-value and imputation quality, respectively. Passed to `HQ_filter`. `QC_histogram` takes only single values, but `histogram_series` accepts vectors as well (see 'details').
`filter_NA`	logical; if `TRUE`, then missing filter variables will be excluded; if `FALSE`, they will be ignored. `QC_histogram` takes only single values, but `histogram_series` accepts vectors as well (see 'Details'). `filter_NA` is the default setting for all variables; variable-specific settings can be specified with the following arguments.
`filter_NA_FRQ`, `filter_NA_cal`, `filter_NA_HWE`, `filter_NA_imp`	logical; variable-specific settings for `filter_NA`. These arguments are passed to `HQ_filter`.
`breaks`	argument passed to `hist`; determines the cell-borders in the histogram.
`graph_name`	character string; used in the title of the plot.
`header_translations`	translation table for column names. See `translate_header` for more information. If the argument is left empty, `dataset` is assumed to use the standard column-names used by `QC_GWAS`.
`check_impstatus`	logical; should `convert_impstatus` be called to convert the imputation-status column into standard values?
`ignore_impstatus`	logical; if `FALSE`, HWE p-value and callrate filters are applied only to genotyped SNPs, and imputation quality filters only to imputed SNPs. If `TRUE`, the filters are applied to all SNPs regardless of the imputation status.
`T_strings`, `F_strings`, `NA_strings`	arguments passed to `convert_impstatus`.
`...`	in `histogram_series`: arguments passed to `QC_histogram`; in `QC_histogram`, arguments passed to `hist`.

Details

histogram_series accepts multiple filter-values, and passes these one by one to QC_histogram to generate a series of histograms. For example, specifying:

filter_FRQ = c(0.05, 0.10), filter_cal = c(0.90, 0.95)

will generate two histograms. The first excludes SNPs with allele frequency < 0.05 or callrate < 0.90; the second allele frequency < 0.10 or callrate < 0.95. The same principle applies to the NA_filter settings. If the vectors submitted to the filter arguments are of unequal length, the shorter vector will be recycled until it equals the length of the longer (if possible). To filter missing values only, set the filter to NA and the corresponding NA-filter argument to TRUE. Setting the filter argument to NULL will disable the filter entirely, regardless of the NA filter setting.

Value

Both functions return an invisible value NULL.

Examples

## Not run: 
  data("gwa_sample")

  QC_histogram(dataset = gwa_sample, data_col = "EFFECT",
             save_name = "sample_histogram",
             filter_FRQ = 0.01, filter_cal = 0.95,
             filter_NA = FALSE,
             graph_name = "Effect size histogram")

  histogram_series(dataset = gwa_sample, data_col = "EFFECT",
             save_name = "sample_histogram",
             filter_FRQ = c(NA, 0.01, 0.01),
             filter_cal = c(NA, 0.95, 0.95),
             filter_NA = c(FALSE, FALSE, TRUE))

## End(Not run)## Not run: 
  data("gwa_sample")

  QC_histogram(dataset = gwa_sample, data_col = "EFFECT",
             save_name = "sample_histogram",
             filter_FRQ = 0.01, filter_cal = 0.95,
             filter_NA = FALSE,
             graph_name = "Effect size histogram")

  histogram_series(dataset = gwa_sample, data_col = "EFFECT",
             save_name = "sample_histogram",
             filter_FRQ = c(NA, 0.01, 0.01),
             filter_cal = c(NA, 0.95, 0.95),
             filter_NA = c(FALSE, FALSE, TRUE))

## End(Not run)

QQ and Manhattan plots

Description

This function creates the most important graphs of the QC: the QQ plots and the Manhattan plot. It also calculates lambda, and determines the effect of the filters.

Usage

QC_plots(dataset,
         plot_QQ = TRUE, plot_Man = TRUE,
         FRQfilter_values = NULL, FRQfilter_NA = filter_NA,
         HWEfilter_values = NULL, HWEfilter_NA = filter_NA,
         calfilter_values = NULL, calfilter_NA = filter_NA,
         impfilter_values = NULL, impfilter_NA = filter_NA,
         impfilter_min = min(dataset$IMP_QUALITY, na.rm = TRUE),
         manfilter_FRQ = NULL, manfilter_HWE = NULL,
         manfilter_cal = NULL, manfilter_imp = NULL,
         filter_NA = TRUE,
         plot_cutoff_p = 0.05, plot_names = FALSE,
         QQ_colors = c("red", "blue", "orange", "green3", "yellow"),
         plot_QQ_bands = FALSE,
         save_name = "dataset", save_dir = getwd(),
         header_translations, use_log = FALSE,
         check_impstatus = FALSE, ignore_impstatus = FALSE,
         T_strings = c("1", "TRUE", "yes", "YES", "y", "Y"),
         F_strings = c("0", "FALSE", "no", "NO", "n", "N"),
         NA_strings = c(NA, "NA", ".", "-"))
QC_plots(dataset,
         plot_QQ = TRUE, plot_Man = TRUE,
         FRQfilter_values = NULL, FRQfilter_NA = filter_NA,
         HWEfilter_values = NULL, HWEfilter_NA = filter_NA,
         calfilter_values = NULL, calfilter_NA = filter_NA,
         impfilter_values = NULL, impfilter_NA = filter_NA,
         impfilter_min = min(dataset$IMP_QUALITY, na.rm = TRUE),
         manfilter_FRQ = NULL, manfilter_HWE = NULL,
         manfilter_cal = NULL, manfilter_imp = NULL,
         filter_NA = TRUE,
         plot_cutoff_p = 0.05, plot_names = FALSE,
         QQ_colors = c("red", "blue", "orange", "green3", "yellow"),
         plot_QQ_bands = FALSE,
         save_name = "dataset", save_dir = getwd(),
         header_translations, use_log = FALSE,
         check_impstatus = FALSE, ignore_impstatus = FALSE,
         T_strings = c("1", "TRUE", "yes", "YES", "y", "Y"),
         F_strings = c("0", "FALSE", "no", "NO", "n", "N"),
         NA_strings = c(NA, "NA", ".", "-"))

Arguments

`dataset`	vector of p-values or a data frame containing the p-value column and (depending on the settings) columns for chromosome number, position, the quality parameters, sample size and imputation status.
`plot_QQ`, `plot_Man`	logical; should QQ and Manhattan plots be saved?
`FRQfilter_values`, `HWEfilter_values`, `calfilter_values`, `impfilter_values`	numeric vectors; the threshold values for the QQ plot filters. The filters are for allele-frequency, HWE p-values, callrate and imputation-quality parameters, respectively. A maximum of five values can be specified per parameter. Set to `NULL` to disable the QQ filter for that parameter. To filter missing values only, set the filter value to `NA` and the corresponding `filter_NA` argument to `TRUE`. The allele-frequency filter is two-sided: for a filter-value of `x`, it will exclude entries with `freq < x` and `freq > 1 - x`. Values >= 1 will be divided by the SNP's sample size. This allows sample-size dependent filtering of allele frequencies. Note that this uses the sample size reported in the sample-size column for that specific SNP. SNPs without sample size will be excluded if the corresponding `filter_NA` argument is `TRUE` and ignored if it is `FALSE`.
`FRQfilter_NA`, `HWEfilter_NA`, `calfilter_NA`, `impfilter_NA`, `filter_NA`	logical; should the filters exclude (`TRUE`) or ignore (`FALSE`) missing values? `filter_NA` is the default setting, the others allow variable-specific settings.
`impfilter_min`	numeric; the lowest possible value for imputation-quality. This argument is currently redundant, as it is calculated automatically.
`manfilter_FRQ`, `manfilter_HWE`, `manfilter_cal`, `manfilter_imp`	single, numeric values; the filter-settings for allele-frequency, HWE p-values, callrate and imputation quality respectively, for the Manhattan plot. The arguments are passed to `HQ_filter`. To filter missing values only, set to `NA` and the corresponding `filter_NA` argument to `TRUE`. To disable filtering entirely, set to `NULL`.
`plot_cutoff_p`	numeric; the threshold of p-values to be shown in the QQ & Manhattan plots. Higher (less significant) p-values are excluded from the plot. The default setting is `0.05`, which excludes 95% of data-points. It's not recommended to increase the value above `0.05`, as this may dramatically increase running time and memory usage.
`plot_names`	argument currently redundant.
`QQ_colors`	vector of R color-values; the color of the QQ filter-plots. The unfiltered data is black by default. This argument sets the colors of the least (first value) to most (last value) stringent filters. (For this setting, filter values `>= 1` (i.e. sample-size based filtering) are considered less stringent than values `< 1`.)
`plot_QQ_bands`	logical; should probability bands be added to the QQ plot?
`save_name`	character string; the filename, without extension, for the graphs.
`save_dir`	character string; the directory where the graphs are saved. Note that R uses forward slash (/) where Windows uses the backslash (\).
`header_translations`	translation table for column names. See `translate_header` for more information. If the argument is left empty, `dataset` is assumed to use the standard column-names of `QC_GWAS`.
`use_log`	argument used by `QC_GWAS`; redundant when `QC_plots` is used separately.
`check_impstatus`	logical; should the imputation-status column be passed to `convert_impstatus`?
`ignore_impstatus`	logical; if `FALSE`, HWE p-value and callrate filters are applied only to genotyped SNPs, and imputation quality filters only to imputed SNPs. If `TRUE`, the filters are applied to all SNPs regardless of the imputation status.
`T_strings`, `F_strings`, `NA_strings`	arguments passed to `convert_impstatus`.

Details

The function QC_plots grew out of phase 4 of QC_GWAS. It carries out three functions, hence the vague name: it calculates lambda, it applies the QQ filters, and it creates the QQ and Manhattan plots (a separate function is available for creating regional-association plots: see below). The function schematic is as follows:

Preparing the dataset: this step involves translating the dataset header to the standard column-names (by identify_column) and converting imputation status (by convert_impstatus). Both steps are optional, and are disabled by default. If the function cannot identify the imputation status column, it will generate a warning message and disable the imputation-status dependent filters.
Calculating the QC stats: here it generates the filters an calculates how many SNPs are removed. Lambda is also calculated at this point.
Creating a QQ graph of every variable for which filters have been specified. Every graph contains an unfiltered plot, plus plots for every effective filter. ("Effective" means "excludes more SNPs than the previous, less-stringent filter".)
Creating the Manhattan plot. The default Manhattan plot covers chromosomes 1 to 23 (X). Fields for XY, Y and M are added when such SNPs are present.

Value

An object of class 'list' with the following components:

`lambda`	vector of the lambda values of all SNPs, genotyped SNPs and imputed SNPs, respectively.
`ignore_impstatus`	logical value indicating whether imputation status was used when applying the filters.
`FRQfilter_names`, `HWEfilter_names`, `calfilter_names`, `impfilter_names`	character vectors naming the specified QQ filters.
`FRQfilter_N`, `HWEfilter_N`, `calfilter_N`, `impfilter_N`	numeric vectors; the number of SNPs removed by the specified filters. Note that the filters are sorted before being applied, so the order may not match that of the input. Check the `filter_names` output to see the order that was used inside `QC_plots`.
`Manfilter_N`	numeric; the number of SNPs removed by the Manhattan filter. This does not include those SNPs removed because they lacked p or chromosome/position-values, or failed the p-cutoff threshold.

Note

By default, QC_plots expects dataset to use the standard column-names used by QC_GWAS. A translation table can be specified in header_translations to allow non-standard names. See translate_header for more information.

The function accepts both integer and character chromosome values. Character values of "X", "Y", "XY" and "M" are automatically converted to integers. By default, the Manhattan plot shows all autosomal chromosomes and chromosome X. Fields for Y, XY and M are added only when such SNPs are present.

There must be more than 10 p-values at or below the plot_cutoff_p threshold for the QQ and Manhattan plots to be created.

Examples

  ## Not run: 
  data("gwa_sample")

  QC_plots(dataset = gwa_sample,
    plot_QQ = TRUE, plot_QQ_bands = TRUE, plot_Man = TRUE,
    FRQfilter_values = c(NA, 0.01, 0.05, 3),
    calfilter_values = c(NA, 0.95, 0.99),
    manfilter_FRQ = 0.05, manfilter_cal = 0.95,
    filter_NA = TRUE, save_name = "sample_plots")
    
## End(Not run)
## Not run: 
  data("gwa_sample")

  QC_plots(dataset = gwa_sample,
    plot_QQ = TRUE, plot_QQ_bands = TRUE, plot_Man = TRUE,
    FRQfilter_values = c(NA, 0.01, 0.05, 3),
    calfilter_values = c(NA, 0.95, 0.99),
    manfilter_FRQ = 0.05, manfilter_cal = 0.95,
    filter_NA = TRUE, save_name = "sample_plots")
    
## End(Not run)

QQ plot(s) of expected vs. reported p-values

Description

QQ_plot generates a simple QQ plot of the expected and reported p-value distribution. It includes the option to filter the data with the high-quality filter. QQ_series generates a series of such QQ plots for multiple filter settings.

Usage

QQ_plot(dataset, save_name = "dataset", save_dir = getwd(),
        filter_FRQ = NULL, filter_cal = NULL,
        filter_HWE = NULL, filter_imp = NULL,
        filter_NA = TRUE,
        filter_NA_FRQ = filter_NA, filter_NA_cal = filter_NA,
        filter_NA_HWE = filter_NA, filter_NA_imp = filter_NA,
        p_cutoff = 0.05, plot_QQ_bands = FALSE,
        header_translations,
        check_impstatus = FALSE, ignore_impstatus = FALSE,
        T_strings = c("1", "TRUE", "yes", "YES", "y", "Y"),
        F_strings = c("0", "FALSE", "no", "NO", "n", "N"),
        NA_strings = c(NA, "NA", ".", "-"), ...)
QQ_series(dataset, save_name = "dataset", save_dir = getwd(),
          filter_FRQ = NULL, filter_cal = NULL,
          filter_HWE = NULL, filter_imp = NULL,
          filter_NA = TRUE,
          filter_NA_FRQ = filter_NA, filter_NA_cal = filter_NA,
          filter_NA_HWE = filter_NA, filter_NA_imp = filter_NA,
          p_cutoff = 0.05, plot_QQ_bands = FALSE,
          header_translations,
          check_impstatus = FALSE, ignore_impstatus = FALSE,
          T_strings = c("1", "TRUE", "yes", "YES", "y", "Y"),
          F_strings = c("0", "FALSE", "no", "NO", "n", "N"),
          NA_strings = c(NA, "NA", ".", "-"), ...)
QQ_plot(dataset, save_name = "dataset", save_dir = getwd(),
        filter_FRQ = NULL, filter_cal = NULL,
        filter_HWE = NULL, filter_imp = NULL,
        filter_NA = TRUE,
        filter_NA_FRQ = filter_NA, filter_NA_cal = filter_NA,
        filter_NA_HWE = filter_NA, filter_NA_imp = filter_NA,
        p_cutoff = 0.05, plot_QQ_bands = FALSE,
        header_translations,
        check_impstatus = FALSE, ignore_impstatus = FALSE,
        T_strings = c("1", "TRUE", "yes", "YES", "y", "Y"),
        F_strings = c("0", "FALSE", "no", "NO", "n", "N"),
        NA_strings = c(NA, "NA", ".", "-"), ...)
QQ_series(dataset, save_name = "dataset", save_dir = getwd(),
          filter_FRQ = NULL, filter_cal = NULL,
          filter_HWE = NULL, filter_imp = NULL,
          filter_NA = TRUE,
          filter_NA_FRQ = filter_NA, filter_NA_cal = filter_NA,
          filter_NA_HWE = filter_NA, filter_NA_imp = filter_NA,
          p_cutoff = 0.05, plot_QQ_bands = FALSE,
          header_translations,
          check_impstatus = FALSE, ignore_impstatus = FALSE,
          T_strings = c("1", "TRUE", "yes", "YES", "y", "Y"),
          F_strings = c("0", "FALSE", "no", "NO", "n", "N"),
          NA_strings = c(NA, "NA", ".", "-"), ...)

Arguments

`dataset`	a data frame containing the p-value column and (depending on the settings) columns for chromosome number, position, the quality parameters, sample size and imputation status.
`save_name`	for `QQ_plot`, a character string; for `QQ_series`, a vector of character strings; specifying the filename(s) of the graph, without extension.
`save_dir`	character string; the directory where the output files are saved. Note that R uses forward slash (/) where Windows uses the backslash (\).
`filter_FRQ`, `filter_cal`, `filter_HWE`, `filter_imp`	Filter threshold-values for allele-frequency, callrate, HWE p-value and imputation quality, respectively. Passed to `HQ_filter`. `QQ_plot` takes only single values, but `QQ_series` accepts vectors as well (see 'details').
`filter_NA`	logical; if `TRUE`, then missing filter variables will be excluded; if `FALSE`, they will be ignored. `QQ_plot` takes only single values, but `QQ_series` accepts vectors as well (see 'Details'). `filter_NA` is the default setting for all variables; variable-specific settings can be specified with the following arguments.
`filter_NA_FRQ`, `filter_NA_cal`, `filter_NA_HWE`, `filter_NA_imp`	logical; variable-specific settings for `filter_NA`. These arguments are passed to `HQ_filter`.
`p_cutoff`	numeric; the threshold of p-values to be shown in the QQ plot(s). Higher (less significant) p-values are excluded from the plot. The default setting is `0.05`, which excludes 95% of data-points. It's not recommended to increase the value above `0.05`, as this may dramatically increase running time and memory usage.
`plot_QQ_bands`	logical; should probability bands be added to the QQ plot?
`header_translations`	translation table for column names. See `translate_header` for more information. If the argument is left empty, `dataset` is assumed to use the standard column-names of `QC_GWAS`.
`check_impstatus`	logical; should the imputation-status column be passed to `convert_impstatus`?
`ignore_impstatus`	logical; if `FALSE`, HWE p-value and callrate filters are applied only to genotyped SNPs, and imputation quality filters only to imputed SNPs. If `TRUE`, the filters are applied to all SNPs regardless of the imputation status.
`T_strings`, `F_strings`, `NA_strings`	arguments passed to `convert_impstatus`.
`...`	arguments passed to `plot`.

Details

QQ_series accepts multiple filter-values, and passes these one by one to QQ_plot to generate a series of plots. For example, specifying:

filter_FRQ = c(0.05, 0.10), filter_cal = c(0.90, 0.95)

will generate two plots. The first excludes SNPs with allele frequency < 0.05 or callrate < 0.90; the second allele frequency < 0.10 or callrate < 0.95. The same principle applies to the NA_filter settings. If the vectors submitted to the filter arguments are of unequal length, the shorter vector will be recycled until it equals the length of the longer (if possible). To filter missing values only, set the filter to NA and the corresponding NA-filter argument to TRUE. Setting the filter argument to NULL will disable the filter entirely, regardless of the NA-filter setting.

Value

Both functions return an invisible value NULL.

Examples

  ## Not run: 
    data("gwa_sample")
  
    QQ_plot(dataset = gwa_sample,
            save_name = "sample_QQ",
            filter_FRQ = 0.01, filter_cal = 0.95,
            filter_NA = FALSE)
  
    QQ_series(dataset = gwa_sample,
              save_name = "sample_QQ",
              filter_FRQ = c(NA, 0.01, 0.01),
              filter_cal = c(NA, 0.95, 0.95),
              filter_NA = c(FALSE, FALSE, TRUE))
  
## End(Not run)
## Not run: 
    data("gwa_sample")
  
    QQ_plot(dataset = gwa_sample,
            save_name = "sample_QQ",
            filter_FRQ = 0.01, filter_cal = 0.95,
            filter_NA = FALSE)
  
    QQ_series(dataset = gwa_sample,
              save_name = "sample_QQ",
              filter_FRQ = c(NA, 0.01, 0.01),
              filter_cal = c(NA, 0.95, 0.95),
              filter_NA = c(FALSE, FALSE, TRUE))
  
## End(Not run)

Log entries for QC_GWAS

Description

A subroutine for reporting problems encountered while running QC_GWAS. It's not designed for use outside of this context, so we recommend that users do not bother with it. It is described here for the sake of completeness only.

Usage

save_log(phaseL, checkL, typeL, SNPL = allSNPs,
         allSNPs = 1L, actionL, noteL = "", fileL)
save_log(phaseL, checkL, typeL, SNPL = allSNPs,
         allSNPs = 1L, actionL, noteL = "", fileL)

Arguments

`phaseL`, `checkL`, `typeL`	character-strings indicating (very briefly) what occurred to trigger a log entry.
`SNPL`	integer value indicating the number of SNPs affected.
`allSNPs`	integer value indicating the total number of SNPs in the dataset.
`actionL`	character-string describing very briefly the response to situation.
`noteL`	character-string for longer explanations of what happened.
`fileL`	character-string of the directory path and the file name, without file extension, of the log file.

Details

save_log is used to add entries to the log file the moment they are triggered (as opposed to waiting until the QC concludes and then saving the entire log). In case of a fatal crash, the log file should therefor give some indication of where in the QC it occurred.

save_log does not create the log file (this is done by QC_GWAS): it merely appends entries to the bottom of the specified file, regardless of what that file is. Hence it is not recommended to use the function outside of QC_GWAS.

Value

save_log returns an invisible NULL. The relevant output is the entry added to the log file.

Convert alleles to the opposing DNA strand

Description

This function is a subroutine of QC_GWAS and match_alleles. It converts allele-pairs to the configuration on the opposing DNA strand.

Usage

switch_strand(input, strand_col = FALSE)switch_strand(input, strand_col = FALSE)

Arguments

`input`	table with the alleles in column 1 and 2, and (optionally) the strand-information (coded as `"+"` or `"-"`) in column 3. Note: the alleles must be uppercase characters.
`strand_col`	logical; if strand-information is present, this switches the sign in column 3 as well.

Value

A table with two or three columns, depending on strand_col.

Examples

  data("gwa_sample")

  switched_data <- gwa_sample[ , c("MARKER", "EFFECT_ALL",
                                   "OTHER_ALL", "STRAND")]
  switched_data[1:10, ]

  switched_data[ , 2:4] <- switch_strand(input =
            switched_data[ , 2:4], strand_col = TRUE)
  switched_data[1:10, ]
data("gwa_sample")

  switched_data <- gwa_sample[ , c("MARKER", "EFFECT_ALL",
                                   "OTHER_ALL", "STRAND")]
  switched_data[1:10, ]

  switched_data[ , 2:4] <- switch_strand(input =
            switched_data[ , 2:4], strand_col = TRUE)
  switched_data[1:10, ]

Translate column names into standard names

Description

This function is used to translate non-standard column names into the standard ones used by QC_GWAS and other functions. It can also translate the standard names into user-specified names (via the out_header argument of QC_GWAS).

Usage

translate_header(header,
    standard = c("MARKER", "CHR", "POSITION", "EFFECT_ALL",
                 "OTHER_ALL", "STRAND", "EFFECT", "STDERR",
                 "PVALUE", "EFF_ALL_FREQ", "HWE_PVAL",
                 "CALLRATE", "N_TOTAL", "IMPUTED",
                 "USED_FOR_IMP", "IMP_QUALITY"),
    alternative)
translate_header(header,
    standard = c("MARKER", "CHR", "POSITION", "EFFECT_ALL",
                 "OTHER_ALL", "STRAND", "EFFECT", "STDERR",
                 "PVALUE", "EFF_ALL_FREQ", "HWE_PVAL",
                 "CALLRATE", "N_TOTAL", "IMPUTED",
                 "USED_FOR_IMP", "IMP_QUALITY"),
    alternative)

Arguments

`header`	character vector; the header to be translated.
`standard`	character vector; the names `header` should be translated into.
`alternative`	translation table; see 'Details' for more information.

Details

In a nutshell: the header argument contains the names you have; the standard argument contains the names you want; and alternative is the conversion table.

The table in alternative should have two columns. The left column contains the standard names; the right column possible alternatives. Only one alternative name should be listed per row. translate_header automatically changes the contents of header to uppercase, so standard and the right column of alternative should be uppercase as well.

A sample translation table is provided in the package data folder. It can be loaded via data("header_translations"). An editable .txt version can be found in the "R\library\QCGWAS\doc" folder.

Value

translate_header returns an object of class 'list' with 6 components:

`header_h`	character vector; the translated header. Unknown columns are included under their old names.
`missing_h`	character vector; the standard column names that were not found. If none, this returns `NULL`.
`unknown_h`	character vector; column names that could not be converted to a standard name. Note that these columns are also included in `header_h`. If none, this returns `NULL`.
`header_N`, `missing_N`, `unknown_N`	integer; the lengths of the above three vectors

Examples

  sample_data <-
    data.frame(SNP = paste("rs", 1:10, sep = ""),
               chrom = 2,
               effect = 1:10/10,
               misc = NA,
               stringsAsFactors = FALSE)
  # Creates a table with four columns:
  #   SNP, chrom, effect and misc.

  ( alt_headers <-
      data.frame(
        standard = c("MARKER", "MARKER", "CHR", "CHR"),
        alternative = c("MARKER", "SNP", "CHR", "CHROM"),
        stringsAsFactors = FALSE) )
  # Creates the translation table
  #  with the standard names in column 1 and the alternatives
  #  in column 2.
  
  ( header_info <- 
      translate_header(header = names(sample_data),
        standard = c("MARKER", "CHR", "EFFECT"),
        alternative = alt_headers)  )
        
  # Despite not being in the translation table, EFFECT is
  #  changed to uppercase because it is present in standard.
  #  misc is neither in standard or the translation table, so
  #  it is marked as unknown and left unchanged.

  names(sample_data) <- header_info$header_h
sample_data <-
    data.frame(SNP = paste("rs", 1:10, sep = ""),
               chrom = 2,
               effect = 1:10/10,
               misc = NA,
               stringsAsFactors = FALSE)
  # Creates a table with four columns:
  #   SNP, chrom, effect and misc.

  ( alt_headers <-
      data.frame(
        standard = c("MARKER", "MARKER", "CHR", "CHR"),
        alternative = c("MARKER", "SNP", "CHR", "CHROM"),
        stringsAsFactors = FALSE) )
  # Creates the translation table
  #  with the standard names in column 1 and the alternatives
  #  in column 2.
  
  ( header_info <- 
      translate_header(header = names(sample_data),
        standard = c("MARKER", "CHR", "EFFECT"),
        alternative = alt_headers)  )
        
  # Despite not being in the translation table, EFFECT is
  #  changed to uppercase because it is present in standard.
  #  misc is neither in standard or the translation table, so
  #  it is marked as unknown and left unchanged.

  names(sample_data) <- header_info$header_h

Package 'QCGWAS'

Help Index

Quality Control of Genome Wide Association Study Results

Description

Details

Author(s)

References

See Also

Skewness and Kurtosis

Description

Usage

Arguments

Details

Value

Note

See Also

Examples

Checking GWAS p-values

Description

Usage

Arguments

Details

Value

Examples

Convert imputation-status values to the QCGWAS standard

Description

Usage

Arguments

Details

Value

Note

Examples

Create an allele-reference file from HapMap data

Description

Usage

Arguments

Details

Value

References

See Also

Examples

Automated filtering and reformatting of GWAS results files

Description

Usage

Arguments

Details

Value

Note

Sample dataset for the QCGWAS package

Description

Usage

Format

Translation table for GWAS dataset headers

Description

Usage

Format

Select high-quality data in GWAS datasets

Description

Usage

Arguments

Details

Value

Note

Examples

Identify non-standard column names

Description

Usage

Arguments

Details

Value

See Also

Examples

Generates an intensity plot from x, y datasets

Description

Usage

Arguments

Details

Value

See Also

Easy loading of GWAS results files