Hotspot Detector for Copy Number Variants
Copy number variants (CNVs) are a major source of genetic variation, and are important in the research of many organisms. Comparing copy number variants between samples is important in elucidating their potential effects in a wide variety of biological contexts.
HD-CNV is a program that compares CNV regions detected across multiple samples, identifying recurrent regions by finding cliques in an interval graph generated from the input. It creates as output a unique, graphical representation of the data, as well as summary spreadsheets and UCSC track files. The interval graph, when viewed with other software or by automated graph analysis, is useful in identifying genomic regions of interest for further study.
Download the runnable jar here
Download the full Java source code here
- Other Packages Required to compile source: Swing (1.0.1), JGraphT (0.8.3)
Download newer version – here
- some bug fixes, and added reciprocal overlap, and added Gephi script output
Download the full Java source code here
- Other Packages Required to compile source: Swing (1.0.1), JGraphT (0.8.3)
Getting help with HD-CNV:
email: mlocke2 – at – uwo.ca
- Check out the video tutorial series
- How do I format my input files?
- How do I run HD-CNV?
- What is each output file?
- What can I do with the output?
- How do I visualize the tracks?
- How do I use the Gephi Script files? – newer version only
How do I format my input files?
HD-CNV takes csv files as input. When you run HD-CNV, select the input tab for an example of what the input file should look like. The following headers must be present, in the following order: ID, Chromosome, Start, End, Sample and Type
ID: This column contains unique ID numbers for each CNV
Chromosome: This column contains which chromosome the CNV is on (files should be separated by chromosome)
Start: The starting genomic location of each CNV event
End: The ending genomic location of each CNV event
Sample: Some kind of sample ID (number) to identify which sample each CNV is from. New version supports text IDs
Type: Amplification or deletion. New version supports numeric states.
In addition to these columns, you may have any extra columns you would like. These may be a p-value from the software you used to detect the CNVs, could be the number of known SNPs in each region, or just a column for notes. Any extra columns will be carried through the software if you check “Should the optional columns be reprinted in the final output files?”.
Once you have made sure all necessary columns exist in the files, you save them as .csv files and place them all in the same directory. You should have 1 file for each chromosome, with all the CNV events found on that chromosome in every sample in the file. When you run HD-CNV, you provide the directory location where your files are, and it will analyze each file in the directory. Only include your .csv files you want analyzed in this directory.
Example 2: Formatting Complete Genomics Diversity Data Set for HDCNV
- The data were downloaded from the Complete Genomics Data ftp site. There are 46 individuals in this set, each given an ID like “NA06985″. The number portion of the id varies between individuals.We downloaded the base folder ASM_Build37_2.0.0, but specific parts can be selected as required. For Example: The CNV calls are in the file “Diversity/ASM_Build37_2.0.0/NA06985/NA06985-200-37-ASM-VAR-files.tar”, for each ID.
- The files of interest are the diploid CNV calls, which are in “ASM_Build37_2.0.0/NA06985/NA06985-200-37-ASM/GS00392-DNA_D01/ASM/CNV/cnvSegmentsDiploidBeta-NA06985-200-37-ASM.tsv”, for each ID. These must be decompressed before they can be used.
- The files were collected into a single directory in the console using the following commands in a terminal (Linux/Apple):
>cd directory containing all the NA##### directories.
>mv NA*/*/*/ASM/CNV/cnvSegmentsDiploid* ./cnvSegmentFileDir/
This makes a directory to hold the files, then copies them from each CNV subfolder so they are all in the same directory.
- We created an R script to translate these files into the HD-CNV input format. You must edit this file to include the correct input/output directories. Either copy and paste the modified text into an R console, or type:>source(“locationOf/CG_to_HDNCV.R”)
into the R console.
You can now run HD-CNV on the directory you specified in the R script above.
How do I run HD-CNV?
Double click on the jar file you downloaded from above, and HD-CNV will open. On the main tab, you can select two parameters. The percent required for overlap is how much two CNV events should overlap in their genomic location in order to consider them “merged”. The percent required for family is how much two CNV events should overlap in order to be considered part of the same family. These values are defaulted at 40% and 99%. See the image below for a description of the terms:
Type the full directory where you files are stored in the text box beside “Input directory:”.
If you wish to generate UCSC track files or graph files (which can be visualized to see a visual representation of each chromosome and any associated hotspots) check the appropriate boxes.
If you wish to add extra columns, click on the Input tab and select that you wish to have extra columns carried through. Type in the total number of columns present in your files. When you are done, click Submit Input.
Lastly, if you wish to have sample specific data output, click on the “Extra” tab. Here you can specify how many samples are present in your file, and for each sample, you can describe the genotype and tissue type. Type the sample number (numerical characters only) followed by the genotype and tissue type separated by commas. Each sample goes on its own line. When you are done, click Submit Sample Data. There is an example of this in the software.
Once you have entered all the data you wish to, on the main HD-CNV tab click Run. If you wish to exit the program, click Exit. If you wish to provide a new directory and start again, click Reset.
What is each output file?
There are four types of output files create: the standard output file, the summarized output file, the UCSC track files and the graph files.
One standard output file is created for each chromosome. It contains all original CNV events and data. For each CNV event in the file, it says what other CNV events this merged with, and what other CNV events are in the same family. The starting and ending base pair positions of the full merged region are also present for each CNV event.
One summarized output file is also created for each chromosome. It has all the merges identified on the chromosome, their starting and ending positions, which CNV events are present in them, and a count of how many events in merges come from different genotypes and tissue types (if that information was provided when the software was run).
UCSC genome browser track files are created if desired, and these can be used to visually inspect each CNV event and each merge.
Lastly, graph files are created if desired. These are .csv files containing the adjacency matrix used in analyzing the CNV data. These files can be visualized using graph visualization software (such as Gephi (Bastian 2009)) and can create a visual karyotype of the data. An example karyotype can be seen below:
- Each note represents a CNV event, and edges are added between nodes that share 40% (default) overlap.
What can I do with the output?
The output from HD-CNV can be used to identify hotspots and coldspots of CNV events, and can assist in identifying recurrent CNVs, either from clonal events or hotspots. This can be applied across multipel levels, comparing samples from different populations right down to different parts of the same tissue. This application spans the population level right down to the tissue level. CNV data out of a variety of software sources (such as Partek’s Genotype Console, PennCNV or MouseDivGeno) can be entered into HD-CNV, and the software can be used to do concordance analysis.
The graph files allow you visualize areas of interconnected CNVs, regions where CNVs are showing up across samples with a strong amount of overlap. They let you quickly identify which chromosomes are “hot” or by comparison, which are “cold”.
The UCSC track files can be visualized with the UCSC Genome Browser, or similar programs. The file is in BED format. More information is in the next section:
How do I visualize the tracks?
Check out the “Using USCS Genome Browser Track File Output” Video Tutorial above.
The UCSC track files are all stored in the UCSC_Tracks folder in BED format. To visualize them in the browser, go to the genome browser (http://genome.ucsc.edu/), select the correct species and then select “Manage Custom Tracks”. You can then upload the “Summary_UCSCTrack.BED” from the track folder, which contains all merges and CNVs as two separate tracks.
How do I use the Gephi Script files?
If you are using Gephi, you may have found the formatting to be tedious. There is a plugin available for Gephi called “Scripting”, which allows you to run script files which can automate a lot of the layouts/node colouring/node size etc.
To install the plugin, see their FAQ here: https://marketplace.gephi.org/faq/
Once installed, you will be able to run the scripts by
- Starting a new project
- Opening the console tab:
- You should now be able to enter Jython scripting language. See some information here
- To run a script file, as output from HDCNV, you would use the “execfile” command:
>>> execfile( "/Your/File/Location/Output/chr1Script.gy")
- There is a limit to the number of lines a script can be long, so HDCNV will split large graph files over multiple scripts, using subscripts to indicate the order they need to uploaded. To automatically upload them all, say if you have 25 or less, use a loop like:
>>> for i in range(1,25): ... execfile( "/Your/Files/Location/Output/chr1Script_"+str(i)+".gy") ... *just hit enter
- Note: You need that tab space before execfile so that it knows you are filling in the loop. When you just hit enter, the loop will run.
- You may still want to manually fix some overlapping edges if you desire. It is very hard to get the layouts to be perfect without adjusting them
- Not all the layouts in the Gephi environment are available in the scripting environment, but there are several listed in the Wiki linked above.
Now comes the fun stuff. There are a bunch of commands that allow you to run layouts automatically, and recolour nodes. Here is an example layout script that I made, which will make a heatmap coloured, fairly well laid out graph, ultimately using the Fruchterman Reingold layout (circular)
You can also export it to a PNG file, using:
If you are having issues with HD-CNV, try the following:
- Are your input files correct? Make sure all the necessary headings are present in the right order
- Did you give it the correct directory? Make sure no other files are present except those files you wish to analyze
- Is your data in the correct format? Make sure the IDs, chromosome, start, end and sample columns only contain numbers and no letters or special characters
- If you are unsure what the problem is, you can look at the source code to potentially identify what is going wrong. It is well documented and openly available.
- Lastly, if you are still having problems, feel free to email Jenna Butler at email@example.com for assistance
Each copy number variant from the input file is added to the UCSC track output. A copy number variant is a large (>1Kb) region where the number of copies, when detected by experimental methods, is not what is expected based on the reference genome. This can be either an amplification (coloured green in the HD-CNV track) or a deletion (coloured red in the HD-CNV track). The sample ID is used to label each track entry.
A merged region is the output of the HD-CNV program. It identifies a region of overlap among CNVs present in the input files. The amount of overlap required to have two CNVs considered in a merge is defined by the input parameters to HD-CNV (default 40%). A merge is reported whenever a group of CNVs are each overlapping with all others in the group. Merges are labeled with a Merge ID, which corresponds to the row ID in the tabular output files.