-=RetroMap=-

 
   


Introduction

RetroMap is an application designed to help characterize LTR retroelements on a genome scale in a visually interactive manner. This is NOT a particularly great for comprehensive identification of elements found in highly nested contexts. Only the most internal element of those which are nested is likely to be identified as a complete element. The other elements will be treated as solo LTRs and internal regions depending on how you have set up your element searches. However, it is still a handy way to get a quick visual overview of the situation and create pretty figures.

Any significant future changes will be reflected in this document. Please note that this software is highly experimental and has not been published yet though it is mentioned briefly in my 2004 Genome Biology paper. Until noted otherwise in this document, any citations should refer to the Genome Biology manuscript. I will work to address any problems you may have and incorporate your suggestions and requests for features as best as I am able. -=Brooke P-B


System Requirements

RetroMap is written in the Java language for enhanced cross-platform compatability. However, there are currently a few external dependencies which have to be met.

  • First, ensure that you are using the most recent version of the RetroMap program which may be downloaded from www.burchsite.com/bioi/java/RetroMap.jar. The current version at this time is 0.021. You may check the version you have by selecting 'Help->About' which will display a simple dialog showing the program version. This will ensure that you have the latest features and bug fixes. Second, it is necessary that the machine running RetroMap knows how to run Java programs. If it does not run at all upon starting the executable RetroMap.jar file, then it is likely you will need to install a copy of the free Java Runtime from SUN. This MUST be version 5 or higher (may be listed as version 1.5)
  • The free NCBI BLAST StandAlone software or at least the Blast2Sequences program of the BLAST suite must be installed in order to use the automatic LTR identification feature. If you are going to follow the tutorial below, then you will need to install the rest of the BLAST executables anyhow. In the future, I intend to add internal search functions to eliminate this reliance on external software.
  • RAM and lots of it when working with large datasets. The more the better. Hardware upgrades are beyond the scope of this document

Tutorial

Concepts

RetroMap is set up to allow datasets to be combined with each other as the default. It is not very good at reversing the process so I would suggest that you keep that in mind when choosing when and what to save. This also means that the software will quite happily append data to existing files without warning you under the belief that it is your intention to do so.

Hidden data: some objects in the window allow you to hide them. RetroMap considers these objects (while hidden) to be completely unavailable to it. This means that information for hidden objects will not be saved to files or included in any search functions you perform while hidden. This is the only mechanism available for removing information from sessions and saved files.

Importing data files
RetroMap accepts several different file formats as input. It will automatically try to determine which file type you are attempting to import. If the file is not supported by this program the import will fail.

BLAST XML formatted output files are the primary source for data. RetroMap will import all BLAST hits from the XML file and if necessary, condense overlapping hits into a single coverage format where overlapping hits are combined into a new hit with boundaries encompassing the greatest possible extent for the two hits.

The native RetroMap format is also XML based and consists of two types. After a BLAST file has been imported your work may be saved as a RetroMap (*.hmx) file. A specialized (*.hgd) XML format file is allowable to impose genome information such as centromere locations upon your dataset so they may be drawn. This file also typically contains information about sequences used to construct the BLAST database which was queried. Genome data files, pre-formatted BLAST databases , and source chromosome sequences are available for Arabidopsis thaliana, Drosophila melanogaster, Saccharomyces cerevisiae, and Schizosaccharomyces pombe

Coordinates provided by RetroMap are relative to the orientation that the reference sequences are found in. An object found on the antisense strand relative to the reference sequence will have a start coordinate which is larger than the end coordinate.

Phylogenetic information may be applied to the hits through import of a MEGA tree (.tre NOT a tree session file.mts). Working with MEGA tree files will be covered in the Phylogentic data section when I create it.

A step by step example

Here I'll walk you through everything you need to do to work with the features of the program using example data files which may be downloaded as indicated below.

  1. Prepare the sequences for the blast search. *NOTE* If you are just wanting to get a quick overview, the first 2 steps can be replaced with an XML BLAST search result file generated using the NCBI BLAST search pages for example. Those who are looking for a more interactive experience should follow along with this Drosophila chromosome arm example. You may use your own sequences in place of the Drosophila ones discussed here.
    1. Download the Drosophila melanogaster sequences.
    2. Rename the fasta headers for each sequence to something short but unique. This name is what RetroMap will use as the label for the large chromosome sequences or contigs. examples of good identifiers are things like Chr1, or Dm_2L. The fasta header lines in each file should now look something like '>Dm_2L'
    3. Create a large text document where you have appended all of the fasta sequences together. The file created for this example was called DmelGenomeV4.mfa. Tip: If you would like sequence retrieval to be as speedy as possible, ensure that the fasta files are set up to be as minimal as possible. This means that the header line should be followed by the sequence line where the sequence is all on a single line and contains no non-dna characters.
    4. Generate a BLAST database for the genome sequences. Blast commands will have to be run from a command or terminal prompt, see the Blast documentation for further details on available commands and settings. formatdb was used with the following command line:
      'formatdb -t "D. melanogaster genome rel. 4 blastdb created 2004-11-04" -i "DmelGenomeV4.mfa" -p F -o T -n DmelGenomeV4'
    5. Get an internal (located between the LTRs) known sequence to use as a blast query sequence to identify new retrotransposons. Save it as Endovir_IN.fan. This is a core (conserved region) of the integrase gene from the Arabidopsis thaliana Endovir1-1 Pseudoviridae retrotransposon.
  2. Run a first round blast search making sure you have set blast to generate XML output for the report. My command was:
    'blastall -p tblastn -d DmelGenomeV4 -i Endovir_IN.fan -m 7-e 1e-5 -o IN_Rnd1.xml -F F -v 0 -b 1000000'
    Alternatively, a fasta file containing multiple fasta formatted sequences can be used for querying the database with a number of query sequences at the same time.
  3. Begin a RetroMap session
    1. Start RetroMap by double clicking on it for Windows and Macs or typing 'java -jar RetroMap.jar' in the directory where RetroMap.jar is located for Unix command lines. If either of these do not work, please contact your system administrator or computer geek friend to ensure that java is properly installed and the java.exe executable can be found in your system path.
    2. Select File->Import (Ctrl+I) to open a file dialog. Navigate to the blast report you wish to use and click on Open. Note that this blast report file MUST be in XML format which you should have selected when running BLAST with the '-m 7' argument.
      1. A rudimentary 'Blast Import Options' window should open. For now just look at the 'Default output filename' and change it if you like with the 'Select Filename' button. This is the root save location and name for all RetroMap generated files. Click on 'OK' to start the import process.
      2. Create a large text document where you have appended all of the fasta sequences together. The file created for this example was called DmelGenomeV4.mfa.
      3. If all went well with the import, a new window called 'Main' should have opened on the RetroMap desktop displaying the locations of all blast hits matching a reference sequence (subject in blast parlance and is the chromosome sequence in this example) from the database on their respective reference sequence. You will probably want to adjust the size and position of this window.
      4. Problems with the import may be indicated by symptoms such as all hits being displayed on only one strand of the query sequence(s) and/or 'No definition line found' may be displayed on items throughout the window. This usually means that the importer did not find the XML document to be structured the way it knew how to handle. If these problems recurr, then you may want to contact me to fix this as I consider it to be a bug. Please include a copy of the problematic XML file with your report.
    3. If you wish, you may save the current HitPlot (minus the scalebar currently, :-( sorry) by right clicking (or whatever needs to be done on your operating system to display popup menus) and click on 'Save Image'. Doing so will save a scalable vector graphic (.svg) file in the current directory with the same name as that you selected during the import with a '.svg' extension added. This may be opened and worked with by image editors capable of viewing SVG files such as Adobe Illustrator&tm;.
    4. Go ahead and save the current data set so that you can revert to it if necessary. To save, follow these steps:
      1. Select File->Save or Ctrl+S to open the save dialog
      2. Select the 'Save HitMapper (hmx) data' and 'Save seq for hits' option buttons. You will see that the output names have now been set for those options. If you would like to change the base name for the output files you can do so by clicking on the 'Change File' button. RetroMap native data will be saved to files with a '.hmx' extension while the sequences will be '.fan'
      3. When you are satisfied with the Save options, click on the 'Continue' button. Since RetroMap doesn't know where the source sequence files are located, it will prompt you to provide the source file for each sequence. The sequences may be contained in a single multi-FASTA file or in individual files. The file chooser title lists the name of the sequence that it currently wishes you provide. The sequence MUST be in FASTA format and have a header beginning with the sequence name, e.g. '>SeqName Possibly other FASTA header info'. In this example, the title of my chooser says it is looking for Dm_2L. Navigate to the source file, select it and click the 'Open' button. If you are following along with my example, the source file will be named DmelGenomeV4.mfa. If you aren't, then you may have to provide the filenames for a number of sequences.
      4. It may take a while for RetroMap to index the fasta file (particularly for genome sized ones like the Drosophila example file) and the application will be unresponsive until indexing completes. The status bar at the bottom of the application will indicate when indexing has completed and tell you the number of sequences written. As long as the file and it's location do not change, the indexing should only have to occur once. Please note that if a sequence file already exists, RetroMap will append to the end of it rather than replace it. However, the (.hmx) file WILL be replaced.
    5. Now we can attempt to find LTRs for all of the imported sequences. Select Tools->'Identify Complete Elements' or Ctrl+G to open the full-length element (LTR to LTR) identification dialog.
      1. If you are following along with the example, the dialog will appear and show a list of the reference sequences along with their file locations. If not, the dialog will ask that these be set using the 'Select file' button to tell RetroMap where the requested (if the blast file provides a name) source sequence file for the hits is located on your computer.
      2. Set a default name for the output file, eg. test. The extensions are added automatically and noted in the dialog. If you fail to provide one, the default filename will be set to 'default', originally enough.
      3. Select the 'Save full length elements?' radio button and any others you'd like. This is currently the only chance you have to save the sequences for these.
      4. Click Continue. Since you haven't told the program where to find NCBI's bl2seq program yet, it will prompt you to find the file. Navigate to it and hit 'Open'. RetroMap will now start searching for LTRs for each of the blast hits you imported. Therefore, you really, really wouldn't like to use LTRs as the queries for creating the blast report you import.
      5. RetroMap will appear to freeze while it performs the search. The text in the lower left of the status bar will say "Completed LTR Search" when the program finishes looking for LTRs. At this point you can continue to work with the program.
      6. Full-length sequences for those hits which appear to be part of elements with two LTRs will be found in the output files you selected.
      7. Selecting the 'Save full length elements' button is going to make RetroMap take a long time to retrieve the sequences, several minutes on my 2GHz 64bit machine. This is because RetroMap assumes that your sequence file (DmelGenomeV4.mfa) is potentially full of non-sequence characters like spaces, newlines, and numbers. RetroMap is currently set up to minimize the amount of disk space it consumes when running so it doesn't reformat your input sequence file into a version which enables much more efficient sequence retrieval. This would entail creating new copies of your sequence files which could consume a lot of space if you were working with a large eukaryotic genome. In the future I may add an option (or requirement!) to have RetroMap ensure that sequence files are formatted the way enabling fast sequence retrieval.
    6. RetroMap can output its information on the objects it is displaying as a tab-delimited output file (*.tdf). This file is suitable for import into spreadsheet software such as MS Excel or Gnumeric
      1. To do this select 'File->Save...'. This re-opens the save dialog we used above. This time select the 'Tab delimited data file' option. Change the filename if you need to and hit 'Continue'.
      2. Open the new file with a spreadsheet application. I'm writing this tutorial on a linux machine right now, so I am going to use Gnumeric. You may have to go through an import dialog in your spreadsheet application to tell it that the data is in tab-delimited format.
      3. The top row is a header line providing headings for each of the columns in the table. The definitions follow.

        HitNameThe name that has been assigned to a particular hit object
        RefSeqThis will be the sequence name that this object belongs to and is located on
        Strand A (+) or (-) symbol representing the orientation of the hit object relative to the RefSeq as it is found in the source fasta file.
        HitBegThe position on the the RefSeq where this hit object begins in it's own sense oriention
        HitEnd The position on the the RefSeq where this hit object begins in it's own sense oriention. This means that a hit on the RefSeqs antisense strand will have a begin coordinate that is larger than its end coordinate.
        HitLength The length of the original hit.
        LTRThis can be 5', 3', Solo, or unset
        LTRbeg Provides the begin coordinate on the RefSeq if this row provides data about an LTR. Empty otherwise.
        LTRendThe end coordinate for this LTR if it is an LTR row
        TandemTrue or False. True indicates that RetroMap believes that two elements share a LTR
        Internal True or False. True if this hit object is entirely contained or nested in another hit object's sequence
        LTRscore Higher scores are better. This represents how likely RetroMap believes the selected LTR is to be a genuine LTR. Currently it only: 1) checks to see if the LTR sequences begin with TG or TA and end with TCA or CA. 2) checks whether these residues are identical on both LTRs. Future improvements would include searching for a target-site duplication, and nearby primer binding sites
        LTRlength Actual residue length of this particular LTR. Non-identical LTRs may have different lengths
        numIDThe number of identical aligned residues for the two LTRs
        totalComparedThe length of the alignment between the two LTRs
        %ID((numID / totalCompared) * 100)
        AgeEst If you have provided a nucleotide substitution rate, RetroMap provides an age estimate in MYr for the time since an element with non-identical LTRs inserted.
        ElementLengthThis encompasses the largest known bounds for the hit which means that for a hit with ltrs, it will represent the span from the beginning of the 5'LTR to the end of the 3'LTR
        grp:groupname This column will note members of a group you have set up by listing the group name next to members of that group. There will be as many of these columns as there are groups.


    Bugs and Desired Enhancements

    Bugs

      Enhancements

      • Web interactivity for reference sequence imports
      • Custom chromosomal and element tag construction
      • Turn off hit merging across queries so that hits remain distinct by query
      • Move the save options from the Find complete elements dialog to the save menu
       
     

    Comments and Suggestions go to Brooke D. Peterson-Burch.
    Copyright 2002-2004 BioI. All rights reserved.
    Copyright 2002-2004 Brooke D. Peterson-Burch. All rights reserved.
    Information in this document is subject to change without notice.
    Any other products and companies referred to herein are trademarks or
    registered trademarks of their respective companies or mark holders.

    Last Modified: