Data Transfer Tools

There are several tools available to transfer data from external source to HPC supported resources. All except one tool is available on Sol and Hawk and can be accessed from the command line.  Some of these tools can be used to download data from web servers, cloud providers and remote *nix systems

cURL

cURL is a tool to transfer data from or to a server, using one of the supported protocols (DICT, FILE, FTP, FTPS, GOPHER, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, MQTT, POP3, POP3S, RTMP, RTMPS, RTSP, SCP, SFTP, SMB, SMBS, SMTP, SMTPS, TELNET or TFTP). The command is designed to work without user interaction.

cURL is installed system wide and does not require a module to be loaded.

Basic Example
curl www.example.com

cURL defaults to displaying the output it retrieves to the standard output specified on the system (usually the terminal window). So running the command above would, on most systems, display the www.example.com source-code in the terminal window. The -o flag can be used to store the output in a file instead:

curl -o example.html www.example.com
For more information, visit https://curl.se/ or view the cURL man page by typing man curl on the command prompt.

SCP

Secure copy protocol (SCP) is a means of securely transferring data between a local host and a remote host or between two remote hosts. It uses Secure Shell (SSH) for data transfer and uses the same mechanisms for authentication, thereby ensuring the authenticity and confidentiality of the data in transit. A client can send (upload) files to a server, optionally including their basic attributes (permissions, timestamps). Clients can also request files or directories from a server (download). 

SCP is installed system wide and does not require a module to be loaded.


Syntax
scp <options> <user>@<host>:/path/to/source/file <user>@<host>:/path/to/destination/file/or/directory

You can omit <user>@<host> on the local system. If username is same on local and remote system, the <user> can be omitted

Data transfer between Sol and SSH Gateway
[alp514.ssh8](501): scp ~/.bashrc sol.cc.lehigh.edu:~/.bashrc.tmp

Commonly used options

OptionsDescription
-rcopy directories recursively
-p

preserve time stamps

For more information, view the scp man page by typing man scp on the command prompt.


SFTP

SFTP is a command-line interface client program to transfer files using the SSH File Transfer Protocol (SFTP), which runs inside the encrypted Secure Shell connection. It provides an interactive interface similar to that of traditional command-line FTP clients.

SFTP is installed system wide and does not require a module to be loaded.

To start a sftp session, enter the following at the command prompt (replace user@host with your username and hostname on the remote system)

sftp user@host
Example
[alp514.ssh8](502): sftp sol.cc.lehigh.edu
alp514@sol.cc.lehigh.edu's password:
Connected to sol.cc.lehigh.edu.
sftp>

At the sftp command prompt, use linux commands to navigate the file system and the get/put command to transfer from/to the remote system.  Some standard command-line sftp commands

CommandDescription
getCopy a file from the remote host to the local computer.
put

Copy a file from the local computer to the remote host

help (or ?)Get help on the use of SFTP commands
exit (or quit)Close the connection to the remote host, and exit SFTP

For more information, view the sftp man page by typing man sftp on the command prompt.

rsync

rsync is a fast and extraordinarily versatile file copying tool. It can copy locally, to/from another host over any remote shell, or to/from a remote rsync daemon. It offers a large number of options that control every aspect of its behavior and permit very flexible specification of the set of files to be copied. It is famous for its delta-transfer algorithm, which reduces the amount of data sent over the network by sending only the differences between the source files and the existing files in the destination. rsync is widely used for backups and mirroring and as an improved copy command for everyday use.

rsync is installed system wide and does not require a module to be loaded.

Generic Syntax:

rsync [OPTION] … SRC … [USER@]HOST:DEST
rsync [OPTION] … [USER@]HOST:SRC [DEST]

where SRC is the file or directory (or a list of multiple files and directories) to copy from, DEST is the file or directory to copy to, and square brackets indicate optional parameters.

Common Options

OptionDescription
-aarchive mode
-r

recurse into directories

-vincrease verbosity
-zcompress file data during the transfer
-uskip files that are newer on the DEST
-tpreserve modification times
-ndry-run, perform a trial run with no changes made

For more information, view the rsync man page by typing man rsync on the command prompt.

Wget

GNU Wget or Wget is a free software package for retrieving files using HTTP, HTTPS, FTP and FTPS, the most widely used Internet protocols. It is a non-interactive command line tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.

Wget is installed system wide and does not require a module to be loaded.

Basic Example
wget www.example.com

For more information, visit https://www.gnu.org/software/wget/ or view the Wget man page by typing man wget on the command prompt.

aria2

aria2 is a utility for downloading files. The supported protocols are HTTP(S), FTP, SFTP, BitTorrent, and Metalink. aria2 can download a file from multiple sources/protocols and tries to utilize your maximum download bandwidth. It supports downloading a file from HTTP(S)/FTP/SFTP and BitTorrent at the same time, while the data downloaded from HTTP(S)/FTP/SFTP is uploaded to the BitTorrent swarm. Using Metalink's chunk checksums, aria2 automatically validates chunks of data while downloading a file like BitTorrent.

Version

module

1.35.0

aria2/1.35.0


Usage
[alp514.sol](1149): aria2c -h
Usage: aria2c [OPTIONS] [URI | MAGNET | TORRENT_FILE | METALINK_FILE]...
Example
aria2c -x10  http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_raw_GT_with_annot/20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr2.recalibrated_variants.vcf.gz


For more information, visit https://aria2.github.io

Axel

Axel is a program that downloads a file from a FTP or HTTP server through multiple connection.  Each connection downloads its own part of the file. Unlike most other programs, Axel downloads all the data directly to the destination file.  It saves some time at the end because the program does not have to concatenate all the downloaded parts.

 Axel supports HTTP, HTTPS, FTP and FTPS protocols.

Version

module

2.16.1

axel/2.16.1


Usage
[alp514.sol](1142): axel -h
Usage: axel [options] url1 [url2] [url...]

--max-speed=x           -s x    Specify maximum speed (bytes per second)
--num-connections=x     -n x    Specify maximum number of connections
--max-redirect=x                Specify maximum number of redirections
--output=f              -o f    Specify local output file
--search[=n]            -S[n]   Search for mirrors and download from n servers
--ipv4                  -4      Use the IPv4 protocol
--ipv6                  -6      Use the IPv6 protocol
--header=x              -H x    Add HTTP header string
--user-agent=x          -U x    Set user agent
--no-proxy              -N      Just don't use any proxy server
--insecure              -k      Don't verify the SSL certificate
--no-clobber            -c      Skip download if file already exists
--quiet                 -q      Leave stdout alone
--verbose               -v      More status information
--alternate             -a      Alternate progress indicator
--help                  -h      This information
--timeout=x             -T x    Set I/O and connection timeout
--version               -V      Version information

Visit https://github.com/axel-download-accelerator/axel/issues to report bugs


Example
axel -n 10 http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_raw_GT_with_annot/20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr2.recalibrated_variants.vcf.gz

For more information, visit https://github.com/axel-download-accelerator/axel

Rclone

Rclone is a command line program to manage files on cloud storage. It is a feature rich alternative to cloud vendors' web storage interfaces. More details available on the Rclone page.

Globus

Globus is a third party web based tool to manage transfer of data between two gridftp endpoints. Sol and Hawk do not have a gridftp endpoint so you cannot transfer data from external sources directly to your home directory. Instead, you need to first transfer data either to your local system or to Lehigh's Data Transfer Node (DTN) and then transfer to Sol using either scp, sftp or rsync as described above. More details are available on the Globus page.