Ceph

Ceph is a free-software storage platform, implements object storage on a single distributed computer cluster, and provides interfaces for object-, block- and file-level storage. Ceph aims primarily for completely distributed operation without a single point of failure, scalable to the exabyte level, and freely available.

Ceph replicates data and makes it fault-tolerant, using commodity hardware and requiring no specific hardware support. As a result of its design, the system is both self-healing and self-managing, aiming to minimize administration time and other costs.

Ceph Storage at Lehigh

LTS Research Computing provides a Ceph based storage resource, also called as Ceph. In Fall 2018, a 768TB storage cluster was designed, build and deployed to replace the original Ceph cluster, a 1PB storage cluster.

How is Data Stored in Ceph?

Data is replicated on across three disks on three nodes in three racks with distinct power feeds and network paths, secured against simultaneous failure of two full nodes in the primary data center. With current connectivity, the cluster supports an aggregate read/write speed of 3.75GB/s, with capability to increase bandwidth as needed. The Ceph software performs daily and weekly data scrubbing to ensure replicas remain consistent. An option for daily snapshots of data and stored to a secondary data center on Lehigh’s campus is also available.

NOTE: Ceph does not do backups. If you need daily snapshot and store the snapshots, you need to purchase an additional block of Ceph storage. If you need backup, one alternative is mount the Ceph project as a network drive and use Crashplan to backup contents in your Ceph project.

System Configuration

7 storage nodes
- One 2.5GHz 16-core AMD EPYC 7351P, 2.4GHz
- 128GB 2666MHz DDR4 RAM
- Three Micron 1.9TB SATA 2.5 IN Enterprise SSD
  - Total Raw Storage: 5.7TB for CephFS (Fast Tier)
- Two Intel 240GB DC S4500 Enterprise SSD (OS only)
- 13 Seagate 8TB SATA
  - Total Raw Storage: 104TB Ceph (Slow Tier)
- 10 GbE and 1 GbE network interface
- CentOS 7.x
Raw Storage: 728TB (Slow Tier) and 39.9TB (Fast Tier)
Available Storage: 206TB (Slow Tier) and 11.3TB (Fast Tier)

Why two tiers of storage?

The original Ceph cluster was designed for archival data storage. In circa 2015, Research Computing decommissioned the storage resource on the then HPC clusters, Corona, Maia, Trits, Capella and Cuda0 and used Ceph as the storage backend instead. This worked fine until Sol, built as a 34 node replacement cluster for Corona, Capella, Cuda0 and Trits, was expanded and upgraded to 56 nodes (81 nodes in Fall 2019) with 66 (120 in Fall 2019) nVIDIA GPUs. The increase in I/O from simulations on Sol caused instability in Ceph. After some research, it was decided that the Ceph replacement should include a fast tier of storage built using SSDs based on the Ceph file system (CephFS) to handle I/O from the ever expanding Sol cluster. The fast tier, CephFS, would provide a distributed global scratch on Sol for writing simulation data from running jobs while the slow tier, Ceph would provide longer term storage of simulation data.

How do I get access to Ceph storage?

To use Ceph as a storage device, Faculty, Staff, Department and Colleges need to purchase a storage project, minimum 1TB for a duration of 5 years. The cost for 1TB of storage is $375 for 5 year duration. The storage project can be shared with a named group of users including students at no charge. To request a Ceph storage project, please contact the Manager of Research Computing with the following

Name of the project, default name is the PIs username followed by group for e.g. alp514group
List of username of users who will have access to the storage. The list can be modified at any time during the 5 year duration.
Amount of Storage desired (minimum 1TB)
Banner Index to charge with authorization from Finance Manager

Ceph storage on Sol

Ceph is the storage backend for Sol. All users are provided with 150GB home storage with the $50/user/year account fees. Principle Investigators are encouraged to purchase a Ceph storage project for their research groups use on Sol. Such PIs have the option of using their Ceph storage project for their home directories instead of the 150GB quota and have their annual user fees waived for the duration of the Ceph project (i.e. 5 years from purchase) for current and future users.

Using Ceph for storage

Ceph storage projects are shared using cifs utilities and can be mounted as a network drive on Windows, Mac OSX and Linux. Ceph projects are mounted on Sol. Groups that use Ceph as home directory have access to their projects when they login to Sol. All others can access their Ceph projects at /share/ceph/projectname

Using CephFS on Sol for running jobs

CephFS is available at /share/ceph/scratch/username on Sol login and compute nodes. Users should use CephFS storage for in flight jobs only and are responsible for transferring simulation data from CephFS to their home directories or Ceph storage projects. The SLURM scheduler automatically creates a folder ${SLURM_JOB_ID} to store data generated from job ${SLURM_JOB_ID}. Users cannot create a folder in CephFS only subfolder in the ${SLURM_JOB_ID}. All data older than 7 days in CephFS will be deleted. It is the responsibility of the user to transfer data from CephFS to their home directories, Ceph project spaces or external storage resource.

Best Storage Practices on Sol

With multiple storage options, home/ceph, cephfs and local scratch, it is the responsibility of users to develop a data management plan for their simulation data.

home/ceph: permanent storage for the life of your account limited by 150GB/size of ceph project
cephfs: semi-permanent and should moved to permanent storage
local scratch: temporary - available only for in-flight jobs. This is 500GB space on each compute node that is shared by all users assigned to that node.