I’ve been looking at building a robust NAS system over the past few days with a focus on budget and data redundancy via cold storage and off-site backups. I focused on the Raspberry Pi with renewed hard disk drives since it was an affordable option for a ton of data. I have roughly 4 TB of data, but was focused on trying to get the most storage that I could afford to give me many years to grow, and to make use of features such as ZFS cache drives, secondary logs, snapshots, and RAIDZ arrays that improve speed of transfers and the ability to recover in the event of a data loss. Most of my planning was focusing on the maximum speeds that a Raspberry Pi could handle on its PCIe 2.0 bus when promoted to PCIe 3.0, and increasing network speeds by using a 2.5 Gbps adapter on the USB 3.0 port. Pushing the Raspberry Pi to its limit was due to the limitation of file systems in general, in that everything must be on the same device in order to create one large pool of data.
Enter FrankeNAS with NetworkChuck who demonstrated how to create a large pool of storage using a misfit of NAS devices of different ages, sizes, and speeds.
| Task | ZFS | Ceph |
|---|---|---|
| Licensing | Open-Source: Common Development and Distribution License (CDDL) | Open-Source: Lesser General Public License version 2.1 (LGPL-2.1) |
| Add Storage Capacity A: Running out of space | A: Replace all drives, one drive at a time, wait a few hours/days to resilver each volume. B: Add a duplicate pool of the same size and add more hard drives to the same machine. | A: Add a new drive to a device – any capacity/speed B: Add a new device to the cluster |
| Scalability | A: Limited by storage medium and latest protocols / interfaces on motherboard B: Single Node with addressable space of 256 Quadrillion Zettabytes C: Has been used to manage 100s of petabytes of data | A: No hard limit B: Multiple devices. Limited by network bandwidth, storage devices, and processing power C: Has been used to store data in exabytes |
| Data Transfer Speeds | Limited by number of bonded ethernet ports and speeds on device (PCIe bus), and number of simultaneous requests from multiple users. | Parallel Access across multiple nodes, limited by speed of client interface and router speed. Ie 10 Gbps client can receive files from to 10 nodes running 1 Gbps in parallel, or 100 nodes running at 100 Mbps. |
| Reduce storage capacity | A: Remove drive B: Unhealthy state | A: Remove drive B: Automated redistribution |
| Drive recovery A: Corrupted drive | A: Replace failed volume B: Have hot-spare registered to take over immediately | A: Automated redistribution B: Replace volume not critical |
| Disaster recovery A: Server destroyed B: Server inoperable | Nothing works. A: Use failover NAS B: Rebuild from cold storage C: Rebuild from remote NAS | Everything still works. A: Replace failed device at will |
| Snapshots A: Server infected B: Ransomeware | A: Recover from remote snapshots B: Rebuild from cold storage | Recover from RBD snapshots |
| At-rest encryption: A: Lost disks / Theft B: Untrusted storage location | A: Add layer of LUKS, VeraCrypt, etc. B: Use self encrypting drives C: Encrypt specific datasets and use KMIP to store keys Note: Large performance impact on Deduplication. Based on how encryption is layered, inline compression may become unusable. | Object Gateway supports server-side encryption when files are sent via https. B: Client manages encryption keys and provides them with each request C: Ceph uses key management service to retrieve keys on demand D: Ceph uses SSE-S3 to create/delete keys on demand |
| End-to-End encryption A: Packet sniffing B: Man-in-the-middle attack | Use secure protocols: SSL/TLS, PGP, AS2, FTPS, SFTP, HTTPS. Use SMB 3.0 or later for network attached storage. | All data moving through network between daemons and clients are encrypted via messenger version 2 protocol |
| Mirroring / Replication | Snapshots sent to another ZFS file system | RBD Journaling can be used to mirror all images within a pool, or a specific subset to another ceph cluster |
| Geography A: Cross-Site Replication B: Two-Way Replication | A: Not built-in. Backups via snapshots / replication offsite (ZFS send/receive) B: Manual or Scripted Replication via rsync + ZFS send/receive C: Failovers usually involve having second system activated after primary system fails D: TrueNAS offers some support for two-way replication with extra steps, but manual interaction required to handle conflicts | A: Nodes within the cluster can be remote, but require high-bandwidth and low latency (1 Gbps+ < 100ms latency). MON should be at remote location to maintain quorum. B: Backups via snapshots / replication C: Setup separate cluster and link as geo-replication |
I was looking at ZFS as a great way to add a budget NAS with a great way to ensure that data could be recovered during a failure, but I was planning to setup multiple devices as failover solutions (cold storage & remote replication). The system as a whole, is limited by its total storage capacity, and upgrading that capacity over time becomes a lengthy process that can take days while data is replicated. For example, it takes about 22 hours to write 20 TB of data at 250 MB/s. If you want to upgrade a ZFS RAIDZ array with five drives, that’s about five days to complete the process while crossing your fingers that another drive doesn’t fail during that time.
In the past, statistics from data centers were showing that hard drives often have a high failure rate when first installed, then dropping drastically until 3-5 years pass when the failure rate gradually builds up again as the drives wear out. When graphed out, this is often referred to as the bathtub curve. However, that curve has been changing as technologies, shipping, and manufacturing processes improve.

Drive Failure Over Time: The Bathtube Curve Is Leaking
This initial high rate is due to drives that are dead on arrival, manufacturing defects, shipping and handling damage, installation stress, firmware issues, high workload from initial setup, or compatibility issues. If you have a RAID5 or RAIDZ1 setup, any drive that fails during the resilvering process, other than the volume being resilvered, will cause the pool to fail. RAID6 or RAIDZ2 allows two drives to fail during that process, and RAID7 & RAIDZ3 allows three.
Ceph
So – what is Ceph? How does it work? Can I add my existing NAS? What about ZFS?
You can think of Ceph as a distributed NAS or File System. Rather than having multiple drives within one server, you have multiple servers in one cluster. Each server may expose multiple drives individually, or combine them as a single volume.

There are several key components that make Ceph work.
- OSD Nodes are physical machines that have storage with one or more OSD daemons to store data and handle replication, recovery, and rebalancing.
- Object Store Daemons (OSD) are usually assigned to one individual hard drive. A daemon can also manage a set of disks in RAID and RAID-like pools, or logical volume management (LVM) to group multiple drives into a single volume. The daemon itself is responsible for storing actual data as well as handling tasks such as data replication, data placement, recovery, and rebalancing within the cluster.
- Monitors (MON) talk to OSDs to ensure the health of the cluster. They track the location of data objects, OSD status of being up or down, and information. They ensure changes are propagated correctly to all OSDs. Clusters usually have three or more monitors on separate devices for fault tolerance and avoid a single point of failure. They ensure that a quorum is maintained for the cluster to continue operating in case of a failure.
- Managers (MGR) gathers performance metrics, and provides a REST API for monitoring and managing the cluster. Clusters typically have two or more manager nodes for high availability. The Manager helps with managing the clusters configuration and status checks.
- Metadata Servers (MDS) are optional, and are used when the cluster uses its own file system (Ceph File System – CephFS) such as directories and file names. It manages the metadata such as directories, file names, permissions, etc. Multiple Metadata Servers may be configured for scalability and fault tolerance.

Ceph has a few models that can be used to store the data as well.
- Object storage (RADOS Gateway – RGW) is compatible with interfaces like Amazon S3 and OpenStack Swift.
- Block Storage (RBD) is often used with virtual machines and other systems, similar to traditional SAN (Storage Area Network)
- File System (CephFS) sits on top of Object Storage as a POSIX-compliant file storage allowing shared storage for applications.
Data is distributed across all of the OSDs in the cluster using Controlled Replication Under Scalable Hashing (CRUSH) algorithm. Rather than using a central lookup table, it calculates the placement of data with a hash-based algorithm to increase scalability and reduce single points of failure. Data is replicated across multiple OSDs to provide fault tolerance with three devices by default (Similar to RAIDZ2 or RAID7). In this configuration, you only have access to a third of raw capacity, but the primary benefits are that its not localized to a single machine failing, and you can scale the capacity by adding more devices to the cluster. Ceph also supports Erasure Coding for more efficient storage where data is split into fragments with parity information instead of replicating the data to reduce the storage overhead while still providing fault tolerance.
Ceph automatically detects OSD failures and immediately starts a recovery process by distributing the data to healthy OSDs without manual intervention. With RAID and RAID-like systems, this usually involves having a hot swap disk on standby or manually replacing the volume and telling the device to start resilvering. When new OSDs are added or removed, Ceph automatically rebalances the data across the cluster.
Although Ceph does not use ZFS, each individual device can still use ZFS locally. However, Ceph already offers many of the features that ZFS offers such as snapshots, data placement, and RAID-like features. In turn, using both Ceph and ZFS has disadvantages such as less storage volume when using both local and distributed RAID, and the resource demand of large RAM, cache drives, and secondary log drives that are often used to improve ZFS performance.
If you don’t have access to the OS on a NAS, or the NAS doesn’t have the resources to run the daemons, you could use virtual machines or docker containers on another device to expose the NAS to the cluster, or by mounting the SMB share on a ceph node. However, this setup has limited redundancy and performance benefits and isn’t participating directly in the Ceph cluster. It could be used as a backup target for snapshots. You could also setup a RADOS gateway to interact directly with the SMB share. In the end, you can use these devices, but its preferable to install the ceph software to expose native block-level access to the storage medium.
Change of Plan
I was laser focused on the ZFS file system and what it had offered. However, I needed to add extra costs to support larger capacity storage, fast transfer speeds, and recovery. With Ceph, I can use less RAM, hard drives, and slower network interfaces and distribute the load over multiple devices.
So what exactly does that mean? Where am I going to find that kind of benefit?
Simple. Zero.
The Raspberry Pi Zero W is a very cheap, inexpensive motherboard that you can purchase for about $15. It has built-in WiFi, 1 GHz CPU, and 512 MB RAM. This pales in comparison to a Raspberry Pi 5 with 8 GB of ram for $80. I can purchase 5 zeros for $85 and have great data redundancy and speeds via parallel access. However, I still have to purchase a power supply, MicroSD card, and SATA adapter for each one as well. The benefits however, seem to be very beneficial. I no longer need to plan for a Primary/Secondary NAS with two cold storage devices to swap. That’s four machines – four copies of data, and tons of redundancy with each copy. The Ceph approach allows me to get tons of redundancy and automated healing with only one set of data and one replication scheme. And it’s expandable. I don’t need to plan for maximum storage space for the cheapest price. I can just buy the storage that I need now, and add another node later.
Minimum Requirements
So here are some questions. What is the minimum requirements to run Ceph? How much RAM does it need? Can it run over WiFi? Is it compatible with USB drives? Can it use Flash drives, or is it limited to SSD’s and HDD’s? If there are restrictions – are they limited to specific roles such as OSD, MON, MGR, MDS? Can the OSD run on a Raspberry Pi Zero W with 512MB Ram, 1GHz CPU, and USB drive over WiFi? What about the MON, MGR, and MDS? Are multiple nodes over WiFi, limited to 65 Mbps each, capable of delivering speeds up to 1 Gbps to a client connected to the router via 1 Gbps ethernet when transferring files in parallel?
Let’s throw all of these questions at a large language model. At this time, we can’t rely on the information, but we can at least use it as a jumping point and verify with other sources.
Running a Ceph cluster involves some minimum requirements and considerations depending on the specific roles (OSD, MON, MGR, MDS) and hardware capabilities. Below are answers to your specific questions:
1. Minimum Requirements to Run Ceph:
Ceph is designed to run on commodity hardware, but there are still some minimum resource requirements for optimal performance. While it can technically run on lower-end hardware for experimental purposes, it is best to follow the recommended specifications for a production cluster.
General Minimum Requirements:
- CPU: A modern multi-core processor (e.g., Intel or AMD x86_64).
- RAM: At least 2GB of RAM per node, but 4-8GB is recommended for better performance, especially for more nodes or larger clusters.
- Disk Storage: For OSD nodes, fast and reliable storage is essential. Minimum of 1TB per OSD is common in larger setups.
- Network: A gigabit Ethernet connection is recommended for network speed between nodes.
- Operating System: Linux (usually CentOS, Ubuntu, or similar).
2. How Much RAM Does Ceph Need?
- Ceph’s RAM usage depends on the number of nodes, OSDs, and the size of the data being managed. Generally, for a small test setup, 4GB of RAM per node is a bare minimum.
- For a production cluster, it is recommended to have at least 8GB of RAM per node for OSD, MON, and MGR roles, with 16GB or more for larger or more demanding setups. OSD nodes tend to consume more RAM because they handle data storage, replication, and recovery.
3. Can Ceph Run Over WiFi?
Technically, yes, Ceph can run over WiFi, but it is not recommended for production environments. The primary concerns with running Ceph over WiFi are:
- Latency: WiFi introduces higher latency compared to wired Ethernet, which can negatively affect performance.
- Instability: WiFi is more prone to interference and signal loss, leading to potential data inconsistencies or cluster health issues.
- Bandwidth: WiFi, especially with lower speeds (e.g., 65 Mbps), could bottleneck your cluster’s performance, especially for data-intensive tasks like OSD operations or recovery. Recommendation: For small test environments, WiFi may be acceptable, but for reliable performance, it’s best to use wired Ethernet.
4. Is Ceph Compatible with USB Drives?
- Yes, Ceph can run on USB drives, but there are limitations:
- Performance: USB drives, especially those using USB 2.0, are much slower than SSDs or HDDs and can create bottlenecks.
- Durability: USB drives may not have the same lifespan and reliability as SSDs or HDDs, especially under heavy read/write operations in an OSD.
- Role Limitations: It’s possible to use USB drives for OSDs, but they are not ideal for production-level OSDs in large Ceph clusters due to speed and reliability concerns.
5. Can Ceph Use Flash Drives, or Is It Limited to SSDs and HDDs?
- Flash drives (e.g., USB flash drives or SD cards) can technically be used, but they are generally not recommended for the OSD role in Ceph clusters for the following reasons:
- Performance: Flash drives are usually much slower compared to dedicated SSDs or HDDs, which can impact the overall performance of the Ceph cluster, especially for OSDs where high read/write performance is crucial.
- Durability: Flash drives have a lower number of write cycles compared to SSDs and HDDs, so they may wear out faster when handling Ceph’s constant read/write operations.
- For OSD roles, SSD or HDD is the ideal choice. Flash drives might be more suited for non-OSD roles, such as logging or metadata caching, but even then, SSDs are preferred for reliability and performance.
6. Can the OSD Run on a Raspberry Pi Zero W with 512MB RAM, 1GHz CPU, and USB Drive Over WiFi?
- Technically, you can run an OSD on a Raspberry Pi Zero W, but there are severe limitations:
- RAM: The Raspberry Pi Zero W has only 512MB of RAM, which is far below the minimum requirement for a functional Ceph OSD (which generally needs 2GB of RAM, with 4GB or more recommended).
- CPU: The 1GHz single-core CPU on the Raspberry Pi Zero W would also struggle to handle the heavy I/O and data processing Ceph requires, especially as part of the OSD role.
- USB Drive Over WiFi: USB drives on a Raspberry Pi Zero W would be much slower compared to dedicated storage solutions like SSDs, and running over WiFi adds additional latency and potential instability.
- Conclusion: Running an OSD on a Raspberry Pi Zero W is not recommended for anything beyond very basic testing or educational purposes. For real production environments, a Raspberry Pi with more RAM and processing power (such as the Raspberry Pi 4) and wired Ethernet would be more suitable.
7. What About the MON, MGR, and MDS on a Raspberry Pi?
- MON (Monitor): A Raspberry Pi can run a MON role (which is less resource-intensive than the OSD role) with 1GB of RAM or more. The Raspberry Pi 4 (with 4GB RAM) would be much more suitable for this role than the Pi Zero W.
- MGR (Manager): Like the MON, the MGR role doesn’t require a lot of resources. It can run on low-end hardware like a Raspberry Pi 3 or 4. However, the Pi Zero W is too limited for this purpose.
- MDS (Metadata Server): The MDS role can be resource-intensive, especially for clusters with large numbers of clients or heavy metadata usage. While a Raspberry Pi 4 may handle small workloads, the Pi Zero W is not suitable for running an MDS due to its limited RAM and CPU power.
8. Are Multiple Nodes Over WiFi, Limited to 65 Mbps Each, Capable of Delivering Speeds Up to 1 Gbps to a Client Connected to the Router via 1 Gbps Ethernet When Transferring Files in Parallel?
- No, the speed of the transfer would be limited by the slowest link in the network path. Even if the client is connected via 1 Gbps Ethernet, if the nodes are connected over WiFi at 65 Mbps each, the maximum transfer rate would be limited to 65 Mbps per node.
- Parallelism can help improve throughput, but each node’s WiFi connection will still be a bottleneck. Therefore, the overall performance would be far below 1 Gbps for large transfers, especially as the number of nodes increases and WiFi bandwidth becomes saturated.
- Recommendation: For good performance, Ceph nodes should be connected via gigabit Ethernet or faster, rather than WiFi.
Summary:
- Ceph can run on a variety of hardware, but certain roles (especially OSDs) require more resources and better hardware (RAM, CPU, and storage performance).
- WiFi and USB drives are not ideal for a production Ceph setup, especially for OSDs, due to performance limitations.
- Raspberry Pi Zero W is not suitable for Ceph roles like OSD due to limited resources, though it could potentially handle MON, MGR, or MDS roles in a minimal, experimental setup.
- Parallel transfers over WiFi at 65 Mbps per node would not deliver 1 Gbps speeds, as the transfer rate is limited by the WiFi bandwidth.
Well, that answers a few questions. I was hoping the Pi Zero W was a good candidate as I have quite a few that I can pull from a few projects that are no longer in operation. It’s good to see that the Pi 4’s can be used, as I have a few of them around the house as well. USB Flash and USB external drives are supported but not recommended. WiFi is supported but slow and may cause issues with health of the cluster. It seems like everything is like – yes, technically it is possible, but no – you really don’t want to do that unless you are experimenting.
Well… I could experiment, and then add better nodes to the cluster. It would definitely give me the experience of what happens when I intentionally make a node fail, and watch it heal itself with better hardware. The main thing is, I can start much earlier with a lot of components that I have on-hand. The entry-level to setting this up seems very good for budget conscious people. I still have the four x86 pc’s that I purchased from the library. Unraid is still in trial mode on one of them. I can uninstall it in put one of the memory sticks back in the other box and have two nodes ready to come online immediately. Yea… I may not need to buy anything. I’ve got my mothers old computer in the closet. That’s three old computers on-hand, ready to go. All I need is 3 to start. The hard drives are dismal – but that’s part of what makes this an interesting solution. I can take one offline and upgrade it, while the other nodes keep the cluster operating and serving files as if nothing’s wrong to the end-user. I still need a second hard drive for all of the machines in order to keep the OS separate from the storage medium. However, that’s actually possible to just to use an external USB Flash, SSD, or HDD as a temporary solution until I can get the adapters and power to support a second drive, or find a way to get the on-board M.2 slot to be recognized by the OS. Maybe that was only an issue with the Unraid software. I hadn’t tried to see if TrueNAS could recognized the M.2 card on-board, and the motherboards BIOS recognized it just fine.
I had started moving data to the TrueNAS server, so I can’t simply take it offline and move the extra stick of memory to the other PC. Once I have the other PC’s up and running, I can transfer files from the TrueNAS server. Or I can set up things now so that everything on TrueNAS is copied to the spare WD Cloud Backup NAS that I have on hand. Maybe I’ll do that, as it can sit and transfer all of the data while I’m away. Although the PC’s are limited in physical space, SATA ports, and power cables for adding additional drives, I can always expand by getting a PCIe to SATA adapter and setup a JBOD enclosure next to the PCs later.
I still want to setup the Raspberry Pi’s since they are smaller devices, portable, and could be demonstrated in a workshop at the library to explain the concepts of a NAS, RAID, and a distributed NAS. So with all of this redundancy with multiple roles, can I add the Raspberry Pi’s to the cluster, synchronize everything, and then separate them so that they can keep the cluster running among themselves at the same time? I think one main problem that I’d run into is meeting the quorum since it requires a majority of MON nodes. Splitting the cluster into two means that one of the clusters will not have a majority. I believe I can adjust the quorum requirements in order to bring the cluster back online. This seems like something you’d need to be able to do if too many monitors fail, and you want to add more monitors to recover the cluster. There is also an issue regarding redundancy of data, in that I’d have to figure out a way so that a piece of data is not only available on just the Pi’s alone, or just the non-Pi’s. It also seems that the OSD role itself is the heavy lifter in a cluster. Which is kinda defeating as each OSD manages the storage directly.
I think I would need to build up a separate cluster of Raspberry Pi’s to be safe, and experiment with replicating data to the Pi cluster. I just feel like it may not be ideal anyway, unless I’m backing up a pi cluster with limited resources to another pi cluster with limited resources. I think WiFi isn’t such as great idea, so the Pi Zero W boards are a no-go solution. I think with either cluster, I should probably look into getting a dedicated network switch to keep the network chatter of replication limited to the cluster.
Usually when you get a new NAS or external hard drive, you have to start copying data over to the new system. It takes so long that you end up only copying the most recent stuff that you use, and everything else remains in an archived state. Ceph addresses that problem, and rather than having separate storage devices, you now have increased redundancy and speeds as a new node is added to the cluster. After many years, the capacity, RAM, and CPU of a node may become outdated. Rather than simply removing the node physically, you can instruct Ceph to migrate data away from the existing node prior to removing the drive(s).
I see room for improvement with Ceph in terms of replication with remote sites if you don’t have the bandwidth or latency to support it as it depends on real-time communication and high bandwidths for replication. It has work-arounds with using separate clusters – but for the most part, it seems like an improvement could be made around scheduling or being optimized to recognize which nodes are not local and give extra time to account for synchronization issues, or the occasional situations where communication may fail between the locations. Mainly, I’m thinking along the lines of friends and family. If a friend and I agree to host a few of each others nodes, then yes – we have disaster recovery in a remote site, and are capable of transferring and storing encrypted data that an untrusted source can’t access, but the instability of remote sites only makes it viable for backup rather than fault tolerance, redundancy, and repairing the cluster. If the node(s) go offline while they are away on vacation, then the cluster is waiting a few days before it can access that node again. This in turn requires the entire storage to be backed up at a single remote site, rather than spread across multiple sites. Why let one untrusted site have access to the entire data store, rather than fragments?
I’d also like the ability limit the bandwidth to that remote site to a trickle if possible so that their own personal internet usage is not affected, and/or setup scheduling to allow more bandwidth during midnight hours. If we really want to expand on the idea of remote sites, we can consider communications with Mars where the signal travel time takes between 3 to 23 minutes relative to the distance to Earth. That’s a crazy amount of latency. NASA’s Mars rover has data rates around 50 to 100 kbps, thus crippling the possibility of data replication on a OSD Node on Mars. Perhaps this scenario may be ideal for one of SuckerPinch’s Harder drives that depends on high latency to improve storage capacity via PING requests. So yes – Ceph for an Distributed Interplanetary Storage System (DISS) seems ridiculous, but it’s fun to think about. Maybe it should be called a Planetary Redundant Array of Independent Disks (PRRAID).

The remote-site NAS is more reasonable, especially since its likely that a company may have multiple offices around town, or a large corporation has satellite office across the globe. You could fallback to cloud services to address the synchronization issue – but that’s essentially what we are trying to pull off on our own with a distributed NAS.
Well, I’m about to head out of town on a short vacation so I’m not going to post anything over the next few days. I have plenty to think about here. I like the fact that I can start setting up a small cluster with the resources that I have on hand, and then start increasing my storage capacity to meet the demands that I have today, without having to plan for the demands that I may or may not have in the future. The option to scale horizontally with automated repairs are the primary advantages that I see with using Ceph over ZFS.
