LeoFS is an unstructured object storage for the Web and a highly available, distributed, eventually-consistent storage system.
The model impacts the system's ability to scale while maintaining an efficient, secure and reliable service.
The logic allows for applications to benefit from more advanced functionalities on top of the low-level distribution layer.
|file object||block file object|
An interface allows applications to access a storage logic through a standard protocol, GUI or else.
|amazon s3 api nfs||amazon s3 api dokan fuse gcs iscsi nbd nfs openstack swift smb|
A storage solution can be deployed over an environment which is or not completely controlled by a single entity e.g a company.
The ability to scale out heteroegeneously allows for more diverse storage deployments.
Different redundancy mechanisms can be provided, each with its tradeoffs.
The ability to maintain a storage system operational requires being able to cope with "unexpected" behaviors.
A secure storage system must be designed with encrypted communications, access control etc. from the ground up.
|none||at rest in transit|
Storage products and projects may or may not have their source code available to the public.
Storage solutions differ in their pricing model from free, licenses and more.
The way the different servers and clients are arranged in a storage system can have a drastic impact on the overall infrastructure's performance, scalability and reliability (redundancy & fault tolerance).
Centralized storage systems store both the data and metadata on a single server. Even though such models are interesting for their simplicity, the availability and durability of the whole infrastructure depends upon a single server.
Adding more servers would increase complexity to the overall infrastructure but would allow it to support a larger number of clients as well as increase its total storage capacity. Such storage systems distribute the data blocks between the different storage servers effectively rendering the system scalable and resilient.
Distributed storage systems handle such elasticity by introducing more specific and privileged master servers, also known as managers, to orchestrate the whole system, from routing requests to the other servers, known as slave nodes or workers, managing metadata and more.
For some systems, the distributed model, also known as manager/worker model, is limited by the capacity of metadata servers to process client requests, handle worker nodes and achieve consensus with the other managers. With too many requests to process, the system could suffer bottlenecks, single point of failures, performance degradations, cascading effects and more, not to mention that the managers cannot be scaled out as easily as the workers, limiting further the system's scalability and performance.
Decentralized systems are designed for extreme scalability by distributing the data, metadata and requests between all the nodes that have the particularity of being equally unprivileged. By doing away with managers, such systems exhibit better performance (natural distribution), scalability (per-block quorums), resilience (no critical node) and security (reduced surface of attack).
Object storage systems manage data as objects or blobs. Object storage is mostly used for storing application-specific, movies in the case of Netflix, file content in the case of Dropbox, etc.
File storage solutions represent data in files organized in folders and subfolders. This organization of information into a hierarchical view is perfectly adapted for humans to store and collaborate. In addition, file systems often provide an access control mechanism, in particular in multi-user environments, to control which files/directories can be seen/edited by other users/groups.
Another storage logic exists, known as block storage, which manages data as blocks within a virtual raw partition i.e very much as for raw hard disk drive. Note that such block storage logics are made accessible through a virtual block device which needs to be formatted in a local file system, ext4 for instance. Such block storage logics have the particularity of being usually only accessible from a single node while a distributed file system handles concurrent accesses from multiple client nodes.
A storage solution may provide a set of interfaces for accessing the different storage logics it offers.
Block & File Storage
Block and file storage logics are usually accessed through standard low-level protocols such as network block device (NBD), iSCSI, etc. and FUSE/OSX FUSE/Dokan, network file system (NFS) etc., respectively.
Other interfaces include file synchronization as used by popular cloud file storage services likes Dropbox, Box, Google Drive etc.
In addition to existing interfaces, many products, in particular consumer oriented, provide graphical user interfaces (GUI) for desktop, mobile and Web while other also offer specific application programming interfaces (API) or software development kits (SDK).
The environment in which is deployed a storage system radically changes the constrains it must take into account when it comes to security, reliability (redundancy & fault tolerance) and scalability mechanisms.
The vast majority of storage solutions have been designed to operate in a controlled environment i.e a set of servers and clients that are under the control of a single entity e.g a company.
As an example, the latency in a local area network will not be a constraint while the data may not need to be encrypted at rest since hosted on trusted servers.
Storage solutions that can be deployed over a cluster of untrustworthy nodes are known to evolve in uncontrolled environments and as such face unique challenges when it comes to latency, security etc.
Indeed, in such environments, the data is spread across a large number of computing devices that are under the control of different entities. As such, the data blocks stored by a user may end up on computers located in different countries (e.g US, China, Germany etc.) and controlled by a potentially malicious user.
A user may not feel comfortable knowing that his/her files are stored on someone else's computer, even though the blocks are encrypted. This is particularly worrying for businesses that could be interested in the technology but need to control where and how data is actually stored.
Note that some solutions provide tools allowing developers and operators to create their own storage infrastructure. With Tahoe-LAFS and Infinit, an operator can decide which computers are involved in the infrastructure while defining scalability policies, allowing (or not) for untrustworthy nodes to join the network. As a result, such solutions can be deployed in both controlled and uncontrolled environments.
Some systems do not allow for the overall storage capacity to evolve over time. The ones that do are said to be scalable. Scalable storage systems could be further categorized into those that can scale without any interruption of service and those that require shutting down the system. In reality, the vast majority of existing scalable systems can scale (to some extent) dynamically i.e without requiring shutting down the service.
There exists two types of systems when it comes to scalability. The first category contains systems that can scale over homogeneous resources such as local disks, network-attached storage resources etc; resources that are under the control of the infrastructure's administrator (DAS, NAS, SAN etc.).
Systems capable of heterogeneous scalability are able to integrate additional storage capacity from resources and providers of different nature leading to more complex deployment such as hybrid clouds, multi clouds etc.
Heterogeneous-scalable storage systems may rely on at-rest encryption to secure the data when stored through a storage provider with limited trust. Likewise, the most advanced storage systems can make use of a Byzantine consensus algorithm to cope with potentially malicious behaviors since the servers are not under the administrator's control.
Note that should such a system provide these mechanisms, it would exhibit most of the prerequisites to be deployed in a uncontrolled environment.
Many storage systems provide no redundancy mechanism and therefore cannot ensure the reliability of the data they store. As a result, should a server fail in such a system, some of (and possibly all) the data would be made unavailable ― temporarily or permanently depending on the nature of the failure.
Reliability ― represented by both durability (data stored remain in the system) and availability (data is always accessible) ― implies that a system provides some sort of a redundancy mechanism. There are basically two categories of redundancy algorithms: replication and erasure codes (which are error-correcting codes).
Replication consists in creating exact copies of a data item to ensure that, if a copy goes missing (following a server failure, corruption or else), the system can keep operating with the remaining copies.
There are many ways to decompose the original data piece, distribute it and replicate it between the storage media (server, disk etc.) in order to benefit from different properties. The RAID technology for instance provides several schemas, or RAID levels, to achieve different balance between reliability, availability, performance and capacity.
Erasure codes, such as Reed-Solomon, do not create raw copies of a data item as replication does. Instead, error-correcting codes transform a data block of k symbols into a longer of n symbols such that the original data block can be recovered from a subset of the n symbols.
Erasure codes are more interesting than replication when it comes to storage consumption since less storage capacity is required to achieve the same durability and availability. However, the process of writing is slower than with replication as more servers need to be contacted to host data symbols. Likewise, more computing power is required to reconstruct the original data from several pieces when accessing data, leading to more latency.
Erasure codes are more adapted to archiving while replication is often preferred for primary storage. Note however, than beyond 100TB of data, the gain in storage capacity achievable through erasure coding becomes interesting enough to be considered even for primary storage.
Note that such redundancy mechanisms differ from backups. Should a server fail, redundancy mechanisms ensure that clients can continue accessing the data transparently (property known as availability). Backups however require restoring a snapshot before the data becomes accessible again, process which can take several days during which the whole system is non-operational.
The capacity to maintain a storage system operational in the event of failures is referred to as fault tolerance.
While some systems provide no fault tolerance mechanism, most storage products do. Depending on the nature of the resources composing a storage infrastructure, a system will need a different algorithms to detect and deal with potential failures: bug, crash (temporary or permanent) and even malicious behavior.
The scale-out software-defined storage systems that do offer some means of redundancy, often integrate a mechanism that monitors storage servers to detect potential failures. Note however that such standard fault tolerance algorithms assume that the servers will always follow the system's protocol because distributed storage solutions are generally deployed within a controlled environment e.g a company's network.
If some of the storage resources composing the infrastructure are not under the control of the administrator, the assumption cannot hold and the fault tolerance algorithm must be adapted to tolerate failures known as Byzantine i.e potentially malicous behaviors.
Finally, some systems ― in particular distributed storage without a redundancy mechanism ― may be partially fault tolerant. In such cases, a server failing may render a number of data pieces unavailable until those servers come back online or an administration operation is manually performed.
Large-scale deployments in uncontrolled environments require a storage solution that both encrypts in transit and at rest so that a piece of data cannot be decrypted by a potentially malicious storage server. Such at-rest encryption mechanisms are necessary because uncontrolled environments are usually considered untrustworthy since composed of many devices under the control of unknown entities.
In some cases, for instance cloud-based storage services, only the communications (in transit) are encrypted while the blocks are stored in plain form on the servers. Although this is required for those solutions to be able to offer users with a Web-based interface to browse and manipulate their data online, it is recommended to always encrypt data at rest, in particular on cloud storage services.