The Data-Scope's Mission and History
Technology and scientific methodologies have greatly advanced over the last decade as it relates to data. Scientists have been simultaneously fortunate and unfortunate: these advances are providing larger and more intricate data sets than ever, however, problems arise in evaluating the dataset as a whole and its infinite details.
Processing large datasets requires a significant amount of forethought; traditional supercomputing models do not provide an ideal match to big data, and so compromises or significantly over-complicated solutions are often necessary to achieve science goals. The Data-Scope endeavors to overcome issues related to big data on traditional HPC by doing the following:
Storing the data local to the compute.
Large datasets are traditionally stored on communal “head” or “storage” nodes, which are shared across hundreds or thousands of nodes. Putting aside IO subsystem obstacles, the data also has to travel across high-speed Infiniband or Ethernet networks to computational nodes that typically have less than a TB of slow, local storage. Thus the network and the slow disk become bottlenecks.
One-to-One mapping of users to nodes.
Providing users with their own nodes eliminates the problem of sharing large head node spaces with users that require different access patterns and data layouts. Providing exclusive access to a single project (or complementary projects) allows the data to be laid out in a fashion that is conducive for computation.
Leverage GPUs for computation.
GPUs are a cost-effective and lightning-fast solution to big data problems. A Fermi-generation NVIDIA GPU is capable of half a Teraflop (double precision) and provides a much higher flop-per-dollar cost .
Eliminating Bottlenecks.
The Data-Scope removes the network from the equation during computation of the data. The computational nodes are equipped with twenty-four 1TB hard drives that are mapped one-to-one on the backplane, as well as four MLC SSDs. What this means is the design does not bottleneck on SAS expanders; sequential IO throughput scales with the aggregate bandwidth of the drives to the maximum ability of the host bus adapter.