boatstar.blogg.se - Data analytics benchmarking

Data analytics benchmarking software#
Data analytics benchmarking Offline#

For AI, we provide TensorFlow and Caffe implementations. For graph analytics, we provide Hadoop, Spark GraphX, Flink Gelly and GraphLab implementations.

Data analytics benchmarking Offline#

For offline analytics, we provide Hadoop, Spark, Flink and MPI implementations. For the benchmarking requirements of system and data management communities, we provide diverse implementations using the state-of-the-art techniques. To achieve the consistency of benchmarks across different communities, we absorb state-of-the-art algorithms from the machine learning communities that considers the model’s prediction accuracy. Using real data sets as the seed, the data generators-BDGS- generate synthetic data by scaling the seed data while keeping the data characteristics of raw data. Currently, the included data sources are text, graph, table, and image data. Hence, data varieties are considered with the whole spectrum of data types including structured, semi-structured, and unstructured data.

Meanwhile, data sets have great impacts on workloads behaviors and running performance (CGO’18). Our benchmark suite includes micro benchmarks, each of which is a single data motif, components benchmarks, which consist of the data motif combinations, and end-to-end application benchmarks, which are the combinations of component benchmarks. The benchmarks cover six workload types including online services, offline analytics, graph analytics, data warehouse, NoSQL, and streaming from three important application domains, Internet services (including search engines, social networks, e-commerce), recognition sciences, and medical sciences. The current version BigDataBench 5.0 provides 13 representative real-world data sets and 27 big data benchmarks. We release an open-source big data benchmark suite-BigDataBench. Other than creating a new benchmark or proxy for every possible workload, we propose using data motif-based benchmarks-the combination of eight data motifs-to represent diversity of big data and AI workloads. For the first time, among a wide variety of big data and AI workloads, we identify eight data motifs (PACT 18 paper)- including Matrix, Sampling, Logic, Transform, Set, Graph, Sort and Statistic computation, each of which captures the common requirements of each class of unit of computation. We consider each big data and AI workload as a pipeline of one or more classes of units of computation performed on initial or intermediate data inputs, each of which we call a data motif. We capture the differences and collaborations among IoT, edge, datacenter and HPC in handling Big Data and AI workloads. We specify the common requirements of Big Data and AI only algorithmically in a paper-and pencil approach, reasonably divorced from individual implementations. In addition, the benchmarks should be consistent across different communities.

Data analytics benchmarking software#

Third, for co-design of software and hardware, we need simple but elegant abstractions that help achieve both efficiency and general-purpose. Second, for the sake of fairness, the benchmarks must include diversity of data and workloads. First, for the sake of conciseness, benchmarking scalability, portability cost, reproducibility, and better interpretation of performance data, we need understand what are the most time-consuming classes of unit of computation among big data and AI workloads.

However, complexity, diversity, frequently changed workloads, and rapid evolution of big data and AI systems raise great challenges in benchmarking. As architecture, system, data management, and machine learning communities pay greater attention to innovative big data and AI or maching learning algorithms, architecture, and systems, the pressure of benchmarking rises.