Analytics on colossal time-series data with Summary Store
Global analytics in the face of bandwidth and regulatory constraints
Global-scale organizations produce large volumes of data across geographically distributed data centers. Querying and analyzing such data as a whole introduces new research issues at the intersection of networks and databases. Today systems that compute SQL analytics over geographically distributed data operate by pulling all data to a central location. This is problematic at large data scales due to expensive transoceanic links, and may be rendered impossible by emerging regulatory constraints. The new problem of Wide-Area Big Data (WABD) consists in orchestrating query execution across data centers to minimize bandwidth while respecting regulatory constaints. WABD combines classical query planning with novel network-centric mechanisms designed for a wide-area setting such as pseudo-distributed execution, joint query optimization, and deltas on cached subquery results. Our prototype, WANalytics, builds upon Hive and uses 257x less bandwidth than centralized analytics in a Microsoft production workload and up to 360x less on popular analytics benchmarks including TPC-CH and Berkeley Big Data. WANalytics supports all SQL operators, including Joins, across global data.
WANalytics: Geo-distributed analytics for a data-intensive world
SIGMOD 2015 demo
Many large organizations collect massive volumes of data each day in a geographically distributed fashion, at data centers around the globe. Despite their geographically diverse origin the data must be processed and analyzed as a whole to extract insight. We call the problem of supporting large-scale geo-distributed analytics Wide-Area Big Data (WABD). To the best of our knowledge, WABD is currently addressed by copying all the data to a central data center where the analytics are run. This approach consumes expensive cross-data center bandwidth and is incompatible with data sovereignty restrictions that are starting to take shape. We instead propose WANalytics, a system that solves the WABD problem by orchestrating distributed query execution and adjusting data replication across data centers in order to minimize bandwidth usage, while respecting sovereignty requirements. WANalytics achieves an up to 360x reduction in data transfer cost when compared to the centralized approach on both real Microsoft production workloads and standard synthetic benchmarks, including TPC-CH and Berkeley Big-Data. In this demonstration, attendees will interact with a live geo-scale multi-data center deployment of \name, allowing them to experience the data transfer reduction our system achieves, and to explore how it dynamically adapts execution strategy in response to changes in the workload and environment.
WANalytics: Analytics for a geo-distributed data-intensive world
Large organizations today operate data centers around the globe where massive amounts of data are produced and consumed by local users. Despite their geographically diverse origin, such data must be analyzed/mined as a whole. We call the problem of supporting rich DAGs of computation across geographically distributed data Wide-Area Big-Data (WABD). To the best of our knowledge, WABD is not supported by currently deployed systems nor sufficiently studied in literature; it is addressed today by continuously copying raw data to a central location for analysis. We observe from production workloads that WABD is important for large organizations, and that centralized solutions incur substantial cross-data center network costs. We argue that these trends will only worsen as the gap between data volumes and transoceanic bandwidth widens. Further, emerging concerns over data sovereignty and privacy may trigger government regulations that can threaten the very viability of centralized solutions.
To address WABD we propose WANalytics, a system that pushes computation to edge data centers, automatically optimizing workflow execution plans and replicating data when needed. Our Hadoop-based prototype delivers 257x reduction in WAN bandwidth on a production workload from Microsoft. We round out our evaluation by also demonstrating substantial gains for three standard benchmarks: TPC-CH, Berkeley Big Data, and BigBench.
Low latency via redundancy
Low latency is critical for interactive networked applications. But while we know how to scale systems to increase capacity, reducing latency --- especially the tail of the latency distribution --- can be much more difficult. In this paper, we argue that the use of redundancy is an effective way to convert extra capacity into reduced latency. By initiating redundant operations across diverse resources and using the first result which completes, redundancy improves a system's latency even under exceptional conditions. We study the tradeoff with added system utilization, characterizing the situations in which replicating all tasks reduces mean latency. We then demonstrate empirically that replicating all operations can result in significant mean and tail latency reduction in real-world systems including DNS queries, database servers, and packet forwarding within networks.
More is less: Reducing latency via redundancy
Low latency is critical for interactive networked applications. But while we know how to scale systems to increase capacity, reducing latency --- especially the tail of the latency distribution --- can be much more difficult.
We argue that the use of redundancy in the context of the wide-area Internet is an effective way to convert a small amount of extra capacity into reduced latency. By initiating redundant operations across diverse resources and using the first result which completes, redundancy improves a system's latency even under exceptional conditions. We demonstrate that redundancy can significantly reduce latency for small but critical tasks, and argue that it is an effective general-purpose strategy even on devices like cell phones where bandwidth is relatively constrained.
How well can congestion pricing neutralize denial-of-service attacks?
Denial of service protection mechanisms usually require classifying malicious traffic, which can be difficult. Another approach is to price scarce resources. However, while congestion pricing has been suggested as a way to combat DoS attacks, it has not been shown quantitatively how much damage a malicious player could cause to the utility of benign participants. In this paper, we quantify the protection that congestion pricing affords against DoS attacks, even for powerful attackers that can control their packets' routes. Specifically, we model the limits on the resources available to the attackers in three different ways and, in each case, quantify the maximum amount of damage they can cause as a function of their resource bounds. In addition, we show that congestion pricing is provably superior to fair queueing in attack resilience.
Application of secondary information for misbehavior detection in VANETs
IFIP Networking 2010
Safety applications designed for Vehicular Ad Hoc Networks (VANETs) can be compromised by participating vehicles transmitting false or inaccurate information. Design of mechanisms that detect such misbehaving nodes is an important problem in VANETs. In this paper, we investigate the use of correlated information, called "secondary alerts", generated in response to another alert, called as the "primary alert" to verify the truth or falsity of the primary alert received by a vehicle. We first propose a framework to model how such correlated secondary information observed from more than one source can be integrated to generate a "degree of belief" for the primary alert. We then show an instantiation of the model proposed for the specific case of Post-Crash Notification as the primary alert and Slow/Stopped Vehicle Advisory as the secondary alerts. Finally, we present the design and evaluation of a misbehavior detection scheme (MDS) for PCN application using such correlated information to illustrate that such information can be used efficiently for MDS design.
Unsupervised and supervised classification of hyperspectral image data using projection pursuit and Markov random field segmentation
International Journal of Remote Sensing, Vol. 33 Issue 18, 2012
This work presents a classification technique for hyperspectral image analysis when concurrent ground-truth is unavailable and available. The method adapts a principal component analysis based projection pursuit (PP) procedure with an entropy index to reduce the dimensionality followed by the Markov Random Field (MRF) model based segmentation. An ordinal optimization approach to PP determines a set of good enough projections with high probability, the best among which is chosen with the help of MRF model based segmentation. When ground-truth is absent, the segmented output obtained is labeled with the desired number of classes so that it resembles the natural scene closely. When the landcover classes are in detailed level, some special reectance characteristics based on the classes of the study area in question are determined. These are later incorporated in MRF model based segmentation stage while minimizing the energy function in the image space. Segments are evaluated with training samples so as to yield a classified image with respect to the type of ground-truth data. Two illustrations are presented with (i) EO-1 Hyperion sensor image with concurrent groundtruth at detailed level classes and (ii) AVIRIS-92AV3C image with concurrent groundtruth - for supervised cases. Comparison of classification accuracies and computational times of some nonparametric approaches with that of the proposed methodology are provided for the illustrations. Experimental results demonstrate that the method provides high classification accuracy and is computationally faster compared to other methods.
Unsupervised hyperspectral image analysis with projection pursuit and MRF segmentation approach
2008 International Conference on Artificial Intelligence and Pattern Recognition (AIPR-08), pp. 120-127
This work deals with hyperspectral image analysis in the absence of ground-truth. The method adopts a projection pursuit (PP) procedure with entropy index to reduce the dimensionality followed by Markov Random Field (MRF) model based segmentation. Ordinal optimization approach to PP determines a set of "good enough projections" with high probability the best among which is chosen with the help of MRF model based segmentation. The segmented output so obtained is labeled with desired number of landcover classes in the absence of ground-truth. While comparing with original hyperspectral image the methodology outperforms principal component analysis with respect to class separation as exhibited in the illustration of an archive EO-1 hyperspectral image. The technique is not computationally intensive as is usually the case in hyperspectral image analysis. When training samples are available, the segmented regions yield a classified image with any cluster validation technique.