-
Building highly-available geo-distributed data stores for continuous learning
Nitin Agrawal, Ashish Vulimiri
Systems for ML workshop at NIPS 2018
[pdf]
-
Low-latency analytics on colossal data streams with SummaryStore
Nitin Agrawal, Ashish Vulimiri
SOSP 2017
[abstract] [pdf]
SummaryStore is an approximate time-series store, designed for analytics, capable of storing large volumes of time-series data (~1 petabyte) on a single node; it preserves high degrees of query accuracy and enables near real-time querying at unprecedented cost savings. SummaryStore contributes time-decayed summaries, a novel abstraction for summarizing data streams, along with an ingest algorithm to continually merge the summaries for efficient range queries; in conjunction, it returns reliable error estimates alongside the approximate answers, supporting a range of machine learning and analytical workloads. We successfully evaluated SummaryStore using real-world applications for forecasting, outlier detection, and Internet traffic monitoring; it can summarize aggressively with low median errors, 0.1 to 10%, for different workloads. Under range-query microbenchmarks, it stored 1PB synthetic stream data (1024 x 1TB streams), on a single node, using roughly 10 TB (100x compaction) with 95%-ile error below 5% and median cold-cache query latency of 1.3s (worst case latency under 70s).
-
Learning with less: Can approximate storage systems save learning from drowning in data?
Nitin Agrawal, Ashish Vulimiri
AI Systems workshop at SOSP 2017
[abstract] [pdf]
Data empowers learning. But soon, we may have too much of it to store, process, and analyze in a timely and cost-effective manner. We take the position that approximate storage systems have a role to play in alleviating this problem. The paper is intended to generate discussion on the merits and pitfalls of data approximation, its applicability, and lack thereof, to a variety of learning algorithms, and its broader appeal to AI. Tackling the challenges of large-scale data analysis requires not only expertise in systems, but also in machine learning, statistics, and algorithms. The paper borrows from the lessons the authors learnt in building SummaryStore, an approximate storage system capable of storing large streams of time–series data (1 Petabyte on a single node), while preserving high degrees of accuracy and real-time querying at unprecedented cost savings.
-
Global analytics in the face of bandwidth and regulatory constraints
Ashish Vulimiri, Carlo Curino, Brighten Godfrey, Thomas Jungblut, Jitu Padhye, George Varghese
NSDI 2015
[abstract] [pdf]
Global-scale organizations produce large volumes of data across geographically distributed data centers. Querying and analyzing such data as a whole introduces new research issues at the intersection of networks and databases. Today systems that compute SQL analytics over geographically distributed data operate by pulling all data to a central location. This is problematic at large data scales due to expensive transoceanic links, and may be rendered impossible by emerging regulatory constraints. The new problem of Wide-Area Big Data (WABD) consists in orchestrating query execution across data centers to minimize bandwidth while respecting regulatory constaints. WABD combines classical query planning with novel network-centric mechanisms designed for a wide-area setting such as pseudo-distributed execution, joint query optimization, and deltas on cached subquery results. Our prototype, WANalytics, builds upon Hive and uses 257x less bandwidth than centralized analytics in a Microsoft production workload and up to 360x less on popular analytics benchmarks including TPC-CH and Berkeley Big Data. WANalytics supports all SQL operators, including Joins, across global data.
-
WANalytics: Analytics for a geo-distributed data-intensive world
Ashish Vulimiri, Carlo Curino, Brighten Godfrey, Konstantinos Karanasos, George Varghese
CIDR 2015
[abstract] [pdf]
Large organizations today operate data centers around the globe where massive amounts of data are produced and consumed by local users. Despite their geographically diverse origin, such data must be analyzed/mined as a whole. We call the problem of supporting rich DAGs of computation across geographically distributed data Wide-Area Big-Data (WABD). To the best of our knowledge, WABD is not supported by currently deployed systems nor sufficiently studied in literature; it is addressed today by continuously copying raw data to a central location for analysis. We observe from production workloads that WABD is important for large organizations, and that centralized solutions incur substantial cross-data center network costs. We argue that these trends will only worsen as the gap between data volumes and transoceanic bandwidth widens. Further, emerging concerns over data sovereignty and privacy may trigger government regulations that can threaten the very viability of centralized solutions.
To address WABD we propose WANalytics, a system that pushes computation to edge data centers, automatically optimizing workflow execution plans and replicating data when needed. Our Hadoop-based prototype delivers 257x reduction in WAN bandwidth on a production workload from Microsoft. We round out our evaluation by also demonstrating substantial gains for three standard benchmarks: TPC-CH, Berkeley Big Data, and BigBench.
-
WANalytics: Geo-distributed analytics for a data-intensive world
Ashish Vulimiri, Carlo Curino, Brighten Godfrey, Thomas Jungblut, Konstantinos Karanasos, Jitu Padhye, George Varghese
SIGMOD 2015 demo
[abstract]
Many large organizations collect massive volumes of data each day in a geographically distributed fashion, at data centers around the globe. Despite their geographically diverse origin the data must be processed and analyzed as a whole to extract insight. We call the problem of supporting large-scale geo-distributed analytics Wide-Area Big Data (WABD). To the best of our knowledge, WABD is currently addressed by copying all the data to a central data center where the analytics are run. This approach consumes expensive cross-data center bandwidth and is incompatible with data sovereignty restrictions that are starting to take shape. We instead propose WANalytics, a system that solves the WABD problem by orchestrating distributed query execution and adjusting data replication across data centers in order to minimize bandwidth usage, while respecting sovereignty requirements. WANalytics achieves an up to 360x reduction in data transfer cost when compared to the centralized approach on both real Microsoft production workloads and standard synthetic benchmarks, including TPC-CH and Berkeley Big-Data. In this demonstration, attendees will interact with a live geo-scale multi-data center deployment of \name, allowing them to experience the data transfer reduction our system achieves, and to explore how it dynamically adapts execution strategy in response to changes in the workload and environment.
-
Low latency via redundancy
Ashish Vulimiri, P. Brighten Godfrey, Radhika Mittal, Justine Sherry, Sylvia Ratnasamy, Scott Shenker
CoNEXT 2013
[abstract] [pdf]
Low latency is critical for interactive networked applications. But while we know how to scale systems to increase capacity, reducing latency --- especially the tail of the latency distribution --- can be much more difficult. In this paper, we argue that the use of redundancy is an effective way to convert extra capacity into reduced latency. By initiating redundant operations across diverse resources and using the first result which completes, redundancy improves a system's latency even under exceptional conditions. We study the tradeoff with added system utilization, characterizing the situations in which replicating all tasks reduces mean latency. We then demonstrate empirically that replicating all operations can result in significant mean and tail latency reduction in real-world systems including DNS queries, database servers, and packet forwarding within networks.
-
More is less: Reducing latency via redundancy
Ashish Vulimiri, Oliver Michel, P. Brighten Godfrey, Scott Shenker
HotNets 2012
[abstract] [pdf]
Low latency is critical for interactive networked applications. But while we know how to scale systems to increase capacity, reducing latency --- especially the tail of the latency distribution --- can be much more difficult.
We argue that the use of redundancy in the context of the wide-area Internet is an effective way to convert a small amount of extra capacity into reduced latency. By initiating redundant operations across diverse resources and using the first result which completes, redundancy improves a system's latency even under exceptional conditions. We demonstrate that redundancy can significantly reduce latency for small but critical tasks, and argue that it is an effective general-purpose strategy even on devices like cell phones where bandwidth is relatively constrained.
-
How well can congestion pricing neutralize denial-of-service attacks?
Ashish Vulimiri, Gul A. Agha, P. Brighten Godfrey, Karthik Lakshminarayanan
SIGMETRICS 2012
[abstract] [pdf]
Denial of service protection mechanisms usually require classifying malicious traffic, which can be difficult. Another approach is to price scarce resources. However, while congestion pricing has been suggested as a way to combat DoS attacks, it has not been shown quantitatively how much damage a malicious player could cause to the utility of benign participants. In this paper, we quantify the protection that congestion pricing affords against DoS attacks, even for powerful attackers that can control their packets' routes. Specifically, we model the limits on the resources available to the attackers in three different ways and, in each case, quantify the maximum amount of damage they can cause as a function of their resource bounds. In addition, we show that congestion pricing is provably superior to fair queueing in attack resilience.
-
Application of secondary information for misbehavior detection in VANETs
Ashish Vulimiri, Arobinda Gupta, Pramit Roy, Skanda N. Muthaiah and Arzad A. Kherani
IFIP Networking 2010
[abstract] [pdf]
Safety applications designed for Vehicular Ad Hoc Networks (VANETs) can be compromised by participating vehicles transmitting false or inaccurate information. Design of mechanisms that detect such misbehaving nodes is an important problem in VANETs. In this paper, we investigate the use of correlated information, called "secondary alerts", generated in response to another alert, called as the "primary alert" to verify the truth or falsity of the primary alert received by a vehicle. We first propose a framework to model how such correlated secondary information observed from more than one source can be integrated to generate a "degree of belief" for the primary alert. We then show an instantiation of the model proposed for the specific case of Post-Crash Notification as the primary alert and Slow/Stopped Vehicle Advisory as the secondary alerts. Finally, we present the design and evaluation of a misbehavior detection scheme (MDS) for PCN application using such correlated information to illustrate that such information can be used efficiently for MDS design.
-
Unsupervised and supervised classification of hyperspectral image data using projection pursuit and Markov random field segmentation
Anjan Sarkar, Ashish Vulimiri, Suman Paul, Md. Jawaid Iqbal, Avishek Banerjee, Rahul Chatterjee, Shibendu S. Ray
International Journal of Remote Sensing, Vol. 33 Issue 18, 2012
[abstract] [link]
This work presents a classification technique for hyperspectral image analysis when concurrent ground-truth is unavailable and available. The method adapts a principal component analysis based projection pursuit (PP) procedure with an entropy index to reduce the dimensionality followed by the Markov Random Field (MRF) model based segmentation. An ordinal optimization approach to PP determines a set of good enough projections with high probability, the best among which is chosen with the help of MRF model based segmentation. When ground-truth is absent, the segmented output obtained is labeled with the desired number of classes so that it resembles the natural scene closely. When the landcover classes are in detailed level, some special reectance characteristics based on the classes of the study area in question are determined. These are later incorporated in MRF model based segmentation stage while minimizing the energy function in the image space. Segments are evaluated with training samples so as to yield a classified image with respect to the type of ground-truth data. Two illustrations are presented with (i) EO-1 Hyperion sensor image with concurrent groundtruth at detailed level classes and (ii) AVIRIS-92AV3C image with concurrent groundtruth - for supervised cases. Comparison of classification accuracies and computational times of some nonparametric approaches with that of the proposed methodology are provided for the illustrations. Experimental results demonstrate that the method provides high classification accuracy and is computationally faster compared to other methods.
-
Unsupervised hyperspectral image analysis with projection pursuit and MRF segmentation approach
Anjan Sarkar, Ashish Vulimiri, Shantanu Bose, Suman Paul, Shibendu S Ray
2008 International Conference on Artificial Intelligence and Pattern Recognition (AIPR-08), pp. 120-127
[abstract] [pdf]
This work deals with hyperspectral image analysis in the absence of ground-truth. The method adopts a projection pursuit (PP) procedure with entropy index to reduce the dimensionality followed by Markov Random Field (MRF) model based segmentation. Ordinal optimization approach to PP determines a set of "good enough projections" with high probability the best among which is chosen with the help of MRF model based segmentation. The segmented output so obtained is labeled with desired number of landcover classes in the absence of ground-truth. While comparing with original hyperspectral image the methodology outperforms principal component analysis with respect to class separation as exhibited in the illustration of an archive EO-1 hyperspectral image. The technique is not computationally intensive as is usually the case in hyperspectral image analysis. When training samples are available, the segmented regions yield a classified image with any cluster validation technique.