2019 Lightning Talk Abstracts
Eliminating Bias in the Deployment of Machine Learning
The primary source of bias in machine learning is not in the algorithms deployed, but rather the data used in training the predictive models. In this talk we will talk about why this is a huge problem and what to do about it. We will speak about implications for both supervised and unsupervised learning.
Data Summaries for Large Scale Aggregation
Streaming data summaries (sketches) have enabled analytics systems to answer approximate queries over large datasets with minimal memory and compute overheads. We have seen in our collaborations with Microsoft and Imply.io that these sketches are being used in a variety of workloads, including aggregation over data cubes, despite not being designed for such use cases. In particular, large collections of small-footprint sketches can be materialized and then aggregated quickly along different dimensions for interactive exploration. In this talk I will describe our work developing new data summaries for frequent items and quantile queries specifically for OLAP aggregation. By tuning our sketches to the characteristics of the workload, we can construct different summaries that are more efficient to aggregate and yield better final query accuracy after being aggregated. In particular I will discuss how balancing statistical bias and variance in a sketch can help optimize accuracy when aggregating many sketches.
Towards Near-Data Processing in Deep and Cold Storage Hierarchies
Modern high-resolution observational instruments and complex models of the earth system and of physical, chemical, and biological processes generate multiple hundreds of petabytes of scientific data per year. Digital data archives store such scientific data in private clouds for further investigation and long-term preservation, and disseminate it through data platforms via order-based catalogs. To reduce the total cost of ownership, such data platforms employ hierarchical storage management with large, disk-based caches and robotic tape libraries. Prefetching all the data from a slower storage layer is typically not possible due to the ad-hoc nature of scientific analyses and the size of the required data set to achieve satisfactory results for long-term trend analysis and prediction. Consequently, data movement is one of the most time- and energy-consuming tasks for data-intensive, scientific workflows. Near-data processing (NDP) has been advocated to reduce the amount of data to be transferred as early as possible. Unfortunately, for large-scale scientific data, this only benefits faster layers of the storage hierarchy. In a deep storage hierarchy, as it is common for active data archives, NDP can be even more beneficial if pushed further down the storage hierarchy. We propose CryoDrill, an NDP framework, which pushes parts of the computation of a data analysis workflow down the storage hierarchy to enable processing close to the data while minimizing wasteful data movements up the storage hierarchy. CryoDrill specifically targets complex data analysis tasks on large amounts of scientific data residing in cold storage devices, such as archival disks, massive-array-of-idle-disks systems, and robotic tape libraries. We plan to use in-storage processing resources by extending storage controllers to run simple computation tasks, e.g., filtering, data tiling and tile selection, and aggregation, directly within the storage device or exploit near-storage processing capabilities through FPGAs.
Teaching Open Datasets to Dance Together
Since 2009, municipal and national governments have been uploading datasets into open data portals. Already today, the Open Government Data (OGD) Dataset is one of the biggest and most promising types of data that one can acquire on the web. Recently, Google has released its first beta search engine that specialize in open datasets (https://toolbox.google.com/datasetsearch). In 2013, a McKinsey Report estimated that, this OGD revolution could add annually up to 5.4 trillion dollars in potential economic value. Nonetheless, virtually everywhere, OGD datasets contain a very thin metadata layer that tells us nothing about their potential to participate in economic activity outside the domains within which these datasets were originally created. Simply put, OGD datasets are out there on the web but virtually no one can find the useful ones. So, to unleash the promise of the OGD revolution, we must first address the OGD metadata challenge‚Äîhow to describe an open dataset well? Going one step further, we can further ask‚Äîis it possible to know in advance how an open dataset created in one domain might be meshed with a dataset created in another domain in order to discover an important insight in yet a third domain? I will propose a new approach by which some OGD datasets might interface with other OGD datasets and Jointly improve their metadata. In a nutshell, I will propose ideas, concepts, and some evidence to demonstrate that it is possible to teach datasets how to help each other discover their potential economic value.
Evolving the computing technology in extremely large eCommerce data platform: a JD.com case study
JD.com is one of the largest online eCommerce companies in the world, where big data analytics is a key enabler for effective business decision. JD big data platform centrally manages all the computation and storage needs for the data analytics: with 40k+ modern servers, more than one million jobs are run every day and data occupy 800+PB of storage. Most of the jobs are run on very mature and stable Hadoop MapReduce technology. We are working on evolving the computing technology to: 1. Be more cloud native and container friendly, in anticipation of co-location running with online jobs on container based private cloud to improve the utilization. 2. Deliver higher running efficiency to reduce the cost. 3. Support batch, streaming and machine learning data analytics jobs consistently. Apache Spark stands out as a good fit candidate. However in practice, it turns out that evolving the computing technology to Spark in such a large data platform is a very challenging design and implementation problem. For example, to leverage the elasticity of container based private cloud, we pushed for the idea of decoupling computing and storage design, and build centrally managed external data shuffle service to reduce the relying on local storage. The external shuffle service also improves the stability of Spark jobs by reducing the fetch failures which are known to a source of instability in production. Also in addition to the improvements from Spark community, we customize and build tools around Spark to optimize our target: run the job within a given deadline reliably with the minimum resource. This short talk will discuss our design for the technique evolution, as well as the lessons learned from the implementation and deployment. It will also conclude with the current status and the plan for next steps.