Abstracts 2019

3.7 Decades of Quantum Computing

Edward (Denny) Dahl, Principal Research Scientist, dWave

Since Richard Feynman's suggestion in 1982 that computers built from quantum building blocks could be more powerful, there has been much research in models of quantum computing, implementations and algorithms.  Shor's discovery of a quantum factoring algorithm motivated further activity, though significant practical barriers to implementation remain.  Quantum annealing has proved to be a more practical alternative and has come further in terms of applicability to real world problems. Proto-applications for this approach exist today and will come closer to economic viability with near-term improvements in qubit count, interconnectivity and control.

Databricks Delta: A Unified Data Management System for Real-time Big Data

Tathagata Das, Software Engineer, Apache Spark committer, member of its PMC

The increasing volume and variety of data and the imperative to derive value from it ever faster has created significant challenges for data management. Raw business data need to be ingested, curated, and optimized to enable data scientists and business analysts to answer their complex queries and investigations. Conventional architecture usually involves combining

- Streaming systems for low latency ingestion,
- Data lakes for cheap, large scale, long-term storage, and
- Data warehouses for high concurrency and reliability (but higher cost) needed that data lakes are unable to provide.

Building solutions across this variety of storage systems lead to complex and error-prone ETL data pipelines. At Databricks, we’ve seen these problems throughout organizations of all sizes. In order to radically simplify data management, we built Databricks Delta, which is a new type of unified data management system that provides

1. *The reliability and performance of a data warehouse*: Delta supports transactional insertions, deletions, upserts, and queries; this enables reliable concurrent access from hundreds of applications. In addition, Delta indexes, compacts, and caches data, thus achieving up to 100x better performance over Apache Spark running on Parquet.

2. *The speed of streaming systems*: Delta transactionally incorporates new data in seconds and makes this data immediately available for high-performance queries using either streaming or batch.

3. *The scale and cost-efficiency of a data lake*: Delta stores data in open Apache Parquet format in cloud blob stores like S3. From these systems it inherits low cost, massive scalability, support for concurrent accesses, and high read and write throughput.

With Delta, organizations no longer need to make a tradeoff between storage system properties.

Enabling Responsible Analytics: Generating Bias-Corrected and Privacy-Preserving Training Data

Bill Howe, Faculty, University of Washington

Data too sensitive to be "open" typically remains "closed" as proprietary information. This dichotomy undermines efforts to make algorithmic decision systems more fair, transparent, and accountable. Access to proprietary data is needed by government agencies to enforce policy, researchers to evaluate methods, and the public to hold agencies accountable; all of these needs must be met while preserving individual privacy and affording oversight by data owners on how the data is used.

We're developing algorithms for customized synthetic datasets that offer a) strong privacy guarantees, b) removal of signals that could expose competitive advantage, and c) removal of biases that could reinforce discrimination, all while maintaining fidelity to the original data. We find that releasing semi-synthetic data in conjunction with strong legal protections for the original data strikes a balance between transparency, proprietorship, privacy, and research objectives.

In this talk, I’ll describe the algorithms we’re developing to generate privacy-preserving and bias-corrected synthetic datasets, and touch on the legal protections that govern their use.  These datasets are intended to be shared with academic and private collaborators to experiment with advanced analytics without incurring significant legal risk, and to focus attention on pressing problems in housing, education, and mobility.

Supercharging Business Decision with AI: Insights, Optimize and Personalize

Eddie Ma, Engineering Director, Uber

Uber runs multi-entity marketplaces for food delivery, ride sharing, and freight, balancing the supply and demand for each of these verticals on a real time basis. The systems and services at Uber perform near real time analytics to forecast the needs of the marketplace and translate it into incentives for our driver partners and/or promotions for our riders. The technical services create weekly financial goals for our operations teams to remain within budget while aiming to satisfy maximum number of orders.

Our finance data stack is built using our core messaging framework, data transformation pipelines as well as real time data compute infrastructure that spans across various clouds. The business intelligence services perform a continuous balancing act between speed and accuracy of data only intensified by the need to have right level of granularity for right level of visibility into our revenue and budget spend for operations teams across the globe. The combined power of offline and online data processing also aid in running machine learning models to identify financial fraud and safety incidents in a timely manner.

In this talk, we cover our data processing principles that are defined by our business needs. We also cover the technical infrastructure that is responsible for making millions of financial decisions. Finally, we will share with you our business requirements that are in search of technical answers.

Using High Capacity Flash based Storage in Extremely Large Databases

Keith Muller, Halıcıoğlu Data Science Institute, UCSD, and Fellow in the Technology and Innovation Office at Teradata Corporation

Extremely database systems have been enjoying the benefits of a steady rate of improvements in flash-based storage sub-systems. As a result, these devices are slowly displacing HDD based use in extremely large database systems they can scale to tens of thousands of devices. Along this path, many design tradeoffs were made to accelerate the decline of HDDs. Examples include a focus by the flash based SSD vendors on $/GB and GB/unit volume as key design metrics. We look at the accumulated impact of several design tradeoffs and illustrate some of challenges we face going forward in terms of performance continuity, data protection, cost optimization and volume optimization.

Hardware Acceleration for Big Data Analytics and AI Workloads

Paolo Narvaez, Sr. Principal Engineer and Engineering Director for Analytics and AI Solutions, Intel Corporation

Big Data analytics presents new computational challenges with the volume of data movement and parallelization of operations required to be
done at scale. In this talk we will present new hardware features such as Optane DC Persistent memory that enable new forms of faster storage and new
instructions such as VNNI that enable wider parallelization of common operations. We will show how these features when enabled correctly can provide
significant performance improvements to widely used workloads.

European Space Agency Gaia Mission as a Classification Machine for Variability Studies

Krzysztof Nienartowicz, Data architect and manager of Gaia Data Processing Centre, Observatory of Geneva, Switzerland

Gaia is the European Space Agency milestone mission - one of the biggest ESA astronomical missions up to date (http://sci.esa.int/gaia/).

The main goal of the Gaia mission is to procure the largest and the most precise three-dimensional map of our Galaxy by surveying fraction of its 100 billion stars. Not stopping there, Gaia is the multi-instrument spacecraft providing an excellent photometry, spectroscopic and radial velocity measurements over its five year mission span. I will present the mission, the Gaia consortium and the approach how we handle classification and characterization of variable stars at Gaia Data
Processing Centre in Geneva for this petabyte-scale time-series analysis effort. Product of the analysis of time-series of nearly two billion light-sources will ultimately be the  100 times bigger catalog of variable stars in comparison to what we have today, reshaping our knowledge about the universe. The technology chosen, technical and scientific challenges and solutions based on the open source components and parallel DBMS Postgres-XL at the centre will be presented.

Snorkel: A System for Training Set Modeling

Chris Ré, Associate Professor, Stanford University

High quality training data sets are critical ingredients to most machine learning applications. However, in comparison to the large amount of progress in improving discriminative prediction, comparatively little attention has been paid to modeling the generative process of labeling training data. This talk describes some of our recent work modeling the quality of training sets, including a theoretical framework that makes the connection to a class of latent variable models that allow us to understand basic questions of data quality, correlation among sources, and more. This theoretical work is the basis of a software system called Snorkel (Snorkel.stanford.edu) that has been successfully deployed by some of the world's largest machine learning organizations.

Detecting novel associations in large data sets

David Reshef, MIT

As data sets grow in dimensionality, making sense of the wealth of interactions they contain has become a daunting task, not just due to the sheer number of relationships but also because relationships come in different forms (e.g. linear, exponential, periodic, etc.) and strengths. If you do not already know what kinds of relationships might be interesting, how do you find the most important or unanticipated ones effectively and efficiently? This is commonly done by using a statistic to rank relationships in a data set and then manually examining the top of the resulting list. For such a strategy to succeed though, the statistic must give similar scores to equally noisy relationships of different types. In this talk we will formalize this property, called equitability, and show how it is related to a variety of traditional statistical concepts. We will then introduce the maximal information coefficient, a statistic that has state-of-the-art equitability in a wide range of settings, and discuss how its equitability translates to practical benefits in the search for dependence structure in high-dimensional data using examples from global health and the human gut microbiome.

Promises and Pitfalls of 'Big' Medical Records Data for Research

Kathryn Rough, Research Scientist, Google

The widespread adoption of electronic health records has facilitated passive collection of large amounts of computerized medical data. Researchers are eager to leverage these data into insights that can meaningfully improve clinical care and patient outcomes. Yet enthusiasm about the impressive size and availability of these datasets should not diminish our awareness of their weaknesses; as with all research, it is essential to draw careful conclusions that are well supported by the data. This talk will give an overview of electronic health record data, review its potential strengths, and outline five common pitfalls, with recommendations on how to mitigate them.

Design of BigQuery ML

Umar Syed, Research Scientist, Google

BigQuery ML is an in-database machine learning system that allows data scientists and analysts using Google BigQuery to easily train and deploy predictive models. BigQuery ML is tightly integrated with the underlying database query processing engine and reuses much of the same infrastructure. This talk will discuss the design choices our team made to leverage the strengths and respect the limitations of BigQuery, enabling BigQuery ML to run efficiently on datasets ranging from thousands to billions of examples.

Accelerating the Machine Learning Lifecycle with MLflow

Matei Zaharia, Cofounder and Chief Technologist at Databricks

Machine learning development creates multiple new challenges that are not present in traditional software development. These include keeping track of the myriad inputs to an ML application (e.g., data versions, code and tuning parameters), reproducing results, and production deployment. We describe MLflow, an open source platform to streamline the machine learning lifecycle that we launched in response to these challenges at Databricks. MLflow covers three key problems: experiment management, reproducibility, and model deployment, using generic APIs that work with any ML library, algorithm and programming language. The project has a rapidly growing open source community, with over 75 contributors representing more than 30 companies since its launch in June 2018.