Invited Speakers

Big data in the Intensive Care Unit

Big data has brought much promise for discovery of treatment and therapies, drug safety, and care delivery process by identifying which treatment would work best for which patients. NIH has recently taken the initiative, 'All of Us', to collect one million or more patients' data (electronic health records (EHR), imaging, genomics, environmental data, etc.) over the next few years. Intensive care unit represents a unique data source in this context, with carefully captured detailed high volume data from different systems such as EHR, electrocardiogram, blood pressure, infusion pumps, and photoplethysmogram, among other data. However, the mere availability of data does not translate into knowledge or improved outcome. Questions remain on what data is needed, how to integrate these high volume data with high throughput infrastructure for near real time decision making by the clinicians. In this talk, we present our works in integrating this heterogeneous high-volume data with state of the art technologies for retrospective analysis and near real time decision making with the Medical Information Mart from Intensive Care Unit (MIMIC III) database, a nationally recognized data set.

Peter Bailis, Stanford, Machine Learning
Data Infrastructure for the DAWN of Widespread ML

Data volumes continue to rise,providing unprecedented resolution and fidelity across a wide range of applications in industry, science and government. However, these volumes pose severe computational overheads and require us to rethink the design of data analytics to prioritize the scarcest resource in modern analytics: human attention. In this talk, I'll describe our work on next-generation ML-powered data infrastructure to address this problem and others in the Stanford DAWN (http://dawn.cs.stanford.edu) project, a research initiative designed to enable more usable and efficient machine learning infrastructure.

Chaitanya Baru, NSF,  Senior Advisor for Data Science (CISE)
Common Model Infrastructure

The rapid accumulation of large amounts of data leads naturally to the question of how to manage the increasingly complex modeling process and the large numbers of data-driven models that are being generated from these data—for machine learning, deep learning, and other data-intensive analytics approaches. Current modeling practices are rather ad hoc and domain-specific, often depending upon the experience and expertise of individual data scientists and on the use of domain and even application-specific pre-processing steps. Research is needed on a common model infrastructure to support discovery, sharing, reuse, and serving of machine learning and other analytic models. Such an infrastructure should also strive to support reproducibility, and research should investigate issues related to transparency and interpretability of models, and the ability to describe and explain computations in domain contexts. Advances in this area are relevant to NSF’s Harnessing the Data Revolution (HDR) Big Idea, https://www.nsf.gov/news/news_summ.jsp?cntn_id=244678&org=CI, which aims to engage NSF’s research community in the pursuit of a cohesive, federated, national-scale approach to advance fundamental data-centric research and data-driven domain discoveries, build data infrastructure for research, and develop a 21st-century data-capable workforce

Stephen Brobst, Teradata
Digital Twins and the future of Analytics

Digital twin technology is facilitating a new era of data intensive design and operation of machinery, organizations, business processes, and complex systems. The fuel for digital twins is data plus advanced machine learning algorithms. In this session we will discuss the design principles for digital twin solutions along with the best practice applications of the technology. We will also discuss the new business models that are enabled through use of digital twin technology for aggressively participating in the digital economy.

  • Learn about the value proposition for deploying digital twin solutions.
  • Learn about the distinct lifecycle stages of a digital twin.
  • Learn how to use machine learning and deep learning together with digital twins.
Surajit Chaudhuri, Microsoft Research
What Data Platforms can do for Machine Learning Workloads?

Almost two decades ago, some enterprise database systems including Microsoft SQL Server offered the ability to support data mining primitives (including predictive models) natively in the database systems with the goal of reducing data movement and supporting data governance. However, such attempts did not gain traction. In this talk, we reflect on that failure and explain how the architecture of modern data platforms (SQL Server 2017 and Azure Data Platforms) support data scientists the flexibility to use languages and IDE of their choice (e.g., Python, R) while preserving the goodness of reduced data movement and data governance. In the second part of the talk, we discuss what other fundamental ways data platforms can help data scientists with the task of exploration. Specifically, we focus on the challenges of finding and transforming structured data. We discuss our Search Engine for Transform-Data-By-Example that enables "Self-Service" Data Transformation.

Jason Crane and Duan Xu , UCSF School of Medicine
New Frontiers in Medical Imaging in the Age of Big Data and Precision Medicine

This talk introduces the challenges of managing medical images at the complexity, multi-factor inquiry and massive scale that Precision Medicine demands.  This is a task for Very Large scale computing, that extends beyond the realm of current systems.

The use of multi-modality medical imaging has dramatically expanded over the last two decades and the push for increasingly precise diagnosis and treatment monitoring has accelerated.  At UCSF, since 2000 when Radiology went digital, we have accumulated over 1.5 Billion images and over 1 PB of clinical and research data. The rate of increase in the size, spatial resolution and relevance of dynamic imaging scans is continuing to trend upwards as newer and more advanced technologies are becoming available. HIPAA and other privacy regulations have a major impact upon the process of making data available for machine learning and population based investigations. Developing reliable mechanisms for priming and organizing the data in a manner that overcomes these limitations remains a challenging and ongoing process.

Looking beyond imaging data, the integration of genetics, labs, electronic medical record (EMR), and pathology data adds to the complexity of storing and managing repositories for use in big data related investigations.  At UCSF, we have embarked on this process by taking a modular approach toward incrementally expanding the types of data included in our repository. In this talk we will share our experience at UCSF in developing a multi-modal data discovery platform that includes imaging and information from the hospital’s medical imaging records.

Somalee Datta, Stanford School of Medicine
Intersection of Healthcare, Machine Learning, and Big Data

In the last 25 years, Electronic Medical Records have become ubiquitous in healthcare, the cost of genome sequencing has reduced a million-fold, wearable sensors and IoMT are collecting billions of measurements a day, the cloud has democratized access to petascale computing, and data science has become the career of choice. I will go over a bird’s eye view of key successes and challenges at the resulting frontier of precision health.

Sujit Dey,  UC San Diego, Director, Center for Wireless Communications
Predictive and Personalized Virtual Care using IoMTs and Hybrid Analytics

Future healthcare will see radical changes in the delivery of preventative, routine and post-surgical care, yielding lower care cost models with meaningful behavior change and patient engagement. Virtual care is evolving as a promising enabler of future care delivery, with approaches like video visits and mobile clinics gaining commercial adoption. This talk will describe our projects in enabling the next-generation of predictive and personalized virtual care using wireless Internet of Medical Things (IoMT) sensors and other medical devices providing patient health, activities and contextual data, together with innovations in machine vision and intelligence, cloud computing and IoMT communications. We will demonstrate the accuracy of our predictive models, and show our ability to provide precision guidance and recommendations with positive impact. We will discuss a scalable hybrid edge-cloud data analytics architecture which can be used to address issues regarding mobility, privacy, and computing and communications resources. Finally, we will point out some of the challenges we are uncovering in our virtual care analytics research, and suggest future work and changes in the related eco-system to make wireless virtual care a reality.

R. Adams Dudley, UCSF School of Medicine
From big data to real policy: Making EHR data matter in health care

While the creation of EHRs offers the promise of large data repositories and informatics applications that can revolutionize care, the truth is that change has come very slowly. Key barriers to change are entrenched interests and a slow-to-evolve system of payment and reputational incentives. In this talk, we will describe how analysis of big data from the ICU setting may finally be enough to overcome this inertia, by allowing measuring of key aspects of performance that are important to payers and policymakers.

Ozgun Erdogan, Citus Data Inc.
SQL, Scaling, and What's Unique About PostgreSQL

Relational databases are known to have tightly integrated components that make them hard to extend. This usually leads to new databases that are specialized for scale and certain workloads.

For its last three releases, PostgreSQL has been providing official APIs to extend the database *without forking it*. Users can read the documentation, implement their ideas according to the APIs, and then dynamically load in their shared object to override or extend any database submodule.

In this talk, we will demonstrate four example APIs that make PostgreSQL a distributed database that focuses on real-time workloads:

1. PostGIS turns Postgres into a spatial database. It adds support for geographic objects, allowing location queries to be run in SQL. 2. JSON and JSONB enable storing and querying semi-structured data. 3. HyperLogLog & TopN add approximation algorithms to Postgres. These sketch algorithms are used in distributed systems when real-time responses to queries matter more than exact results. 4. Citus transforms Postgres into a distributed database. Citus transparently shards and replicates data, performs distributed deadlock detection, and parallelizes queries.

In summary, PostgreSQL's extensible architecture puts it in a unique place for scaling out SQL and also for adapting to evolving hardware trends. It could just be that the monolithic SQL database
is dying. If so, long live Postgres!

David Glazer, Verily Life Sciences
A data biosphere for all of us (and All of Us)

There is a growing global community bringing data and life science together, by building tools and practices to support flagship scientific initiatives, including the NIH Data Commons, the Human Cell Atlas, and the All of Us Research Program. This session provides an overview of the opportunity for shared infrastructure, the emergence of shared approaches, and the early stages of a shared Data Biosphere vision. The vision will evolve with experience and community input, but the fundamental principles provide a solid foundation for using information to improve human health.

Brian Granger, Cal Poly, co-founder Project Jupyter
Project Jupyter: Reproducible, Scalable, Collaborative Data Science

Project Jupyter is an open-source project that exists to develop software, open standards, and services for interactive and reproducible computing. The main application developed by the project is the Jupyter Notebook, a web-application that allows users to create documents that combine live code with narrative text, mathematical equations, and visualizations. In this talk I will describe how the Jupyter architecture enables data scientists to collaborate productively with massive datasets, and share their results within and across organizations.

Jim Green, Cisco, CTO of IoT Software
IoT Makes Big Data Look Small

The emerging world of IoT introduces the prospect of collecting data at frequencies measured in milliseconds from each of thousands of devices.  The sum of the collective telemetry has the potential to swamp networks and storage devices.  The only approach to handling this is to distribute computing and intelligent data reduction to “the edge”, before much of this data consumes network bandwidth.  This is a new form of distributed computing, requiring a new form of distributed microservices.  This presentation will include live use cases in industrial computing, and offer a more precise definition of “edge computing” and “fog computing”, as well as covering the new convergence between operational technologies (OT) and information technologies (IT).

Bill Howe, University of Washington
Machine Learning meets Data Curation: Automatic Validation of Scientific Claims

Data in public repositories remains remarkably underused despite significant investments in open science. Making data available online turns out to be the easy part; making the data usable for science motivates ML algorithms to help automate curation tasks and enable longitudinal, integrative analysis.

But training data is scarce and expensive due to the specialized nature of the tasks, and semantic heterogeneity between datasets makes integration difficult.

To address these issues, we combine distant supervision and co-learning methods to provide high-quality labels with zero training data, and show that this approach outperforms even the state-of-the-art (and expensive) supervised methods.  We then use statistical claims extracted from the text of scientific papers to disambiguate schema mappings across disparate datasets.   Finally, we automate experiments to verify extracted claims against the integrated data, to help researchers, journal editors, and curators hold scientists accountable for weakly reproducible results.

These approaches are already beginning to have impact: computational biologists are beginning to use our curated gene expression corpus as the gold standard to search for new cancer treatments, and social scientists are using our curated corpus of scientific figures to understand and optimize how researchers use visualization to communicate (an emerging field we call "viziometrics.")

Sila Kiliccote, Stanford
DOE, Managing Director of Grid Innovations at Stanford and the leader of the Grid Integration, Systems and Mobility research at SLAC National Accelerator Laboratory. Her work involves the use of highly distributed power utilization and the optimization of power use leveraging Cloud based analysis and ML techniques.

 

Learned Index Structures

Whenever efficient data access is needed, index structures are the answer, and a wide variety of choices exist to address the different needs of various access pattern. For example, B-Trees are the best choice for range requests (e.g., retrieve all records in a certain timeframe); Hash-Maps are hard to beat in performance for key-based lookups; and, Bloom-filters are typically used to check for record existence. Yet, all of those indexes remain general purpose data structures and do not taking advantage of common patterns prevalent in real world data. For example, if the goal would be to build a highly tuned system to store and query ranges of fixed-length records with continuous integers keys (e.g., the keys 100M to 200M), one would not use a conventional B-Tree index over the keys since the key itself can be used as an offset, making it an O(1) rather than O(log n) operation to look-up the beginning of a range of keys. Maybe surprisingly, the same optimizations are still possible for other data patterns. In other words, knowing the data distributions allows for highly optimizing almost any index the database system uses.
In this talk, I will start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes. The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records. Our initial results show, that by using simple neural nets we are able to outperform cache-optimized B-Trees by up to 3x in speed while saving an order-of-magnitude in memory over several real-world data sets. More importantly though, we believe that the idea of enhancing core components of a (data management) system through learned models might have implications for future systems designs.

Michael Mahoney, UC Berkeley
Alchemist: An Apache Spark <=> MPI Interface

The need for efficient and scalable numerical linear algebra and machine-learning implementations continues to grow with the increasing importance of big data analytics, in both business analytics applications as well as in scientific applications, e.g, astrophysics, climate science, biomedicine, etc.  Since its introduction, Apache Spark has become an integral tool in this field, with attractive features such as ease of use, interoperability with the Hadoop ecosystem, and fault tolerance.  However, it has been shown that many numerical linear algebra routines implemented using MPI, a tool for parallel programming commonly used in high-performance computing, can outperform the equivalent Spark routines by an order of magnitude or more, and moreover that Spark anti-scales for non-trivial linear algebra algorithms in the terabyte-scale regime.  We describe Alchemist, a system for interfacing between Spark and existing MPI libraries that is designed to address this performance gap.  The libraries can be called from a Spark application with little effort, and we illustrate how the resulting system leads to efficient and scalable performance on large datasets.  We describe use cases from scientific data analysis that motivated the development of Alchemist and that benefit from this system.  These large-scale scientific machine learning applications tend to have quite different requirements than common internet and social media applications.

Ian Mathews, Redivis
Connecting researchers, HPC, and big data: A UX challenge

In close collaboration with the Stanford Center for Population Health Sciences, Redivis (redivis.com) has been developed over the past 18 months to serve as foundational infrastructure for researchers to explore, access, and query large, population-level datasets. Co-founder Ian Mathews will discuss the myriad learnings that occurred over the platform\u2019s development, and the key insight that it is user-focused design, not (solely) technological innovation, that will bring data exploration capabilities to the broader population of biomedical researchers. He will also discuss the current state of the platform and its future goals and trajectory.

Rajat Monga, Google
Challenges in Machine Learning Systems

The rapid evolution of machine learning over the current decade has brought about a number of exciting challenges in building systems. Large datasets and the need for more compute has shifted the computation models from CPUs to GPUs and even to ML specific accelerators like TPUs. Over the same time, the research has moved from simple feed forward models to much deeper convolutional ones, sequential models like LSTMs and beyond deep learning with reinforcement learning and evolutionary approaches. This talk will cover some of the ideas in TensorFlow that address the need to provide performance and scale, along with flexible programming models for researchers to explore new ideas.

Ben Nachman, LBNL/CERN
Technological imitations for analysis of the biggest scientific dataset: Particle physics at the Large Hadron Collider

Many thousand-person, multi-billion dollar experiments at the Large Hadron Collider (LHC) are generating enormous datasets to investigate the smallest distance scales of nature. We are currently preparing to increase the data rate by a factor of 10 over the next 10 years to fully exploit the power of the LHC. This poses significant challenges at all stages of data taking and analysis. At the lowest level, we have to (re)design custom ASICs that can cope not only with the high particle rates, but also withstand significant radiation damage. For offline analysis, our current computing and simulation models will not scale to the this high-luminosity LHC era and will thus require heavy use of HPCs and other solutions. Modern machine learning will continue to play a growing and critical role in all aspects of this work, including accelerating simulation with techniques like Generative Adversarial Networks (GANs). Integrating modern tools into aging workflows is a significant challenge. This talk will review the status and prospects of these topics for the future of collider-based particle physics at the LHC and beyond.

Frank Nothaft, Databricks
Analyzing Massive Genomics Datasets using Apache Spark

Powered by the continuous decrease of the cost of sequencing a single human genome, "big data" sequencing studies (>10,000 sample) are becoming common in both industrial and research settings. To work with datasets at this size and scale, we need to allow bioinformaticians to write genomic analysis queries that can be distributed across large compute clusters. Recently, several prominent libraries like GATK4, ADAM, and Hail have used Apache Spark to achieve this goal. Apache Spark is a "map-reduce"-like system that allows code written in Scala, Java, Python, R, or SQL to be run in parallel across a cluster with hundreds to thousands of cores. In this talk, we will briefly explain what Apache Spark is and how it works. Then, we will look at a few genomic analyses where Apache Spark drops latency from hours to minutes, which enables a human-in-the-loop analysis workflow. As part of these analyses, we will also explore how Apache Spark can be used to integrate other data sources (clinical measurements, imaging) with genomics data, and we will extract best practices for architecting scientific analyses on Apache Spark.

Jerry PanFacebook
Machine Learning at Facebook

Facebook applies Machine Learning extensively today, and its application is also growing at a tremendous pace. This talk gives an end to end overview of ML pipelines at Facebook, their massive scale, a deep dive into one of such PB scale pipelines, and insights into the unique requirements for processing and storage of large scale ML data.

Machine Learning meets Databases at Netflix

Netflix collects billions of data points from impressions, viewing history, actions etc. to generate a personalization story for each user. With a catalog spanning thousands of titles and a diverse member base spanning over a hundred million accounts, recommending the titles that are just right for each member is crucial. We have therefore developed a software architecture that can deliver an enhanced experience and support rapid innovation. Our architecture handles large volumes of existing data, is responsive to user interactions, and makes it easy to experiment with new recommendation approaches. At the same time, Netflix as a cloud-native enterprise runs all of its infrastructure on the public cloud.
To achieve these goals, we run online and offline processing services. For example, real-time model scoring, real-time stream processing as well as offline services using data lakes in Amazon S3. We also leverage pub/sub-frameworks like Apache Kafka, we transform data and store them in one of the largest Apache Cassandra deployments or in our caching layers which are comprised of a globally replicated in-house Memcached deployment called EVCache. Hence, there are multiple data sources depending on the ML use cases and their respective SLAs. Real-time, latency sensitive, and with high refresh rate ML use cases tend to be served by our caching infrastructure. We prefer to store structured data, like viewing history, in Cassandra because we can horizontally scale the writers per second, apply customer filters (user, device, subtitle, episode, season, actor etc) given the columnar format and can use tunable consistency to tradeoff performance vs data consistency. We also use Apache Parquet to leverage efficient columnar data representation that aids in reduced I/O for large-scale reads.
In this talk, we are going to cover how the Netflix Machine Learning infrastructure leverages the data we store and the technology choices we have made to offer personalization to more than 117M subscribers across the world. Our story does not end here, as we are growing across many directions (subscribers, catalog, countries etc.) we must consider efficiencies at a large scale.

Resources:
Artwork Personalization https://medium.com/netflix-techblog/artwork-personalization-c589f074ad76
Distributed Time Travel for Feature Generation https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907

Neoklis Polyzotis, Google, Data Management
Data Analysis and validation for production ML pipelines

In the context of machine learning, data becomes an important production asset on par with algorithms and infrastructure used for learning. In this talk, I will first cover a few fundamental data-management issues that arise in the context of production ML pipelines, related to data understanding, validation, and debugging. Then, I will describe a data-validation system that we built at Google to address these issues. This system is an integral part of TensorFlow Extended (TFX), an end-to-end machine learning platform at Google, and is used by hundreds of teams for continuous monitoring and data validation.

Manuel Rivas, Stanford Medical
Large-scale inference across genomes and health records with the Global Biobank Engine

Biobank resources like the UK Biobank contain a wealth of phenotypic information for each subject compared to traditional ascertainment-based studies. However, specific diseases are generally represented near their population prevalence in biobanks which can hamper studies of individual diseases. New statistical methods that can jointly analyze multiple phenotypes are therefore needed to fully take advantage of biobank resources with linked health records and other phenotypic measurements. We have developed a suite of methods to study disease genetics in the context of population-scale biobanks that jointly leverage information from multiple phenotypes to estimate genetic parameters such as genetic correlations, model disease risk, and identify likely disease-associated variants and genes. These methods use GWAS summary statistics as input which allows easy meta-analyses across biobanks and with previous studies while protecting the privacy of participants. This presentation will describe these novel approaches and their application to data from the UK Biobank and how extra-large database management systems like SciDB have been profoundly instrumental in our success at making progress in gene discovery.

Building a machine learning health system (ALT: Using machine learning in healthcare)

Using Electronic Health Records (EHRs), it is possible to examine the outcomes of decisions made by doctors during clinical practice for generating evidence from the collective experience of patients. We will discuss our efforts, and bottlenecks at Stanford Medicine in transforming unstructured EHR data to discover hidden trends, build predictive models, and drive comparative effectiveness studies in a learning health system.

Ion Stoica, UC Berkeley, RISELabs, Databricks, Conviva Networks
Ray: A system for distributed AI

Over the past decade, the bulk synchronous processing (BSP) model has been proven highly effective for processing large amounts of data. However, today we are witnessing the emergence of a new class of applications, i.e., AI workloads. These applications exhibit new requirements, such as nested parallelism and highly heterogeneous computations. To support such workloads, we have developed Ray, a distributed systems which provides both task-parallel and actor abstractions. Ray is highly scalable employing an in-memory storage system and a distributed scheduler. In this talk, I will discuss some of our design decisions, and the early experience with using Ray to implement a variety of applications.

Ian Willson, The Boeing Company, Senior Technical Fellow, Data Engineering
IoT @ Boeing: From Flight Data Recorders to Factory RFID Analytics

This talk focuses on milestones since Boeing solved our first IoT scalability challenge by analyzing fleet-level flight sensor data on integrated Hadoop Teradata clusters. This IoT pipeline provided nearly interactive predictive analytics capabilities, a 300x net improvement. This pathfinder and as well as our long expertise leveraging Teradata in engineering and manufacturing influenced our subsequent IoT designs, from lossless temporal compression of data (10x-1,000x) to interactive temporal geo-spatial analytics. As data increasingly flows into our platform in real-time, analytics needed to move towards run-time invocation. This lets Boeing leverage our temporal EDW in combination with RFID location data to optimize factory processes, from autonomous vehicle tracking to asset utilization. Current efforts are focused on dynamic real-time optimization across heterogeneous platforms comprising our digital thread, from next generation operational systems and edge compute to our evolving hybrid cloud analytics platform.

Duan Xu and Jason Crane, UCSF School of Medicine
New Frontiers in Medical Imaging in the Age of Big Data and Precision Medicine

This talk introduces the challenges of managing medical images at the complexity, multi-factor inquiry and massive scale that Precision Medicine demands.  This is a task for Very Large scale computing, that extends beyond the realm of current systems.
The use of multi-modality medical imaging has dramatically expanded over the last two decades and the push for increasingly precise diagnosis and treatment monitoring has accelerated.  At UCSF, since 2000 when Radiology went digital, we have accumulated over 1.5 Billion images and over 1 PB of clinical and research data. The rate of increase in the size, spatial resolution and relevance of dynamic imaging scans is continuing to trend upwards as newer and more advanced technologies are becoming available. HIPAA and other privacy regulations have a major impact upon the process of making data available for machine learning and population based investigations. Developing reliable mechanisms for priming and organizing the data in a manner that overcomes these limitations remains a challenging and ongoing process.
Looking beyond imaging data, the integration of genetics, labs, electronic medical record (EMR), and pathology data adds to the complexity of storing and managing repositories for use in big data related investigations.  At UCSF, we have embarked on this process by taking a modular approach toward incrementally expanding the types of data included in our repository. In this talk we will share our experience at UCSF in developing a multi-modal data discovery platform that includes imaging and information from the hospital’s medical imaging records.