Lightning Talks

Accepted Lightning Talk Abstracts

TUE 1: Starting “Small” to Go Big – Building a Living Database
Michael Sabbatino, AECOM; Lucy Romeo, AECOM
With the ever-increasing torrent of data collected through scientific research, experiments, simulations, and analysis, there is a growing need for tool development and database management resources to ensure that data can live, adapt, and evolve beyond a given project. As research and science push increasingly into big data analytics for more complex, multi-variate, multi-dimensional systems, there is a need for data to support those studies. Data that can be gathered, integrated, and made interoperable on the fly, adapting to end-user needs. Starting with a single “small” Living Database project, researchers at NETL are leveraging U.S. fossil energy research efforts to develop an evolving, terabyte scale virtual fossil energy data framework. The pilot portion of this project, the Global Oil and Gas Inventory (GOGI) geodatabase, focused on utilizing available tools and resources including AI (Artificial Intelligence) keyword search, Big Data clusters, and subject matter experts to rapidly, over 4 months, parse the worldwide web, access, acquire and integrate relevant, authoritative, open data into the GOGI framework. Since the creation of the GOGI geodatabase, methods have expanded to collect new data, including real-time, spatial, experimental, model, laboratory, and offer options to acquire and integrate other related datasets with these resources. These new tools and resources seek to evolve how datasets and databases are developed to improve efficiency, flexibility and accessibility so end users can focus on utilization of the resources within. The Living Database concept is evolving, working to incorporate advanced computing algorithms and capabilities, including AI and machine learning, to address these needs. This poster and presentation will highlight progress to date and future plans in addressing development of a flexible, adaptable, living database capability.
TUE 2: Practical Automated Regression Testing of ETL Applications
Miro Dzakovic, Intel Corporation
MIDAS, the largest integrated database at Intel, manages Intel's internal manufacturing data from a variety of sources.  As Intel's manufacturing technologies keep changing, the MIDAS ETL applications must go through frequent upgrades.  When releasing these upgrades, the stakes are high: a single defect can spoil the data in the database, and the worst-case recovery from a backup would take days.  These challenges, which exist in many organizations that run on data, clearly call for fast and reliable regression testing. However, automated testing of database applications in industry is spotty.  No tool or technique is widely used.  This talk will highlight the approach behind 1TH, the regression testing tool that enabled weekly MIDAS upgrades while minimizing risks.  Apart from being a generic solution that is independent of database and ETL platforms, 1TH is particularly important for big data problems, such as entity resolution, and holds promise for applications that persist non-relational data.
TUE 3: Real-time Fraudulent Detection with Innovative Big Graph Feature
Yu Xu, CEO, TigerGraph; Mingxi Wu, VP Engineering, TigerGraph
Fraud detection is hot and benefits real life in many areas. Such as anti phone scam, anti wire/payment fraud, business email compromise fraud detection etc. There are supervised learning approaches and unsupervised learning approaches. In supervised learning approaches, due to the scarcity of the labeled fraud training data, it is also known as an imbalanced classification problem.

We present that by collecting physical graph features of big discrete data in realtime, and feeding them to a classifier, we can leverage topology information to more effectively solve the universal fraud detection problem. We demo this technique by presenting a real customer's anti-fraud case--Combating Phone Scam On China Mobile Call Graph With Innovative Graph Features.

To elaborate, most existing telecom anti-scam systems leverage crowdsourcing. That is, mobile users proactively report spam numbers to a central service via mobile apps. The central service will maintain a blacklist and leverage it to alert potential victims in the future. However, in real-life, rule based anti-scam system is hardly effective due to the scarcity of the labeled data and the short-life of the labeled data. Scam phone numbers are highly dynamic, short-lived, and moving targets. As a result, anti-phone scam is a difficult imbalanced classification problem.

To effectively combat phone scam, we researched and discovered a highly effective process —via collecting innovative graph features on a mutable phone call graph in real-time, we can classify fradulent caller and alert victims in real-time. This process has been put in production to protect hundreds of millions of mobile users from phone scam since Dec 2016. To the best of our knowledge, such a system has neither been reported in academia nor in industry before. This is partly because there is no such large-scale, commercial grade mutable graph database to support this kind of attempt.

Specifically, we will cover
- The real-time anti-scam system architecture, which accepts real-time stream update on a 10B+ scale call graph, and computes and collects graph features in real-time on the graph.
- TigerGraph's innovative SQL-like query language, which is the enabler of the whole process
- Report a set of innovative graph features that is universal and effective in combating phone scam.

TUE 4: KGBuilder: A System for Large-Scale Scientific Domain Knowledge Graph Building
Yi Zhang, Renmin University of China; Xiaofeng Meng, Renmin University of China
In scientific domain, cutting edge creations and discoveries are often open accessible as texts in Web, papers and other carriers. Decentralized and unordered, they are hard to follow. By information extraction and reorganization, knowledge graph can help professionals follow latest and discover unknown scientific facts more effectively. Therefore, we will take microbiology for an example to show the process of large-scale scientific domain knowledge graph building by KGBuilder.
Taking texts as input, KGBuilder creates domain knowledge graph in 3 steps: identifying entities, relations and new links respectively by named entity recognition, relation extraction and linking prediction. In named entity recognition, there are some classic methods and tools in open domain, but usually perform poorly in specific one. Generally, relation extraction is completed by machine learning methods, requiring large amount of annotated data and manual feature engineering, which is expensive and time consuming. As for linking prediction, the unbalance between head entities and tail ones makes its data distribution unique.
Based on the above observation, KGBuilder can alleviate or solve corresponding problems. By combining BiLSTM, CRF and probabilistic methods, domain knowledge is added to named entity recognition in a simple and extensible way. Its relation extraction can automatically generate large amount of annotated data and extract features via distant supervision and neural network, reducing annotating cost. Inspired by Question Answering and rectilinear propagation of lights, KGBuilder puts forward TransMT to deal with the above unbalanced problem in link prediction, performing better than traditional models, especially translation ones.
By a preliminary attempt, KGBuilder draws a great blueprint for scientific facts following and discovering by knowledge graph, under which are machine learning and artificial intelligence.
TUE 5: Adhaar - the largest biometric database in the world - XLDB in every way data is persisted
Anurag Choudhary, Director of Engineering, heavily involved in development and delivery.
MapR implemented and runs the largest biometric database in the world - used for a wide variety of use cases - from entrance to exit out of India by citizens to health care fraud detection. This system has files and data on many of the majority of citizens in India for the purposes of supporting government functions. This system is live and fully operational - just as impressive the system has not had an outage in over 4 years.
TUE 6: Reproducible Computational and Data-Intensive Experimentation Pipelines with Popper
Ivo Jimenez, UC Santa Cruz; Carlos Maltzahn, UC Santa Cruz
Popper is a convention and CLI tool for implementing scientific explorations following best DevOps practices. Popper pipelines encompass the end-to-end experimentation workflow that is typically followed by researchers (building, deploying, executing software, as well as analyzing datasets) and are relatively easy to execute: clone a repository, define environment variables, and execute a bash script. This talk will introduce Popper and walk through examples relevant to the XLDB community.
TUE 7: Databases for an Engaged World: Requirements and Design Approach
Keshav Murthy, Senior Director, Couchbase R&D
Traditional databases have been designed for system of record and analytics. Modern enterprises have orders of magnitude more interactions than transactions. Couchbase Server is a rethinking of the database for interactions and engagements called, Systems of Engagement. Memory today is much cheaper than disks were when traditional databases were designed back in the 1970's, and networks are much faster and much more reliable than ever before. Application agility is also an extremely important requirement. Today's Couchbase Server is a memory- and network-centric, shared-nothing, auto-partitioned, and distributed NoSQL database system that offers both key-based and secondary index-based data access paths as well as API- and query-based data access capabilities. This lightning talk gives you an overview of requirements posed by next-generation database applications and approach to implementation including “Multi Dimensional Scaling.
 WED 1: Asynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU?
Florin Rusu, UC Merced; Yujing Ma, UC Merced
There is an increased interest in building data analytics frameworks with advanced algebraic capabilities both in industry and academia. Many of these frameworks, e.g., TensorFlow and BIDMach, implement their compute-intensive primitives in two flavors---as multi-thread routines for multi-core CPUs and as highly-parallel kernels executed on GPU. Stochastic gradient descent (SGD) is the most popular optimization method for model training implemented extensively on modern data analytics platforms. While the data-intensive properties of SGD are well-known, there is an intense debate on which of the many SGD variants is better in practice. We perform a comprehensive study of parallel SGD for training generalized linear models. We consider the impact of three factors -- computing architecture (multi-core CPU or GPU), synchronous or asynchronous model updates, and data sparsity -- on three measures---hardware efficiency, statistical efficiency, and time to convergence. In the process, we design an optimized asynchronous SGD algorithm for GPU that leverages warp shuffling and cache coalescing for data and model access. Our CPU and GPU implementations always outperform TensorFlow and BIDMach in time to convergence---sometimes by several orders of magnitude.
WED 2: Postgres-XL for European Space Agency Gaia Variability Studies
Krzysztof Nienartowicz, University of Geneva
Gaia (http://sci.esa.int/gaia/28820-summary/) is a cornerstone ESA mission to create a 1.5 billion stars 3D catalog of our galaxy with unprecedented accuracy. Variability group hosted in Gaia Data Processing Centre in Geneva developed pipeline where the center of gravity is occupied by parallel DBMS: Postgres-XL. I will describe rationale, architecture and our involvement in development of Postgres-XL which enabled large scale processing and analytics for variability studies already publicly available at Gaia archive and to be expanded in following years.
WED 3: XProd: Efficient and Scalable Cartesian Product distribution for Map-Reduce primitives
Gopal Vijayaraghavan, Apache Tez; Zhiyuan Yang, Apache Tez; Hitesh Shah, Apache Tez
Cartesian product distributions are an unintended consequence of distributing non Equi-join algorithms which occur in geo-spatial and similarity problems. While there are existing methods to distribute such joins across a network using Map-Reduce primitives, they suffer from data duplication and network connection overheads which are avoidable. These overheads relegate these join queries into an undeserved reputation of being slow and hard to scale to billions of rows.

In this talk, we introduce a fast and scalable Cartesian distribution algorithm, built into Apache Tez, which allows for the execution of scalable non-equi joins within Apache Hive. The distribution algorithm uses the existing Map-Reduce primitives and only modifies the intermediate data interchange to achieve a full cross-product or any sub-set.

The Tez XProd Edge does not specify a join algorithm, allowing the use of Hive's distributed sorted merge or the distributed partitioned hash join algorithms, depending on the scale of operations. The physical distribution edge was designed to co-exist with a bloom semi-join edge, to reduce data-sets moving over the network based on one of the inputs. The XProd edge reduces data duplication by allowing one compressed partition of data to be routed to multiple destinations without modification by injecting itself as a routing table for the shuffle algorithm. The routing algorithms have multiple flavors which allows for the implementation to operate with fewer than O(N*M) network connections when input data is without skew, to group together partitions when skew is observed at runtime or distribute partitions in a skew-safe fashion when input skews are always expected.

We have implemented all these algorithms as part of Apache Tez and Apache Hive, with specific allowances for failure tolerance and to minimize total runtime of non-equi join queries, without any modification to the map-reduce assumptions within the tasks executing these join algorithms.

WED 4: ArrayUDF Explores Structural Locality for Faster Scientific Analyses
John Wu, Lawrence Berkeley National Laboratory
Many analysis operations, especially those from scientific applications, depend on data values that are hard to express in the popular data processing systems. Common operations, such as applying the Laplacian operator or computing vorticity from velocity, are typically represented as multi-ways joins that are very time-consuming to evaluate. In this work, we propose to capture the data dependency using a form of structural locality. We implement a User-Defined Function mechanism in a scientific data services framework to support operations with structural locality in data. This framework support data stored as multi-dimensional arrays, the most common form of data structure used in large-scale simulations. We implement this UDF mechanism, named ArrayUDF, as an in situ procedure to support these operation with structural locality. ArrayUDF allows users to define computations on adjacent array cells without the use of join operations, and executes the UDF directly on arrays stored in data files without loading their content into a data management system. Additionally, we present a thorough theoretical analysis of the data access cost to exploit the structural locality, which enables ArrayUDF to automatically select the best array partitioning strategy for a given UDF operation. In a series of performance tests on large scientific data sets, we have observed that ArrayUDF outperforms Apache Spark by as much as 2070X on the same high-performance computing system.
WED 5: Alchemist: An Apache Spark <=> MPI Interface
Kai Rothauge, ICSI and Dept. of Statistics, UC Berkeley
Apache Spark is a popular system aimed at the analysis of large data sets, but recent studies have shown that certain computations done in Spark are significantly slower than when using libraries written using a high-performance computing framework such as the Message-Passing Interface (MPI).

We introduce Alchemist, a system designed to call MPI-based libraries from Apache Spark to help accelerate linear algebra, machine learning, and related computations, while still retaining the benefits of working within the Spark environment. We discuss the motivation behind the development of Alchemist, provide an overview of its design and implementation, and compare the performances of pure Spark implementations with those of Spark implementations that leverage MPI-based codes via Alchemist.

WED 6: Love at First Sight: MonetDB/TensorFlow
Ying Zhang, MonetDB Solutions; Richard Koopmanschap, MonetDB Solutions; Martin Kersten, MonetDB Solutions / Centrum Wiskunde & Informatica
Machine learning is quickly taking its well-deserved place in enriching Big Data applications through classification of facts, identifying trends, and proposing actions based on massive profile comparisons. Started out as a rather static workflow, first-generation machine learning has covered the following steps: 1) select salient features in the objects considered, 2) cleaning the collections to avoid statistical over/under fitting, and 3) a CPU intensive search to fit a model. A next challenge is to close this workflow loop and go after real-time learning and model adaptations.

This brings into focus a strong need to manage large collections of object features, avoiding their repetitive extraction from the raw objects, controlling the iterative learning process, and administering algorithmic transparency for post decision analysis. These are the realms in which database management systems excel. Their primary role in this world has so far mostly been limited to administration of the workflow states, e.g. administrating training sets used and conducting distributed processing.

However, should the two systems be considered separated entities, where application glue is needed to exchange information? How would such isolated systems be able to arrive at the real-time learning experience required? Inevitably, machine-learning applications will face the volume, velocity, variety and veracity challenges of big data applications.
Fortunately, the database community has been spending a tremendous effort on improving the analytical capabilities of database systems and improving the software eco-system at large. So when the well-established analytical database system MonetDB [1] met the popular new member of the machine- learning family TensorFlow, it was love at first sight. The click is immediate: they both can speak the same language (i.e. Python); they both have a strong focus on optimised processing on heterogeneous hardware (e.g. from multi-core in-memory processing to GPU/FPGA acceleration); they both serve the same communities (i.e. AI-enabled analytical applications); and most importantly this couple started out with a means of seamless communication (i.e. through the same binary data exchange format) [2].

In this short talk, I will first tell the story of how the recent in-database machine learning relationship between MonetDB (an open-source analytical columnar DBMS) and TensorFlow (an open-source machine learning library) came about. Then I will argue with an example application of entity linking using neural embeddings [3], that this relationship has all the potential to reach a happily-ever-after.

[1] P. A. Boncz, M. L. Kersten, and S. Manegold, Breaking the memory wall in MonetDB, Commun. ACM, 51(12): 77–85, 2008.
[2] M. Raasveldt and H. Mühleisen, Vectorized UDFs in Column-Stores, in Proceedings of the 28th International Conference on Scientific and Statistical Database Management, SSDBM ’16, pages 16:1–16:12, New York, NY, USA, 2016. ACM.
[3] T. Kilias, A. Löser, F. A. Gers, R. Koopmanschap, Y. Zhang, and M. L. Kersten, IDEL: In-Database Entity Linking with Neural Embeddings, submitted to VLDB 2018, https://github.com/tkilias/vldb2018-idel.

WED 7: Turing Meets Watson-Crick: A Massive Data Storage Platform for EXtreme Longevity, Density, and ReplicaBility
Swapnil Bhatia, Catalog Technologies Inc.
Catalog is building the world's first massive data storage platform using DNA as the storage medium. As an information storage medium, DNA is unique in its stability over millennia (fossils), information density (gigabits in cellular volume), and replicability (billions of copies, cheaply and efficiently). This makes it an attractive medium for high density storage of latency tolerant data. A central challenge in scaling DNA-based storage beyond narrowband rates is the cost-throughput trade off in classical letter-by-letter synthesis of DNA.

In this talk, I will present a glimpse of Catalog's technology for broadband DNA data storage and some of the unique challenges and opportunities presented by this new medium. These innovations result in dramatic cost reductions, which will enable DNA to become an economically attractive solution for high density latency tolerant applications.

WED 8: Adobe Identity Graphs
Leila Jalali, Adobe; Akanksha Nagpal, Adobe
Digital marketing solutions – whether Analytics, Advertising, Personalization, Email marketing or Data management platform – are evaluated based on their effectiveness in understanding and targeting end consumers. In a world where the consumer interacts with multiple devices, understanding the consumer on each device in isolation only provides fragmented insights and can lead to poor data driven decisions. An identity graph, by linking the consumer’s identities across different devices, browsers, applications and customer data management systems allows the marketer to obtain a holistic view of the consumer and enables the marketer to make better informed decisions. Identity Graph provides a comprehensive solution to the challenge posed by fragmentation of identities. It resolves all the known and anonymous identities of a person. Given a known or anonymous identities it provides all the other known and anonymous identities that belong to the person. This allows for data management at the people level instead of identities, providing more accurate insights into consumers. Identity Graph also enables building a holistic profile of a consumer across different identities providing richer insights and more accurate targeting. In our proposed solutions, we discuss experiments of graphs with more than 330 millions of identities. This paper provides an overview of the technical approach taken to build the Identity Graph and how it can effectively be using in different use-cases.
WED 9: Reproducible Data Science at Scale: Exploring Allen Institute’s 200TB Open Microscopy Image Dataset in Python
Kevin Moore, CEO, Quilt Data, Inc.
The Allen Institute for Cell Science generates terabytes of microscopy images every week. To improve access to these datasets for our data scientists and external collaborators, we sought a platform that would enable: (1) plain-text search; (2) subsetting of large datasets; (3) version control to support reproducible experiments; and (4) easy accessibility from data science tools like Jupyter, Python, and Pandas. We discovered that software optimized for storing and versioning source code (e.g. GitHub) exhibits slow performance for large files and places hard limits on file size that preclude large data repositories altogether. In response, we are creating an open repository of image data that is enriched with metadata and encapsulated in "data packages." Data packages are versioned, immutable sets of data dependencies. The concept of package management is well-known in software development. To date, however, package management has largely been applied to source code. We propose to extend package management to the unique file size and format challenges of data by building on top of Quilt, an open-source data package manager. In combination with custom filtering software, Quilt enables efficient search and query of metadata, so that data scientists can filter terabyte-sized packages into megabyte-size subsets that fit on a single machine. Our package management infrastructure optimizes not only storage and network transfer, but serialization and virtualization. As a result, data scientists can interact with data packages in formats that are native to Jupyter and Python.