Scope: The AT&T NYC Research Seminar Series is organized by members of the AT&T Labs Big Data Research organization in our NYC offices. Our research spans a wide range of computer and data science topics, including databases, machine learning, networking, security, statistics, and data visualization. Much of our work is motivated by the massive data sets generated by our network each day. Understanding the dynamics of this data helps AT&T better serve its customers, improve its network, and develop new products and services.
Attending: Seminars are open to all AT&T employees as well as external visitors. However, as seating is limited, we require guests to RSVP (links included below for each talk) at least one day before the event. A photo ID is required and must be presented to security upon entering the building. All events will take place at 33 Thomas Street, New York, NY 10007 (see below for map, directions, and some history on the building without windows).
Trust is an essential element of the human condition, the basic building block of human interaction, communication, and exchange. We are in an era that surfaces concerns about online misinformation, coupled with the disturbing declines in trust in institutions like media and government. At the same time, there is evidence that people are increasingly trusting one another online, for example sharing their homes with strangers using Airbnb or jumping into the back of a hired car with an unknown driver. I will explore this tension by giving an overview of our recent work on trust in the sharing economy, and in particular on how language in self descriptions leads to trust in Airbnb. I will use Signaling Theory to explain and expand on the results, and will discuss the application of these ideas — if any is possible — to the problem on online misinformation.
Bio: Mor Naaman is an associate professor of Information Science at the Jacobs Institute at Cornell Tech, where he is the founder of the Connective Media hub, leads a research group focused on social technologies, and directs the Oath-supported Connected Experiences laboratory. His research group designs, builds, and studies systems that support social interactions in online and physical spaces. Mor applies multidisciplinary methods to 1) gain a better understanding of people and their use of social tech; 2) extract insights about people, technology and society from social media and other sources of social data, and 3) develop new social technologies as well as novel tools to make social data more accessible and usable in various settings. Previously, Mor was on the faculty at the Rutgers School of Communication and Information, led a research team at Yahoo! Research Berkeley, received a Ph.D. in Computer Science from the Stanford University InfoLab, and played professional basketball for Hapoel Tel Aviv. He is a recipient of a NSF Early Faculty CAREER Award, research awards and grants from numerous corporations including AOL and Google, and multiple best paper awards.
In this talk I give a broad overview of Approximate Message Passing algorithms, a class of low-complexity, scalable algorithms that have recently been proposed for use in solving high-dimensional linear regression tasks when the size of the problem causes more traditional approaches to fail. These types of algorithms have been used in many applications including signal processing, machine learning, statistical data analysis, and information theory.
Bio: Cynthia Rush is an Assistant Professor in the Department of Statistics at Columbia University. Originally from North Carolina, she completed her undergraduate coursework in Mathematics at the University of North Carolina at Chapel Hill and in 2016 she received a Ph.D. from the Department of Statistics at Yale University under the supervision of Andrew Barron. Her research interests lie broadly in statistics and applied probability with a current focus on statistical machine learning algorithms, such as message passing. These algorithms can be used for inference and optimization in many applications and some that she studies include communications systems, compressed sensing, and image reconstruction.
It is well known that exposure to environmental pollutants aggravate acute health conditions like asthma and may increase the risk of cancer. Recently though, outdoor environmental pollution has been associated with lifelong damage to the health of unborn babies motivating the Guardian to call this a potential "global health catastrophe." While we are now able to monitor and model pollution more accurately than ever before, we are still unable to effectively evaluate community-level exposure. In this talk, I will show how we are using publically available pollution and temperature data, cell tower data, and the R programming language are being used to fill this gap and better understand the exposure of demographic and spatial groups of people. The result of this work is accurate and precise community exposure data and it motivates new methods and studies in the areas of epidemiology and environmental medicine.
Bio: Michael Kane is an Assistant Professor at Yale University and collaborator with researchers at AT&T Labs Research. His research interests are in the areas of scalable machine/statistical learning, applied probability, and computing.
The rapid democratization of data has placed its access and analysis in the hands of the entire population. While the tools for rapid and large-scale data processing have continued to reduce the time to compute analysis results, the techniques to help users better and more easily visualize their data, clean and prepare their data, and understand what their results mean are still lacking. In this talk, I will provide an overview of our lab's recent work on addressing each stage of data analysis—data cleaning, data visualization, and explanation.
Bio: Eugene Wu is broadly interested in technologies that help users play with their data. His goal is for users at all technical levels to effectively and quickly make sense of their information. His focus is in solutions that ultimately improve the interface between users and data, and borrows techniques from fields such as data management, systems, crowdsourcing, visualization, and HCI. Eugene Wu received his Ph.D. from CSAIL at MIT, advised by the esteemed Sam Madden and Michael Stonebraker, in the database group. He spent the first half of 2015 at UC Berkeley before starting at Columbia University in Fall 2015. http://eugenewu.net/
Machine learning and advanced statistical models increasingly are being deployed by organizations to support decision making, for almost any application where data can be obtained for training. Now these organizations must turn their attention to how to deal with decision making based on machine-learned models. At the very least, decision making is different than before, possibly in ways not understood by the stakeholders. More insidiously, learned models may incorporate biases present in the data from which they were learned. These biases could be human biases, for example prior discriminatory practices reflected in historical data might be reified in learned models. Or these biases could be statistical, for example based on selection bias in the data or peculiarities of what is harder/easier for an algorithm to learn.
In this talk I will present two ideas that help us deal with decision-making based on machine-learned models, and some related research results. First, I discuss a general method for giving transparency into the decisions made by arbitrary machine learning models. I illustrate the method by giving transparency into decisions based on the prediction of personal information about users of Facebook, and suggest that it also could be used to give users more control over how their personal data are used. Second, I discuss how the process of building machine learning models leaves us with a problem of "unknown unknowns." What is our decision-making missing? I define this concept precisely, and present a method for revealing (some of) the unknown unknowns of machine learning models, and some results.
Bio: FOSTER PROVOST is Professor of Data Science, Professor of Information Systems, Andre Meyer Faculty Fellow, and former Director of the Center for Data Science at New York University. He previously was Editor-in-Chief of the journal Machine Learning and was elected as a founding board member of the International Machine Learning Society. Foster’s research has won a number of awards, including (among others) the 2017 European Research Paper of the Year, the 2016 Best Paper Award in Information Systems Research, Best Paper awards at the ACM SIGKDD Conference across three decades, the 2009 INFORMS Design Science Award, and a President’s Award from NYNEX Science and Technology (now Verizon). His book, Data Science for Business, is a perennial best-seller. Foster also designed the founding data science architectures for several successful startups, including Forbes’ Most Promising Companies Dstillery and Integral Ad Science.
Big Data has become ubiquitous, large amounts of data are being collected with the hope of being useful in some way. The rather organic nature of many big datasets requires flexible tools to manage, transform and compute on the data. Numerous projects are emerging to address this need, but very often they are limited to aggregations and transformations, or provide only rudimentary modeling capabilities. R on the other hand provides a very wide variety of data analytic tools and models. Due to its vectorized nature, it is also very efficient at dealing with moderately sized data. In this talk we will show several approaches how R can be used very efficiently for Big Data analytics at scale leveraging the Hadoop ecosystem. We will start with hmr - a faster way to use the map/reduce framework from R, introduce ROctopus which allows us to perform arbitrary operations on large data without the constraints of a map/reduce framework and show a general framework for developing and using models in R that can leverage distributed systems. We will illustrate the use of the approaches on real dataset and a large cluster.
Bio: Simon Urbanek is a Lead Inventive Scientist at AT&T Labs in Big Data Research and a member of the R Core Development Team with interests in visualization, interactive graphics, distributed computing and Big Data analytics. He is the author of RCloud and numerous R packages including Rserve, rJava, multicore, iotools, iPlots, RJDBC, Cairo and many others. Simon received his Ph.D. in Statistics at the University of Augsburg in 2004.
In this talk I will provide an overview of my group’s research projects at Cornell Tech involving Computer Vision, Machine Learning and Human in the Loop Computing. Specific examples of projects we will cover include bird identification, learning perceptual embeddings of food and the Visipedia.org initiative.
Bio: Serge Belongie received a B.S. (with honor) in EE from Caltech in 1995 and a Ph.D. in EECS from Berkeley in 2000. While at Berkeley, his research was supported by an NSF Graduate Research Fellowship. From 2001-2013 he was a professor in the Department of Computer Science and Engineering at University of California, San Diego. He is currently a professor at Cornell Tech and the Department of Computer Science at Cornell University. His research interests include Computer Vision, Machine Learning, Crowdsourcing and Human-in-the-Loop Computing. He is also a co-founder of several companies including Digital Persona, Anchovi Labs and Orpix. He is a recipient of the NSF CAREER Award, the Alfred P. Sloan Research Fellowship, the MIT Technology Review “Innovators Under 35” Award and the Helmholtz Prize for fundamental contributions in Computer Vision.
Machine (data-driven learning-based) decision making is increasingly being used to assist or replace human decision making in a variety of domains ranging from banking (rating user credit) and recruiting (ranking applicants) to judiciary (profiling criminals) and journalism (recommending news-stories). Recently concerns have been raised about the potential for discrimination and unfairness in such machine decisions. Against this background, in this talk, I will pose and attempt to answer the following high-level questions:
(a) How do machines learn to make discriminatory decision making?
(b) How can we quantify discrimination in machine decision making?
(c) How can we control machine discrimination? i.e., can we design learning mechanisms that avoid discriminatory decision making?
(d) Is there a cost to non-discriminatory decision making?
Bio: Krishna Gummadi is a tenured faculty member and head of the Networked Systems research group at the Max Planck Institute for Software Systems (MPI-SWS) in Germany. Krishna's research interests are in the measurement, analysis, design, and evaluation of complex Internet-scale systems. His current projects focus on understanding and building social computing systems. Specifically, they tackle the challenges associated with (i) assessing the credibility of information shared by anonymous online crowds, (ii) understanding and controlling privacy risks for users sharing data on online forums, (iii) understanding, predicting and influencing human behaviors on social media sites (e.g., viral information diffusion), and (iv) enhancing fairness and transparency of machine (data-driven) decision making in social computing systems. Krishna's work on online social networks, Internet access networks, and peer-to-peer systems has led to a number of widely cited papers and award papers at IW3C2's WWW, NIPS's ML & Law Symposium, ACM's COSN, ACM/Usenix's SOUPS, AAAI's ICWSM, Usenix's OSDI, ACM's SIGCOMM IMC, and SPIE's MMCN conferences. He has also co-chaired AAAI's ICWSM 2016, IW3C2 WWW 2015, ACM COSN 2014, and ACM IMC 2013 conferences.
To achieve high parallelism, modern processors use variations of a single-instruction multiple datastream (SIMD) model. For simple tasks like adding vectors, high degrees of parallelism can be achieved with relatively small effort. For more complex tasks, however, parallelism can be obstructed by explicit or implicit dependencies between data items in different SIMD lanes. In this talk, I will describe several techniques that our group has used to improve SIMD parallelism for complex operations used in modern database or other "big data" systems: hash table probing, string search, and regular expression matching. We evaluate these techniques on CPUs, graphics processors (GPUs) and novel SIMD processors such as the Intel Xeon Phi.
This talk represents joint work with Eva Sitaridi, Orestis Polychroniou, and Arun Raghavan.
Bio: Kenneth Ross is a Professor in the Computer Science Department at Columbia University in New York City. His research interests touch on various aspects of database systems, including query processing, query language design, data warehousing, and architecture-sensitive database system design. He also has an interest in computational biology, including the analysis of large genomic data sets. He has received several awards, including a Packard Foundation Fellowship, a Sloan Foundation Fellowship, and an NSF Young Investigator award.
We are in the middle of a remarkable rise in the use and capability of artificial intelligence systems. Much of this growth has been fueled by the success of deep learning architectures, and we are working on ways to direct these tools towards economic questions. Our approach uses economic theory to break complex questions into a series of machine learning tasks. Each task is then solved using mostly off-the-shelf ML, and we have recipes for combining the trained learners together to answer the original economic questions. I'll detail a couple of examples of this approach, including AI for optimal pricing and for causal inference in search advertisement. The end result is that we are able to automate and improve some common economic tasks, building towards a future system for Economic AI.
Bio: Matt Taddy joins Microsoft Research from the University of Chicago, where he is Professor of Econometrics and Statistics at the Booth School of Business and a fellow of the Computation Institute. He leads MSR’s Alice project on Economic AI. Taddy works at the intersections of statistics, economics, and machine learning. His research is directed towards development of new algorithms for machine learning, uncertainty quantification for these algorithms, and incorporation of artificial intelligence into the study of social and economic systems. Recent projects include optimization for complex demand and incentive systems, analysis of the polarization of political dialogue, and development of artificial intelligence for questions of causation. Taddy developed and teaches the Big Data class at Booth, an advanced MBA course that is designed to prepare students for careers at the interface of business strategy and Data Science. He has collaborated extensively with national laboratories, a variety of start-up ventures, and was a research fellow at eBay from 2014-2016. He earned his PhD in Applied Mathematics and Statistics in 2008 from the University of California, Santa Cruz, as well as a BA in Philosophy and Mathematics and an MSc in Mathematical Statistics from McGill University. He joined the Chicago Booth faculty in 2008 and Microsoft in 2016.
Abstract: "Trellis Display", also known as "small multiples" display, is a commonly used exploratory visualization technique where data is split into groups and a plot is made for each group, with the resulting plots arranged in a grid. This approach is very simple yet is considered by data visualization experts to be "the best design solution for a wide range of problems in data presentation" because of its ability to display data in more detail and effectively elicit comparisons across groups. Historically, small multiple display systems presume the data is small enough to be presented in a single static display. Larger and more complex datasets call for a small multiple display system that allows displays to come alive by providing the ability to interactively navigate a potentially extremely large number of plots. This interactive navigation is enabled through the use of "cognostics" -- interesting summary statistics automatically computed for each group. In this talk, I will cover some history of small multiple displays, the principles that make them useful in exploratory data analysis, and recent work toward interactive, scalable small multiple displays that can be used to efficiently explore large data sets in detail. I will demonstrate the new TrelliscopeJS R package and show how it can be used to plug into common data analysis workflows to easily create interactive small multiple displays.
Bio: Ryan Hafen is a statistical consultant and remote adjunct assistant professor at Purdue University. Ryan's research focuses on methodology, tools, and applications in exploratory analysis, statistical model building, statistical computing, and machine learning on large, complex datasets. He is the developer of the datadr and Trelliscope components of the DeltaRho project (deltarho.org), as well as the rbokeh visualization package, and has developed several other R packages. Prior to his work as a consultant, Ryan worked at Pacific Northwest National Laboratory, where he analyzed large complex data spanning many domains, including power systems engineering, nuclear forensics, high energy physics, biology, and cyber security. Ryan has a B.S. in Statistics from Utah State University, M.Stat. in Mathematics from University of Utah, and Ph.D. in Statistics from Purdue University.
Abstract: Data profiling comprises a broad range of methods to efficiently analyze a given data set. In a typical scenario, which mirrors the capabilities of commercial data profiling tools, tables of a relational database are scanned to derive metadata, such as data types and value patterns, completeness and uniqueness of columns, keys and foreign keys, and occasionally functional dependencies and association rules. Individual research projects have proposed several additional profiling tasks, such as the discovery of inclusion dependencies or conditional functional dependencies.
Data profiling deserves a fresh look for two reasons: First, the area itself is neither established nor defined in any principled way, despite significant research activity on individual parts in the past. Second, current data profiling techniques hardly scale beyond what can only be called small data. Finally, more and more data beyond the traditional relational databases are being created and beg to be profiled. The talk highlights the state of the art and proposes new research directions and challenges.
Bio: Felix Naumann studied mathematics, economy, and computer sciences at the University of Technology in Berlin. After receiving his diploma (MA) in 1997 he joined the graduate school "Distributed Information Systems" at Humboldt University of Berlin. He completed his PhD thesis on "Quality-driven Query Answering" in 2000. In 2001 and 2002 he worked at the IBM Almaden Research Center on topics of data integration. From 2003 - 2006 he was assistant professor for information integration at the Humboldt-University of Berlin. Since then he holds the chair for information systems at the Hasso Plattner Institute at the University of Potsdam in Germany. His research interests are in data profiling, data cleansing, and text mining.
Abstract: We consider the potential to improve the eﬃciency and eﬃcacy of broader advertising eﬀorts through cross
channel coordination. Past work has demonstrated a positive relationship between television advertising
and online search activity. Here, we consider the types of devices on which search response predominantly
manifests following TV advertisements, and the degree to which shifts in search activity can be used to
evaluate the success of TV advertisers’ targeting eﬀorts. We leverage data on TV advertising around Microsoft
Windows 10 and an Xbox video game, in combination with large-scale proprietary search data from Microsoft
Bing. Our identiﬁcation strategy hinges on a combination of geographic heterogeneity in TV advertising
exposure and continuous variation in the cost of TV advertisements (a proxy for TV audience size). We ﬁrst
demonstrate that search response peaks within three minutes of the airing of a TV advertisement, and that
this manifests primarily via second-screen devices. Our estimated elasticities indicate that a 20% increase
in advertising spend equates to an approximately 2.5% (3.4%) increase in search volumes for Windows 10
(the Xbox game). Second, we show that, indeed, the demographic groups targeted by TV advertisements are
those most likely to respond, and we thereby demonstrate that TV ad eﬀectiveness can be usefully measured
via online search data. Third, examining sponsored search clicks in our query-level data, for queries involving
brand-related keywords, we demonstrate a signiﬁcant increase in rank-ordering eﬀects in searches that take
place in the minutes immediately following a TV advertisement, which implies a complementarity between
TV and sponsored search advertisements.
Bio: Shawndra Hill is a Senior Researcher at Microsoft Research NYC . Before joining Microsoft, she was an Assistant Professor in the Operations and Information Management at the Wharton School of the University of Pennsylvania, where she is still an Annenberg Public Policy Center Distinguished Research Fellow, a Wharton Customer Analytics Initiative Senior Fellow, and a core member of the Penn Social Media and Health Innovation Lab. Generally, she researches the value to companies of mining data on consumers, including how consumers interact with each other on social media -- for targeted marketing, advertising, health and fraud detection purposes. Her current research focuses on the interactions between TV content and Social Media (www.thesocialtvlab.com). Dr. Hill holds a B.S. in Mathematics from Spelman College, a B.E.E. from the Georgia Institute of Technology and a Ph.D. in Information Systems from NYU's Stern School of Business.
The digital age has transformed how we are able to study social behavior. Unfortunately, researchers have not yet taken full advantage of these opportunities because we are too focused on "big data", such as digital traces of behavior. These big data can be wonderful for some research questions, but they have fundamental limitations for addressing many questions because they were never designed for research. This talk will argue that rather than focusing on "found data”, researchers should use the capabilities of the digital age to create new forms of "designed data.” I’ll provide three templates that researchers can use to combine the strengths of found data and designed data, and I’ll illustrate these templates with recent empirical studies. This talk is based on my forthcoming book—Bit by Bit: Social Research in the Digital Age—which is currently in Open Review at http://www.bitbybitbook.com.
Bio: Matthew Salganik is Professor of Sociology at Princeton University, and he is affiliated with several of Princeton's interdisciplinary research centers: the Office for Population Research, the Center for Information Technology Policy, the Center for Health and Wellbeing, and the Center for Statistics and Machine Learning. His research interests include social networks and computational social science. He is the author of the forthcoming book Bit by Bit: Social Research in the Digital Age. Salganik's research has been published in journals such as Science, PNAS, Sociological Methodology, and Journal of the American Statistical Association. His papers have won the Outstanding Article Award from the Mathematical Sociology Section of the American Sociological Association and the Outstanding Statistical Application Award from the American Statistical Association. Popular accounts of his work have appeared in the New York Times, Wall Street Journal, Economist, and New Yorker. Salganik's research is funded by the National Science Foundation, National Institutes of Health, Joint United Nations Program for HIV/AIDS (UNAIDS), Facebook, and Google. During sabbaticals from Princeton, he has been a Visiting Professor at Cornell Tech and a Senior Research are Microsoft Research.
Mobile network monitoring and analysis can provide insight into the
activity of individual mobile devices as well as into collective user
behavior. This creates opportunities for new applications and engineering
optimizations, but also faces challenges in terms of privacy and
In the first part of the talk, I will present our work on analyzing mobile data (in particular, CDRs provided by cellular operators and geospatial data we
collected from social networks) to characterize human activity in metropolitan
areas, with applications to ride-sharing [UBICOMP 2014, SIGSPATIAL 2015],
urban ecology [MOBIHOC 2015], and network provisioning [SmartCity 2016].
Time permitting, I will also present algorithms we designed to construct
synthetic graphs that resemble real mobile and social network graphs within
the dk-series framework [INFOCOM 2015, NetSci 2016].
In the second part of the talk, I will present our current work on AntMonitor
- a system for on-device passive network monitoring, collection, and analysis.
I will describe the design of AntMonitor as a user-space mobile app based on
a VPN-service [SIGCOMM C2BID 2015], but without the need to route through a
remote VPN server. Evaluation of our prototype shows that it significantly
outperforms state-of-the-art approaches, both in terms of throughput and
battery consumption. I will then describe the use of AntMonitor as a platform
to enable a number of applications, including: (i) real-time detection and
prevention of private information leakage from the device to the network;
(ii) passive network performance monitoring; and (iii) application
classification and user profiling.
Bio: Athina Markopoulou is an Associate Professor in EECS at the University of California, Irvine. She received the Diploma degree in Electrical and Computer Engineering from the National Technical University of Athens, Greece (1996), and the Master's (1998) and Ph.D. (2003) degrees in Electrical Engineering from Stanford University. She has held short-term appointments at Sprintlabs (2003), Arista Networks (2005), IT University of Copenhagen (2012-2013), and she co-founded Shoelace Wireless (2012+). She received the Henry Samueli School of Engineering Faculty Midcareer Award for Research (2014) and the NSF CAREER Award (2008). She has been an Associate Editor for IEEE/ACM Transactions on Networking (2013-2015) and for ACM CCR, the General Chair for CoNext 2016, and the Director of the Networked Systems program at UCI. Her research interests are in the area of networked systems including network measurement and modeling, mobile and social networks, network security and privacy.
In this talk I will present our recent results on detecting
behavioural targeting in online advertising. I will describe the methods
that we have developed to: 1) audit web domains for behavioural targeting
by training artificial "personas", collecting ads, and identifying
correlations between training and landing pages, 2) audit individual
impression by using only browser history and online taxonomies for web-pages,
and 3) audit individual impression by using crowdsourced data from multiple
users. I will also present our initial findings on the amount of targeting
going on, the most targeted categories, the existence of targeting even
in sensitive personal categories for which the law requires explicit user
consent, as well as our results on identifying the chain of companies
involved in the delivery of such ads.
Bio: Nikolaos is the Chief Scientist and one of the co-founders of the Data Transparency Lab, a community of technologists, researchers, policymakers and industry representatives working to create a new wave of transparency software that will permit end users to sneak peek on what happens to their personal data behind the curtains of the web. He is currently working on answering questions like: "Why am I seeing this advertisement?", "Is the price that I see online for this ticket same as the one seen by you?", "How can we reconcile the information needs of online advertising/marketing and the privacy concerns of everyday people?". Before dropping everything to work on privacy and transparency Nikolaos spent many years conducting research and innovation in intelligent transportation, economics of networks, content distribution, new protocols for the Internet, energy efficient communications, social networks, algorithms, and others. More info at: http://http://laoutaris.info/
Abstract: According to the recent surveys, data scientists spend most of their time collecting, curating, and organizing data from heterogeneous and often dirty sources. In this process, datasets have to be cleaned from errors, equal entities from different data sources have to be matched, and data values have to be transformed into a common desired representation. In this talk, I will share our experience in using data curation systems in the wild. I will first report on our recent findings from testing state-of-the-art data cleaning systems on real world data and point out the limitations of current cleaning algorithms. Then, I will discuss the difficult task of data transformation discovery by presenting our data transformation discovery system, DataXFormer. Finally, I will shed light on our vision for future data curation systems and on how we intend to overcome the current limitations.
Bio: Ziawasch Abedjan is an assistant professor and the head of the "Big Data Management" (BigDaMa) Group at the TU Berlin in Germany and a Principal Investigator in the Berlin Big Data Center. Prior to that, Ziawasch was a postdoctoral associate at MIT CSAIL where he worked on various data cleaning topics. He received his PhD from the Hasso Plattner Institute in Potsdam, Germany, where he worked on methods for mining Linked Open Data. His current research focuses on data integration and data profiling. He is the recipient of the 2014 CIKM Best Student Paper Award, the 2015 SIGMOD Best Demonstration Award, and the 2014 Best Dissertation Award from the University of Potsdam.
Today, 50% of the world's population lives in cities and the number will grow to 70% by 2050. Cities are the loci of economic activity and the source of innovative solutions to 21st century challenges. At the same time, cities are also the cause of looming sustainability problems in transportation, resource consumption, housing affordability, and inadequate or aging infrastructure. The large volumes of urban data, along with vastly increased computing power open up new opportunities to better understand cities. Encouraging success stories show better operations, more informed planning, improved policies, and a better quality of life for residents. However, analyzing urban data often requires a staggering amount of work, from identifying relevant data sets, cleaning and integrating them, to performing exploratory analyses over complex, spatio-temporal data.
Our long-term goal is to enable domain experts to crack the code of cities by freely exploring the vast amounts of data cities generate. This talk describes challenges which have led us to fruitful research on data management, data analysis, and visualization techniques. I will present methods and systems we have developed to increase the level of interactivity, scalability, and usability for spatio-temporal analyses.
This work was supported in part by the National Science Foundation, a Google Faculty Research award, the Moore-Sloan Data Science Environment at NYU, IBM Faculty Awards, NYU Tandon School of Engineering and the Center for Urban Science and Progress.
Bio: Juliana Freire is a Professor of Computer Science and Data Science at New York University. She is the Executive Director of the NYU Moore Sloan Data Science Environment. She holds an appointment at the Courant Institute for Mathematical Science, is a faculty member at the NYU Center for Urban Science and Progress and at the NYU Center of Data Science, where she is also the Director of Graduate Studies. Her recent research has focused on big-data analysis and visualization, large-scale information integration, provenance management, and computational reproducibility. Prof. Freire is an active member of the database and Web research communities, with over 150 technical papers, several open-source systems, and 11 U.S. patents. She is an ACM Fellow and a recipient of an NSF CAREER, two IBM Faculty awards, and a Google Faculty Research award. She has chaired or co-chaired several workshops and conferences, and participated as a program committee member in over 70 events. Her research grants are from the National Science Foundation, DARPA, Department of Energy, National Institutes of Health, Sloan Foundation, Gordon and Betty Moore Foundation, W. M. Keck Foundation, AT&T, Google, Amazon, the University of Utah, New York University, Microsoft Research, Yahoo! and IBM.
Abstract: Many approaches have been recently
introduced to automatically create or augment Knowledge
Graphs (KGs) with facts extracted from Wikipedia,
particularly from its structured components like the
infoboxes. Although these structures are valuable, they
represent only a fraction of the actual information
expressed in the articles, and surprisingly many KG miss
facts that are indeed present in Wikipedia articles. In
this work, we present Lector, an information extraction
system that harvests new facts from the text of
Wikipedia articles using information extraction
techniques bootstrapped from the entities and relations
of a given KG. Our preliminary experimental evaluations,
which use Freebase as reference KG, reveal that we can
augment several relations in the domain of people by
more than 10%, with facts whose accuracy are over
95%. Moreover, the vast majority of these facts are
missing from the infoboxes, YAGO and DBpedia.
Bio: Paolo Merialdo is with Università Roma Tre from 2001, first as a research associate and then as an associate professor. He graduated in Computer Engineering from Università di Genova (1990), and he received his PhD from Università di Roma "La Sapienza" (1998), under the supervision of prof. Paolo Atzeni. In 1997 and 1998 he spent several months at the University of Toronto as visiting researcher, working with prof. Alberto Mendelzon. He has published his research results in important journals of the field, and in the refereed proceedings of major conferences. He is co-founder of InnovAction Lab, the most important Italian startup program for university students, and he serves as advisor at the LuissEnlabs startup accelerator in Rome.
Abstract: In this talk, I’ll explore how I’ve
used data science and my blog, I Quant NY, to make changes in the city
I live in: New York City. From parking ticket geography, to restaurant
inspection scores to subway and taxi pricing, I will discuss best
practices for data science in the policy space, explore how story
telling is an important aspect of data science and highlight the
various data-driven interactions I've had with City agencies. Along
the way, I will point out that data science need not always use
complicated math and complex programs. I will show examples of the
power of simple arithmetic, and show how often it is more about your
curiosity and the questions you ask than the complexity of the
equations you use.
Bio: Ben Wellington is the creator of I Quant NY, a data science and policy blog that focuses on insights drawn from New York City's public data, and advocates for the expansion and improvement of that data. Ben is a contributor to The New Yorker, and is a Visiting Assistant Professor in the City & Regional Planning program at the Pratt Institute in Brooklyn. Ben holds a Ph.D. in Computer Science from New York University.
Abstract: There is a two decades long history of
algorithms for dealing with data streams with small ---
sublinear --- resources like space, time and
communication. In this talk, I will review some of the
achievements in this area, and will discuss emerging
directions including stochastic streams, graphical
models, graph and matrix algorithms and others. These
methods have applications in statistical data analysis
and machine learning, social data analysis and analytics
for modern Big Data systems.
Bio: Muthu is a Professor at Rutgers University. His research interest is in algorithms, in particular, data stream algorithms and online advertising. He has a blog: http://mysliceofpizza.blogspot.com/.
Abstract: Topic modeling algorithms analyze a
document collection to estimate its latent thematic
structure. However, many collections contain an
additional type of data: how people use the
documents. For example, readers click on articles in a
newspaper website, scientists place articles in their
personal libraries, and lawmakers vote on a collection
of bills. Behavior data is essential both for making
predictions about users (such as for a recommendation
system) and for understanding how a collection and its
users are organized.
I will review the basics of topic modeling and describe
our recent research on collaborative topic models,
models that simultaneously analyze a collection of texts
and its corresponding user behavior. We studied
collaborative topic models on 80,000 scientists'
libraries from Mendeley and 100,000 users' click data
from the arXiv. Collaborative topic models enable
interpretable recommendation systems, capturing
scientists' preferences and pointing them to articles of
interest. Further, these models can organize the
articles according to the discovered patterns of
readership. For example, we can identify articles that
are important within a field and articles that transcend
Bio: David Blei is a Professor of Statistics and Computer Science at Columbia University, and a member of the Columbia Data Science Institute. His research is in statistical machine learning, involving probabilistic topic models, Bayesian nonparametric methods, and approximate posterior inference algorithms for massive data. He works on a variety of applications, including text, images, music, social networks, user behavior, and scientific data. David has received several awards for his research, including a Sloan Fellowship (2010), Office of Naval Research Young Investigator Award (2011), Presidential Early Career Award for Scientists and Engineers (2011), Blavatnik Faculty Award (2013), and ACM-Infosys Foundation Award (2013). He is a fellow of the ACM.
Abstract: Many computer applications are bound to
a particular point in time; more precisely, to a given
set of technologies and costs. The same is true of
computer security. Unfortunately, once something
becomes possible people become wedded to it, and never
look back at the environment and assumptions that made
it possible or even necessary. This is especially
serious for security, since it causes us to endure the
costs and annoyances of marginally useful (or even
harmful) mechanisms while blinding us to newer threats.
What can be done? How can we recognize the implicit
assumptions in what we're doing? Can we do better in
the future? How do differing threat models affect the
Bio: Steven M. Bellovin is the Percy K. and Vidal L. W. Hudson Professor of computer science at Columbia University, where he does research on networks, security, and especially why the two don't get along, as well as related public policy issues. In his copious spare professional time, he does some work on the history of cryptography. He joined the faculty in 2005 after many years at Bell Labs and AT&T Labs Research, where he was an AT&T Fellow. He received a BA degree from Columbia University, and an MS and PhD in Computer Science from the University of North Carolina at Chapel Hill. While a graduate student, he helped create Netnews; for this, he and the other perpetrators were given the 1995 Usenix Lifetime Achievement Award (The Flame). Bellovin has served as Chief Technologist of the Federal Trade Commission. He is a member of the National Academy of Engineering and is serving on the Computer Science and Telecommunications Board of the National Academies of Sciences, Engineering, and Medicine. In the past, he has been a member of the Department of Homeland Security's Science and Technology Advisory Committee, and the Technical Guidelines Development Committee of the Election Assistance Commission; he has also received the 2007 NIST/NSA National Computer Systems Security Award and has been elected to the Cybersecurity Hall of Fame. Bellovin is the co-author of Firewalls and Internet Security: Repelling the Wily Hacker, and holds a number of patents on cryptographic and network protocols.
Abstract: Digital advertising is one of the
largest and open playgrounds for machine learning, data mining and
related analytic approaches. This talk will touch on a number of
challenges which arise in this environment: 1) high volume data
streams of around 30 Billion daily consumer touch points, 2) low
latency requirements on scoring and automated bidding decisioning
within 100ms and 3) adversarial modeling in the light of advertising
fraud and bots. Specifically, we will discuss an automated learning
system implemented at Dstillery, that uses privacy friendly data
representation to build sparse targeting models for thousands of
products in Millions of dimensions. The solution incorporates ideas
from transfer learning, Bayesian priors, stochastic gradient descent,
hashing and learning rate estimation. On the sidelines, but of no less
importance, are topics on bid optimization, data reliability, cross-device
identification and observational methods for causal
inference. Finally, I will touch on a few higher-level lessons around
incentive misalignments/measurement issues in the advertising industry
and measuring causality on observational data.
Bio: Claudia Perlich leads the machine learning efforts that power Dstillery’s digital intelligence for marketers and media companies. With more than 50 published scientific articles, she is a widely acclaimed expert on big data and machine learning applications, and an active speaker at data science and marketing conferences around the world. Claudia is the past winner of the Advertising Research Foundation’s (ARF) Grand Innovation Award and has been selected for Crain’s New York’s 40 Under 40 list, Wired Magazine’s Smart List, and Fast Company’s 100 Most Creative People. Claudia holds multiple patents in machine learning. She has won many data mining competitions and awards at Knowledge Discovery and Data Mining (KDD) conferences, and served as the organization’s General Chair in 2014. Prior to joining Dstillery in 2010, Claudia worked at IBM’s Watson Research Center, focusing on data analytics and machine learning. She holds a PhD in Information Systems from New York University (where she continues to teach at the Stern School of Business), and an MA in Computer Science from the University of Colorado.
Abstract: Today, 50% of the world's population
lives in cities and the number will grow to 70% by
2050. Urban data opens up many new opportunities to
improve cities and people’s lives. In NYC, by
integrating and analyzing data sets from multiple city
agencies, the Bloomberg administration was able improve
the success rate of inspections. A marked reduction in
crime both in New York and Los Angeles has been in part
attributed to data-driven policing. Policy changes have
also been triggered by data-driven studies that, for
example, showed correlations between foreclosures and
increase in crime, the effects of subsidized housing on
surrounding neighborhoods, and how low income households
use the flexibility provided by vouchers to reach
neighborhoods with high performing schools. But in each
of these successes, the level of effort required to
gather, integrate, analyze the relevant data, design and
refine models, or develop and deploy apps, is
staggering. Further as data volumes and data complexity
continue to explode, these problems are only getting
worse. In this talk, we will provide an overview of
research in the development of new methods and systems
for enabling interdisciplinary teams to better
understand cities. We will also show some applications
of our work.
Bio: Cláudio Silva is a professor of computer science and engineering and data science at New York University. Claudio’s research lies in the intersection of visualization, data analysis, and geometric computing, and recently he has been interested in the analysis of urban data and sports analytics. He has published over 220 journal and conference papers and is an inventor of 12 US patents. His work received over 10,000 citations according to Google Scholar and an h-index of 50. Cláudio has served on the editorial boards of several journals, including IEEE Transactions on Big Data, ACM Transactions on Spatial Algorithms and Systems, Computer Graphics Forum, The Visual Computer, Graphical Models, Computer and Graphics, Computing in Science and Engineering, and IEEE Transactions on Visualization and Computer Graphics. He helped developed a number of award-winning software systems, most recently Major League Baseball (MLB) MLB.com's Statcast player tracking system. He is an IEEE Fellow and was the recipient of the 2014 IEEE VGTC Visualization Technical Achievement Award “in recognition of seminal advances in geometric computing for visualization and for contributions to the development of the VisTrails data exploration system.” He is currently Chair of the IEEE Technical Committee on Visualization and Graphics.
Abstract: I will highlight two data-motivated
projects in Olympic figure skating. I will then
concentrate in more detail on prediction and modeling
challenges arising in a range of problems not unique to
sports but illustrated through the analysis of Olympic
diving and college basketball.
Bio: John W. Emerson (Jay) is Director of Graduate Studies in the Department of Statistics at Yale University. He teaches a range of graduate and undergraduate courses as well as workshops, tutorials, and short courses at all levels around the world. His interests are in computational statistics and graphics, and his applied work ranges from topics in sports statistics to bioinformatics, environmental statistics, and Big Data challenges. He is the author of several R packages including bcp (for Bayesian change point analysis), bigmemory and sister packages (towards a scalable solution for statistical computing with massive data), and gpairs (for generalized pairs plots). His teaching style is engaging and his workshops are active, hands-on learning experiences.
Abstract: Suppose you are given a software system that is composed of a set of
packages each at a particular version. You want to update some
packages to their most recent versions possible, but you want your
software to run after the upgrades, thus perhaps entailing changes to
the versions of other packages. One approach is trial and error, but
that quickly ends in frustration. We advocate a reproducibility-based
approach in which tools like ptrace, reprozip, pip, and virtual
machines combine to enable us to explore version combinations of
different packages even on a variety of platforms. Because the space
of versions to explore grows exponentially with the number of
packages, we have developed a memoizing algorithm that avoids
exponential search while guaranteeing an optimum version combination.
This is joint work with Christophe Pradal, Sarah Cohen-Boulakia, and Patrick Valduriez.
Bio: Dennis Shasha is a professor of computer science at the Courant Institute of New York University and an Associate Director of NYU Wireless. He works with biologists on pattern discovery for network inference; with computational chemists on algorithms for protein design; with physicists and financial people on algorithms for time series; on clocked computation for DNA computing; and on computational reproducibility. Other areas of interest include database tuning as well as tree and graph matching. Because he likes to type, he has written six books of puzzles about a mathematical detective named Dr. Ecco, a biography about great computer scientists, and a book about the future of computing. He has also written five technical books about database tuning, biological pattern recognition, time series, DNA computing, resampling statistics, and causal inference in molecular networks. He has written the puzzle column for various publications including Scientific American, Dr. Dobb's Journal, and the Communications of the ACM. He is a fellow of the ACM and an INRIA International Chair.