Conference wrap up: DataEdge 2013

The 2nd DataEdge Conference, organized by UC Berkeley’s I School, has wrapped, and it was a doozy. The GIF was a sponsor, and Kevin Koy from the Geospatial Innovation Facility gave a workshop Understanding the Natural World Through Spatial Data. Here are some of my highlights from what was a solid and fascinating 1.5 days. (All presentations are now available online.)

Michael Manoochehri, from Google, gave the workshop Data Just Right: A Practical Introduction to Data Science Skills. This was a terrific and useful interactive talk discussing/asking: who/what is a data scientist? One early definition he offered was a person with 3 groups of skills: statistics, coding or an engineering approach to solving a problem, and communication. He further refined this definition with a list of practical skills for the modern data scientist:

  • Short-term skills: Have a working knowledge of R; be proficient in python and JavaScript, for analysis and web interaction; understand SQL; know your way around a unix shell; be familiar with distributed data platforms like Hadoop; understand the Data Pipeline: collection, processing, analysis, visualization, communication.
  • Long-term skills: Statistics: understand what k-means clustering is, multiple regression, Baysien inference; and Visualization: both the technical and communication aspects of good viz.
  • Finally: Dive into a real data set; and focus on real use cases.

Many other great points were brought up in the discussion: the data storage conundrum in science was one. We are required to make our public data available: where will we store datasets, how will we share them and pay for access of public scientific data in the future?

Kate Crawford, Principal Researcher, Microsoft Research New England gave the keynote address entitled The Raw and the Cooked: The Mythologies of Big Data. She wove together an extremely thoughtful and informative talk about some of our misconceptions about Big Data: the “myths” of her title. She framed the talk by introducing Claude Levi-Strauss’ influential anthropological work “The Raw and the Cooked” - a study of Amerindian mythology that presents myths as a type of speech through which a language and culture could be discovered and learned. You know you are in for a provocative talk in a Big Data conference when the keynote leads with CLS. She then presented a series of 6 myths about Big Data, illustrated simply with a few slides each. Here is a quick summary of the myths:

  1. Big Data is new: the term was first used in 1997, but the “pre-history” of Big Data originates much earlier, in 1950s climate science for example, or even earlier. What we have is new tools driving new foci.
  2. Big Data is objective: she used the example of post-Sandy tweets, and makes the point that while widespread, these data are a subset of a subset. Muki Haklay makes the same point with his cautionary: “you are mining the outliers” comment (see previous post). She also pointed out that 2013 marks the point in the history of the internet when 51% of web traffic is non-human. Who are you listening to?
  3. Big Data won't discriminate: does BD avoid group level prejudice? We all know this, people not only have different access to the internet, but given that your user experience has been framed by your previous use and interaction with the web, the rich and the poor see different internets.
  4. Big Data makes cities smart: there are numerous terrific examples of smart cities (even many in the recent news) but resource allocation is not even. When smart phones are used for example to map potholes needing repair, repairs are concentrated in areas where cell phone use is higher: the device becomes a proxy for the need.
  5. Big Data is anonymous: Big Data has a Big Privacy problem. We all know this, especially in the health fields. I learned the new term “Health Surrogate Data” which is information about your health that results from your interaction with the Internet. Great stuff for Google Flu Tracker for example, but still worrying. The standard law for protection in the public health field, HIPAA, is similar to “bringing a knife to a gunfight” as she quoted Nicholas Terry.
  6. You can opt out: there are currently no clear ways to opt out. She asks: how much would you pay for privacy? And if the technological means to do so were created and made widespread, we would likely see the development of privacy as a luxury good, further differentiating internet experience based on income.

The panel discussion Digital Afterlife: What Happens to Your Data When You Die? moderated by Jess Hemerly from Google, and including Jed Brubaker from UC Irvine and Stephen Wu, a technology and intellectual property attorney was eye-opening and engaging. Each speaker gave a presentation from their expertise: Stephen Wu gave us a primer on digital identity estate planning and Jed Brubaker shared his research on the spaces left in social media when someone dies. Both talks were utterly fascinating, thought provoking and unique.

And finally, Jeffrey Heer from Stanford University gave a stunning and fun talk entitled Visualization and Interactive Data Analysis showcased his Viz work, and introduced to many of us Data Wrangler, which is awesome.

Great conference!