Public Data as Public History – and Future

Jun 18, 2019

“It is a supreme gift to realize that the past is a burden you don’t need to carry with you.”¹

In our current digital world, this advice feels both relevant and out of reach. As tech companies follow your every click, view, like, and search across the web, they build profiles of you and assign you a shadow identity even if you “opt out” of tracking, and they effectively make it impossible for you to let the past go.²

Not only is it unclear whether you can ever erase this past, but it’s also incredibly difficult to escape it — both within a single product and across the internet via advertisements. For example: A friend recently searched for “swallow” on eBay in order to find back issues of a food magazine of the same name, and after getting results that were pornographic rather than useful, she continued to see recommendations for the sex-related products for days after. She spent hours scouring the internet and finally talking to eBay support for ways to delete her search history or change recommendations, all to find out that the only path forward was to delete her account and make a new one, a process which could take days or weeks. There are numerous product issues here, but to me one of the most shocking things that even within a single product, users do not have the ability to control what’s saved and used about them.

For another example, take this powerful article from last year in which the author describes how after she suffered a stillbirth, she continued seeing ads targeted at pregnant women. When she reported them as not relevant, she was then shown ads for products for newborns, as though the ad algorithm had assumed that because she was no longer pregnant, she must’ve given birth happily. Facebook responded to this article with instructions on how to opt out of entire ad topics, but that’s just for their platform. How can someone possibly reshape their preferences, history, and identity across the internet when their data is being consumed, analyzed, and used for targeted ads (or other purposes) without their knowledge or consent by companies they may or may not even know about?³

Cities and the burden of their past

A lot has been written about the loss of agency and data ownership of individuals on the internet, and there are projects and legislation underway seeking to address these issues. But what does this mean for communities? For cities?

How does the current state of technology enable or prohibit cities and the people living in them from making their own history, re-making it, owning it, and disowning it?

Note: I’m focusing on cities here rather than communities or other levels of government, because they are a nice little unit with formal governance and plenty of examples to draw on.

Obviously, cities are a bit different than individuals. For one, cities are very much built on the past: they survive for centuries if not millennia, and they evolve and are constantly shaped by past decisions as well as the desires and needs of current inhabitants or stakeholders, whether they are locals or live in Silicon Valley. We see the past all around us: physical infrastructure like buildings, streets, and water systems, and cultural infrastructure, like public art, outdoor spaces, and memorials. We also see different versions of or remembrances of the past coexisting, like statues of Martin Luther King, Jr, sharing space with memorials to confederate soldiers or white supremacists.

And there are many pasts we don’t see: the villages, cities, trade routes, and culture of indigenous peoples erased or displaced by settlers. There are the voices and stories that have historically in this society not been heard or recorded: those of women, indiginous groups, minorities, lower classes, the disabled, and immigrant communities.

This past of a city not something we can or should easily discard, even if it’s a burden we don’t want to carry with us. It is is important to recognize and to seek to understand, because those past decisions impact the present. The construction of highways through historically black or blue-collar neighborhoods not only displaced communities and ensured the future difficulty of revitalizing those neighborhoods, but also led people and money out of cities and into suburbs, shaping American poverty today.

But a city’s past is constantly being reshaped: we reshape it when we uncover the untold stories, when we understand the influences shaping our present, when we make new decisions for our city’s present or future.

The digital history of cities is public data

With tech, we have the opportunity – or misfortune – of having another medium on and with which to write our cities’ and our communities’ histories.

We’re writing the digital history of cities in the same way our personal histories are being written for us online: through data. For individuals, digital history is the personal data that accumulates from our digital activity - the data we intentionally input and collect as well as the data collected about us.

For cities, that digital history is public data, by which I mean data that is generated by the public, though it may not necessarily be publicly accessible. Public data can take a few different forms – and if I’m missing any below, please let me know!

Surveys and observational analysis

For ages cities have been using public surveys to collect data to understand the stories of their communities and inform policies, zoning rules, etc. There are known issues with this, such as sample size, self-selection, truthfulness, and replicability.⁴ People have to opt in to taking the survey, so surveys are missing the voices of people who opt out, and even when taking surveys, people may not answer truthfully or consistently with what they’ve said in the past. Other tactics involve in-person observational analysis, but that’s only useful when not used in isolation, which I am told is unfortunately often the practice.

Operational data

Operational data is data the city agencies collect in the process of its daily operations. More governments are starting to understand the power of the data they generate simply by doing their jobs, and the stories they can tell with that data.

For example, New York City established the Mayor’s Office of Data Analytics (MODA) to start treating its operational data as a true asset that can help the city improve services, address issues, share data across the city, and implement NYC’s open data law. They are starting to tell the stories of this data and the people involved in its creation, such as those of drivers of for-hire-vehicles and their welfare.

Open data

Public data can be open data. It can be the data that’s available for citizens and companies and other organizations to download and browse or access with an API key. Not all public data that governments collect is actually – or should be – public in the sense of open and freely accessible. That same ride-hailing data that NYC has used to understand and inform policy was shown at one point to contain personally identifiable information which the public would surely not want to actually be public. The balance of privacy and transparency isn’t a problem that’s been solved, but that shouldn’t keep us from trying and promoting open when possible.

While I have heard government tech folks lament at the underutilization of open data portals, open data is critical in the effort for cities to own their narrative and be accountable to residents and themselves.

Social data can also be public data. An Australian start-up called Neighboulytics has recognized this and is using social data to help cities understand their communities and inform the city decisions. I saw their Head of Analytics, Gala Camacho Ferrari, give a talk at CSV,conf last month,⁵ and I’m equally cautious and excited about this.

This is an example of people in the community creating their own data and cities being able to read and incorporate that into the city’s story and use it to have a voice in shaping the city. I have a few concerns though:

People post on social media or review sites for a different purpose than city planning, and that context needs to be taken into account when trying to glean insights.
Furthermore, those people may not consent to their data being used that way. They are posting publicly, though, so they have at least dubiously consented to public use (whether they understand that or not is a different question).
Not everyone in the city engages with social media in a way that can be accessed and used, so their voices may not be represented.
That data lives on notoriously closed platforms like Facebook, Twitter, and Instagram (owned by Facebook), so we don’t necessarily know what’s being filtered out or pushed to the top, and those algorithms might impact the way data is presented and read.

Their founder does a good job addressing the some of these concerns in a recent interview, and the last is deeply related to the issue of personal data ownership that we’ve already talked about above. Regardless, I think social data is a valuable piece of the puzzle because it rethinks how cities find and incorporate the voices of their residents.

311 data

Technically 311 data is a subset of operational data and in some cases open data, but it’s worth calling out specifically because it is so important. 311 is the service that can be a hotline or other communication mechanism through which residents in a city can report issues or complaints with city services or neighbors, e.g. noise complaints, tenants rights issues, etc. More than 200 cities in the US have 311, though I’m not sure if cities in other countries have equivalent services.⁶

I’ve heard NYC government employees call 311 open data the single most important dataset in the city. It is the feedback loop between the city and residents.

Physical data

Cities are physical places, and they generate physical data. There’s been a lot of hype in the past few years about “Smart Cities” and the potential of unlocking the value of this data for cities through smart or wifi-connected devices placed around the city. For example, Syracuse recently announced a $32 million project to upgrade its streetlights to have smart controllers, and these new lights will be the foundation for future projects like sensors to collect traffic data.

At what point does this data collection for public good become privacy-violating surveillance? I don’t want to dive into that too much now, but what I find promising are the emergence of sensor companies like Numina that build devices that do “onboard computing” or “edge processing” – basically meaning that images and other identifying information are processed on the device and sent to the cloud in an anoymized form, and then that identifying information is deleted on the device. To me, this is the only acceptable and responsible way to do public Internet of Things data collection that I know of.

Geographic data

Another type of public physical data is geographic data. Also known as map data or geospatial data, this type of data is public because it describes the world that we all share. This may not necessarily include geospatial data describing private property, but it does include data describing streets, parks, locations of public institutions, etc. Cities and governments typically have departments responsible for a geographic information system (GIS) with detailed geographic data of their jurisdiction, though that data has historically been difficult or costly for the public to access.

Maps are an important part of the public data conversation because they are a “tool of both recognition and oppression.” I dive into this a bit more below, but for some positive news and a historical look at the social impact of maps, check out this article about a new mapping project called LandMark that aims to map and therefore help protect indigenous land.

Shaping public data with an eye towards equity

Data itself isn’t objective, and the act of collecting data isn’t enough. The stories we can tell from data are shaped by the way we collect the data, and what data we choose to collect. For example, if we collect data about medical service use and only collect binary gender options (male and female, rather than additional options such as trans-male and trans-female), then we are missing insight into the medical needs of the trans community.

Taking the same approach with 311 data, if we don’t overlay service request data with socioeconomic, demographic, or location data, we may miss valuable insights into why certain requests seem more prevalent than others. Studies like this one have shown that socio-economic and demographic factors do play a role in who is more likely to make service requests, meaning that we cannot use 311 data on its own to tell a definitive and unbiased story of all city service issues. From a practical perspective, this is important because the city uses this data to determine things like resource allocation and maintenance, and therefore needs to make sure additional data and analsysis are used alongside the raw data to provide context.

The hand that holds the pen

As the characters of the recent film, Colette, like to say, the hand that holds the pen writes history. If public data is the history being written, we have to make sure that the public is the one holding the pen (and the paper). We already see the disturbing consequences of individuals not owning their data or rights to their data in the current tech landscape. This has sobering implications for cities and communities that we can’t ignore.

We’ve already seen multiple instances of communities’s identities being shaped against their knowledge or will because of the power of tech companies like Google in owning and controlling the data that people use. Take for example the recent story about Google erasing a neighborhood and the aftereffects. A community in Buffalo that had referred to itself as the Fruit Belt for generations, suddenly found itself being referred to as “Medical Park” on Google Maps. The source of the name change is complex (read the article - it’s a good one!), but a local geographer and data scientist named Aaron Krolikowski quoted in the article summarizes a key point:

“We’ve historically tended to self-identify our communities…. If suddenly we become disconnected from that process, I think there’s a lot of questions that emerge about the ability of a community to determine its future, in some cases.”

The ability for a for-profit company, which is not accountable to the community (except perhaps when there is bad enough press), to issue an entirely new identity to that community without its consent and with clear economic and social consequences on that community’s shape and future, is a demonstrative and alarming example of the wrong person holding the pen.

We also have to be cognizant of who is making the pen. If the tools being used for civic planning and data collection are built by people who are not representative of the communities in which these tools are being deployed, they will not even be aware of the variations and types of data they need to be able to collect.

This is why many people are reasonably wary of “Smart Cities” programs, especially Alphabet’s Sidewalk Labs project in Toronto. Alphabet is the parent company of Google, and this project involves huge quantities of data being collected. For this project and all the other tech projects involving public data generation, collection, and analysis, we have to keep asking:

Who will truly own that data? Who will decide what types of data get collected, and who is collecting the data? Who is making the tools for this data collection? What policy decisions will this data influence, and what stories will be told from it? How will individuals’ privacy be protected? How will cities ensure this data doesn’t get passed to undisclosed companies to further target ads or seek profit or be used against the will of the public? Perhaps most importantly, how will city residents be able to control and shape that data, delete it when they choose to, and use it for their own self-determination?

1 I saw this quote on a little bronze plaque in a Budapest coffee house called Cafe Madal, though, full disclosure, the wording might have been slightly different. I didn’t take a picture and I haven’t found this in the online archives of Sri Chinmoy, to whom the cafe attributed these words.

2 You can read more about that here.

3 For a look at data brokering, but in the location/geospatial industry, check out this report from MIT.

4 Admittedly my source is a company trying to sell an alternative to surveys.

5 Also, thanks to Gala for pointing me to some of the articles linked in this post!

6 Check out this fun timeline for a history on 311.