New research examines the issues that have hindered the growth of data markets and the models that could soon accelerate their development

One of the internet’s most profound social and business impacts has been the change in how many things are bought and sold. The rise of Amazon and eBay, for example, revolutionized consumer marketplaces. Multiple travel sites now enable a buyer to shop around for the best travel bargains. The pandemic accelerated this shift, with even art and cars now routinely bought without physical inspection.

Interestingly, no similar revolution has happened with data, though data has long been traded between and among people and institutions. This is not to say that no data trading occurs because consumer spending and stock market data are bought and sold every day. However, there should be many more data markets connecting data buyers and sellers, especially given available data creation and cloud storage technologies. The fact that they don’t exist is an intriguing question to consider. Fortunately, recent work by Pantelis Koutroumpis (Oxford), Aija Leiponen (Cornell), and Llewellyn D W Thomas (Imperial College) examines this very issue.

The authors start their analysis by noting that most data are in “raw” form, a state in which they have limited value. In the language of economics, they are intermediate goods that need to be processed in some way to reach maximum value. In other words, a list of numbers in a database is of little use until it is connected with identifying information and then utilized for a value-creating purpose, e.g., consumer analysis or investing. In some ways, data are like patents in that the idea alone creates little value — it must be put to work in some way in order to make it worthwhile. Given the exponential rise in data creation in the past few decades, one would expect to find many markets connecting data sellers and buyers, but the authors present several issues that have prevented the wider development of data markets.

Issue 1: Laws that govern data are unclear

A major hurdle to data markets is the protection the law must provide to the seller’s information. As the authors note, although databases are “theoretically protected under copyright, the strength and extent of the protection are limited and variable.” Indeed, for databases, “copyright typically only protects an empty shell—the structure and organization of the database, not the individual observations it contains (unless the data themselves are characterized as creative content), provided there is an original contribution in putting the dataset together.”

The mechanics that protect ownership of information are known as an appropriation regime, and the authors note that some countries lack one altogether or that the regimes of many countries do not align. For example, the United States has no specific database rights, Australian copyright law protects databases, and the Canadian approach is “somewhere in the middle.” While the European Union’s database directive of 1995 “sought to extend protection to the noncopyrightable aspects of databases, for example, when the data are provided in a different order or in a manipulated format,” in the United States, “it is difficult to prevent a competitor from taking substantial amounts of material from collections of data and using them in a competing product.”

When data have a weak appropriation regime, the authors note, they are typically protected through contracts, but these can be complex and expensive to design, deploy and enforce. 

Issue 2: Quality control is difficult

Quality is the lifeblood of data, but, as any data scientist will attest, it is the hardest dimension to attain and maintain. For data markets, note the authors, the varying quality of data sources hinders market development, so much so that third parties are often created with the sole purpose of guaranteeing the quality of someone else’s data. Unfortunately, note the authors, “product-level verification by intermediaries such as screening and disclosure is more difficult, given the vast heterogeneity in both the format and content of data.” Furthermore, “when data have been combined into hybrid datasets, consisting of a variety of industries, jurisdictions, and contractual conditions, and used in a variety of corporate functions, the legal status of the hybrid product may be impossible to define.” This legal limbo impacts quality, because “by not having certainty on the legal status of a dataset, the sellers themselves may be (perhaps unwittingly) offering a lower quality product.”

Issue 3: Privacy control is difficult

Most consumers are familiar with the challenges companies have in keeping data private in the face of technical and sometimes criminal forces. Moreover, even when data sellers honestly claim to provide anonymized data sets, the individual sources can usually be identified despite the anonymization. “Computer scientists,” note the authors, “have convincingly demonstrated that they can rather easily ‘reidentify’ or ‘deanonymize’ individuals from anonymized data, highlighting that regulation of privacy is a crucial concern.”

Issue 4: Data sources can be challenging to identify with certainty

Weak appropriation regimes and quality challenges mean that buyers often must infer data quality from the seller’s identity. In other words, “rather than attempting to verify the status of the data goods directly, trading partners usually rely on the reputation and legal liability of the original source, potentially with their contractual commitment to correct any mistakes found in the data.” Much like a fine French wine, without a guarantee of provenance, the value of a data set is often diminished.

Unfortunately, note the authors, “there have been few institutional responses to the necessity for proving provenance, that is, disclosure of the sources and processes that created the data, although there have been calls to action for the development of sector-specific and trans sector standards for metadata, calibration, accuracy and timeliness to provide a firm and trusted foundation for data capture, trading and re-use.”

Given the issues noted above, the authors note four key attributes that a future data market model should have to overcome the current challenges.  

Attribute 1: Thickness

In economics, a thick market is one with many buyers and sellers trading with each other, e.g., major Western stock markets. A thin market is the opposite, of course. Data markets need to be thick, but not overly so, explain the authors: “while thickness is a necessary precondition for an efficient market, popularity can also create ‘congestion’ by slowing down transaction times and thus limiting participants’ alternatives. 

Attribute 2: Speed

An efficient data market requires speedy transactions to ensure market clearing, but not so rapid that buyers don’t have an opportunity to evaluate alternatives. Indeed, the amazing speeds attained in several equity markets today are sometimes blamed for material inefficiencies in trading and demand allocation.

Attribute 3: Safety

Safe markets are those where participants present information correctly and are prevented from acting in ways that would reduce the overall efficiency of the market. For example, note the authors, “it would be important to prevent buyers from colluding and prevent sellers from making side contracts with buyers or other sellers or trade outside the market altogether.” In the case of data, “a safe marketplace will provide credible provenance information: if a buyer is unable to assess the origins (and thus the quality and legality) of the data, information asymmetries between the seller and the buyer are aggravated and the market becomes inefficient.” 

Attribute 4: Awareness

Any successful marketplace “needs to respect the social and ethical norms associated with the underlying commodity” and avoid engaging in transactions that violate social norms. For example, European regulators believe that there are classes of data that should never be traded, though in other markets some of those data are traded every day. As the authors note, “individuals or social groups may view trade in personal data as repugnant and seek to limit its legality and legitimacy.” Indeed, not only is there increasing public interest “in the societal impacts of data, privacy, and data trading, there is also increasing regulatory interest in the transparency and quantity of the personal data that has been amassed and is being traded.”

In the final section of their paper, the authors examine four market types, noting their strengths and weaknesses if applied to data buying and selling.

Type 1: One-to-one

One-to-one matching is a “bilateral relationship that involves one buyer and one seller and is typically characterized by negotiated terms of exchange, usually setting up a relational contract.” These firms buy, aggregate, and sell consumer data from hundreds of online and offline sellers of consumer goods and services. For example, “Acxiom sells intricate profiles of US households including demographics, financial status, major purchases, political behavior, interests, and life events such as marriage, divorce, and birth of children.”

The challenges of one-to-one markets include that they are thin by nature, and often operate in secrecy. On the plus side, they can allow for better provenance and quality control, as well as appropriation. With a relatively small number of trades,” note the authors, “congestion is unlikely to be a concern, but transaction costs will be high due to costs of search, negotiation, and relationship management, including contract enforcement.” Moreover, because these markets often deal with confidential data, they may engage in the trading of information that society would condemn, e.g., markets that sell stolen consumer information or medical records. Thus, conclude the authors, “even though the relational aspect of bilateral data markets can ameliorate some of the issues, these markets are still likely to feature significant failures whereby sellers with valuable assets are not able to trade with buyers willing to pay a positive price.”

Type 2: One-to-many

One-to-many markets for data exist, and many are thriving. “Much financial market data (e.g. securities or commodities data as provided by the New York Stock Exchange, NASDAQ, or the Chicago Mercantile Exchange) is accessed this way,” note the authors. Because data delivery can be very efficient in these markets, and because their transparency makes it less likely they will enable socially repugnant activities, they provide many advantages to data market designers. However, buyers may “use the data in ways that reduce the value of data for the seller and for other buyers” and “automated standard contracts may also fail to comprehensively describe the sources and quality of the data, hence weakening provenance.” 

Type 3: Many-to-one

Many-to-one marketplaces have multiple sellers but only one (major) buyer. These marketplaces are characterized by “the harvesting of data, where users make their data available to a single service provider, under terms of exchange that often resemble barter.” The user typically receives access to a “free” service in return for giving up their information. An example of this is Waze, in which many users share their location data with one broker in return for real-time logistics services.

In a many-to-one market, “congestion and transaction costs of harvesting can be very low, because there is no need for individual negotiation or relationship management.” However, if the brokers collect data that society later decided to make untradable, costs can increase. For example, the European Union GDPR laws gives users a right to be forgotten. Should this become a popular right to exercise, “it might become very costly to online service providers aiming to monetize user data.” More generally, “users of the adjacent service may find it repugnant that their behavioral data is exploited by the service provider for other purposes than those in which the users participate.”

Type 4: Many-to-many

Multilateral or many-to-many marketplaces are trading platforms “upon which a large number of registered users can upload and maintain datasets, and where access to and use of the data are regulated through varying licensing models, either standardized or negotiated.” Unlike traditional market intermediaries, note the authors, “two-sided markets usually do not take ownership of the goods, instead alleviating (and profiting from) bottlenecks by facilitating transactions.”

The authors believe that multilateral markets may provide several desirable features over other market designs, as “they potentially enable economies of scale, scope, innovation, complementarity, transaction, and search.” In principle, such digital platforms could “generate value for data sellers and buyers through enhanced market efficiency due to high transaction volume, resource allocation efficiency, and stable matching.” Although costly to maintain, they can achieve economies of scale, with the associated risk that — as with eBay and Amazon — a small number of players will eventually dominate specific data markets. Moreover, designing technical or contractual systems that “incentivize and enforce appropriate behavior of the participants on a multilateral platform, in the absence of relational contracting, may be difficult if not impossible.” Indeed, Microsoft’s Azure Data Catalog may be the only thriving legal example of this model at work today.

Conclusions

Once we understand the challenges this paper outlines, it becomes easier to understand why there is no “eBay for data.” Some of the obstacles noted may seem unlikely to be resolved in the near future, but there is progress on the horizon. For example, ID Analytics (now owned by Lexis/Nexis Risk Solutions) started life as a non-profit, one-to-many platform on which consumers would agree to trade their personal information in return for credit monitoring and protection services. 

Yet another proposed approach is the use of personal data banks, “where a centrally organized ‘personal data management service’ enables consumers to exploit their personal data through the provision of secure and trusted space.” Tim Berners-Lee’s Solid initiative “seeks to give every user a choice about where their data is stored, which specific people and groups can access selected elements, and which applications can use them.” MIT’ researchers have proposed data cooperatives, and CitizenMe has created a data “exchange” that enables individuals to pool their data for surveys and other uses in exchange for compensation. However, the most complex issue — appropriation — is still a hotly debated topic. Scholars such as Alan Westin have argued forcefully for the right to data ownership, while others propose that data provides social benefits and thus should not be treated as personal property.

While it would be easy to dismiss the idea of ubiquitous data marketplaces for the reasons noted above, that’s a risky proposition. Time and again, technological models that were theoretical in one decade became reality the next. Indeed, paradoxically, the growing concern with data privacy and collection visibility may be the force that breaks some of the regulatory logjams that have hindered data marketplace development in the past. It is not that hard to imagine a firm such as Apple, which is increasingly using data privacy as a competitive advantage, creating a one-to-many market in which consumers deposit location, spending, and health data for storage (collected today via Apple iPhone, Apple Card and the Apple Watch). With their data in Apple’s “safe hands,” consumers could opt-in to businesses who want to use their information for other purposes in a data marketplace (much like Apple’s App Marketplace of today). This may seem a fanciful notion, but the value of such a data market would probably dwarf anything Apple or Google have built to date. It is an idea, as this illuminating paper hints, whose time may arrive sooner than we think.

The Research

Pantelis Koutroumpis, Aija Leiponen, Llewellyn D W Thomas, Markets for data, Industrial and Corporate Change, Volume 29, Issue 3, June 2020, Pages 645–660, https://doi.org/10.1093/icc/dtaa002

Posted by:Carlos Alvarenga