Google’s AI Blog published an article discussing how Dataset Search works and what signals it uses to rank datasets. Google is already showing dataset rich results and may begin showing more as publishers add the Schema.org markup. Understanding what datasets are and how to rank them is important because it can become a new source of traffic.
Google’s dataset developers page was updated in May 2018 to note that dataset rich data is coming to Google’s search results:
This feature is in pilot, and you may not see rich results for datasets yet. However, we recommend that you add dataset structured data to your site in preparation for new dataset features in Search results.
Every publisher should consider adding dataset markup in order to prepare for a wider rollout of this feature in Google’s search results.
Dataset Search
Dataset Search relies on Structured Data Metadata that uses the Schema.org/Dataset standard.
Google takes the structured data and links it to what it knows through the Knowledge Graph, as well as consider other ranking signals like links, then creates a dataset search index.
Duplicate Data Sets
Google indicates that it is partially relying on the Schema.org sameAS property. The same As property is meant to canonicalize the original publisher.
This is how the official Schema.org specifications for the sameAs property are defined:
“URL of a reference Web page that unambiguously indicates the item’s identity. E.g. the URL of the item’s Wikipedia page, Wikidata entry, or official website.”
In the context of dataset structured data, the sameAs property can be used to canonicalize the information to a specific URL which represents the original publisher of the data.
Additional signals that Google uses to identify duplicate datasets:
“Other signals include two datasets descriptions pointing to the same canonical page, having the same Digital Object Identifier (DOI), sharing links for downloading the dataset, or having a large overlap in other metadata fields.
None of these signals are perfect in isolation, therefore we combine them to get the strongest possible indication of when two datasets are the same.”
Google Knowledge Graph Scholar and Ranking
Google’s Knowledge Graph plays a role in the ranking of dataset information. The Knowledge Graph helps Google understand the context of the datasets, including understanding what language it’s for and for understanding acronyms.
Google’s Knowledge Graph provides a data layer for matching various entities. So if the dataset incorporates a brand, a currency or language it can match that to the dataset.
Here’s what Google’s AI Blog Post says about it:
“Google’s Knowledge Graph is a powerful platform that describes and links information about many entities, including the ones that appear in dataset metadata… This type of reconciliation opens up lots of possibilities to improve the search experience for users.
For instance, Dataset Search can localize results by showing reconciled values of metadata in the same language as the rest of the page. Additionally, it can rely on synonyms, correct misspellings, expand acronyms, or use other relations in the Knowledge Graph for query expansion.”
Google Scholar May be a Ranking Signal
According to Google, Google Scholar may provide a signal that a dataset is authoritative and who the author of the dataset is.
This can help a dataset publisher rank better. But it can also assist in combating scrapers from using someone else’s data.
Google’s official blog post describes it like this:
“Knowing which datasets are referenced and cited in publications serves at least two purposes:
1. It provides a valuable signal about the importance and prominence of a dataset.
2. It gives dataset authors an easy place to see citations to their data and to get credit.”
How Google Ranks Datasets
Google doesn’t have a lot of data to use for learning how users search for data. As a consequence Google is using regular ranking algorithms to rank datasets for users of it’s dataset search.
However, once Google has enough data on how people search it will begin developing a separate algorithm that is specifically tuned to dataset search.
That said, Google is using additional signals to better rank datasets:
“…ranking datasets is different from ranking Web pages, and we add some additional signals that take into account the metadata quality, citations, and so on.”
Optimize Your Datasets
Chances are that your site already has datasets. Now is the moment for marking them up with the appropriate Schema.org structured data.
The value of Google Datasets may be in people using Dataset Search to find your data and linking to your site. But another value is appearing in Google’s rich results for specific kinds of data. Better you than your competition, right?