Exploring schema.org with RDF and SPARQL

Pascal Heus
10 min readMar 27, 2023

--

I recently loaded the schema.org RDF definition into an OpenLink Virtuoso triple store to (1) support the development of REST-based API around schema.org, (2) populate a JSON schema registry with schema.org objects, and (3) explore how a GraphQL-based API leveraging the GraphQL/RDF bridge introduced in Virtuoso late last year (see Native GraphQL Support in Virtuoso — The Basics) could be created.

During my research, I wrote a few simple SPARQL queries to examine the schema.org content. These queries could be beneficial for those who are curious about this. This post outlines the most valuable ones and highlights a few insights I gained during the process.

If you are unfamiliar with schema.org, RDF, SPARQL, and OpenLink Virtuoso, a high-level overview is provided at the end of this article. Naturally, numerous resources are also available on the web.

What are the differences between the HTTP, HTTPS , and current flavors?

As I started this exercise, I learned that they were different flavors of schema.org in the GitHub repository: HTTP, HTTPS, and current. The main difference between the HTTPS and HTTP versions is the use of HTTPS URLs in the HTTPS version, which indicates that the resources are being served over a secure connection.

Prior to the HTTPS version, Schema.org used HTTP URLs to identify resources, which could potentially be intercepted or modified by third parties. The HTTPS version addresses this issue by using secure URLs to ensure the integrity of the data.

Another difference between the two versions is that the HTTPS version includes additional types and properties for describing data related to security, privacy, and accessibility. These additions reflect the growing importance of these topics in today’s digital landscape.

The HTTPS version of Schema.org was introduced with the release of version 3.5 on March 31, 2020.

Based on what I see, the current versions simply seem to exclude the entries that have been deprecated (see Pending and Attic below).

How to load the vocabulary in a triple store?

I used the HTTPS flavor of version 15, which can be found in the schema.org GitHub repository under data/releases. Loading this in Virtuoso is pretty straightforward using the web-based Condutor management UI. Go to Linked Data → Quad Store Upload, point the resource URL to the raw turtle file in GitHub, provide the name of the graph you want this to be under, and click Upload. This takes seconds.

Note that the queries in this article should work with most RDF triple stores, so feel free if you prefer to use other servers such as Apache Jena.

Which namespaces are used by schema.org?

Schema.org uses the following namespace and prefixes which you may need when writing your SPARQL queries:

@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dcmitype: <http://purl.org/dc/dcmitype/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema: <https://schema.org/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix void: <http://rdfs.org/ns/void#> .

What are the Pending and Attic Collections?

One thing I noticed is that there are additional properties in the full RDF version that are not visible or highlighted differently when you browse resources on the schema.org public website. There are two such collections (which are described on the schema.org documentation page):

  • pending: a staging area for new schema.org terms that are under discussion and review. These items are shown in blue instead of red on the website.
  • attic: a special area where terms are archived when deprecated from the core and other sections or removed from pending as not accepted into the full vocabulary.

This is indicated on a resource by the schema:isPartOf property:

schema:isPartOf <https://pending.schema.org> ;
schema:isPartOf <https://attic.schema.org> ;

In version 15, 12 entries are in the attic, and 718 are in pending. There is no particular way to tell how long these have been in that state (looking at GitHub history may be an option).

The following query lists the deprecated entries part of the attic collection (12 in v15):

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX schema: <https://schema.org/>
select ?s ?type ?label where {
?s rdf:type ?type;
rdfs:label ?label;
schema:isPartOf <https://attic.schema.org>.
}

And to count the candidate entries in the pending collection (718 in v15):

PREFIX schema: <https://schema.org/>
select count(*) where {
?s schema:isPartOf <https://pending.schema.org>.
}

Note that schema:isPartOf is also used for associating classes and properties with other namespaces or standards.

How many classes?

A schema.org object type is represented in RDF by an rdfs:Class.There are 896 classes in the HTTPS version 15, as shown by this query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX schema: <https://schema.org/>
SELECT count(?class) as ?count
WHERE {
?class a rdfs:Class.
}

Amongst these, 154 are in the pending collection, 3 are in the attic collection, and 16 have been superseded.

How to quickly search for a Class?

To perform a quick search of a class by name, you can use the query below, in this case, matching the term ‘data’ in the label:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX schema: <https://schema.org/>
select ?s ?label where {
?s rdf:type rdfs:Class;
rdfs:label ?label.
FILTER(regex(?label, ".*data.*", "i")).
FILTER(NOT EXISTS {?s schema:isPartOf <https://attic.schema.org>.} ).
FILTER(NOT EXISTS {?s schema:supersededBy ?any.} ).
}

SPARQL provides various options for text search, regular expressions being a versatile option. Other characteristics of the class, such as rdsf:comment(which carries the description) or properties’ names, could be used as well.

But for more advanced text search capabilities, rather than trying to do this in SPARQL, I would recommend indexing the content in a server such as Apache Solr or Elasticsearch.

Class Definition

A specific class definition can be retrieved with this simple query:

PREFIX schema: <https://schema.org/>
select ?p ?o where {
schema:Dataset ?p ?o.
}

which in this case shows that Dataset is defined as:

Note that Dataset is a subclass of CreativeWork.

What are all the classes’ properties?

The list and count of all properties associated with all schema.org classes can be found using the following query:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX schema: <https://schema.org/>
SELECT ?property, count(?property) as ?count
WHERE {
?class rdf:type rdfs:Class.
?class ?property ?value.
}
ORDER BY desc(?count)

Which returns:

There properties are used as follows:

How to find the values of a class’ properties?

The following query lists the properties of a specific class:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX schema: <https://schema.org/>
select ?s ?label ?part where {
?s a rdf:Property;
rdfs:label ?label;
schema:domainIncludes schema:Dataset.
OPTIONAL {?s schema:isPartOf ?part}
FILTER(NOT EXISTS {?s schema:isPartOf <https://attic.schema.org>.} ).
FILTER(NOT EXISTS {?s schema:supersededBy ?any.} ).
}
ORDER BY ?label

which for Dataset returns:

The query filter excludes the properties that are part of the attic (deprecated) or have been superseded. This reflects the behavior of the scheme.org public Dataset page, where the pending properties are shown in blue.

Inherited Properties

Note that the above query does not resolve properties inherited from the parent class. These can be found by simply dereferencing the rdfs:subClassOf property value. We can basically run the same query for CreativeWork, the Dataset’s parent.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX schema: <https://schema.org/>
select ?s ?label ?part where {
?s a rdf:Property;
rdfs:label ?label;
schema:domainIncludes schema:CreativeWork.
OPTIONAL {?s schema:isPartOf ?part}
FILTER(NOT EXISTS {?s schema:isPartOf <https://attic.schema.org>.} ).
}
ORDER BY ?label

which returns a much longer list of properties as partially illustrated here:

…more…

Conclusions

There are naturally many SPARQL queries we can come up with. The above are simple ones to help you get started. I hope you find this helpful. Let me know the interesting ones you write or come across. As I expect to keep exploring this approach, watch for future articles on this topic.

Annex

What is schema.org?

Schema.org is a collaborative, community-driven project that provides a standard vocabulary for describing structured data on the internet. It was launched in 2011 by Google, Microsoft, Yahoo, and Yandex, with the aim of creating a common language for search engines to better understand and interpret web content.

The vocabulary provided by Schema.org includes a set of schemas, or types, that describe entities, actions, and relationships between entities. These schemas are expressed using the semantic web technologies RDFa, Microdata, and JSON-LD, and can be used by web developers and content creators to markup their web pages in a way that is recognized and understood by search engines.

By using Schema.org markup, webmasters can improve the visibility of their web pages in search engine results, as well as provide richer and more detailed information to users. In addition, the use of structured data can also enable new features and functionalities in search engines, such as rich snippets, knowledge graphs, and voice search.

What is RDF?

RDF (Resource Description Framework) is a standard framework for modeling and describing resources on the web. It is a data model that represents information in the form of subject-predicate-object statements, also known as “triples”. These triples form the building blocks of RDF data, which can be used to describe any kind of resource, such as people, places, organizations, events, and more.

In RDF, each triple consists of a subject, which is the resource being described, a predicate, which is the property or relationship being asserted about the resource, and an object, which is the value of the property or the resource being related to.

RDF provides a flexible and extensible framework for representing data, as it allows users to define their own vocabularies and ontologies to describe resources in a specific domain. It also supports the integration of data from multiple sources, as it provides a standard format for exchanging data between different systems.

RDF is a fundamental technology for the Semantic Web, which is an extension of the web that enables the sharing and reuse of data across applications and systems. It is widely used in applications such as data integration, knowledge management, and semantic search, and is supported by a range of tools and technologies, including SPARQL, OWL, and RDFa.

What is SPARQL?

SPARQL (pronounced “sparkle”) is a query language for the Resource Description Framework (RDF), a framework for modeling and describing resources on the web. SPARQL is used to query RDF data sources, which can include data from multiple sources, such as databases, spreadsheets, and web pages.

SPARQL allows users to query RDF data using a syntax that is similar to SQL, the standard query language for relational databases. SPARQL queries can retrieve data based on specific patterns of triples (subject-predicate-object statements) in the RDF data, as well as filter, sort, and group the results.

SPARQL is designed to be flexible and powerful, allowing users to perform complex queries that span multiple data sources and data models. It includes features such as graph patterns, aggregation functions, and support for regular expressions, as well as the ability to query remote SPARQL endpoints and integrate with other web technologies such as RDFa and Linked Data.

Overall, SPARQL is a key technology for working with RDF data and for building applications that rely on the semantic web. It provides a powerful tool for querying and analyzing RDF data, and has become an important standard in the world of linked data and semantic web technologies.

What is OpenLink Virtuoso?

OpenLink Virtuoso is a powerful, multi-model database management system that provides support for both SQL and RDF-based data management. It was developed by OpenLink Software, a technology company based in the United States.

Virtuoso is a highly scalable system that can handle large volumes of data and a wide range of data models, including relational, XML, JSON, RDF, and geospatial data. It provides a unified platform for managing data across these different models, allowing users to perform queries and analysis across all of their data sources.

In addition to its database capabilities, Virtuoso also includes a number of advanced features such as support for SPARQL, the RDF query language, and support for Web Services and Linked Data. It also includes built-in support for a variety of data integration and management tasks, such as data federation, data virtualization, and data replication.

Overall, OpenLink Virtuoso is a comprehensive and versatile database management system that can help organizations to better manage and leverage their data assets across multiple data models and formats. Both an open source and commercial edition of the platform are available.

--

--

Pascal Heus

Data Lead at Postman Open Technologies. Information Technologist / Data Engineer / Metadata Expert. Interests in Gamification, Quantum Physics, Astrophysics.