Semantic Web and LOD

Fr De
Emerging Technology

The concept of semantic web goes back to Tim Berners-Lee /tbl/ and is being standardized and developed by /w3c/. Semantic web is an emerging technology that allows the linking of data from various sources through the use of URIs. A similar principle is used by today’s websites, where links to other websites are found. The data in semantic web are not linked to each other over websites but rather over so called triplets via URIs. A triplet always contains 3 elements in the form of a simple sentence: subject – verb/predicate – object. Everything that exists in a triplet is data or metadata.

To describe a complex object or, for example, resource “X”, a finite (a priori not firmly defined) sentence of triplets is used, whereby all elements have the same subject “X”.

Each property of “X” is thereby defined through a triplet; “X” is given a property (verb/predicate) and a value (object). [Triplets were extended to so called quadruples (“quads”) already in 2008. The fourth element in a quad is the context, in which the triplet statement should be valid /quads/]. Interestingly enough, resources such as “X” can also be linked through triplets, through which a semantic graph is created, whereby each node represents a subject (or an object) and the edges that link the nodes connote the properties (verb/predicate) – see /skos/ for an example. A semantic graph can contain thousands of similarly connoted concepts. Data in semantic web is available in graphs and is normally publicly available. This sort of public, freely available data in the form of semantic graphs is very easy to navigate through. It provides precise information and information relationships and enormously reduces the search time. According to the current acronym we speak of “LOD data” or simply “LOD” (linked open data) data /lod1/ or also of „LOD cloud“. At the end of 2011, LOD data was estimated to comprise 30 billion triplets and approximately 500 million semantic graph interconnections /lod2/. A resource can therefore be defined through a priori unknown number of triplets – not necessarily in the same semantic graph (!). The resource is consequently described in a semi-structured manner i.e. with a structure that can be variable in both time and space. The LOD technology necessitates standard formats such as RDF – see /rdf/.

Why semantic web?
Semantic web is not only a visualization of LOD but represents a series of technologies that act upon LOD visualizations with logical operators and thereby allow to discover (validate, infer, /inf/) “new” (intensional, implicit) data. So called “Reasoners” (computer programs that can process semantic web or LOD representation formats) can find data within semantic graphs fast and safely, where otherwise a simple search query would deliver thousands of search results. Larger scale use of semantic web technologies can be found in /wea/ and /swx/. Complex problems can be quickly and efficiently resolved through this approach. Semantic web technologies (precise representation and logical processing) are therefore indispensable for scientific companies because:
1.) company data can be linked precisely yet flexibly (semi-structured),
2.) knowledge units (standards, practices, rights…) about company processes can be modeled modularly, flexibly and sustainably (knowledge management /knr/),
3.) company data on supply and delivery can be linked precisely to one another so that a homogenous, easily navigable representation of data is possible. The time required for search, report and response time is therefore reduced drastically.

Company knowledge can be provided in ontologies – but how?
Domain specific knowledge (standards, practices, processes, rights, empirical results) is usually held for decades in the form of structured documents. Especially expert knowledge should be appropriately transferred into LOD form, where its special behavior patterns are documented. The resultant document – a semantic sub graph – is known as “ontology.” An ontology is a systematic representation of knowledge, using subjects and objects within a domain. Different branches use and share ontologies. Thus, ontologies describe special facts and can therefore be considered to be a special case of semantic graphs. For example, geneticists have been using ontologies for quite some time – see e.g. /gno/.

How do you get to the LOD cloud?
Lead by the necessity of building a sustainable, clearly structured and easily navigable information platform, you want to “LODify” a part of your company data in order to be able to:
1.) carry out precise searches through easy application,
2.) gain or maintain an overview over data,
3.) link company data with other data precisely and flexibly.

An actual condition analysis will first determine the type and form of your data.
In the resultant specifications, the target services and the way they should be carried out shall be determined. The major steps of the LODifying process are:
a) careful selection of vocabularies to describe the data;
b) the data in question will be mapped onto semantic graphs; triplets are created here, which can be suitably hosted in your company;
c) development of the specific applications that are necessary for your data processing.

To operate the semantic graphs as LOD data, opportune scaling measures are implemented that guarantee application speed.

Does my company have to make all data public?
Although LOD data principally requires and supports publication, their barrier lie however in legal and competition law limitations. A company should not and must not publish all LOD data. However it can profitably make use of the connected semantic web technologies internally and with trusted partners.
Since LOD data requires web technology per se, the access to this data is placed in suitable URIs in the LOD repository. These URIs make use of a suitable rights model that protects the web area in question and thereby keep the data non-public. With the use of today’s technology, access to a semantic graph – be it for internal or controlled use – is therefore suitably protected. If participant users or market forces wish to combine certain areas (e.g. suppliers), it is possible to simply authorize the openness of a semantic graph through the use of web technology.

What is the “price” of LODifying?
To ensure the precise and flexible processing of LOD data associated with semantic web technology, it is necessary that LOD data be logically built and maintained according to standards. Future use of company data implies that a proper data mantainance be provided. Thus, the tasks of a knowledge engineer /seng/ constitute here a very important role in the company.

What are the risks of LODifying?
We see the risks of LODifying fundamentally in an inconsistent state of LOD data over time. If the meta data portion is not properly maintained, the associated semantic graph becomes out of date, shows “logical holes”, or contains inappropriate concepts. LOD data organized to semantic web technology follow a logical structure – similar to a special bill of materials. The existence of an error in the structure can therefore lead to large-scale losses upon acquisition requests. A further risk could lie in the (possible) logically weak modeling of data. In this case, linking or attaching LOD data that will be available in the future is only partially possible, if not completely impossible. Too weakly LODifyed data chunks will necessitate re-modeling. This can however also be the case for entire semantic sub graphs.

How do LOD clouds behave with regard to data archiving?
According to the principle of records management, only if their relevant record class should be archived due to legal, structural or timely requirements, data are destroyed or archived. Suitable reports enable the archiving of LOD content. Archiving turns the necessary data into a readable and durable format. Connecting archived company data to the LOD cloud is not further necessary.

How does Semweb support me during LODifying?
Semweb assists you through every step of LODifying – including support and operation of LODified data.