Status: Draft
Version: 1.0, Feb 03
Author: Leo Meyer, meyer@software-services.de
The notions expressed here are not meant to be universally valid. They are intended to provide a basic "working philosophy" for the understanding and creation of an Associator. I have tried to keep them as simple and free of errors as possible. Please feel free to send feedback and comments.
Data. Data is anything which can be stored by a computer.
Information. Information is data which has been classified by an intelligent agent as such.
Knowledge. Knowledge is information retained by intelligent agents in the form of thoughts.
Thought. A thought is an idea which can be expressed in a language.
Note: You may, for the time being, read "human" instead of "intelligent agent".
"Data" is something "given" (lat. datum). More specifically, data is something which can be stored by a computer, provided that an adequate mathematical abstraction can be found. For example, data gathered by the measuring instruments of a missile to control that missile's flight path is data in this sense. It is, however, not necessary that this data is ever assessed by or even made available to humans. This constitutes the basic difference to information.
"Information" is something which has been derived from data by
humans. The data collected by the missile definitely contains information about
its flight path. However, this data has to be interpreted by a human to become
information; otherwise it is just data which is processed by computers. The
nature of data usually does not allow any conjecture as to what the information derived
from it might be. For example, a sequence of 150 zeroes may well contain
information for a certain person, for example, the number 150 - or the fact that
the numbers are present might be information to somebody, while somebody else
might feel completely indifferent to it.
Deriving information from data requires a specific context knowledge. Without it
data cannot be interpreted successfully. If the context knowledge is not matching the data,
false or no information will be derived.
Admitting this, a computer never contains information. It contains data which is turned into information by humans
who assign importance to it after having perceived it.
This information exists in the human brain in the form of concepts. As
soon as a human communicates it via a certain channel, be it speech, written
text, images, music etc, this information again becomes data. Another human must
have a certain minimum of context knowledge to successfully turn this data into
conceptual information in his own mind.
"Knowledge" is information which has become a part of the memory of humans and which allows them to regulate their thoughts or actions. If they cannot think or act according to something they learnt, the learnt information may not be classified as knowledge. In this case, it is not much more than data. Sadly, it is the nature of knowledge to slip out of memory and become forgotten.
The internet consists entirely of data and their various ways of interchange. It does not per se, in the sense stated above, contain information or knowledge. The amount of information on the internet to which I, for instance, assign importance, is considerably less than the amount of data. Furthermore, it varies as the amount of data or my way of classification changes. And, according to the definition, the internet does not contain knowledge at all.
Searching information on the internet can become a very tedious task. As most search engines offer full text search only, many documents containing unwanted information will turn up. As the computer does not know the difference between the information you'd like to see and the information it displays, it cannot come up with something that might save your time and nerves.
An obvious way to overcome this problem is to store metadata about the data, namely, a kind of context knowledge which will enable the computer to find the information you would like to have.
One serious attempt to tackle this problem is the Semantic Web approach. A quick outline of its features:
1. You can include annotation data in a web page using RDF (Resource
Description Framework - an XML application).
2. Annotation data consists of references to concepts which are organized
hierarchically in an ontology.
3. Inference engines can combine metadata information and text search to produce
results more exact than mere full text search.
For example, let us consider this document which I would want to annotate. Important information about this document includes its name, version, author etc, which I usually give at the beginning of the document, e.g.
Data, Knowledge and Information
Status: Draft Version: 1.0, Feb 03 Author: Leo Meyer, meyer@software-services.de
To annotate this document, I would use an ontology like this (simplified example):
SimpleOntology:
root_concept ()
Person (name, address)
Author (name, address, publications)
Publication (name, author, date)
WebPage (name, author, date, status, url)
Concept names are in bold, attributes are in brackets. As can be easily seen, sub-concepts inherit attributes from their parent concepts. A web page can be treated as a publication, which means it can appear in the list of publications of an author. For readers familiar with object-oriented programming languages: the concepts correspond to classes, their attributes to fields, and they support polymorphy, while a concrete person, for example, is called an instance of the concept person.
In the example above, I would annotate my web page as follows (simplified example):
<annotation ontology="SimpleOntology"> <Author id="Leo Meyer" name="Leo Meyer" address="meyer@software-services.de" publications="Data, Knowledge and Information"> <WebPage id="Data, Knowledge and Information" name="Data, Knowledge and Information" date="Feb 03" author="Leo Meyer" url="http://www.software-services.de/etc..."> </annotation>
This annotation data is stored anywhere in the web page and provides a search engine with enough information to know:
In addition to this I could define logical relationship properties in the ontology, to specify rules which must apply to the elements I have described.
Now I can issue fairly complex queries to the search engine, for example: "Give me the URLs of all publications of Leo Meyer". This would provide a vast improvement over present time full text search.