src/site/apt/dev-csv-extractor.apt - any23 - Git at Google

                                     ------
                                     Apache Any23 - CSV Extractor Algorithm
                                     ------
                               The Apache Software Foundation
                                     ------
                                      2011-2012

 ~~  Licensed to the Apache Software Foundation (ASF) under one or more
 ~~  contributor license agreements.  See the NOTICE file distributed with
 ~~  this work for additional information regarding copyright ownership.
 ~~  The ASF licenses this file to You under the Apache License, Version 2.0
 ~~  (the "License"); you may not use this file except in compliance with
 ~~  the License.  You may obtain a copy of the License at
 ~~
 ~~     http://www.apache.org/licenses/LICENSE-2.0
 ~~
 ~~  Unless required by applicable law or agreed to in writing, software
 ~~  distributed under the License is distributed on an "AS IS" BASIS,
 ~~  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 ~~  See the License for the specific language governing permissions and
 ~~  limitations under the License.

 CSV Extractor Algorithm

   The {{{./apidocs/org/apache/any23/extractor/csv/CSVExtractor.html}CSV Extractor}} produces
   an RDF representation of a CSV file compliant with the {{{http://www.ietf.org/rfc/rfc4180.txt}RFC 4180}}
   and that foresees an header.
   Such extractor relies on the presence of an header to use the named fields as RDF properties.
   Field delimiter could be automatically guessed or specified via {{{./configuration.html}Apache Any23 Configuration}}.

   Given a document with URL <url>, <<Apache Any23>> uses the following algorithm to extract RDF:

    * It tries to guess the fields delimiter and to detect the header

    * for each field <name>:

     * if <name> is a valid IRI keep it as an IRI since could be derefenceable.

     * if <name> is not a valid IRI, the associated RDF Property IRI <propUri>
       will be in the form of: <url> concatenated <name>

     * add label statement: <propUri> rdfs:label <name>

     * add column index statement: <propUri> \<http://vocab.sindice.net/csv/rowPosition\> <index>

    * for each <row>:

     * add RDFS type statement: \<url/row/<index>\> rdfs:type \<http://vocab.sindice.net/csv/Row\>,
     where <index> is the column index number.

     * for each <cell> value:

       * write statement, \<url/row/\<index\>\> <propUri> <cell> where:
          <cell> could be an IRI if the cell value is an IRI, or a typed literal
          according the value of the CSV actual value <cell>.

    * add RDF statements claiming number of rows and columns.

   For example, given this trivial CSV with an header and just two rows:

 +---------------------------------------------------------------
 first name; last name; http://xmlns.com/foaf/0.1/knows; age
 Davide; Palmisano; http://michelemostarda.com; 30; value should not appear
 Michele; Mostarda; http://g1o.net;
 +---------------------------------------------------------------

   the following RDF (serialized in RDF/XML) is produced:

 +---------------------------------------------------------------
 <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

   <rdf:Description rdf:about="http://bob.example.com/firstName">
     <label xmlns="http://www.w3.org/2000/01/rdf-schema#">first name</label>
     <columnPosition xmlns="http://vocab.sindice.net/csv/"
     rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">0</columnPosition>
   </rdf:Description>

   <rdf:Description rdf:about="http://bob.example.com/lastName">
     <label xmlns="http://www.w3.org/2000/01/rdf-schema#">last name</label>
     <columnPosition xmlns="http://vocab.sindice.net/csv/"
     rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1</columnPosition>
   </rdf:Description>

   <rdf:Description rdf:about="http://xmlns.com/foaf/0.1/knows">
     <columnPosition xmlns="http://vocab.sindice.net/csv/"
     rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">2</columnPosition>
   </rdf:Description>

   <rdf:Description rdf:about="http://bob.example.com/age">
     <label xmlns="http://www.w3.org/2000/01/rdf-schema#">age</label>
     <columnPosition xmlns="http://vocab.sindice.net/csv/"
     rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">3</columnPosition>
   </rdf:Description>

   <rdf:Description rdf:about="http://bob.example.com/row/0">
     <rdf:type rdf:resource="http://vocab.sindice.net/csv/Row"/>
     <firstName xmlns="http://bob.example.com/"
     rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Davide</firstName>
     <lastName xmlns="http://bob.example.com/"
     rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Palmisano</lastName>
     <knows xmlns="http://xmlns.com/foaf/0.1/"
     rdf:resource="http://michelemostarda.com"/
     <age xmlns="http://bob.example.com/"
     rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">30</age>
   </rdf:Description>

   <rdf:Description rdf:about="http://bob.example.com/">
     <row xmlns="http://vocab.sindice.net/csv/" rdf:resource="http://bob.example.com/row/0"/>
   </rdf:Description>

   <rdf:Description rdf:about="http://bob.example.com/row/0">
     <rowPosition xmlns="http://vocab.sindice.net/csv/">0</rowPosition>
   </rdf:Description>

   <rdf:Description rdf:about="http://bob.example.com/row/1">
     <rdf:type rdf:resource="http://vocab.sindice.net/csv/Row"/>
     <firstName xmlns="http://bob.example.com/"
     rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Michele</firstName>
     <lastName xmlns="http://bob.example.com/"
     rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Mostarda</lastName>
     <knows xmlns="http://xmlns.com/foaf/0.1/" rdf:resource="http://g1o.net" />
   </rdf:Description>

   <rdf:Description rdf:about="http://bob.example.com/">
     <row xmlns="http://vocab.sindice.net/csv/"
     rdf:resource="http://bob.example.com/row/1"/>
   </rdf:Description>

   <rdf:Description rdf:about="http://bob.example.com/row/1">
     <rowPosition xmlns="http://vocab.sindice.net/csv/">1</rowPosition>
   </rdf:Description>

   <rdf:Description rdf:about="http://bob.example.com/">
     <numberOfRows xmlns="http://vocab.sindice.net/csv/"
     rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">2</numberOfRows>
     <numberOfColumns xmlns="http://vocab.sindice.net/csv/"
     rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">4</numberOfColumns>
   </rdf:Description>

 </rdf:RDF>
 +---------------------------------------------------------------
	------
	Apache Any23 - CSV Extractor Algorithm
	------
	The Apache Software Foundation
	------
	2011-2012

	~~ Licensed to the Apache Software Foundation (ASF) under one or more
	~~ contributor license agreements. See the NOTICE file distributed with
	~~ this work for additional information regarding copyright ownership.
	~~ The ASF licenses this file to You under the Apache License, Version 2.0
	~~ (the "License"); you may not use this file except in compliance with
	~~ the License. You may obtain a copy of the License at
	~~
	~~ http://www.apache.org/licenses/LICENSE-2.0
	~~
	~~ Unless required by applicable law or agreed to in writing, software
	~~ distributed under the License is distributed on an "AS IS" BASIS,
	~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	~~ See the License for the specific language governing permissions and
	~~ limitations under the License.

	CSV Extractor Algorithm

	The {{{./apidocs/org/apache/any23/extractor/csv/CSVExtractor.html}CSV Extractor}} produces
	an RDF representation of a CSV file compliant with the {{{http://www.ietf.org/rfc/rfc4180.txt}RFC 4180}}
	and that foresees an header.
	Such extractor relies on the presence of an header to use the named fields as RDF properties.
	Field delimiter could be automatically guessed or specified via {{{./configuration.html}Apache Any23 Configuration}}.

	Given a document with URL <url>, <<Apache Any23>> uses the following algorithm to extract RDF:

	* It tries to guess the fields delimiter and to detect the header

	* for each field <name>:

	* if <name> is a valid IRI keep it as an IRI since could be derefenceable.

	* if <name> is not a valid IRI, the associated RDF Property IRI <propUri>
	will be in the form of: <url> concatenated <name>

	* add label statement: <propUri> rdfs:label <name>

	* add column index statement: <propUri> \<http://vocab.sindice.net/csv/rowPosition\> <index>

	* for each <row>:

	* add RDFS type statement: \<url/row/<index>\> rdfs:type \<http://vocab.sindice.net/csv/Row\>,
	where <index> is the column index number.

	* for each <cell> value:

	* write statement, \<url/row/\<index\>\> <propUri> <cell> where:
	<cell> could be an IRI if the cell value is an IRI, or a typed literal
	according the value of the CSV actual value <cell>.

	* add RDF statements claiming number of rows and columns.

	For example, given this trivial CSV with an header and just two rows:

	+---------------------------------------------------------------
	first name; last name; http://xmlns.com/foaf/0.1/knows; age
	Davide; Palmisano; http://michelemostarda.com; 30; value should not appear
	Michele; Mostarda; http://g1o.net;
	+---------------------------------------------------------------

	the following RDF (serialized in RDF/XML) is produced:

	+---------------------------------------------------------------
	<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

	<rdf:Description rdf:about="http://bob.example.com/firstName">
	<label xmlns="http://www.w3.org/2000/01/rdf-schema#">first name</label>
	<columnPosition xmlns="http://vocab.sindice.net/csv/"
	rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">0</columnPosition>
	</rdf:Description>

	<rdf:Description rdf:about="http://bob.example.com/lastName">
	<label xmlns="http://www.w3.org/2000/01/rdf-schema#">last name</label>
	<columnPosition xmlns="http://vocab.sindice.net/csv/"
	rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1</columnPosition>
	</rdf:Description>

	<rdf:Description rdf:about="http://xmlns.com/foaf/0.1/knows">
	<columnPosition xmlns="http://vocab.sindice.net/csv/"
	rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">2</columnPosition>
	</rdf:Description>

	<rdf:Description rdf:about="http://bob.example.com/age">
	<label xmlns="http://www.w3.org/2000/01/rdf-schema#">age</label>
	<columnPosition xmlns="http://vocab.sindice.net/csv/"
	rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">3</columnPosition>
	</rdf:Description>

	<rdf:Description rdf:about="http://bob.example.com/row/0">
	<rdf:type rdf:resource="http://vocab.sindice.net/csv/Row"/>
	<firstName xmlns="http://bob.example.com/"
	rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Davide</firstName>
	<lastName xmlns="http://bob.example.com/"
	rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Palmisano</lastName>
	<knows xmlns="http://xmlns.com/foaf/0.1/"
	rdf:resource="http://michelemostarda.com"/
	<age xmlns="http://bob.example.com/"
	rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">30</age>
	</rdf:Description>

	<rdf:Description rdf:about="http://bob.example.com/">
	<row xmlns="http://vocab.sindice.net/csv/" rdf:resource="http://bob.example.com/row/0"/>
	</rdf:Description>

	<rdf:Description rdf:about="http://bob.example.com/row/0">
	<rowPosition xmlns="http://vocab.sindice.net/csv/">0</rowPosition>
	</rdf:Description>

	<rdf:Description rdf:about="http://bob.example.com/row/1">
	<rdf:type rdf:resource="http://vocab.sindice.net/csv/Row"/>
	<firstName xmlns="http://bob.example.com/"
	rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Michele</firstName>
	<lastName xmlns="http://bob.example.com/"
	rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Mostarda</lastName>
	<knows xmlns="http://xmlns.com/foaf/0.1/" rdf:resource="http://g1o.net" />
	</rdf:Description>

	<rdf:Description rdf:about="http://bob.example.com/">
	<row xmlns="http://vocab.sindice.net/csv/"
	rdf:resource="http://bob.example.com/row/1"/>
	</rdf:Description>

	<rdf:Description rdf:about="http://bob.example.com/row/1">
	<rowPosition xmlns="http://vocab.sindice.net/csv/">1</rowPosition>
	</rdf:Description>

	<rdf:Description rdf:about="http://bob.example.com/">
	<numberOfRows xmlns="http://vocab.sindice.net/csv/"
	rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">2</numberOfRows>
	<numberOfColumns xmlns="http://vocab.sindice.net/csv/"
	rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">4</numberOfColumns>
	</rdf:Description>

	</rdf:RDF>
	+---------------------------------------------------------------