| ------ |
| Apache Any23 - CSV Extractor Algorithm |
| ------ |
| The Apache Software Foundation |
| ------ |
| 2011-2012 |
| |
| ~~ Licensed to the Apache Software Foundation (ASF) under one or more |
| ~~ contributor license agreements. See the NOTICE file distributed with |
| ~~ this work for additional information regarding copyright ownership. |
| ~~ The ASF licenses this file to You under the Apache License, Version 2.0 |
| ~~ (the "License"); you may not use this file except in compliance with |
| ~~ the License. You may obtain a copy of the License at |
| ~~ |
| ~~ http://www.apache.org/licenses/LICENSE-2.0 |
| ~~ |
| ~~ Unless required by applicable law or agreed to in writing, software |
| ~~ distributed under the License is distributed on an "AS IS" BASIS, |
| ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| ~~ See the License for the specific language governing permissions and |
| ~~ limitations under the License. |
| |
| CSV Extractor Algorithm |
| |
| The {{{./apidocs/org/apache/any23/extractor/csv/CSVExtractor.html}CSV Extractor}} produces |
| an RDF representation of a CSV file compliant with the {{{http://www.ietf.org/rfc/rfc4180.txt}RFC 4180}} |
| and that foresees an header. |
| Such extractor relies on the presence of an header to use the named fields as RDF properties. |
| Field delimiter could be automatically guessed or specified via {{{./configuration.html}Apache Any23 Configuration}}. |
| |
| Given a document with URL <url>, <<Apache Any23>> uses the following algorithm to extract RDF: |
| |
| * It tries to guess the fields delimiter and to detect the header |
| |
| * for each field <name>: |
| |
| * if <name> is a valid IRI keep it as an IRI since could be derefenceable. |
| |
| * if <name> is not a valid IRI, the associated RDF Property IRI <propUri> |
| will be in the form of: <url> concatenated <name> |
| |
| * add label statement: <propUri> rdfs:label <name> |
| |
| * add column index statement: <propUri> \<http://vocab.sindice.net/csv/rowPosition\> <index> |
| |
| * for each <row>: |
| |
| * add RDFS type statement: \<url/row/<index>\> rdfs:type \<http://vocab.sindice.net/csv/Row\>, |
| where <index> is the column index number. |
| |
| * for each <cell> value: |
| |
| * write statement, \<url/row/\<index\>\> <propUri> <cell> where: |
| <cell> could be an IRI if the cell value is an IRI, or a typed literal |
| according the value of the CSV actual value <cell>. |
| |
| * add RDF statements claiming number of rows and columns. |
| |
| For example, given this trivial CSV with an header and just two rows: |
| |
| +--------------------------------------------------------------- |
| first name; last name; http://xmlns.com/foaf/0.1/knows; age |
| Davide; Palmisano; http://michelemostarda.com; 30; value should not appear |
| Michele; Mostarda; http://g1o.net; |
| +--------------------------------------------------------------- |
| |
| the following RDF (serialized in RDF/XML) is produced: |
| |
| +--------------------------------------------------------------- |
| <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> |
| |
| <rdf:Description rdf:about="http://bob.example.com/firstName"> |
| <label xmlns="http://www.w3.org/2000/01/rdf-schema#">first name</label> |
| <columnPosition xmlns="http://vocab.sindice.net/csv/" |
| rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">0</columnPosition> |
| </rdf:Description> |
| |
| <rdf:Description rdf:about="http://bob.example.com/lastName"> |
| <label xmlns="http://www.w3.org/2000/01/rdf-schema#">last name</label> |
| <columnPosition xmlns="http://vocab.sindice.net/csv/" |
| rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1</columnPosition> |
| </rdf:Description> |
| |
| <rdf:Description rdf:about="http://xmlns.com/foaf/0.1/knows"> |
| <columnPosition xmlns="http://vocab.sindice.net/csv/" |
| rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">2</columnPosition> |
| </rdf:Description> |
| |
| <rdf:Description rdf:about="http://bob.example.com/age"> |
| <label xmlns="http://www.w3.org/2000/01/rdf-schema#">age</label> |
| <columnPosition xmlns="http://vocab.sindice.net/csv/" |
| rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">3</columnPosition> |
| </rdf:Description> |
| |
| <rdf:Description rdf:about="http://bob.example.com/row/0"> |
| <rdf:type rdf:resource="http://vocab.sindice.net/csv/Row"/> |
| <firstName xmlns="http://bob.example.com/" |
| rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Davide</firstName> |
| <lastName xmlns="http://bob.example.com/" |
| rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Palmisano</lastName> |
| <knows xmlns="http://xmlns.com/foaf/0.1/" |
| rdf:resource="http://michelemostarda.com"/ |
| <age xmlns="http://bob.example.com/" |
| rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">30</age> |
| </rdf:Description> |
| |
| <rdf:Description rdf:about="http://bob.example.com/"> |
| <row xmlns="http://vocab.sindice.net/csv/" rdf:resource="http://bob.example.com/row/0"/> |
| </rdf:Description> |
| |
| <rdf:Description rdf:about="http://bob.example.com/row/0"> |
| <rowPosition xmlns="http://vocab.sindice.net/csv/">0</rowPosition> |
| </rdf:Description> |
| |
| <rdf:Description rdf:about="http://bob.example.com/row/1"> |
| <rdf:type rdf:resource="http://vocab.sindice.net/csv/Row"/> |
| <firstName xmlns="http://bob.example.com/" |
| rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Michele</firstName> |
| <lastName xmlns="http://bob.example.com/" |
| rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Mostarda</lastName> |
| <knows xmlns="http://xmlns.com/foaf/0.1/" rdf:resource="http://g1o.net" /> |
| </rdf:Description> |
| |
| <rdf:Description rdf:about="http://bob.example.com/"> |
| <row xmlns="http://vocab.sindice.net/csv/" |
| rdf:resource="http://bob.example.com/row/1"/> |
| </rdf:Description> |
| |
| <rdf:Description rdf:about="http://bob.example.com/row/1"> |
| <rowPosition xmlns="http://vocab.sindice.net/csv/">1</rowPosition> |
| </rdf:Description> |
| |
| <rdf:Description rdf:about="http://bob.example.com/"> |
| <numberOfRows xmlns="http://vocab.sindice.net/csv/" |
| rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">2</numberOfRows> |
| <numberOfColumns xmlns="http://vocab.sindice.net/csv/" |
| rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">4</numberOfColumns> |
| </rdf:Description> |
| |
| </rdf:RDF> |
| +--------------------------------------------------------------- |