Merge pull request #90 from mikewalch/export-rewrite
Export queue rewrite and refactored code to use 'uri'
diff --git a/docs/code-guide.md b/docs/code-guide.md
index 5dab8ab..93f800a 100644
--- a/docs/code-guide.md
+++ b/docs/code-guide.md
@@ -23,81 +23,63 @@
### Page Loader
-This loader queues updated page content for processing by the page observer.
-
-**Code:** [PageLoader.java][PageLoader]
+The [PageLoader] queues updated page content for processing by the [PageObserver].
### Page Observer
-This observer computes changes to links within a page by comparing new and
-current pages. It computes links added and deleted and then pushes this
-information to the URI Map Observer and Page Exporter.
+The [PageObserver] computes changes to links within a page by comparing new and current pages. It
+computes links added and deleted and then pushes this information to the [UriMap] observer and
+[IndexExporter].
-Conceptually when a page references a new URI, a `+1` is queued up for the Uri
-Map. When a page no longer references a URI, a `-1` is queued up for the Uri
-Map to process.
-
-**Code:** [PageObserver.java][PageObserver]
+Conceptually when a page references a new URI, a `+1` is queued up for the [UriMap]. When a page no
+longer references a URI, a `-1` is queued up for the [UriMap] to process.
### URI Map Observer
-This observer computes per URI reference counts. The code for this this
-observer is very simple because it builds on the Collision Free Map Recipe. A
-Collision Free Map has two extension points and this example implements both.
-The first extension point is a combiner that processes the `+1` and `-1`
-updates queued up by the Page Observer. The second extension point is an
-update observer that handles changes in reference counts for a URI. It pushes
-these changes in reference counts to the Domain Map and URI Exporter.
+The [UriMap] observer computes per URI reference counts. The code for this this observer is very
+simple because it builds on the Collision Free Map Recipe. A Collision Free Map has two extension
+points and this example implements both. The first extension point is a combiner that processes the
+`+1` and `-1` updates queued up by the [PageObserver]. The second extension point is an update
+observer that handles changes in reference counts for a URI. It pushes these changes in reference
+counts to the [DomainMap] and [IndexExporter].
-Changes to URI reference counts are aggregated per domain and `+1` and `-1`
-updates are queued for the domain map.
-
-**Code:** [UriMap.java][UriMap]
+Changes to URI reference counts are aggregated per domain and `+1` and `-1` updates are queued for
+the [DomainMap].
### Domain Map Observer
-This observer computers per domain reference counts. This is a Collision Free
-Map that tracks per domain information. When its notified that domain counts
-changed, it pushes updates to the export queue to update the Query table.
+The [DomainMap] observer computers per domain reference counts. This is a Collision Free Map that
+tracks per domain information. When its notified that domain counts changed, it queues updates for
+the [IndexExporter] to update the Query table.
-**Code:** [DomainMap.java][DomainMap]
+### IndexExporter
-### Page Exporter
+The [IndexExporter] is an implementation of the 'AccumuloExporter' recipe. It makes updates to
+Accumulo using [IndexUpdate] objects that are placed on a shared 'ExportQueue' by the
+[PageObserver], [UriMap], and [DomainMap].
-For each URI, the Query table contains the URIs that reference it. This export
-code keeps that information in the Query table up to date. One interesting
-concept this code uses is the concept of inversion on export. The
-complete inverted URI index is never built in Fluo, its only built in Query
-table.
+[IndexUpdate] is an interface that is implemented by the following classes:
-**Code:** [PageExport.java][PageExport]
+1. [DomainUpdate] - Updates information related to domain (like page count).
-### URI Exporter
+2. [PageUpdate] - Updates information related to page (like links being added or deleted).
-Previous observers calculated the total number of URIs that reference a URI.
-This export code is given the new and old URI reference counts. URI reference
-counts are indexed three different ways in the Query table. This export code
-updates all three places in the Query table.
+3. [UriUpdate] - Updates information related to URI.
-This export code also uses the invert on export concept. The three indexes are
-never built in the Fluo table. Fluo only tracks the minimal amount of
-information needed to keep the three indexes current.
+When [IndexExporter] receives these objects, it translates the update to mutations using code in
+the [IndexClient].
-**Code:** [UriCountExport.java][UriCountExport]
-
-### Domain Exporter
-
-Export changes to the number of URIs referencing a domain to the Query table.
-
-**Code:** [DomainExport.java][DomainExport]
[PageLoader]: ../modules/data/src/main/java/webindex/data/fluo/PageLoader.java
[PageObserver]: ../modules/data/src/main/java/webindex/data/fluo/PageObserver.java
[UriMap]: ../modules/data/src/main/java/webindex/data/fluo/UriMap.java
[DomainMap]: ../modules/data/src/main/java/webindex/data/fluo/DomainMap.java
-[UriCountExport]: ../modules/data/src/main/java/webindex/data/fluo/UriCountExport.java
-[PageExport]: ../modules/data/src/main/java/webindex/data/fluo/PageExport.java
-[DomainExport]: ../modules/data/src/main/java/webindex/data/fluo/DomainExport.java
+[IndexExporter]: ../modules/data/src/main/java/webindex/data/fluo/IndexExporter.java
+[IndexUpdate]: ../modules/core/src/main/java/webindex/core/models/export/IndexUpdate.java
+[DomainUpdate]: ../modules/core/src/main/java/webindex/core/models/export/DomainUpdate.java
+[PageUpdate]: ../modules/core/src/main/java/webindex/core/models/export/PageUpdate.java
+[UriUpdate]: ../modules/core/src/main/java/webindex/core/models/export/UriUpdate.java
+[IndexClient]: ../modules/core/src/main/java/webindex/core/IndexClient.java
[qt]: tables.md#query-table-schema
[tables]: tables.md
diff --git a/modules/core/pom.xml b/modules/core/pom.xml
index f402de6..25c7a63 100644
--- a/modules/core/pom.xml
+++ b/modules/core/pom.xml
@@ -51,9 +51,18 @@
<artifactId>accumulo-core</artifactId>
</dependency>
<dependency>
+ <groupId>org.apache.fluo</groupId>
+ <artifactId>fluo-api</artifactId>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.fluo</groupId>
+ <artifactId>fluo-recipes-accumulo</artifactId>
+ </dependency>
+ <dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</dependency>
+ <!-- Test Dependencies -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
diff --git a/modules/core/src/main/java/webindex/core/Constants.java b/modules/core/src/main/java/webindex/core/Constants.java
index c41eee3..3597456 100644
--- a/modules/core/src/main/java/webindex/core/Constants.java
+++ b/modules/core/src/main/java/webindex/core/Constants.java
@@ -14,6 +14,10 @@
package webindex.core;
+import org.apache.fluo.api.data.Column;
+import org.apache.fluo.recipes.core.types.StringEncoder;
+import org.apache.fluo.recipes.core.types.TypeLayer;
+
public class Constants {
// Column Families
@@ -32,4 +36,12 @@
public static final String CUR = "cur";
// for domains
public static final String PAGECOUNT = "pagecount";
+
+ // Columns
+ public static final Column PAGE_NEW_COL = new Column(PAGE, NEW);
+ public static final Column PAGE_CUR_COL = new Column(PAGE, CUR);
+ public static final Column PAGE_INCOUNT_COL = new Column(PAGE, INCOUNT);
+ public static final Column PAGECOUNT_COL = new Column(DOMAIN, PAGECOUNT);
+
+ public static final TypeLayer TYPEL = new TypeLayer(new StringEncoder());
}
diff --git a/modules/core/src/main/java/webindex/core/IndexClient.java b/modules/core/src/main/java/webindex/core/IndexClient.java
index e28d7f8..ec7861f 100644
--- a/modules/core/src/main/java/webindex/core/IndexClient.java
+++ b/modules/core/src/main/java/webindex/core/IndexClient.java
@@ -14,6 +14,10 @@
package webindex.core;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.Collections;
+import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
@@ -21,10 +25,19 @@
import org.apache.accumulo.core.client.Connector;
import org.apache.accumulo.core.client.Scanner;
import org.apache.accumulo.core.client.TableNotFoundException;
+import org.apache.accumulo.core.client.lexicoder.Lexicoder;
+import org.apache.accumulo.core.client.lexicoder.ReverseLexicoder;
+import org.apache.accumulo.core.client.lexicoder.ULongLexicoder;
import org.apache.accumulo.core.data.Key;
+import org.apache.accumulo.core.data.Mutation;
import org.apache.accumulo.core.data.Range;
import org.apache.accumulo.core.data.Value;
import org.apache.accumulo.core.security.Authorizations;
+import org.apache.commons.codec.binary.Hex;
+import org.apache.fluo.api.data.Bytes;
+import org.apache.fluo.api.data.Column;
+import org.apache.fluo.api.data.RowColumn;
+import org.apache.fluo.recipes.accumulo.export.AccumuloExporter;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import webindex.core.models.DomainStats;
@@ -34,6 +47,10 @@
import webindex.core.models.Pages;
import webindex.core.models.TopResults;
import webindex.core.models.URL;
+import webindex.core.models.UriInfo;
+import webindex.core.models.export.DomainUpdate;
+import webindex.core.models.export.PageUpdate;
+import webindex.core.models.export.UriUpdate;
import webindex.core.util.Pager;
public class IndexClient {
@@ -62,7 +79,7 @@
if (entry.isNext()) {
results.setNext(row);
} else {
- String url = URL.fromPageID(row.split(":", 3)[2]).toString();
+ String url = URL.fromUri(row.split(":", 3)[2]).toString();
Long num = Long.parseLong(entry.getValue().toString());
results.addResult(url, num);
}
@@ -95,7 +112,7 @@
try {
Scanner scanner = conn.createScanner(accumuloIndexTable, Authorizations.EMPTY);
- scanner.setRange(Range.exact("p:" + url.toPageID(), Constants.PAGE));
+ scanner.setRange(Range.exact("p:" + url.toUri(), Constants.PAGE));
for (Map.Entry<Key, Value> entry : scanner) {
switch (entry.getKey().getColumnQualifier().toString()) {
case Constants.INCOUNT:
@@ -113,7 +130,7 @@
}
if (page == null) {
- page = new Page(url.toPageID());
+ page = new Page(url.toUri());
}
page.setNumInbound(incount);
return page;
@@ -154,8 +171,7 @@
pages.setNext(entry.getKey().getRowData().toString().split(":", 3)[2]);
} else {
String url =
- URL.fromPageID(entry.getKey().getRowData().toString().split(":", 4)[3])
- .toString();
+ URL.fromUri(entry.getKey().getRowData().toString().split(":", 4)[3]).toString();
Long count = Long.parseLong(entry.getValue().toString());
pages.addPage(url, count);
}
@@ -186,18 +202,18 @@
try {
Scanner scanner = conn.createScanner(accumuloIndexTable, Authorizations.EMPTY);
- String row = "p:" + url.toPageID();
+ String row = "p:" + url.toUri();
if (linkType.equals("in")) {
Page page = getPage(rawUrl);
String cf = Constants.INLINKS;
links.setTotal(page.getNumInbound());
Pager pager = Pager.build(scanner, Range.exact(row, cf), PAGE_SIZE, entry -> {
- String pageID = entry.getKey().getColumnQualifier().toString();
+ String uri = entry.getKey().getColumnQualifier().toString();
if (entry.isNext()) {
- links.setNext(pageID);
+ links.setNext(uri);
} else {
String anchorText = entry.getValue().toString();
- links.addLink(Link.of(pageID, anchorText));
+ links.addLink(Link.of(uri, anchorText));
}
});
if (next.isEmpty()) {
@@ -220,7 +236,7 @@
links.addLink(l);
add++;
} else {
- links.setNext(l.getPageID());
+ links.setNext(l.getUri());
break;
}
}
@@ -231,4 +247,79 @@
}
return links;
}
+
+ public static Collection<Mutation> genDomainMutations(DomainUpdate update, long seq) {
+ Map<RowColumn, Bytes> oldData = genDomainData(update.getDomain(), update.getOldPageCount());
+ Map<RowColumn, Bytes> newData = genDomainData(update.getDomain(), update.getNewPageCount());
+ return AccumuloExporter.generateMutations(seq, oldData, newData);
+ }
+
+ public static Map<RowColumn, Bytes> genDomainData(String domain, Long pageCount) {
+ if (pageCount == 0) {
+ return Collections.emptyMap();
+ }
+ return Collections.singletonMap(new RowColumn("d:" + domain, Constants.PAGECOUNT_COL),
+ Bytes.of(pageCount + ""));
+ }
+
+ public static Collection<Mutation> genPageMutations(PageUpdate update, long seq) {
+ int listSize = update.getAddedLinks().size() + update.getDeletedLinks().size() + 1;
+ ArrayList<Mutation> mutations = new ArrayList<>(listSize);
+
+ Mutation jsonMutation = new Mutation("p:" + update.getUri());
+ if (update.getJson().equals(Page.DELETE_JSON)) {
+ jsonMutation.putDelete(Constants.PAGE, Constants.CUR, seq);
+ } else {
+ jsonMutation.put(Constants.PAGE, Constants.CUR, seq, update.getJson());
+ }
+ mutations.add(jsonMutation);
+
+ // invert links on export
+ for (Link link : update.getAddedLinks()) {
+ Mutation m = new Mutation("p:" + link.getUri());
+ m.put(Constants.INLINKS, update.getUri(), seq, link.getAnchorText());
+ mutations.add(m);
+ }
+
+ for (Link link : update.getDeletedLinks()) {
+ Mutation m = new Mutation("p:" + link.getUri());
+ m.putDelete(Constants.INLINKS, update.getUri(), seq);
+ mutations.add(m);
+ }
+ return mutations;
+ }
+
+ public static Collection<Mutation> genUriMutations(UriUpdate update, long seq) {
+ Map<RowColumn, Bytes> oldData = genUriData(update.getUri(), update.getOldInfo());
+ Map<RowColumn, Bytes> newData = genUriData(update.getUri(), update.getNewInfo());
+ return AccumuloExporter.generateMutations(seq, oldData, newData);
+ }
+
+ public static Map<RowColumn, Bytes> genUriData(String uri, UriInfo info) {
+ if (info.equals(UriInfo.ZERO)) {
+ return Collections.emptyMap();
+ }
+
+ Map<RowColumn, Bytes> rcMap = new HashMap<>();
+ Bytes linksTo = Bytes.of("" + info.linksTo);
+ rcMap.put(new RowColumn(createTotalRow(uri, info.linksTo), Column.EMPTY), linksTo);
+ String domain = URL.fromUri(uri).getReverseDomain();
+ String domainRow = encodeDomainRankUri(domain, info.linksTo, uri);
+ rcMap.put(new RowColumn(domainRow, new Column(Constants.RANK, "")), linksTo);
+ rcMap.put(new RowColumn("p:" + uri, Constants.PAGE_INCOUNT_COL), linksTo);
+ return rcMap;
+ }
+
+ public static String revEncodeLong(Long num) {
+ Lexicoder<Long> lexicoder = new ReverseLexicoder<>(new ULongLexicoder());
+ return Hex.encodeHexString(lexicoder.encode(num));
+ }
+
+ public static String encodeDomainRankUri(String domain, long linksTo, String uri) {
+ return "d:" + domain + ":" + revEncodeLong(linksTo) + ":" + uri;
+ }
+
+ private static String createTotalRow(String uri, long curr) {
+ return "t:" + revEncodeLong(curr) + ":" + uri;
+ }
}
diff --git a/modules/core/src/main/java/webindex/core/models/Link.java b/modules/core/src/main/java/webindex/core/models/Link.java
index 6544224..5d8ed5e 100644
--- a/modules/core/src/main/java/webindex/core/models/Link.java
+++ b/modules/core/src/main/java/webindex/core/models/Link.java
@@ -22,16 +22,16 @@
private static final long serialVersionUID = 1L;
private String url;
- private String pageID;
+ private String uri;
private String anchorText;
public Link() {}
- public Link(String pageID, String anchorText) {
- Objects.requireNonNull(pageID);
+ public Link(String uri, String anchorText) {
+ Objects.requireNonNull(uri);
Objects.requireNonNull(anchorText);
- this.url = URL.fromPageID(pageID).toString();
- this.pageID = pageID;
+ this.url = URL.fromUri(uri).toString();
+ this.uri = uri;
this.anchorText = anchorText;
}
@@ -39,8 +39,8 @@
return url;
}
- public String getPageID() {
- return pageID;
+ public String getUri() {
+ return uri;
}
public String getAnchorText() {
@@ -48,27 +48,27 @@
}
- public static Link of(String pageID, String anchorText) {
- return new Link(pageID, anchorText);
+ public static Link of(String uri, String anchorText) {
+ return new Link(uri, anchorText);
}
- public static Link of(String pageID) {
- return new Link(pageID, "");
+ public static Link of(String uri) {
+ return new Link(uri, "");
}
public static Link of(URL url, String anchorText) {
- return new Link(url.toPageID(), anchorText);
+ return new Link(url.toUri(), anchorText);
}
public static Link of(URL url) {
- return new Link(url.toPageID(), "");
+ return new Link(url.toUri(), "");
}
@Override
public boolean equals(Object o) {
if (o instanceof Link) {
Link other = (Link) o;
- return url.equals(other.url) && pageID.equals(other.pageID);
+ return url.equals(other.url) && uri.equals(other.uri);
}
return false;
}
@@ -76,13 +76,13 @@
@Override
public int hashCode() {
int result = url.hashCode();
- result = 31 * result + pageID.hashCode();
+ result = 31 * result + uri.hashCode();
return result;
}
@Override
public int compareTo(Link o) {
- int c = pageID.compareTo(o.pageID);
+ int c = uri.compareTo(o.uri);
if (c == 0) {
c = url.compareTo(o.url);
}
diff --git a/modules/core/src/main/java/webindex/core/models/Page.java b/modules/core/src/main/java/webindex/core/models/Page.java
index 6bfb6dd..3f8117a 100644
--- a/modules/core/src/main/java/webindex/core/models/Page.java
+++ b/modules/core/src/main/java/webindex/core/models/Page.java
@@ -30,7 +30,7 @@
public static final String DELETE_JSON = "delete";
private String url;
- private String pageID;
+ private String uri;
private Long numInbound;
private Long numOutbound = 0L;
private String crawlDate;
@@ -47,10 +47,10 @@
this.isDelete = isDelete;
}
- public Page(String pageID) {
- Objects.requireNonNull(pageID);
- this.url = URL.fromPageID(pageID).toString();
- this.pageID = pageID;
+ public Page(String uri) {
+ Objects.requireNonNull(uri);
+ this.url = URL.fromUri(uri).toString();
+ this.uri = uri;
}
public String getServer() {
@@ -65,8 +65,8 @@
return url;
}
- public String getPageID() {
- return pageID;
+ public String getUri() {
+ return uri;
}
public Set<Link> getOutboundLinks() {
@@ -100,7 +100,7 @@
}
public String getDomain() {
- return URL.fromPageID(pageID).getDomain();
+ return URL.fromUri(uri).getDomain();
}
public Long getNumInbound() {
diff --git a/modules/core/src/main/java/webindex/core/models/URL.java b/modules/core/src/main/java/webindex/core/models/URL.java
index c090083..98722b6 100644
--- a/modules/core/src/main/java/webindex/core/models/URL.java
+++ b/modules/core/src/main/java/webindex/core/models/URL.java
@@ -32,7 +32,7 @@
private static final String URL_SEP_REGEX = "[/?#]";
private static final String HTTP_PROTO = "http://";
private static final String HTTPS_PROTO = "https://";
- private static final String PAGE_ID_SEP = ">";
+ private static final String URI_SEP = ">";
public static final InetAddressValidator validator = InetAddressValidator.getInstance();
private static final long serialVersionUID = 1L;
@@ -81,8 +81,8 @@
public static URL from(String rawUrl, Function<String, String> domainFromHost,
Function<String, Boolean> isValidHost) {
- if (rawUrl.contains(PAGE_ID_SEP)) {
- badUrl(false, "Skipping raw URL as it contains '" + PAGE_ID_SEP + "':" + rawUrl);
+ if (rawUrl.contains(URI_SEP)) {
+ badUrl(false, "Skipping raw URL as it contains '" + URI_SEP + "':" + rawUrl);
}
String trimUrl = rawUrl.trim();
@@ -227,21 +227,21 @@
return url.toString();
}
- public String toPageID() {
+ public String toUri() {
String reverseDomain = getReverseDomain();
String nonDomain = getReverseHost().substring(reverseDomain.length());
String portStr = "";
if ((!secure && port != 80) || (secure && port != 443)) {
portStr = Integer.toString(port);
}
- return reverseDomain + PAGE_ID_SEP + nonDomain + PAGE_ID_SEP + (secure ? "s" : "o") + portStr
- + PAGE_ID_SEP + path;
+ return reverseDomain + URI_SEP + nonDomain + URI_SEP + (secure ? "s" : "o") + portStr + URI_SEP
+ + path;
}
- public static URL fromPageID(String pageID) {
- String[] idArgs = pageID.split(PAGE_ID_SEP);
+ public static URL fromUri(String uri) {
+ String[] idArgs = uri.split(URI_SEP);
if (idArgs.length != 4) {
- throw new IllegalArgumentException("Page ID has too few or many parts: " + pageID);
+ throw new IllegalArgumentException("Page ID has too few or many parts: " + uri);
}
String domain = idArgs[0];
String host = idArgs[0] + idArgs[1];
@@ -257,7 +257,7 @@
port = 443;
} else if (!idArgs[2].startsWith("o")) {
throw new IllegalArgumentException("Page ID does not have port info beg with 's' or 'o': "
- + pageID);
+ + uri);
}
if (idArgs[2].length() > 1) {
port = Integer.parseInt(idArgs[2].substring(1));
diff --git a/modules/core/src/main/java/webindex/core/models/UriInfo.java b/modules/core/src/main/java/webindex/core/models/UriInfo.java
new file mode 100644
index 0000000..40637ff
--- /dev/null
+++ b/modules/core/src/main/java/webindex/core/models/UriInfo.java
@@ -0,0 +1,74 @@
+/*
+ * Copyright 2016 Webindex authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+ * in compliance with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+ * or implied. See the License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package webindex.core.models;
+
+import java.io.Serializable;
+
+import com.google.common.base.Preconditions;
+
+/**
+ * Used by URI collision free map
+ */
+public class UriInfo implements Serializable {
+
+ private static final long serialVersionUID = 1L;
+
+ public static final UriInfo ZERO = new UriInfo(0, 0);
+
+ // the numbers of documents that link to this URI
+ public long linksTo;
+
+ // the number of documents with this URI. Should be 0 or 1
+ public int docs;
+
+ public UriInfo() {}
+
+ public UriInfo(long linksTo, int docs) {
+ this.linksTo = linksTo;
+ this.docs = docs;
+ }
+
+ public void add(UriInfo other) {
+ Preconditions.checkArgument(this != ZERO);
+ this.linksTo += other.linksTo;
+ this.docs += other.docs;
+ }
+
+ @Override
+ public String toString() {
+ return linksTo + " " + docs;
+ }
+
+ @Override
+ public boolean equals(Object o) {
+ if (o instanceof UriInfo) {
+ UriInfo oui = (UriInfo) o;
+ return linksTo == oui.linksTo && docs == oui.docs;
+ }
+ return false;
+ }
+
+ @Override
+ public int hashCode() {
+ return docs + (int) linksTo;
+ }
+
+ public static UriInfo merge(UriInfo u1, UriInfo u2) {
+ UriInfo total = new UriInfo(0, 0);
+ total.add(u1);
+ total.add(u2);
+ return total;
+ }
+}
diff --git a/modules/core/src/main/java/webindex/core/models/export/DomainUpdate.java b/modules/core/src/main/java/webindex/core/models/export/DomainUpdate.java
new file mode 100644
index 0000000..a3732ed
--- /dev/null
+++ b/modules/core/src/main/java/webindex/core/models/export/DomainUpdate.java
@@ -0,0 +1,45 @@
+/*
+ * Copyright 2016 Webindex authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+ * in compliance with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+ * or implied. See the License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package webindex.core.models.export;
+
+/**
+ * Represents index updates for domain
+ */
+public class DomainUpdate implements IndexUpdate {
+
+ private String domain;
+ private Long oldPageCount;
+ private Long newPageCount;
+
+ public DomainUpdate() {} // For serialization
+
+ public DomainUpdate(String domain, Long oldPageCount, Long newPageCount) {
+ this.domain = domain;
+ this.oldPageCount = oldPageCount;
+ this.newPageCount = newPageCount;
+ }
+
+ public String getDomain() {
+ return domain;
+ }
+
+ public Long getOldPageCount() {
+ return oldPageCount;
+ }
+
+ public Long getNewPageCount() {
+ return newPageCount;
+ }
+}
diff --git a/modules/core/src/main/java/webindex/core/models/export/IndexUpdate.java b/modules/core/src/main/java/webindex/core/models/export/IndexUpdate.java
new file mode 100644
index 0000000..cec6e4f
--- /dev/null
+++ b/modules/core/src/main/java/webindex/core/models/export/IndexUpdate.java
@@ -0,0 +1,22 @@
+/*
+ * Copyright 2016 Webindex authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+ * in compliance with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+ * or implied. See the License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package webindex.core.models.export;
+
+/**
+ * Base class for updating indexes
+ */
+public interface IndexUpdate {
+
+}
diff --git a/modules/core/src/main/java/webindex/core/models/export/PageUpdate.java b/modules/core/src/main/java/webindex/core/models/export/PageUpdate.java
new file mode 100644
index 0000000..f171ccb
--- /dev/null
+++ b/modules/core/src/main/java/webindex/core/models/export/PageUpdate.java
@@ -0,0 +1,55 @@
+/*
+ * Copyright 2016 Webindex authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+ * in compliance with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+ * or implied. See the License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package webindex.core.models.export;
+
+import java.util.List;
+
+import webindex.core.models.Link;
+
+/**
+ * Represents index updates for pages
+ */
+public class PageUpdate implements IndexUpdate {
+
+ private String uri;
+ private String json;
+ private List<Link> addedLinks;
+ private List<Link> deletedLinks;
+
+ public PageUpdate() {} // For serialization
+
+ public PageUpdate(String uri, String json, List<Link> addedLinks, List<Link> deletedLinks) {
+ this.uri = uri;
+ this.json = json;
+ this.addedLinks = addedLinks;
+ this.deletedLinks = deletedLinks;
+ }
+
+ public String getUri() {
+ return uri;
+ }
+
+ public String getJson() {
+ return json;
+ }
+
+ public List<Link> getAddedLinks() {
+ return addedLinks;
+ }
+
+ public List<Link> getDeletedLinks() {
+ return deletedLinks;
+ }
+}
diff --git a/modules/core/src/main/java/webindex/core/models/export/UriUpdate.java b/modules/core/src/main/java/webindex/core/models/export/UriUpdate.java
new file mode 100644
index 0000000..0974c45
--- /dev/null
+++ b/modules/core/src/main/java/webindex/core/models/export/UriUpdate.java
@@ -0,0 +1,47 @@
+/*
+ * Copyright 2016 Webindex authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+ * in compliance with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+ * or implied. See the License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package webindex.core.models.export;
+
+import webindex.core.models.UriInfo;
+
+/**
+ * Represents index updates for URIs
+ */
+public class UriUpdate implements IndexUpdate {
+
+ private String uri;
+ private UriInfo oldInfo;
+ private UriInfo newInfo;
+
+ public UriUpdate() {} // For serialization
+
+ public UriUpdate(String uri, UriInfo oldInfo, UriInfo newInfo) {
+ this.uri = uri;
+ this.oldInfo = oldInfo;
+ this.newInfo = newInfo;
+ }
+
+ public String getUri() {
+ return uri;
+ }
+
+ public UriInfo getOldInfo() {
+ return oldInfo;
+ }
+
+ public UriInfo getNewInfo() {
+ return newInfo;
+ }
+}
diff --git a/modules/core/src/test/java/webindex/core/models/LinkTest.java b/modules/core/src/test/java/webindex/core/models/LinkTest.java
index 59cb7c2..433e383 100644
--- a/modules/core/src/test/java/webindex/core/models/LinkTest.java
+++ b/modules/core/src/test/java/webindex/core/models/LinkTest.java
@@ -23,14 +23,14 @@
public void testBasic() {
Link link1 = Link.of("com.a>>o>/", "anchor text");
Assert.assertEquals("http://a.com/", link1.getUrl());
- Assert.assertEquals("com.a>>o>/", link1.getPageID());
+ Assert.assertEquals("com.a>>o>/", link1.getUri());
Assert.assertEquals("anchor text", link1.getAnchorText());
Link link2 = Link.of("com.a>>o>/", "other text");
Assert.assertEquals(link1, link2);
Link link3 = Link.of(URLTest.from("http://a.com"), "more other text");
- Assert.assertEquals("com.a>>o>/", link3.getPageID());
+ Assert.assertEquals("com.a>>o>/", link3.getUri());
Assert.assertEquals(link1, link3);
}
diff --git a/modules/core/src/test/java/webindex/core/models/PageTest.java b/modules/core/src/test/java/webindex/core/models/PageTest.java
index b252c3d..3ea7b2b 100644
--- a/modules/core/src/test/java/webindex/core/models/PageTest.java
+++ b/modules/core/src/test/java/webindex/core/models/PageTest.java
@@ -23,9 +23,9 @@
@Test
public void testBasic() {
- Page page = new Page(URLTest.from("http://example.com").toPageID());
+ Page page = new Page(URLTest.from("http://example.com").toUri());
Assert.assertEquals("http://example.com/", page.getUrl());
- Assert.assertEquals("com.example>>o>/", page.getPageID());
+ Assert.assertEquals("com.example>>o>/", page.getUri());
Assert.assertEquals(Long.valueOf(0), page.getNumOutbound());
Assert.assertTrue(page.addOutbound(Link.of(URLTest.from("http://test1.com"), "test1")));
Assert.assertEquals(Long.valueOf(1), page.getNumOutbound());
diff --git a/modules/core/src/test/java/webindex/core/models/URLTest.java b/modules/core/src/test/java/webindex/core/models/URLTest.java
index e0eb45d..1a9abb3 100644
--- a/modules/core/src/test/java/webindex/core/models/URLTest.java
+++ b/modules/core/src/test/java/webindex/core/models/URLTest.java
@@ -26,7 +26,7 @@
}
public static String toID(String rawUrl) {
- return from(rawUrl).toPageID();
+ return from(rawUrl).toUri();
}
@@ -124,24 +124,24 @@
public void testId() {
URL u1 = urlSecure("a.b.c.com", "/", 8329);
URL u2 = from("https://a.b.C.com:8329");
- String r1 = u2.toPageID();
+ String r1 = u2.toUri();
Assert.assertEquals("com.c>.b.a>s8329>/", r1);
- URL u3 = URL.fromPageID(r1);
+ URL u3 = URL.fromUri(r1);
Assert.assertEquals(u1, u2);
Assert.assertEquals(u1, u3);
Assert.assertEquals(u2, u3);
URL u4 = url80("d.com", "/a/b/c");
- String id4 = u4.toPageID();
+ String id4 = u4.toUri();
Assert.assertEquals("com.d>>o>/a/b/c", id4);
- Assert.assertEquals(u4, URL.fromPageID(id4));
+ Assert.assertEquals(u4, URL.fromUri(id4));
URL u5 = from("http://1.2.3.4/a/b/c");
- String id5 = u5.toPageID();
+ String id5 = u5.toUri();
Assert.assertEquals("1.2.3.4>>o>/a/b/c", id5);
- Assert.assertEquals(u5, URL.fromPageID(id5));
+ Assert.assertEquals(u5, URL.fromUri(id5));
- Assert.assertEquals("com.b>.a>s80>/", from("https://a.b.com:80").toPageID());
+ Assert.assertEquals("com.b>.a>s80>/", from("https://a.b.com:80").toUri());
}
@Test
@@ -205,9 +205,9 @@
Assert.assertEquals("au.com.d", from("http://www.d.com.au").getReverseDomain());
u = from("https://www.d.com.au:9443/a/bc");
- Assert.assertEquals("au.com.d>.www>s9443>/a/bc", u.toPageID());
+ Assert.assertEquals("au.com.d>.www>s9443>/a/bc", u.toUri());
Assert.assertEquals("https://www.d.com.au:9443/a/bc", u.toString());
- URL u2 = URL.fromPageID(u.toPageID());
+ URL u2 = URL.fromUri(u.toUri());
Assert.assertEquals("https://www.d.com.au:9443/a/bc", u2.toString());
Assert.assertEquals("d.com.au", u2.getDomain());
Assert.assertEquals("www.d.com.au", u2.getHost());
diff --git a/modules/data/src/main/java/webindex/data/CalcSplits.java b/modules/data/src/main/java/webindex/data/CalcSplits.java
index 422f83e..5f76624 100644
--- a/modules/data/src/main/java/webindex/data/CalcSplits.java
+++ b/modules/data/src/main/java/webindex/data/CalcSplits.java
@@ -28,7 +28,7 @@
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import webindex.core.models.Page;
-import webindex.data.fluo.UriMap.UriInfo;
+import webindex.core.models.UriInfo;
import webindex.data.spark.IndexEnv;
import webindex.data.spark.IndexStats;
import webindex.data.spark.IndexUtil;
diff --git a/modules/data/src/main/java/webindex/data/FluoApp.java b/modules/data/src/main/java/webindex/data/FluoApp.java
index 182b863..fafa640 100644
--- a/modules/data/src/main/java/webindex/data/FluoApp.java
+++ b/modules/data/src/main/java/webindex/data/FluoApp.java
@@ -15,13 +15,13 @@
package webindex.data;
import org.apache.fluo.api.config.FluoConfiguration;
-import org.apache.fluo.api.config.ObserverConfiguration;
-import org.apache.fluo.recipes.accumulo.export.AccumuloExport;
+import org.apache.fluo.api.config.ObserverSpecification;
import org.apache.fluo.recipes.accumulo.export.AccumuloExporter;
-import org.apache.fluo.recipes.accumulo.export.TableInfo;
import org.apache.fluo.recipes.core.export.ExportQueue;
import org.apache.fluo.recipes.kryo.KryoSimplerSerializer;
+import webindex.core.models.export.IndexUpdate;
import webindex.data.fluo.DomainMap;
+import webindex.data.fluo.IndexExporter;
import webindex.data.fluo.PageObserver;
import webindex.data.fluo.UriMap;
import webindex.serialization.WebindexKryoFactory;
@@ -30,21 +30,20 @@
public static final String EXPORT_QUEUE_ID = "eq";
- public static void configureApplication(FluoConfiguration appConfig, TableInfo exportTable,
- int numBuckets, int numTablets) {
+ public static void configureApplication(FluoConfiguration fluoConfig,
+ AccumuloExporter.Configuration aeConf, int numBuckets, int numTablets) {
- appConfig.addObserver(new ObserverConfiguration(PageObserver.class.getName()));
+ fluoConfig.addObserver(new ObserverSpecification(PageObserver.class.getName()));
- KryoSimplerSerializer.setKryoFactory(appConfig, WebindexKryoFactory.class);
+ KryoSimplerSerializer.setKryoFactory(fluoConfig, WebindexKryoFactory.class);
- UriMap.configure(appConfig, numBuckets, numTablets);
- DomainMap.configure(appConfig, numBuckets, numTablets);
+ UriMap.configure(fluoConfig, numBuckets, numTablets);
+ DomainMap.configure(fluoConfig, numBuckets, numTablets);
- ExportQueue.configure(appConfig, new ExportQueue.Options(EXPORT_QUEUE_ID,
- AccumuloExporter.class.getName(), String.class.getName(), AccumuloExport.class.getName(),
- numBuckets).setBucketsPerTablet(numBuckets / numTablets));
-
- AccumuloExporter.setExportTableInfo(appConfig, EXPORT_QUEUE_ID, exportTable);
+ ExportQueue.configure(
+ fluoConfig,
+ new ExportQueue.Options(EXPORT_QUEUE_ID, IndexExporter.class.getName(), String.class
+ .getName(), IndexUpdate.class.getName(), numBuckets).setBucketsPerTablet(
+ numBuckets / numTablets).setExporterConfiguration(aeConf));
}
-
}
diff --git a/modules/data/src/main/java/webindex/data/fluo/DomainExport.java b/modules/data/src/main/java/webindex/data/fluo/DomainExport.java
deleted file mode 100644
index 5f87c6d..0000000
--- a/modules/data/src/main/java/webindex/data/fluo/DomainExport.java
+++ /dev/null
@@ -1,42 +0,0 @@
-/*
- * Copyright 2015 Webindex authors (see AUTHORS)
- *
- * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
- * in compliance with the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software distributed under the License
- * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
- * or implied. See the License for the specific language governing permissions and limitations under
- * the License.
- */
-
-package webindex.data.fluo;
-
-import java.util.Collections;
-import java.util.Map;
-import java.util.Optional;
-
-import org.apache.fluo.api.data.Bytes;
-import org.apache.fluo.api.data.RowColumn;
-import org.apache.fluo.recipes.accumulo.export.DifferenceExport;
-import webindex.data.util.FluoConstants;
-
-public class DomainExport extends DifferenceExport<String, Long> {
-
- public DomainExport() {}
-
- public DomainExport(Optional<Long> oldCount, Optional<Long> newCount) {
- super(oldCount, newCount);
- }
-
- @Override
- protected Map<RowColumn, Bytes> generateData(String domain, Optional<Long> count) {
- if (count.orElse(0L) == 0) {
- return Collections.emptyMap();
- }
- return Collections.singletonMap(new RowColumn("d:" + domain, FluoConstants.PAGECOUNT_COL),
- Bytes.of(count.get() + ""));
- }
-}
diff --git a/modules/data/src/main/java/webindex/data/fluo/DomainMap.java b/modules/data/src/main/java/webindex/data/fluo/DomainMap.java
index d3f12f0..9ca25ac 100644
--- a/modules/data/src/main/java/webindex/data/fluo/DomainMap.java
+++ b/modules/data/src/main/java/webindex/data/fluo/DomainMap.java
@@ -20,17 +20,18 @@
import org.apache.fluo.api.client.TransactionBase;
import org.apache.fluo.api.config.FluoConfiguration;
import org.apache.fluo.api.observer.Observer.Context;
-import org.apache.fluo.recipes.accumulo.export.AccumuloExport;
import org.apache.fluo.recipes.core.export.ExportQueue;
import org.apache.fluo.recipes.core.map.CollisionFreeMap;
import org.apache.fluo.recipes.core.map.CollisionFreeMap.Options;
import org.apache.fluo.recipes.core.map.Combiner;
import org.apache.fluo.recipes.core.map.Update;
import org.apache.fluo.recipes.core.map.UpdateObserver;
+import webindex.core.models.export.DomainUpdate;
+import webindex.core.models.export.IndexUpdate;
import webindex.data.FluoApp;
-
public class DomainMap {
+
public static final String DOMAIN_MAP_ID = "dm";
/**
@@ -59,7 +60,7 @@
*/
public static class DomainUpdateObserver extends UpdateObserver<String, Long> {
- private ExportQueue<String, AccumuloExport<String>> exportQ;
+ private ExportQueue<String, IndexUpdate> exportQ;
@Override
public void init(String mapId, Context observerContext) throws Exception {
@@ -71,15 +72,16 @@
public void updatingValues(TransactionBase tx, Iterator<Update<String, Long>> updates) {
while (updates.hasNext()) {
Update<String, Long> update = updates.next();
- exportQ.add(tx, update.getKey(),
- new DomainExport(update.getOldValue(), update.getNewValue()));
+ String domain = update.getKey();
+ Long oldVal = update.getOldValue().orElse(0L);
+ Long newVal = update.getNewValue().orElse(0L);
+ exportQ.add(tx, domain, new DomainUpdate(domain, oldVal, newVal));
}
}
}
/**
* A helper method for configuring the domain map before initializing Fluo.
- *
*/
public static void configure(FluoConfiguration config, int numBuckets, int numTablets) {
CollisionFreeMap.configure(config, new Options(DOMAIN_MAP_ID, DomainCombiner.class,
diff --git a/modules/data/src/main/java/webindex/data/fluo/IndexExporter.java b/modules/data/src/main/java/webindex/data/fluo/IndexExporter.java
new file mode 100644
index 0000000..8079cbe
--- /dev/null
+++ b/modules/data/src/main/java/webindex/data/fluo/IndexExporter.java
@@ -0,0 +1,50 @@
+/*
+ * Copyright 2015 Webindex authors (see AUTHORS)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+ * in compliance with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+ * or implied. See the License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package webindex.data.fluo;
+
+import java.util.Collection;
+
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.fluo.recipes.accumulo.export.AccumuloExporter;
+import org.apache.fluo.recipes.core.export.SequencedExport;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import webindex.core.IndexClient;
+import webindex.core.models.export.DomainUpdate;
+import webindex.core.models.export.IndexUpdate;
+import webindex.core.models.export.PageUpdate;
+import webindex.core.models.export.UriUpdate;
+
+public class IndexExporter extends AccumuloExporter<String, IndexUpdate> {
+
+ private static final Logger log = LoggerFactory.getLogger(IndexExporter.class);
+
+ @Override
+ protected Collection<Mutation> translate(SequencedExport<String, IndexUpdate> export) {
+ if (export.getValue() instanceof DomainUpdate) {
+ return IndexClient.genDomainMutations((DomainUpdate) export.getValue(), export.getSequence());
+ } else if (export.getValue() instanceof PageUpdate) {
+ return IndexClient.genPageMutations((PageUpdate) export.getValue(), export.getSequence());
+ } else if (export.getValue() instanceof UriUpdate) {
+ return IndexClient.genUriMutations((UriUpdate) export.getValue(), export.getSequence());
+ }
+
+ String msg =
+ "An object with an IndexUpdate class (" + export.getValue().getClass().toString()
+ + ") was placed on the export queue";
+ log.error(msg);
+ throw new IllegalStateException(msg);
+ }
+}
diff --git a/modules/data/src/main/java/webindex/data/fluo/PageExport.java b/modules/data/src/main/java/webindex/data/fluo/PageExport.java
deleted file mode 100644
index f3bf023..0000000
--- a/modules/data/src/main/java/webindex/data/fluo/PageExport.java
+++ /dev/null
@@ -1,71 +0,0 @@
-/*
- * Copyright 2015 Webindex authors (see AUTHORS)
- *
- * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
- * in compliance with the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software distributed under the License
- * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
- * or implied. See the License for the specific language governing permissions and limitations under
- * the License.
- */
-
-package webindex.data.fluo;
-
-import java.util.ArrayList;
-import java.util.Collection;
-import java.util.List;
-
-import com.google.common.collect.Sets.SetView;
-import org.apache.accumulo.core.data.Mutation;
-import org.apache.fluo.recipes.accumulo.export.AccumuloExport;
-import webindex.core.Constants;
-import webindex.core.models.Link;
-import webindex.core.models.Page;
-
-public class PageExport implements AccumuloExport<String> {
-
- private String json;
- private List<Link> addedLinks;
- private List<Link> deletedLinks;
-
- public PageExport() {}
-
- public PageExport(String json, SetView<Link> addedLinks, SetView<Link> deletedLinks) {
- this.json = json;
- this.addedLinks = new ArrayList<>(addedLinks);
- this.deletedLinks = new ArrayList<>(deletedLinks);
- }
-
- @Override
- public Collection<Mutation> toMutations(String referencingUri, long seq) {
-
- ArrayList<Mutation> mutations = new ArrayList<>(addedLinks.size() + deletedLinks.size() + 1);
-
- Mutation jsonMutation = new Mutation("p:" + referencingUri);
- if (json.equals(Page.DELETE_JSON)) {
- jsonMutation.putDelete(Constants.PAGE, Constants.CUR, seq);
- } else {
- jsonMutation.put(Constants.PAGE, Constants.CUR, seq, json);
- }
- mutations.add(jsonMutation);
-
- // invert links on export
- for (Link link : addedLinks) {
- Mutation m = new Mutation("p:" + link.getPageID());
- m.put(Constants.INLINKS, referencingUri, seq, link.getAnchorText());
- mutations.add(m);
- }
-
- for (Link link : deletedLinks) {
- Mutation m = new Mutation("p:" + link.getPageID());
- m.putDelete(Constants.INLINKS, referencingUri, seq);
- mutations.add(m);
- }
-
- return mutations;
- }
-
-}
diff --git a/modules/data/src/main/java/webindex/data/fluo/PageLoader.java b/modules/data/src/main/java/webindex/data/fluo/PageLoader.java
index 9b02deb..deb01e7 100644
--- a/modules/data/src/main/java/webindex/data/fluo/PageLoader.java
+++ b/modules/data/src/main/java/webindex/data/fluo/PageLoader.java
@@ -25,9 +25,9 @@
import org.apache.fluo.recipes.core.types.TypedTransactionBase;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
+import webindex.core.Constants;
import webindex.core.models.Page;
import webindex.core.models.URL;
-import webindex.data.util.FluoConstants;
public class PageLoader implements Loader {
@@ -57,20 +57,19 @@
@Override
public void load(TransactionBase tx, Context context) throws Exception {
- TypedTransactionBase ttx = FluoConstants.TYPEL.wrap(tx);
+ TypedTransactionBase ttx = Constants.TYPEL.wrap(tx);
Gson gson = new Gson();
RowHasher rowHasher = PageObserver.getPageRowHasher();
switch (action) {
case DELETE:
- ttx.mutate().row(rowHasher.addHash(delUrl.toPageID())).col(FluoConstants.PAGE_NEW_COL)
+ ttx.mutate().row(rowHasher.addHash(delUrl.toUri())).col(Constants.PAGE_NEW_COL)
.set(Page.DELETE_JSON);
break;
case UPDATE:
String newJson = gson.toJson(page);
- ttx.mutate().row(rowHasher.addHash(page.getPageID())).col(FluoConstants.PAGE_NEW_COL)
- .set(newJson);
+ ttx.mutate().row(rowHasher.addHash(page.getUri())).col(Constants.PAGE_NEW_COL).set(newJson);
break;
default:
log.error("PageUpdate called with no action");
diff --git a/modules/data/src/main/java/webindex/data/fluo/PageObserver.java b/modules/data/src/main/java/webindex/data/fluo/PageObserver.java
index 55eeaff..139141a 100644
--- a/modules/data/src/main/java/webindex/data/fluo/PageObserver.java
+++ b/modules/data/src/main/java/webindex/data/fluo/PageObserver.java
@@ -14,7 +14,9 @@
package webindex.data.fluo;
+import java.util.ArrayList;
import java.util.HashMap;
+import java.util.List;
import java.util.Map;
import java.util.Set;
@@ -24,7 +26,6 @@
import org.apache.fluo.api.data.Bytes;
import org.apache.fluo.api.data.Column;
import org.apache.fluo.api.observer.AbstractObserver;
-import org.apache.fluo.recipes.accumulo.export.AccumuloExport;
import org.apache.fluo.recipes.core.data.RowHasher;
import org.apache.fluo.recipes.core.export.ExportQueue;
import org.apache.fluo.recipes.core.map.CollisionFreeMap;
@@ -32,11 +33,13 @@
import org.apache.fluo.recipes.core.types.TypedTransactionBase;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
+import webindex.core.Constants;
import webindex.core.models.Link;
import webindex.core.models.Page;
+import webindex.core.models.UriInfo;
+import webindex.core.models.export.IndexUpdate;
+import webindex.core.models.export.PageUpdate;
import webindex.data.FluoApp;
-import webindex.data.fluo.UriMap.UriInfo;
-import webindex.data.util.FluoConstants;
public class PageObserver extends AbstractObserver {
@@ -44,7 +47,7 @@
private static final Gson gson = new Gson();
private CollisionFreeMap<String, UriInfo> uriMap;
- private ExportQueue<String, AccumuloExport<String>> exportQ;
+ private ExportQueue<String, IndexUpdate> exportQ;
private static final RowHasher PAGE_ROW_HASHER = new RowHasher("p");
@@ -61,18 +64,18 @@
@Override
public void process(TransactionBase tx, Bytes row, Column col) throws Exception {
- TypedTransactionBase ttx = FluoConstants.TYPEL.wrap(tx);
+ TypedTransactionBase ttx = Constants.TYPEL.wrap(tx);
Map<Column, Value> pages =
- ttx.get().row(row).columns(FluoConstants.PAGE_NEW_COL, FluoConstants.PAGE_CUR_COL);
+ ttx.get().row(row).columns(Constants.PAGE_NEW_COL, Constants.PAGE_CUR_COL);
- String nextJson = pages.get(FluoConstants.PAGE_NEW_COL).toString("");
+ String nextJson = pages.get(Constants.PAGE_NEW_COL).toString("");
if (nextJson.isEmpty()) {
log.error("An empty page was set at row {} col {}", row.toString(), col.toString());
return;
}
- Page curPage = Page.fromJson(gson, pages.get(FluoConstants.PAGE_CUR_COL).toString(""));
+ Page curPage = Page.fromJson(gson, pages.get(Constants.PAGE_CUR_COL).toString(""));
Set<Link> curLinks = curPage.getOutboundLinks();
Map<String, UriInfo> updates = new HashMap<>();
@@ -80,10 +83,10 @@
Page nextPage = Page.fromJson(gson, nextJson);
if (nextPage.isDelete()) {
- ttx.mutate().row(row).col(FluoConstants.PAGE_CUR_COL).delete();
+ ttx.mutate().row(row).col(Constants.PAGE_CUR_COL).delete();
updates.put(pageUri, new UriInfo(0, -1));
} else {
- ttx.mutate().row(row).col(FluoConstants.PAGE_CUR_COL).set(nextJson);
+ ttx.mutate().row(row).col(Constants.PAGE_CUR_COL).set(nextJson);
if (curPage.isEmpty()) {
updates.put(pageUri, new UriInfo(0, 1));
}
@@ -91,26 +94,26 @@
Set<Link> nextLinks = nextPage.getOutboundLinks();
- Sets.SetView<Link> addLinks = Sets.difference(nextLinks, curLinks);
+ List<Link> addLinks = new ArrayList<>(Sets.difference(nextLinks, curLinks));
for (Link link : addLinks) {
- updates.put(link.getPageID(), new UriInfo(1, 0));
+ updates.put(link.getUri(), new UriInfo(1, 0));
}
- Sets.SetView<Link> delLinks = Sets.difference(curLinks, nextLinks);
+ List<Link> delLinks = new ArrayList<>(Sets.difference(curLinks, nextLinks));
for (Link link : delLinks) {
- updates.put(link.getPageID(), new UriInfo(-1, 0));
+ updates.put(link.getUri(), new UriInfo(-1, 0));
}
uriMap.update(tx, updates);
- exportQ.add(tx, pageUri, new PageExport(nextJson, addLinks, delLinks));
+ exportQ.add(tx, pageUri, new PageUpdate(pageUri, nextJson, addLinks, delLinks));
// clean up
- ttx.mutate().row(row).col(FluoConstants.PAGE_NEW_COL).delete();
+ ttx.mutate().row(row).col(Constants.PAGE_NEW_COL).delete();
}
@Override
public ObservedColumn getObservedColumn() {
- return new ObservedColumn(FluoConstants.PAGE_NEW_COL, NotificationType.STRONG);
+ return new ObservedColumn(Constants.PAGE_NEW_COL, NotificationType.STRONG);
}
}
diff --git a/modules/data/src/main/java/webindex/data/fluo/UriCountExport.java b/modules/data/src/main/java/webindex/data/fluo/UriCountExport.java
deleted file mode 100644
index 9e2d616..0000000
--- a/modules/data/src/main/java/webindex/data/fluo/UriCountExport.java
+++ /dev/null
@@ -1,73 +0,0 @@
-/*
- * Copyright 2015 Webindex authors (see AUTHORS)
- *
- * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
- * in compliance with the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software distributed under the License
- * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
- * or implied. See the License for the specific language governing permissions and limitations under
- * the License.
- */
-
-package webindex.data.fluo;
-
-import java.util.Collections;
-import java.util.HashMap;
-import java.util.Map;
-import java.util.Optional;
-
-import org.apache.accumulo.core.client.lexicoder.Lexicoder;
-import org.apache.accumulo.core.client.lexicoder.ReverseLexicoder;
-import org.apache.accumulo.core.client.lexicoder.ULongLexicoder;
-import org.apache.commons.codec.binary.Hex;
-import org.apache.fluo.api.data.Bytes;
-import org.apache.fluo.api.data.Column;
-import org.apache.fluo.api.data.RowColumn;
-import org.apache.fluo.recipes.accumulo.export.DifferenceExport;
-import webindex.core.Constants;
-import webindex.core.models.URL;
-import webindex.data.fluo.UriMap.UriInfo;
-import webindex.data.util.FluoConstants;
-
-public class UriCountExport extends DifferenceExport<String, UriInfo> {
-
- public UriCountExport() {}
-
- public UriCountExport(Optional<UriInfo> oldCount, Optional<UriInfo> newCount) {
- super(oldCount, newCount);
- }
-
- @Override
- protected Map<RowColumn, Bytes> generateData(String pageID, Optional<UriInfo> val) {
- if (val.orElse(UriInfo.ZERO).equals(UriInfo.ZERO)) {
- return Collections.emptyMap();
- }
-
- UriInfo uriInfo = val.get();
-
- Map<RowColumn, Bytes> rcMap = new HashMap<>();
- Bytes linksTo = Bytes.of("" + uriInfo.linksTo);
- rcMap.put(new RowColumn(createTotalRow(pageID, uriInfo.linksTo), Column.EMPTY), linksTo);
- String domain = URL.fromPageID(pageID).getReverseDomain();
- String domainRow = encodeDomainRankPageId(domain, uriInfo.linksTo, pageID);
- rcMap.put(new RowColumn(domainRow, new Column(Constants.RANK, "")), linksTo);
- rcMap.put(new RowColumn("p:" + pageID, FluoConstants.PAGE_INCOUNT_COL), linksTo);
- return rcMap;
- }
-
- public static String revEncodeLong(Long num) {
- Lexicoder<Long> lexicoder = new ReverseLexicoder<>(new ULongLexicoder());
- return Hex.encodeHexString(lexicoder.encode(num));
- }
-
- public static String encodeDomainRankPageId(String domain, long linksTo, String pageId) {
- return "d:" + domain + ":" + revEncodeLong(linksTo) + ":" + pageId;
- }
-
- private static String createTotalRow(String uri, long curr) {
- return "t:" + revEncodeLong(curr) + ":" + uri;
- }
-}
diff --git a/modules/data/src/main/java/webindex/data/fluo/UriMap.java b/modules/data/src/main/java/webindex/data/fluo/UriMap.java
index 1842507..ef493d7 100644
--- a/modules/data/src/main/java/webindex/data/fluo/UriMap.java
+++ b/modules/data/src/main/java/webindex/data/fluo/UriMap.java
@@ -14,17 +14,14 @@
package webindex.data.fluo;
-import java.io.Serializable;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.Optional;
-import com.google.common.base.Preconditions;
import org.apache.fluo.api.client.TransactionBase;
import org.apache.fluo.api.config.FluoConfiguration;
import org.apache.fluo.api.observer.Observer.Context;
-import org.apache.fluo.recipes.accumulo.export.AccumuloExport;
import org.apache.fluo.recipes.core.export.ExportQueue;
import org.apache.fluo.recipes.core.map.CollisionFreeMap;
import org.apache.fluo.recipes.core.map.CollisionFreeMap.Options;
@@ -32,6 +29,9 @@
import org.apache.fluo.recipes.core.map.Update;
import org.apache.fluo.recipes.core.map.UpdateObserver;
import webindex.core.models.URL;
+import webindex.core.models.UriInfo;
+import webindex.core.models.export.IndexUpdate;
+import webindex.core.models.export.UriUpdate;
import webindex.data.FluoApp;
/**
@@ -42,59 +42,6 @@
public static final String URI_MAP_ID = "um";
- public static class UriInfo implements Serializable {
-
- private static final long serialVersionUID = 1L;
-
- public static final UriInfo ZERO = new UriInfo(0, 0);
-
- // the numbers of documents that link to this URI
- public long linksTo;
-
- // the number of documents with this URI. Should be 0 or 1
- public int docs;
-
- public UriInfo() {}
-
- public UriInfo(long linksTo, int docs) {
- this.linksTo = linksTo;
- this.docs = docs;
- }
-
- public void add(UriInfo other) {
- Preconditions.checkArgument(this != ZERO);
- this.linksTo += other.linksTo;
- this.docs += other.docs;
- }
-
- @Override
- public String toString() {
- return linksTo + " " + docs;
- }
-
- @Override
- public boolean equals(Object o) {
- if (o instanceof UriInfo) {
- UriInfo oui = (UriInfo) o;
- return linksTo == oui.linksTo && docs == oui.docs;
- }
-
- return false;
- }
-
- @Override
- public int hashCode() {
- return docs + (int) linksTo;
- }
-
- public static UriInfo merge(UriInfo u1, UriInfo u2) {
- UriInfo total = new UriInfo(0, 0);
- total.add(u1);
- total.add(u2);
- return total;
- }
- }
-
/**
* Combines updates made to the uri map
*/
@@ -121,7 +68,7 @@
*/
public static class UriUpdateObserver extends UpdateObserver<String, UriInfo> {
- private ExportQueue<String, AccumuloExport<String>> exportQ;
+ private ExportQueue<String, IndexUpdate> exportQ;
private CollisionFreeMap<String, Long> domainMap;
@Override
@@ -140,13 +87,13 @@
while (updates.hasNext()) {
Update<String, UriInfo> update = updates.next();
+ String uri = update.getKey();
UriInfo oldVal = update.getOldValue().orElse(UriInfo.ZERO);
UriInfo newVal = update.getNewValue().orElse(UriInfo.ZERO);
- exportQ.add(tx, update.getKey(),
- new UriCountExport(update.getOldValue(), update.getNewValue()));
+ exportQ.add(tx, uri, new UriUpdate(uri, oldVal, newVal));
- String pageDomain = URL.fromPageID(update.getKey()).getReverseDomain();
+ String pageDomain = URL.fromUri(uri).getReverseDomain();
if (oldVal.equals(UriInfo.ZERO) && !newVal.equals(UriInfo.ZERO)) {
domainUpdates.merge(pageDomain, 1L, (o, n) -> o + n);
} else if (newVal.equals(UriInfo.ZERO) && !oldVal.equals(UriInfo.ZERO)) {
@@ -160,7 +107,6 @@
/**
* A helper method for configuring the uri map before initializing Fluo.
- *
*/
public static void configure(FluoConfiguration config, int numBuckets, int numTablets) {
CollisionFreeMap.configure(config, new Options(URI_MAP_ID, UriCombiner.class,
diff --git a/modules/data/src/main/java/webindex/data/spark/IndexEnv.java b/modules/data/src/main/java/webindex/data/spark/IndexEnv.java
index ec1dda8..e8d50c6 100644
--- a/modules/data/src/main/java/webindex/data/spark/IndexEnv.java
+++ b/modules/data/src/main/java/webindex/data/spark/IndexEnv.java
@@ -38,7 +38,7 @@
import org.apache.fluo.api.data.Bytes;
import org.apache.fluo.api.data.RowColumn;
import org.apache.fluo.core.util.AccumuloUtil;
-import org.apache.fluo.recipes.accumulo.export.TableInfo;
+import org.apache.fluo.recipes.accumulo.export.AccumuloExporter;
import org.apache.fluo.recipes.accumulo.ops.TableOperations;
import org.apache.fluo.recipes.core.common.TableOptimizations;
import org.apache.fluo.recipes.spark.FluoSparkHelper;
@@ -56,9 +56,9 @@
import org.slf4j.LoggerFactory;
import webindex.core.WebIndexConfig;
import webindex.core.models.Page;
+import webindex.core.models.UriInfo;
import webindex.data.FluoApp;
import webindex.data.fluo.PageObserver;
-import webindex.data.fluo.UriMap.UriInfo;
public class IndexEnv {
@@ -197,11 +197,12 @@
}
}
- public void configureApplication(FluoConfiguration appConfig) {
- FluoApp.configureApplication(appConfig,
- new TableInfo(fluoConfig.getAccumuloInstance(), fluoConfig.getAccumuloZookeepers(),
- fluoConfig.getAccumuloUser(), fluoConfig.getAccumuloPassword(), accumuloTable),
- numBuckets, numTablets);
+ public void configureApplication(FluoConfiguration config) {
+ FluoApp.configureApplication(
+ config,
+ new AccumuloExporter.Configuration(fluoConfig.getAccumuloInstance(), fluoConfig
+ .getAccumuloZookeepers(), fluoConfig.getAccumuloUser(), fluoConfig
+ .getAccumuloPassword(), accumuloTable), numBuckets, numTablets);
}
public void initializeIndexes(JavaSparkContext ctx, JavaRDD<Page> pages, IndexStats stats)
diff --git a/modules/data/src/main/java/webindex/data/spark/IndexUtil.java b/modules/data/src/main/java/webindex/data/spark/IndexUtil.java
index 077a55a..0989f3c 100644
--- a/modules/data/src/main/java/webindex/data/spark/IndexUtil.java
+++ b/modules/data/src/main/java/webindex/data/spark/IndexUtil.java
@@ -36,16 +36,17 @@
import org.archive.io.ArchiveRecord;
import scala.Tuple2;
import webindex.core.Constants;
+import webindex.core.IndexClient;
import webindex.core.models.Link;
import webindex.core.models.Page;
import webindex.core.models.URL;
+import webindex.core.models.UriInfo;
import webindex.data.fluo.DomainMap;
import webindex.data.fluo.PageObserver;
-import webindex.data.fluo.UriCountExport;
+
import webindex.data.fluo.UriMap;
-import webindex.data.fluo.UriMap.UriInfo;
+
import webindex.data.util.ArchiveUtil;
-import webindex.data.util.FluoConstants;
import webindex.serialization.WebindexKryoFactory;
public class IndexUtil {
@@ -75,10 +76,10 @@
List<Tuple2<String, UriInfo>> ret = new ArrayList<>();
if (!page.isEmpty()) {
- ret.add(new Tuple2<>(page.getPageID(), new UriInfo(0, 1)));
+ ret.add(new Tuple2<>(page.getUri(), new UriInfo(0, 1)));
for (Link link : page.getOutboundLinks()) {
- ret.add(new Tuple2<>(link.getPageID(), new UriInfo(1, 0)));
+ ret.add(new Tuple2<>(link.getUri(), new UriInfo(1, 0)));
}
}
return ret;
@@ -92,7 +93,7 @@
public static JavaPairRDD<String, Long> createDomainMap(JavaPairRDD<String, UriInfo> uriMap) {
JavaPairRDD<String, Long> domainMap =
- uriMap.mapToPair(t -> new Tuple2<>(URL.fromPageID(t._1()).getReverseDomain(), 1L))
+ uriMap.mapToPair(t -> new Tuple2<>(URL.fromUri(t._1()).getReverseDomain(), 1L))
.reduceByKey(Long::sum);
domainMap.persist(StorageLevel.DISK_ONLY());
@@ -117,12 +118,12 @@
stats.addExternalLinks(links1.size());
List<Tuple2<RowColumn, Bytes>> ret = new ArrayList<>();
- String pageID = page.getPageID();
+ String uri = page.getUri();
if (links1.size() > 0) {
- addRCV(ret, "p:" + pageID, FluoConstants.PAGE_CUR_COL, gson.toJson(page));
+ addRCV(ret, "p:" + uri, Constants.PAGE_CUR_COL, gson.toJson(page));
}
for (Link link : links1) {
- addRCV(ret, "p:" + link.getPageID(), new Column(Constants.INLINKS, pageID),
+ addRCV(ret, "p:" + link.getUri(), new Column(Constants.INLINKS, uri),
link.getAnchorText());
}
return ret;
@@ -133,12 +134,12 @@
List<Tuple2<RowColumn, Bytes>> ret = new ArrayList<>();
String uri = t._1();
UriInfo uriInfo = t._2();
- addRCV(ret, "t:" + UriCountExport.revEncodeLong(uriInfo.linksTo) + ":" + uri,
- Column.EMPTY, uriInfo.linksTo);
- String domain = URL.fromPageID(t._1()).getReverseDomain();
- String domainRow = UriCountExport.encodeDomainRankPageId(domain, uriInfo.linksTo, uri);
+ addRCV(ret, "t:" + IndexClient.revEncodeLong(uriInfo.linksTo) + ":" + uri, Column.EMPTY,
+ uriInfo.linksTo);
+ String domain = URL.fromUri(t._1()).getReverseDomain();
+ String domainRow = IndexClient.encodeDomainRankUri(domain, uriInfo.linksTo, uri);
addRCV(ret, domainRow, new Column(Constants.RANK, ""), uriInfo.linksTo);
- addRCV(ret, "p:" + uri, FluoConstants.PAGE_INCOUNT_COL, uriInfo.linksTo);
+ addRCV(ret, "p:" + uri, Constants.PAGE_INCOUNT_COL, uriInfo.linksTo);
return ret;
}));
@@ -166,9 +167,9 @@
}
Set<Link> links1 = page.getOutboundLinks();
List<Tuple2<RowColumn, Bytes>> ret = new ArrayList<>();
- String pageID = page.getPageID();
+ String uri = page.getUri();
if (links1.size() > 0) {
- String hashedRow = PageObserver.getPageRowHasher().addHash(pageID).toString();
+ String hashedRow = PageObserver.getPageRowHasher().addHash(uri).toString();
addRCV(ret, hashedRow, new Column(Constants.PAGE, Constants.CUR), gson.toJson(page));
}
return ret;
diff --git a/modules/data/src/main/java/webindex/data/util/ArchiveUtil.java b/modules/data/src/main/java/webindex/data/util/ArchiveUtil.java
index 6ce5118..1491d6e 100644
--- a/modules/data/src/main/java/webindex/data/util/ArchiveUtil.java
+++ b/modules/data/src/main/java/webindex/data/util/ArchiveUtil.java
@@ -58,7 +58,7 @@
log.error("Unexpected exception while parsing raw page URL: " + rawPageUrl, e);
return Page.EMPTY;
}
- Page page = new Page(pageUrl.toPageID());
+ Page page = new Page(pageUrl.toUri());
page.setCrawlDate(archiveRecord.getHeader().getDate());
try {
JSONObject responseMeta =
diff --git a/modules/data/src/main/java/webindex/data/util/FluoConstants.java b/modules/data/src/main/java/webindex/data/util/FluoConstants.java
deleted file mode 100644
index 23f56b3..0000000
--- a/modules/data/src/main/java/webindex/data/util/FluoConstants.java
+++ /dev/null
@@ -1,30 +0,0 @@
-/*
- * Copyright 2015 Webindex authors (see AUTHORS)
- *
- * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
- * in compliance with the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software distributed under the License
- * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
- * or implied. See the License for the specific language governing permissions and limitations under
- * the License.
- */
-
-package webindex.data.util;
-
-import org.apache.fluo.api.data.Column;
-import org.apache.fluo.recipes.core.types.StringEncoder;
-import org.apache.fluo.recipes.core.types.TypeLayer;
-import webindex.core.Constants;
-
-public class FluoConstants {
-
- public static final TypeLayer TYPEL = new TypeLayer(new StringEncoder());
-
- public static final Column PAGE_NEW_COL = new Column(Constants.PAGE, Constants.NEW);
- public static final Column PAGE_CUR_COL = new Column(Constants.PAGE, Constants.CUR);
- public static final Column PAGE_INCOUNT_COL = new Column(Constants.PAGE, Constants.INCOUNT);
- public static final Column PAGECOUNT_COL = new Column(Constants.DOMAIN, Constants.PAGECOUNT);
-}
diff --git a/modules/data/src/main/java/webindex/serialization/WebindexKryoFactory.java b/modules/data/src/main/java/webindex/serialization/WebindexKryoFactory.java
index c3e5035..d07f6e2 100644
--- a/modules/data/src/main/java/webindex/serialization/WebindexKryoFactory.java
+++ b/modules/data/src/main/java/webindex/serialization/WebindexKryoFactory.java
@@ -16,15 +16,15 @@
import java.io.Serializable;
import java.util.ArrayList;
-import java.util.Optional;
import com.esotericsoftware.kryo.Kryo;
import com.esotericsoftware.kryo.pool.KryoFactory;
import webindex.core.models.Link;
-import webindex.data.fluo.DomainExport;
-import webindex.data.fluo.PageExport;
-import webindex.data.fluo.UriCountExport;
-import webindex.data.fluo.UriMap.UriInfo;
+import webindex.core.models.UriInfo;
+import webindex.core.models.export.DomainUpdate;
+import webindex.core.models.export.IndexUpdate;
+import webindex.core.models.export.PageUpdate;
+import webindex.core.models.export.UriUpdate;
public class WebindexKryoFactory implements KryoFactory, Serializable {
@@ -38,12 +38,12 @@
// same order it would be ok) and ran into issue where Spark and Fluo code were using different
// ids for some reason.
kryo.register(UriInfo.class, 9);
- kryo.register(DomainExport.class, 10);
- kryo.register(UriCountExport.class, 11);
- kryo.register(PageExport.class, 12);
- kryo.register(ArrayList.class, 13);
- kryo.register(Link.class, 14);
- kryo.register(Optional.class, 15);
+ kryo.register(IndexUpdate.class, 10);
+ kryo.register(DomainUpdate.class, 11);
+ kryo.register(PageUpdate.class, 12);
+ kryo.register(UriUpdate.class, 13);
+ kryo.register(ArrayList.class, 14);
+ kryo.register(Link.class, 15);
kryo.setRegistrationRequired(true);
diff --git a/modules/data/src/test/java/webindex/data/fluo/it/IndexIT.java b/modules/data/src/test/java/webindex/data/fluo/it/IndexIT.java
index 6b29fec..3e213bc 100644
--- a/modules/data/src/test/java/webindex/data/fluo/it/IndexIT.java
+++ b/modules/data/src/test/java/webindex/data/fluo/it/IndexIT.java
@@ -46,9 +46,9 @@
import webindex.core.models.Link;
import webindex.core.models.Page;
import webindex.core.models.URL;
+import webindex.core.models.UriInfo;
import webindex.data.SparkTestUtil;
import webindex.data.fluo.PageLoader;
-import webindex.data.fluo.UriMap.UriInfo;
import webindex.data.spark.Hex;
import webindex.data.spark.IndexEnv;
import webindex.data.spark.IndexStats;
@@ -98,7 +98,7 @@
if (p.isEmpty() || p.getOutboundLinks().isEmpty()) {
continue;
}
- pageMap.put(URL.fromPageID(p.getPageID()), p);
+ pageMap.put(URL.fromUri(p.getUri()), p);
}
ar.close();
return pageMap;
diff --git a/modules/data/src/test/java/webindex/data/spark/IndexUtilTest.java b/modules/data/src/test/java/webindex/data/spark/IndexUtilTest.java
index 75ea240..3ab30e6 100644
--- a/modules/data/src/test/java/webindex/data/spark/IndexUtilTest.java
+++ b/modules/data/src/test/java/webindex/data/spark/IndexUtilTest.java
@@ -34,8 +34,8 @@
import webindex.core.models.Link;
import webindex.core.models.Page;
import webindex.core.models.URL;
+import webindex.core.models.UriInfo;
import webindex.data.SparkTestUtil;
-import webindex.data.fluo.UriMap.UriInfo;
public class IndexUtilTest {
@@ -106,11 +106,11 @@
private List<Page> getPagesSet1() {
List<Page> pages = new ArrayList<>();
- Page pageA = new Page(URL.from("http://a.com/1").toPageID());
+ Page pageA = new Page(URL.from("http://a.com/1").toUri());
pageA.addOutbound(Link.of(URL.from("http://b.com/1"), "b1"));
pageA.addOutbound(Link.of(URL.from("http://b.com/3"), "b3"));
pageA.addOutbound(Link.of(URL.from("http://c.com/1"), "c1"));
- Page pageB = new Page(URL.from("http://b.com").toPageID());
+ Page pageB = new Page(URL.from("http://b.com").toUri());
pageB.addOutbound(Link.of(URL.from("http://c.com/1"), "c1"));
pageB.addOutbound(Link.of(URL.from("http://b.com/2"), "b2"));
pageB.addOutbound(Link.of(URL.from("http://b.com/3"), "b3"));
diff --git a/modules/data/src/test/java/webindex/data/util/ArchiveUtilTest.java b/modules/data/src/test/java/webindex/data/util/ArchiveUtilTest.java
index 83b58e3..243fd2f 100644
--- a/modules/data/src/test/java/webindex/data/util/ArchiveUtilTest.java
+++ b/modules/data/src/test/java/webindex/data/util/ArchiveUtilTest.java
@@ -43,7 +43,7 @@
Assert
.assertEquals(
"com.1079ishot>>o>/presale-password-trey-songz-young-jeezy-pre-christmas-bash/screen-shot-2011-10-27-at-11-12-06-am/",
- page.getPageID());
+ page.getUri());
Assert.assertEquals("2015-04-18T03:35:13Z", page.getCrawlDate());
Assert.assertEquals("nginx/1.6.2", page.getServer());
diff --git a/modules/data/src/test/resources/data/set1/accumulo-data.txt b/modules/data/src/test/resources/data/set1/accumulo-data.txt
index b8af78f..e82938b 100644
--- a/modules/data/src/test/resources/data/set1/accumulo-data.txt
+++ b/modules/data/src/test/resources/data/set1/accumulo-data.txt
@@ -7,9 +7,9 @@
d:com.b:fefeff:com.b>>o>/|rank||0
d:com.c|domain|pagecount|1
d:com.c:fefdfdff:com.c>>o>/1|rank||2
-p:com.a>>o>/1|page|cur|{"url":"http://a.com/1","pageID":"com.a\x5cu003e\x5cu003eo\x5cu003e/1","numOutbound":3,"outboundLinks":[{"url":"http://b.com/1","pageID":"com.b\x5cu003e\x5cu003eo\x5cu003e/1","anchorText":"b1"},{"url":"http://b.com/3","pageID":"com.b\x5cu003e\x5cu003eo\x5cu003e/3","anchorText":"b3"},{"url":"http://c.com/1","pageID":"com.c\x5cu003e\x5cu003eo\x5cu003e/1","anchorText":"c1"}]}
+p:com.a>>o>/1|page|cur|{"url":"http://a.com/1","uri":"com.a\x5cu003e\x5cu003eo\x5cu003e/1","numOutbound":3,"outboundLinks":[{"url":"http://b.com/1","uri":"com.b\x5cu003e\x5cu003eo\x5cu003e/1","anchorText":"b1"},{"url":"http://b.com/3","uri":"com.b\x5cu003e\x5cu003eo\x5cu003e/3","anchorText":"b3"},{"url":"http://c.com/1","uri":"com.c\x5cu003e\x5cu003eo\x5cu003e/1","anchorText":"c1"}]}
p:com.a>>o>/1|page|incount|0
-p:com.b>>o>/|page|cur|{"url":"http://b.com/","pageID":"com.b\x5cu003e\x5cu003eo\x5cu003e/","numOutbound":3,"outboundLinks":[{"url":"http://b.com/2","pageID":"com.b\x5cu003e\x5cu003eo\x5cu003e/2","anchorText":"b2"},{"url":"http://b.com/3","pageID":"com.b\x5cu003e\x5cu003eo\x5cu003e/3","anchorText":"b3"},{"url":"http://c.com/1","pageID":"com.c\x5cu003e\x5cu003eo\x5cu003e/1","anchorText":"c1"}]}
+p:com.b>>o>/|page|cur|{"url":"http://b.com/","uri":"com.b\x5cu003e\x5cu003eo\x5cu003e/","numOutbound":3,"outboundLinks":[{"url":"http://b.com/2","uri":"com.b\x5cu003e\x5cu003eo\x5cu003e/2","anchorText":"b2"},{"url":"http://b.com/3","uri":"com.b\x5cu003e\x5cu003eo\x5cu003e/3","anchorText":"b3"},{"url":"http://c.com/1","uri":"com.c\x5cu003e\x5cu003eo\x5cu003e/1","anchorText":"c1"}]}
p:com.b>>o>/|page|incount|0
p:com.b>>o>/1|inlinks|com.a>>o>/1|b1
p:com.b>>o>/1|page|incount|1
diff --git a/modules/data/src/test/resources/data/set1/fluo-data.txt b/modules/data/src/test/resources/data/set1/fluo-data.txt
index 2e3750d..e085c81 100644
--- a/modules/data/src/test/resources/data/set1/fluo-data.txt
+++ b/modules/data/src/test/resources/data/set1/fluo-data.txt
@@ -1,8 +1,8 @@
dm:d:28:\x03\x01com.\xe3|data|current|\x09\x02
dm:d:57:\x03\x01com.\xe1|data|current|\x09\x02
dm:d:5a:\x03\x01com.\xe2|data|current|\x09\x08
-p:saxb:com.a>>o>/1|page|cur|{"url":"http://a.com/1","pageID":"com.a\x5cu003e\x5cu003eo\x5cu003e/1","numOutbound":3,"outboundLinks":[{"url":"http://b.com/1","pageID":"com.b\x5cu003e\x5cu003eo\x5cu003e/1","anchorText":"b1"},{"url":"http://b.com/3","pageID":"com.b\x5cu003e\x5cu003eo\x5cu003e/3","anchorText":"b3"},{"url":"http://c.com/1","pageID":"com.c\x5cu003e\x5cu003eo\x5cu003e/1","anchorText":"c1"}]}
-p:xdjd:com.b>>o>/|page|cur|{"url":"http://b.com/","pageID":"com.b\x5cu003e\x5cu003eo\x5cu003e/","numOutbound":3,"outboundLinks":[{"url":"http://b.com/2","pageID":"com.b\x5cu003e\x5cu003eo\x5cu003e/2","anchorText":"b2"},{"url":"http://b.com/3","pageID":"com.b\x5cu003e\x5cu003eo\x5cu003e/3","anchorText":"b3"},{"url":"http://c.com/1","pageID":"com.c\x5cu003e\x5cu003eo\x5cu003e/1","anchorText":"c1"}]}
+p:saxb:com.a>>o>/1|page|cur|{"url":"http://a.com/1","uri":"com.a\x5cu003e\x5cu003eo\x5cu003e/1","numOutbound":3,"outboundLinks":[{"url":"http://b.com/1","uri":"com.b\x5cu003e\x5cu003eo\x5cu003e/1","anchorText":"b1"},{"url":"http://b.com/3","uri":"com.b\x5cu003e\x5cu003eo\x5cu003e/3","anchorText":"b3"},{"url":"http://c.com/1","uri":"com.c\x5cu003e\x5cu003eo\x5cu003e/1","anchorText":"c1"}]}
+p:xdjd:com.b>>o>/|page|cur|{"url":"http://b.com/","uri":"com.b\x5cu003e\x5cu003eo\x5cu003e/","numOutbound":3,"outboundLinks":[{"url":"http://b.com/2","uri":"com.b\x5cu003e\x5cu003eo\x5cu003e/2","anchorText":"b2"},{"url":"http://b.com/3","uri":"com.b\x5cu003e\x5cu003eo\x5cu003e/3","anchorText":"b3"},{"url":"http://c.com/1","uri":"com.c\x5cu003e\x5cu003eo\x5cu003e/1","anchorText":"c1"}]}
um:d:06:\x03\x01com.b>>o>/\xb3|data|current|\x0b\x01\x00\x04
um:d:2d:\x03\x01com.a>>o>/\xb1|data|current|\x0b\x01\x02\x00
um:d:3c:\x03\x01com.c>>o>/\xb1|data|current|\x0b\x01\x00\x04
diff --git a/modules/integration/src/test/resources/5-pages.txt b/modules/integration/src/test/resources/5-pages.txt
index da94d8a..81316e0 100644
--- a/modules/integration/src/test/resources/5-pages.txt
+++ b/modules/integration/src/test/resources/5-pages.txt
@@ -1,5 +1,5 @@
-{"url":"http://app.cheezburger.com/Rokas08/TrophyDetails/13f82307-8f12-402e-a544-76db8a2dc19c","pageID":"com.cheezburger\u003e.app\u003eo\u003e/Rokas08/TrophyDetails/13f82307-8f12-402e-a544-76db8a2dc19c","numOutbound":19,"crawlDate":"2015-07-28T03:06:17Z","title":"Rokas08\u0026#39;s Profile - Trophy Details - Cheezburger.com","outboundLinks":[{"url":"https://www.facebook.com/Cheezburger","pageID":"com.facebook\u003e.www\u003es\u003e/Cheezburger","anchorText":"Facebook"},{"url":"https://plus.google.com/105247221600709734681","pageID":"com.google\u003e.plus\u003es\u003e/105247221600709734681","anchorText":"Google+"},{"url":"http://knowyourmeme.com/forums","pageID":"com.knowyourmeme\u003e\u003eo\u003e/forums","anchorText":"Forums"},{"url":"http://knowyourmeme.com/memes/popular","pageID":"com.knowyourmeme\u003e\u003eo\u003e/memes/popular","anchorText":"Popular Memes"},{"url":"http://knowyourmeme.com/photos/most-viewed","pageID":"com.knowyourmeme\u003e\u003eo\u003e/photos/most-viewed","anchorText":"All Images"},{"url":"http://knowyourmeme.com/search?q\u003dcategory%3Aevent\u0026amp;sort\u003dnewest","pageID":"com.knowyourmeme\u003e\u003eo\u003e/search?q\u003dcategory%3Aevent\u0026amp;sort\u003dnewest","anchorText":"New Events"},{"url":"http://knowyourmeme.com/search?q\u003dcategory%3Aperson\u0026amp;sort\u003dnewest","pageID":"com.knowyourmeme\u003e\u003eo\u003e/search?q\u003dcategory%3Aperson\u0026amp;sort\u003dnewest","anchorText":"New People"},{"url":"http://knowyourmeme.com/search?q\u003dcategory%3Asite\u0026amp;sort\u003dnewest","pageID":"com.knowyourmeme\u003e\u003eo\u003e/search?q\u003dcategory%3Asite\u0026amp;sort\u003dnewest","anchorText":"New Sites"},{"url":"http://knowyourmeme.com/search?q\u003dcategory%3Asubculture\u0026amp;sort\u003dnewest","pageID":"com.knowyourmeme\u003e\u003eo\u003e/search?q\u003dcategory%3Asubculture\u0026amp;sort\u003dnewest","anchorText":"New Subcultures"},{"url":"http://knowyourmeme.com/search?utf8\u003d%E2%9C%93\u0026amp;context\u003dentries\u0026amp;q\u003dstatus%3Aconfirmed+category%3Ameme","pageID":"com.knowyourmeme\u003e\u003eo\u003e/search?utf8\u003d%E2%9C%93\u0026amp;context\u003dentries\u0026amp;q\u003dstatus%3Aconfirmed+category%3Ameme","anchorText":"All Memes"},{"url":"http://knowyourmeme.com/videos/most-viewed","pageID":"com.knowyourmeme\u003e\u003eo\u003e/videos/most-viewed","anchorText":"All Videos"},{"url":"http://knowyourmeme.com?ref\u003dnavbar","pageID":"com.knowyourmeme\u003e\u003eo\u003e?ref\u003dnavbar","anchorText":"KYM Wiki"},{"url":"https://twitter.com/Cheezburger","pageID":"com.twitter\u003e\u003es\u003e/Cheezburger","anchorText":"Follow"},{"url":"http://chzb.gr/1riG0EZ?ref\u003dfooternav","pageID":"gr.chzb\u003e\u003eo\u003e/1riG0EZ?ref\u003dfooternav","anchorText":"Videos"},{"url":"http://chzb.gr/1riG0EZ?ref\u003dnavbar","pageID":"gr.chzb\u003e\u003eo\u003e/1riG0EZ?ref\u003dnavbar","anchorText":"Videos Find all our FAIL videos here!"},{"url":"http://chzb.gr/1riGhru?ref\u003dfooternav","pageID":"gr.chzb\u003e\u003eo\u003e/1riGhru?ref\u003dfooternav","anchorText":"Videos"},{"url":"http://chzb.gr/1riGhru?ref\u003dnavbar","pageID":"gr.chzb\u003e\u003eo\u003e/1riGhru?ref\u003dnavbar","anchorText":"Videos See all our Geek videos here!"},{"url":"http://chzb.gr/1riGzi6?ref\u003dfooternav","pageID":"gr.chzb\u003e\u003eo\u003e/1riGzi6?ref\u003dfooternav","anchorText":"Videos"},{"url":"http://chzb.gr/1riGzi6?ref\u003dnavbar","pageID":"gr.chzb\u003e\u003eo\u003e/1riGzi6?ref\u003dnavbar","anchorText":"Videos Watch and learn from all of our trolling videos here!"}]}
-{"url":"http://apple.stackexchange.com/help/badges/9/autobiographer?userid\u003d796","pageID":"com.stackexchange\u003e.apple\u003eo\u003e/help/badges/9/autobiographer?userid\u003d796","numOutbound":4,"crawlDate":"2015-07-28T01:32:26Z","server":"cloudflare-nginx","title":"Autobiographer - Badge - Ask Different","outboundLinks":[{"url":"http://apple.blogoverflow.com/","pageID":"com.blogoverflow\u003e.apple\u003eo\u003e/","anchorText":"blog"},{"url":"http://apple.blogoverflow.com?blb\u003d1","pageID":"com.blogoverflow\u003e.apple\u003eo\u003e?blb\u003d1","anchorText":"blog"},{"url":"http://blog.stackoverflow.com/2009/06/attribution-required/","pageID":"com.stackoverflow\u003e.blog\u003eo\u003e/2009/06/attribution-required/","anchorText":"attribution required"},{"url":"http://creativecommons.org/licenses/by-sa/3.0/","pageID":"org.creativecommons\u003e\u003eo\u003e/licenses/by-sa/3.0/","anchorText":"cc by-sa 3.0"}]}
-{"url":"http://apple.stackexchange.com/questions/15006/spotlight-sometimes-cant-find-a-file-that-actually-exists","pageID":"com.stackexchange\u003e.apple\u003eo\u003e/questions/15006/spotlight-sometimes-cant-find-a-file-that-actually-exists","numOutbound":6,"crawlDate":"2015-07-28T01:58:50Z","server":"cloudflare-nginx","title":"Spotlight sometimes can\u0026#39;t find a file. (that actually exists) - Ask Different","outboundLinks":[{"url":"http://askubuntu.com/questions/653335/using-sed-how-could-we-cut-a-specific-string-from-a-line-of-text","pageID":"com.askubuntu\u003e\u003eo\u003e/questions/653335/using-sed-how-could-we-cut-a-specific-string-from-a-line-of-text","anchorText":"Using sed, how could we cut a specific string from a line of text?"},{"url":"http://apple.blogoverflow.com/","pageID":"com.blogoverflow\u003e.apple\u003eo\u003e/","anchorText":"blog"},{"url":"http://apple.blogoverflow.com?blb\u003d1","pageID":"com.blogoverflow\u003e.apple\u003eo\u003e?blb\u003d1","anchorText":"blog"},{"url":"http://blog.stackoverflow.com/2009/06/attribution-required/","pageID":"com.stackoverflow\u003e.blog\u003eo\u003e/2009/06/attribution-required/","anchorText":"attribution required"},{"url":"http://stackoverflow.com/questions/31654274/is-it-ever-justified-to-have-an-object-which-has-itself-as-a-field","pageID":"com.stackoverflow\u003e\u003eo\u003e/questions/31654274/is-it-ever-justified-to-have-an-object-which-has-itself-as-a-field","anchorText":"Is it ever justified to have an object which has itself as a field?"},{"url":"http://creativecommons.org/licenses/by-sa/3.0/","pageID":"org.creativecommons\u003e\u003eo\u003e/licenses/by-sa/3.0/","anchorText":"cc by-sa 3.0"}]}
-{"url":"http://apple.stackexchange.com/users/208/john-allers","pageID":"com.stackexchange\u003e.apple\u003eo\u003e/users/208/john-allers","numOutbound":8,"crawlDate":"2015-07-28T01:40:51Z","server":"cloudflare-nginx","title":"User John Allers - Ask Different","outboundLinks":[{"url":"http://apple.blogoverflow.com/","pageID":"com.blogoverflow\u003e.apple\u003eo\u003e/","anchorText":"blog"},{"url":"http://apple.blogoverflow.com?blb\u003d1","pageID":"com.blogoverflow\u003e.apple\u003eo\u003e?blb\u003d1","anchorText":"blog"},{"url":"http://serverfault.com/users/2870/","pageID":"com.serverfault\u003e\u003eo\u003e/users/2870/","anchorText":"Server Fault 111 111 3"},{"url":"http://blog.stackoverflow.com/2009/06/attribution-required/","pageID":"com.stackoverflow\u003e.blog\u003eo\u003e/2009/06/attribution-required/","anchorText":"attribution required"},{"url":"http://stackoverflow.com/users/73986/","pageID":"com.stackoverflow\u003e\u003eo\u003e/users/73986/","anchorText":"Stack Overflow 2.2k 2.2k 11828"},{"url":"http://superuser.com/users/3552/","pageID":"com.superuser\u003e\u003eo\u003e/users/3552/","anchorText":"Super User 231 231 26"},{"url":"http://www.zooplet.com/","pageID":"com.zooplet\u003e.www\u003eo\u003e/","anchorText":"zooplet.com"},{"url":"http://creativecommons.org/licenses/by-sa/3.0/","pageID":"org.creativecommons\u003e\u003eo\u003e/licenses/by-sa/3.0/","anchorText":"cc by-sa 3.0"}]}
-{"url":"http://apple.stackexchange.com/users/3126/mjb?tab\u003dsummary","pageID":"com.stackexchange\u003e.apple\u003eo\u003e/users/3126/mjb?tab\u003dsummary","numOutbound":7,"crawlDate":"2015-07-28T01:53:49Z","server":"cloudflare-nginx","title":"User mjb - Ask Different","outboundLinks":[{"url":"http://apple.blogoverflow.com/","pageID":"com.blogoverflow\u003e.apple\u003eo\u003e/","anchorText":"blog"},{"url":"http://apple.blogoverflow.com?blb\u003d1","pageID":"com.blogoverflow\u003e.apple\u003eo\u003e?blb\u003d1","anchorText":"blog"},{"url":"http://serverfault.com/users/117061/","pageID":"com.serverfault\u003e\u003eo\u003e/users/117061/","anchorText":"Server Fault"},{"url":"http://blog.stackoverflow.com/2009/06/attribution-required/","pageID":"com.stackoverflow\u003e.blog\u003eo\u003e/2009/06/attribution-required/","anchorText":"attribution required"},{"url":"http://stackoverflow.com/users/581665/","pageID":"com.stackoverflow\u003e\u003eo\u003e/users/581665/","anchorText":"Stack Overflow"},{"url":"http://superuser.com/users/63808/","pageID":"com.superuser\u003e\u003eo\u003e/users/63808/","anchorText":"Super User"},{"url":"http://creativecommons.org/licenses/by-sa/3.0/","pageID":"org.creativecommons\u003e\u003eo\u003e/licenses/by-sa/3.0/","anchorText":"cc by-sa 3.0"}]}
+{"url":"http://app.cheezburger.com/Rokas08/TrophyDetails/13f82307-8f12-402e-a544-76db8a2dc19c","uri":"com.cheezburger\u003e.app\u003eo\u003e/Rokas08/TrophyDetails/13f82307-8f12-402e-a544-76db8a2dc19c","numOutbound":19,"crawlDate":"2015-07-28T03:06:17Z","title":"Rokas08\u0026#39;s Profile - Trophy Details - Cheezburger.com","outboundLinks":[{"url":"https://www.facebook.com/Cheezburger","uri":"com.facebook\u003e.www\u003es\u003e/Cheezburger","anchorText":"Facebook"},{"url":"https://plus.google.com/105247221600709734681","uri":"com.google\u003e.plus\u003es\u003e/105247221600709734681","anchorText":"Google+"},{"url":"http://knowyourmeme.com/forums","uri":"com.knowyourmeme\u003e\u003eo\u003e/forums","anchorText":"Forums"},{"url":"http://knowyourmeme.com/memes/popular","uri":"com.knowyourmeme\u003e\u003eo\u003e/memes/popular","anchorText":"Popular Memes"},{"url":"http://knowyourmeme.com/photos/most-viewed","uri":"com.knowyourmeme\u003e\u003eo\u003e/photos/most-viewed","anchorText":"All Images"},{"url":"http://knowyourmeme.com/search?q\u003dcategory%3Aevent\u0026amp;sort\u003dnewest","uri":"com.knowyourmeme\u003e\u003eo\u003e/search?q\u003dcategory%3Aevent\u0026amp;sort\u003dnewest","anchorText":"New Events"},{"url":"http://knowyourmeme.com/search?q\u003dcategory%3Aperson\u0026amp;sort\u003dnewest","uri":"com.knowyourmeme\u003e\u003eo\u003e/search?q\u003dcategory%3Aperson\u0026amp;sort\u003dnewest","anchorText":"New People"},{"url":"http://knowyourmeme.com/search?q\u003dcategory%3Asite\u0026amp;sort\u003dnewest","uri":"com.knowyourmeme\u003e\u003eo\u003e/search?q\u003dcategory%3Asite\u0026amp;sort\u003dnewest","anchorText":"New Sites"},{"url":"http://knowyourmeme.com/search?q\u003dcategory%3Asubculture\u0026amp;sort\u003dnewest","uri":"com.knowyourmeme\u003e\u003eo\u003e/search?q\u003dcategory%3Asubculture\u0026amp;sort\u003dnewest","anchorText":"New Subcultures"},{"url":"http://knowyourmeme.com/search?utf8\u003d%E2%9C%93\u0026amp;context\u003dentries\u0026amp;q\u003dstatus%3Aconfirmed+category%3Ameme","uri":"com.knowyourmeme\u003e\u003eo\u003e/search?utf8\u003d%E2%9C%93\u0026amp;context\u003dentries\u0026amp;q\u003dstatus%3Aconfirmed+category%3Ameme","anchorText":"All Memes"},{"url":"http://knowyourmeme.com/videos/most-viewed","uri":"com.knowyourmeme\u003e\u003eo\u003e/videos/most-viewed","anchorText":"All Videos"},{"url":"http://knowyourmeme.com?ref\u003dnavbar","uri":"com.knowyourmeme\u003e\u003eo\u003e?ref\u003dnavbar","anchorText":"KYM Wiki"},{"url":"https://twitter.com/Cheezburger","uri":"com.twitter\u003e\u003es\u003e/Cheezburger","anchorText":"Follow"},{"url":"http://chzb.gr/1riG0EZ?ref\u003dfooternav","uri":"gr.chzb\u003e\u003eo\u003e/1riG0EZ?ref\u003dfooternav","anchorText":"Videos"},{"url":"http://chzb.gr/1riG0EZ?ref\u003dnavbar","uri":"gr.chzb\u003e\u003eo\u003e/1riG0EZ?ref\u003dnavbar","anchorText":"Videos Find all our FAIL videos here!"},{"url":"http://chzb.gr/1riGhru?ref\u003dfooternav","uri":"gr.chzb\u003e\u003eo\u003e/1riGhru?ref\u003dfooternav","anchorText":"Videos"},{"url":"http://chzb.gr/1riGhru?ref\u003dnavbar","uri":"gr.chzb\u003e\u003eo\u003e/1riGhru?ref\u003dnavbar","anchorText":"Videos See all our Geek videos here!"},{"url":"http://chzb.gr/1riGzi6?ref\u003dfooternav","uri":"gr.chzb\u003e\u003eo\u003e/1riGzi6?ref\u003dfooternav","anchorText":"Videos"},{"url":"http://chzb.gr/1riGzi6?ref\u003dnavbar","uri":"gr.chzb\u003e\u003eo\u003e/1riGzi6?ref\u003dnavbar","anchorText":"Videos Watch and learn from all of our trolling videos here!"}]}
+{"url":"http://apple.stackexchange.com/help/badges/9/autobiographer?userid\u003d796","uri":"com.stackexchange\u003e.apple\u003eo\u003e/help/badges/9/autobiographer?userid\u003d796","numOutbound":4,"crawlDate":"2015-07-28T01:32:26Z","server":"cloudflare-nginx","title":"Autobiographer - Badge - Ask Different","outboundLinks":[{"url":"http://apple.blogoverflow.com/","uri":"com.blogoverflow\u003e.apple\u003eo\u003e/","anchorText":"blog"},{"url":"http://apple.blogoverflow.com?blb\u003d1","uri":"com.blogoverflow\u003e.apple\u003eo\u003e?blb\u003d1","anchorText":"blog"},{"url":"http://blog.stackoverflow.com/2009/06/attribution-required/","uri":"com.stackoverflow\u003e.blog\u003eo\u003e/2009/06/attribution-required/","anchorText":"attribution required"},{"url":"http://creativecommons.org/licenses/by-sa/3.0/","uri":"org.creativecommons\u003e\u003eo\u003e/licenses/by-sa/3.0/","anchorText":"cc by-sa 3.0"}]}
+{"url":"http://apple.stackexchange.com/questions/15006/spotlight-sometimes-cant-find-a-file-that-actually-exists","uri":"com.stackexchange\u003e.apple\u003eo\u003e/questions/15006/spotlight-sometimes-cant-find-a-file-that-actually-exists","numOutbound":6,"crawlDate":"2015-07-28T01:58:50Z","server":"cloudflare-nginx","title":"Spotlight sometimes can\u0026#39;t find a file. (that actually exists) - Ask Different","outboundLinks":[{"url":"http://askubuntu.com/questions/653335/using-sed-how-could-we-cut-a-specific-string-from-a-line-of-text","uri":"com.askubuntu\u003e\u003eo\u003e/questions/653335/using-sed-how-could-we-cut-a-specific-string-from-a-line-of-text","anchorText":"Using sed, how could we cut a specific string from a line of text?"},{"url":"http://apple.blogoverflow.com/","uri":"com.blogoverflow\u003e.apple\u003eo\u003e/","anchorText":"blog"},{"url":"http://apple.blogoverflow.com?blb\u003d1","uri":"com.blogoverflow\u003e.apple\u003eo\u003e?blb\u003d1","anchorText":"blog"},{"url":"http://blog.stackoverflow.com/2009/06/attribution-required/","uri":"com.stackoverflow\u003e.blog\u003eo\u003e/2009/06/attribution-required/","anchorText":"attribution required"},{"url":"http://stackoverflow.com/questions/31654274/is-it-ever-justified-to-have-an-object-which-has-itself-as-a-field","uri":"com.stackoverflow\u003e\u003eo\u003e/questions/31654274/is-it-ever-justified-to-have-an-object-which-has-itself-as-a-field","anchorText":"Is it ever justified to have an object which has itself as a field?"},{"url":"http://creativecommons.org/licenses/by-sa/3.0/","uri":"org.creativecommons\u003e\u003eo\u003e/licenses/by-sa/3.0/","anchorText":"cc by-sa 3.0"}]}
+{"url":"http://apple.stackexchange.com/users/208/john-allers","uri":"com.stackexchange\u003e.apple\u003eo\u003e/users/208/john-allers","numOutbound":8,"crawlDate":"2015-07-28T01:40:51Z","server":"cloudflare-nginx","title":"User John Allers - Ask Different","outboundLinks":[{"url":"http://apple.blogoverflow.com/","uri":"com.blogoverflow\u003e.apple\u003eo\u003e/","anchorText":"blog"},{"url":"http://apple.blogoverflow.com?blb\u003d1","uri":"com.blogoverflow\u003e.apple\u003eo\u003e?blb\u003d1","anchorText":"blog"},{"url":"http://serverfault.com/users/2870/","uri":"com.serverfault\u003e\u003eo\u003e/users/2870/","anchorText":"Server Fault 111 111 3"},{"url":"http://blog.stackoverflow.com/2009/06/attribution-required/","uri":"com.stackoverflow\u003e.blog\u003eo\u003e/2009/06/attribution-required/","anchorText":"attribution required"},{"url":"http://stackoverflow.com/users/73986/","uri":"com.stackoverflow\u003e\u003eo\u003e/users/73986/","anchorText":"Stack Overflow 2.2k 2.2k 11828"},{"url":"http://superuser.com/users/3552/","uri":"com.superuser\u003e\u003eo\u003e/users/3552/","anchorText":"Super User 231 231 26"},{"url":"http://www.zooplet.com/","uri":"com.zooplet\u003e.www\u003eo\u003e/","anchorText":"zooplet.com"},{"url":"http://creativecommons.org/licenses/by-sa/3.0/","uri":"org.creativecommons\u003e\u003eo\u003e/licenses/by-sa/3.0/","anchorText":"cc by-sa 3.0"}]}
+{"url":"http://apple.stackexchange.com/users/3126/mjb?tab\u003dsummary","uri":"com.stackexchange\u003e.apple\u003eo\u003e/users/3126/mjb?tab\u003dsummary","numOutbound":7,"crawlDate":"2015-07-28T01:53:49Z","server":"cloudflare-nginx","title":"User mjb - Ask Different","outboundLinks":[{"url":"http://apple.blogoverflow.com/","uri":"com.blogoverflow\u003e.apple\u003eo\u003e/","anchorText":"blog"},{"url":"http://apple.blogoverflow.com?blb\u003d1","uri":"com.blogoverflow\u003e.apple\u003eo\u003e?blb\u003d1","anchorText":"blog"},{"url":"http://serverfault.com/users/117061/","uri":"com.serverfault\u003e\u003eo\u003e/users/117061/","anchorText":"Server Fault"},{"url":"http://blog.stackoverflow.com/2009/06/attribution-required/","uri":"com.stackoverflow\u003e.blog\u003eo\u003e/2009/06/attribution-required/","anchorText":"attribution required"},{"url":"http://stackoverflow.com/users/581665/","uri":"com.stackoverflow\u003e\u003eo\u003e/users/581665/","anchorText":"Stack Overflow"},{"url":"http://superuser.com/users/63808/","uri":"com.superuser\u003e\u003eo\u003e/users/63808/","anchorText":"Super User"},{"url":"http://creativecommons.org/licenses/by-sa/3.0/","uri":"org.creativecommons\u003e\u003eo\u003e/licenses/by-sa/3.0/","anchorText":"cc by-sa 3.0"}]}