| <html> |
| |
| <head> |
| <title>Rationale behind using JNI as opposed to threads in a remote JVM process</title> |
| <style> |
| <!-- |
| h1 { font-size: 18pt } |
| h2 { font-size: 14pt; margin-top: 12; margin-bottom: 3 } |
| p { margin-top: 0; margin-bottom: 6 } |
| --> |
| </style> |
| </head> |
| |
| <body> |
| <h1>Rationale behind using JNI as opposed to threads in a remote JVM process.</h1> |
| <p><font size="2">Java™ is a registered trademark of Sun Microsystems, Inc. in the United States and other countries.</font></p> |
| <h2>Reasons to use a high level language like Java™ in the backend</h2> |
| |
| <p>A large part of the reason why JNI was chosen in favor of an RPC based, |
| single JVM solution was due to the expected use-cases. Enterprise systems today are almost always |
| 3-tier or n-tier. Database functions, triggers, and stored procedures are |
| mechanisms that extend the functionality of the backend tier. They typically |
| rely on a tight integration with the database due to a very high rate of |
| interactions and execute inside of the database largely to limit the number of |
| interactions between the middle tier and the backend tier. Some typical |
| use-cases: </p> |
| |
| <ul type=disc> |
| <li>Referential integrity |
| enforcement. Using Java, referential integrity that goes beyond what can |
| be provided using the standard SQL semantics can be provided. It might |
| involve checking XML documents, enforcing some meta-driven rule system, or |
| other complex tasks that put high demands on the implementation language. </li> |
| <li>Advanced pattern recognition. |
| Soundex, image comparison, etc. </li> |
| <li>XML support functions. Java |
| comes with a lot of XML support. Parsers etc. are readily available. </li> |
| <li>Support functions for O/R |
| mappers. A variety of support can be implemented depending on design. One |
| example is an O/R mapper that allows methods on persistent objects. A lot |
| can be gained if such methods are pushed down and executed within the |
| database. Consider the following (OQL): |
| |
| <p><code> |
| SELECT AVG(x.salary - x.computeTax()) FROM Employee x WHERE x.salary > 120000; |
| </code></p> |
| |
| <p>Pushing the computeTax logic down to the database instead of |
| computing it in the middle tier (where much or the O/R logic resides) is a huge |
| gain from a performance standpoint. The statement could be transformed into SQL |
| as: </p> |
| |
| <p><code> |
| SELECT AVG(x.salary - computeTax(x.salary)) FROM Employee x WHERE x.salary > 120000; |
| </code></p> |
| |
| <p>As a result, very few interactions (typically only one) need |
| to be made between the middle and the backend tier. </p> |
| </ul> |
| <ul type=disc> |
| <li>Views and indexes making use |
| of computed values. In the above example and index could be created on |
| computeTax(x.salary) and a view could express that as net_income. </li> |
| <li>Message queue management. |
| Delivering or fetching things using message queues or other delivery |
| mechanisms. As with most interactions with other processes, this requires |
| transaction coordination of some kind. </li> |
| </ul> |
| |
| <p>One might argue that since a JVM often is present running an app-server in |
| the middle tier, would it not be more efficient if that JVM also executed the |
| database functions and triggers? In my opinion, this would be very bad. One |
| major reason for moving execution down to the database is performance (by |
| minimizing the number of roundtrips between the app-server and the database) |
| another is separation of concern. Referential data integrity and other ways to |
| extend the functionality of the database should not be the app-servers concern, |
| it belongs in the backend tier. Other aspects like database versus app-server |
| administration, replication of code and permission changes for functions, and |
| running different tiers on different servers, makes it even worse. </p> |
| |
| <h2>Resource consumption</h2> |
| |
| <p>Having one JVM per connection instead of one thread per connection running |
| in the same JVM will undoubtedly consume more resources. There are however a |
| couple of facts that must be remembered: </p> |
| |
| <ul type=disc> |
| <li>The overhead of multiple |
| processes is already present due to the fact that each connection is a |
| process in a PostgreSQL system. </li> |
| <li>In order to keep connections |
| separated in case they run in the same JVM, some kind of "compartments" |
| must be created. Either you create them using parallel class loader chains |
| (similar to how EAR files are managed in an EJB server) or you use a less |
| protective model similar to a servlet engine. In order to get a separation |
| that comparable to what you get using separate JVM's, you have to go for |
| the former. That consumes some resources. </li> |
| <li>The JVM has undergone a |
| series of improvements in order to reduce footprint and startup time. Some |
| significant improvements where made in Java 1.4 and Java 1.5 introduces |
| Java Heap Self Tuning, Class Data Sharing, and Garbage Collector |
| Ergonomics (read more <a |
| href="http://java.sun.com/j2se/1.5.0/docs/relnotes/features.html#vm">here</a>), |
| technologies that will minimize the startup time and make the JVM adopt |
| its resource consumption in a much improved way.</li> |
| <li>PL/Java can make use of the <a |
| href="http://gcc.gnu.org/java">GCJ</a>. Using this technology, all core |
| classes will be compiled into binaries and optionally pre-loaded by the |
| postmaster. It also means that all modules that are loaded using the |
| install_jar/replace_jar can be compiled into real shared objects. Finally, |
| it means that the footprint for each "JVM" will be significantly |
| decreased. </li> |
| </ul> |
| |
| <h2>Connection pooling</h2> |
| |
| <p>In the Java community you are very likely to use a connection pool. The pool |
| will ensure that the number of connections stays as low as possible and that |
| connections are reused (instead of closed and reestablished). New JVMs are |
| started rarely. </p> |
| |
| <h2>Connection isolation</h2> |
| |
| <p>Separate JVMs gives you a much higher degree of isolation. This brings a |
| number of advantages: </p> |
| |
| <ul type=disc> |
| <li>There's no problem attaching |
| a debugger to one connection (one JVM) while the others run unaffected. </li> |
| <li>There's no chance that one |
| connection manages to accidentally (or maliciously) exchange dirty data |
| with another connection. </li> |
| <li>A process that performs tasks |
| that consume a lot of CPU under a long period of time can be scheduled |
| with a lower priority using a simple OS command. </li> |
| <li>The JVMs can be brought down |
| and restarted individually. </li> |
| <li>Security policies are much |
| easier to enforce. </li> |
| </ul> |
| |
| <h2>Transaction visibility</h2> |
| |
| <p>In order to maintain the correct visibility, the transaction must somehow be |
| propagated to the Java layer. I can see two solutions for this using RPC. |
| Either an XA aware JDBC-driver is used (requires XA support from PostgreSQL) or |
| a JDBC driver is written so that it calls back to the SPI functions in the |
| invoking process. Both choices results in an increased number of RPC calls and |
| a negative performance impact. </p> |
| |
| <p>The PL/Java approach is to use the underlying SPI interfaces directly |
| through JNI by providing a "pseudo connection" that implements the |
| JDBC interfaces. The mapping is thus very direct. Data need never be serialized |
| nor duplicated. </p> |
| |
| <h2>RPC performance</h2> |
| |
| <p>Remote procedure calls are extremely expensive compared to in-process calls. |
| Relying on an RPC mechanism for Java calls will cripple the usefulness of such |
| an implementation a great deal. Here are two examples: </p> |
| |
| <ul type=disc> |
| <li>In order for an update |
| trigger to function using RPC, you can choose one of two approaches. |
| Either you limit the number of RPC calls and send two full Tuples (old and |
| new) and a Tuple Descriptor to the remote JVM, and then pass a third Tuple |
| (the modified new) back to the original, or you pass those structures by |
| reference (as CORBA remote objects) and perform one RPC call each time you |
| access them. You have a tradeoff between on one hand, limited |
| functionality and poor performance, and on the other, good functionality |
| and really bad performance. </li> |
| <li>When one or several Java |
| functions are used in the projection or filter of a SELECT statement on a |
| query processing several thousand rows, each row will cause at least one |
| call to Java. In case of RPC, this implies that the OS needs to do at |
| least two context switches (back and forth) for each row in the query. </li> |
| </ul> |
| |
| <p>Using JNI to directly access structures like TriggerData, Relation, |
| TupleDesc, and HeapTuple minimizes the amount of data that needs to be copied. |
| Parameters and return values that are primitives need not even become Java |
| objects. A 32-bit int4 Datum can be directly passed as a Java int (jint in |
| JNI). </p> |
| |
| <h2>Simplicity</h2> |
| |
| <p>I've have some experience of work involving CORBA and other RPCs. They add a |
| fair amount of complexity to the process. JNI however, is invisible to the |
| user. </p> |
| |
| </body> |
| |
| </html> |