| Soft functional dependencies |
| ============================ |
| |
| Functional dependencies are a concept well described in relational theory, |
| particularly in the definition of normalization and "normal forms". Wikipedia |
| has a nice definition of a functional dependency [1]: |
| |
| In a given table, an attribute Y is said to have a functional dependency |
| on a set of attributes X (written X -> Y) if and only if each X value is |
| associated with precisely one Y value. For example, in an "Employee" |
| table that includes the attributes "Employee ID" and "Employee Date of |
| Birth", the functional dependency |
| |
| {Employee ID} -> {Employee Date of Birth} |
| |
| would hold. It follows from the previous two sentences that each |
| {Employee ID} is associated with precisely one {Employee Date of Birth}. |
| |
| [1] https://en.wikipedia.org/wiki/Functional_dependency |
| |
| In practical terms, functional dependencies mean that a value in one column |
| determines values in some other column. Consider for example this trivial |
| table with two integer columns: |
| |
| CREATE TABLE t (a INT, b INT) |
| AS SELECT i, i/10 FROM generate_series(1,100000) s(i); |
| |
| Clearly, knowledge of the value in column 'a' is sufficient to determine the |
| value in column 'b', as it's simply (a/10). A more practical example may be |
| addresses, where the knowledge of a ZIP code (usually) determines city. Larger |
| cities may have multiple ZIP codes, so the dependency can't be reversed. |
| |
| Many datasets might be normalized not to contain such dependencies, but often |
| it's not practical for various reasons. In some cases, it's actually a conscious |
| design choice to model the dataset in a denormalized way, either because of |
| performance or to make querying easier. |
| |
| |
| Soft dependencies |
| ----------------- |
| |
| Real-world data sets often contain data errors, either because of data entry |
| mistakes (user mistyping the ZIP code) or perhaps issues in generating the |
| data (e.g. a ZIP code mistakenly assigned to two cities in different states). |
| |
| A strict implementation would either ignore dependencies in such cases, |
| rendering the approach mostly useless even for slightly noisy data sets, or |
| result in sudden changes in behavior depending on minor differences between |
| samples provided to ANALYZE. |
| |
| For this reason, extended statistics implement "soft" functional dependencies, |
| associating each functional dependency with a degree of validity (a number |
| between 0 and 1). This degree is then used to combine selectivities in a |
| smooth manner. |
| |
| |
| Mining dependencies (ANALYZE) |
| ----------------------------- |
| |
| The current algorithm is fairly simple - generate all possible functional |
| dependencies, and for each one count the number of rows consistent with it. |
| Then use the fraction of rows (supporting/total) as the degree. |
| |
| To count the rows consistent with the dependency (a => b): |
| |
| (a) Sort the data lexicographically, i.e. first by 'a' then 'b'. |
| |
| (b) For each group of rows with the same 'a' value, count the number of |
| distinct values in 'b'. |
| |
| (c) If there's a single distinct value in 'b', the rows are consistent with |
| the functional dependency, otherwise they contradict it. |
| |
| |
| Clause reduction (planner/optimizer) |
| ------------------------------------ |
| |
| Applying the functional dependencies is fairly simple: given a list of |
| equality clauses, we compute selectivities of each clause and then use the |
| degree to combine them using this formula |
| |
| P(a=?,b=?) = P(a=?) * (d + (1-d) * P(b=?)) |
| |
| Where 'd' is the degree of functional dependency (a => b). |
| |
| With more than two equality clauses, this process happens recursively. For |
| example for (a,b,c) we first use (a,b => c) to break the computation into |
| |
| P(a=?,b=?,c=?) = P(a=?,b=?) * (e + (1-e) * P(c=?)) |
| |
| where 'e' is the degree of functional dependency (a,b => c); then we can |
| apply (a=>b) the same way on P(a=?,b=?). |
| |
| |
| Consistency of clauses |
| ---------------------- |
| |
| Functional dependencies only express general dependencies between columns, |
| without referencing particular values. This assumes that the equality clauses |
| are in fact consistent with the functional dependency, i.e. that given a |
| dependency (a=>b), the value in (b=?) clause is the value determined by (a=?). |
| If that's not the case, the clauses are "inconsistent" with the functional |
| dependency and the result will be over-estimation. |
| |
| This may happen, for example, when using conditions on the ZIP code and city |
| name with mismatching values (ZIP code for a different city), etc. In such a |
| case, the result set will be empty, but we'll estimate the selectivity using |
| the ZIP code condition. |
| |
| In this case, the default estimation based on AVIA principle happens to work |
| better, but mostly by chance. |
| |
| This issue is the price for the simplicity of functional dependencies. If the |
| application frequently constructs queries with clauses inconsistent with |
| functional dependencies present in the data, the best solution is not to |
| use functional dependencies, but one of the more complex types of statistics. |