An aggregate function or clause takes a group of records (which could be an entire table or a subgroup in that table) and from all the values for a field in that group computes one or more resulting values. For example, the aggregate function Sum(<field>) takes all values for a given field in the group, sums them and returns the sum. The aggregate function Avg(<field>) returns the average of values for a field in the group. Aggregates ignore NULL values.

Using aggregates is easy: given a table called Expenses that has a field called Payment if we want to know the sum of all payments and the average of all payments we can write:

SELECT Sum(Payment), Avg(Payment)

FROM Expenses;

Aggregates include Sum, Avg, Count, Max and Min. Max and Min aggregates work for dates and strings and also for uuid and binary values. Max and Min aggregates work for xN vector values. See also the SQL Functions and INLINE topics for examples.

Aggregates are often used with the GROUP BY clause, which creates subgroups from a table. The aggregate can then operate on each subgroup to create the desired aggregate summary of the record values in that subgroup. If our Expenses table has a field called Payment and a field called Job that give payments made for each job that was done, we can get a list of each job category and the sum of payments for each job subgroup with:

SELECT Job, Sum(Payment)

FROM Expenses GROUP BY Job;

If there is no GROUP BY clause, the aggregate function will operate on the entire table. See the GROUP BY, COLLECT, and SPLIT Tutorial topic for aggregate examples using COLLECT and Sum together with GROUP BY.

Most aggregates are so simple and obvious, like Sum, they do not require additional commentary. The following notes discuss aggregates that are less obvious or have nuances in their use.

Depending on the argument, the Count aggregate function counts either the number of non-NULL values in a field or expression within a group, or it counts the total number of records within a group:

Count(field) - will count the number of non-NULL values in field.

Count(expression) - will count the number of non-NULL values of the expression.

Count(*) - will always count the total number of records.

A classic example is finding the number of duplicates in a City field within a table, using Count(*):

SELECT [City], Count(*) FROM [Employees]

GROUP BY [City]

HAVING Count(*) > 1;

See the discussion of the above example in the Queries topic.

The COLLECT aggregate collects values from a subgroup into a table. COLLECT returns a table with one or more fields and zero or more records. The values in that table normally are just regular values and are not tables.

Modern databases such as Oracle or PostgreSQL often allow control over the values seen by aggregates. For example, a user might write a query that contains several Sum aggregate functions using the same field with each Sum computing a partial sum utilizing a filter specified within the Sum, for example,

SUM(x FILTER ...)

The COLLECT aggregate provides a generic aggregate for performing such operations and other operations as well. For example:

SELECT name, Count(*),

SPLIT (COLLECT property, value)

FROM mfd_meta GROUP BY name;

In the above example SPLIT is used merely to show which values are in each COLLECT.

The table returned by COLLECT is typically fed into a function. This essentially allows creating custom aggregates, including with scripts.

We can also COLLECT expressions similar to how SELECT can use expressions:

SELECT name, Count(*),

SPLIT (COLLECT property, StringToUppercase(value))

FROM mfd_meta GROUP BY name;

One way to understand what COLLECT does is to start with what GROUP does. GROUP takes a table, sorts the records within that table into groups and then produces a record for each group by applying aggregates to derive that one record for the group.

Each aggregate takes a group, which is a subset of records from the original table, and from the values in that group computes one or more resulting values for the resulting record. For example, the aggregate Sum(f) takes all values in the group, sums them and returns the sum.

COLLECT works the same way: it is like a SELECT which runs on a group. COLLECT takes a table and returns a table without requiring us to write a FROM section as we would with a SELECT.

Notes on COLLECT

COLLECT supports ORDER BY using the same syntax as using ORDER BY in SELECT statements, including ordering by expression.

COLLECT supports DISTINCT, for example as in

(COLLECT DISTINCT value)

COLLECT supports WHERE.

In addition to the aggregates mentioned above, other aggregate functions are available:

Diversity - Takes a set of numbers and computes the total number of different values.

DiversityIndex - Takes a set of numbers and computes a measure of diversity using the formula 1 - sum(individualcount^2) / (totalcount^2) A diversity index of 0 means that all values are the same.

GeomMergeAreas takes a set of areas and returns an area geom that contains branches. Curves and Z values are removed.

GeomMergeLines takes a set of lines and returns a line geom that contains branches. Curves and Z values are removed.

GeomMergePoints takes a set of points and returns a multipoint geom that contains branches. Z values are removed.

GeomUnionAreas takes a set of areas and returns their union.

GeomUnionRects takes a set of x4 values and returns their union.

First and Last return the first and last value respectively. Unlike most other aggregates, First and Last do not skip NULLs.

JoinAnd, JoinOr and JoinXor combine boolean values using And, Or and Xor operators.

JoinBitAnd, JoinBitOr and JoinBitXor combine numeric values using BitAnd, BitOr and BitXor operators.

Median returns the median value of an arbitrary type.

Major - Takes a set of numbers and returns the most frequently occurring value.

StDev and StDevPop compute sample standard deviation and population standard deviation.

StringJoinTokens takes a set of string values and joins them into a single string with the specified separator character(s).

Var and VarPop compute sample variance and population variance.

Covar and CovarPop compute sample covariance and population covariance, taking two parameters.

Corr computes correlation, taking two parameters.

Do aggregates require using GROUP BY? No. We can use aggregates without using a GROUP BY, in which case the aggregates will apply to the entire table. For example, given a table called Expenses that has a field called Payments we can write:

SELECT Sum(Payment), Avg(Payment)

FROM Expenses;

Using any aggregate in a SELECT list tells the query engine the SELECT is about aggregates even if there is no GROUP BY. If there is a GROUP BY, the aggregates in the SELECT list will be applied to each subgroup created by the GROUP BY, but if there is no GROUP BY the SELECT will take the entire table as a group, applying the aggregate to values in the entire table.

Geared up for aggregates, the SELECT will
process the field list insisting that any fields in that list are either
the result of an aggregate or are the subject of a GROUP BY. Therefore,
if we had some other field in our table called Description
we could *not*
write:

-- NOT a valid query

SELECT Description, Sum(Payment), Avg(Payment)

FROM Expenses;

The above will fail with an error message saying that Description must be either part of the GROUP BY or part of an aggregate. See the GROUP BY, COLLECT, and SPLIT Tutorial topic for examples.

When implementing aggregates any query engine must decide how far an aggregate can go from the producing GROUP, whether that is explicit or implicit. For example, considering the fragment

SELECT ... Max(a) ... FROM t GROUP BY b

the query engine must decide what can be allowed or disallowed within the ellipses. Manifold allows everything except nested SELECTs and functions.

Nested Selects - An aggregate may not go into a nested SELECT. So, for example, the following works:

FUNCTION f(@t TABLE, @u NVARCHAR) TABLE AS

(SELECT mfd_id, type FROM @t WHERE type>@u) END;

SELECT name, SPLIT CALL f(mfd_root, Max(value))

FROM mfd_meta GROUP BY name;

But the following does not work:

SELECT name, SPLIT

(SELECT mfd_id, type FROM mfd_root

WHERE type>Max(value))

FROM mfd_meta GROUP BY name;

The above does not work because the inner SELECT treats Max(value) as its own, not that of the outer SELECT. If the inner SELECT accepted aggregates from the outer SELECT, we could end up with something like the following, which does not work:

SELECT name, SPLIT

(SELECT mfd_id, type FROM mfd_root

WHERE Max(name) > Max(value)

GROUP BY mfd_id, type)

FROM mfd_meta GROUP BY name;

In the above it would be difficult for a reader to tell which Max in the inner SELECT belongs to that SELECT and which belonged to the outer SELECT. In such cases it would be easy to create queries that would compile and run but which did not work as intended.

To prevent such problems, in Manifold a SELECT acts as a fence for aggregates. All aggregates inside a SELECT belong to that SELECT.

An aggregate may not cross over between the outside and the inside of the body of a FUNCTION. That does not mean we cannot use aggregates within functions. Aggregates may be used within functions but they either must be completely inside a function or completely outside a function. The boundary between the inside and outside of a function is also like a fence that an aggregate may not cross.

For example, the following works because the aggregate is completely outside the function:

FUNCTION f(@a INT32) INT32 AS @a+100 END;

SELECT name, f(Count(*))

FROM mfd_meta GROUP BY name;

The following works as well (aggregate completely inside):

FUNCTION f(@t TABLE) TABLE AS

(SELECT name, Count(*)+100 FROM @t GROUP BY name) END;

TABLE CALL f(mfd_meta);

The following works too with aggregates inside and outside the function but separated from each other:

FUNCTION f(@t TABLE, @n NVARCHAR) TABLE AS

(SELECT Count(*) FROM @t WHERE name=@n) END;

SELECT SPLIT CALL f(mfd_meta, Max(name)) FROM mfd_root;

This works:

SELECT Min(mfd_id) + Max(mfd_id) FROM mfd_meta;

But the following does not work:

FUNCTION f(@a INT32) INT32 AS Min(@a) + Max(@a) END;

SELECT f(mfd_id) FROM mfd_meta;

But this does work:

FUNCTION f(@a INT32, @b INT32) INT32 AS @a + @b END;

SELECT f(Min(mfd_id), Max(mfd_id)) FROM mfd_meta;

In general, everything else apart from nested SELECTs and mixing between the inside and outside of functions is fair game. For example, the following works:

SELECT SPLIT (VALUES (Min(mfd_id)), (Max(mfd_id)))

FROM mfd_meta;

..and this works as well:

SELECT SPLIT

(EXECUTE WITH (@n NVARCHAR=Max(name))

[[ SELECT * FROM mfd_root WHERE name=@n ]])

FROM mfd_meta;

Merge and Union aggregates usually also have a non-aggregate "Pair" version of the function that operates between two such types. For example, the aggregate GeomMergeLines function has a GeomMergeLinesPair equivalent that operates to merge two line objects.

We will try passing a non-line to GeomMergeLinesPair:

--SQL

TABLE CALL GeomToCoords(GeomMergeLinesPair(

GeomConvertToArea(GeomMakeRect(VectorMakeX4(3, 3, 4, 4))),

GeomConvertToLine(GeomMakeRect(VectorMakeX4(7, 7, 8, 8)))

))

...it ignored the non-line and returned branches for the line, the second argument. If we passed two non-lines it would have returned a NULL value.

The non-aggregate merge and union functions only take two arguments. If we have three objects, how do we merge them?

We can do it like this:

--SQL

SELECT GeomMergePoints(f) FROM

(VALUES

(GeomMakePoint(VectorMakeX2(0, 0))),

(GeomMakePoint(VectorMakeX2(1, 1))),

(GeomMakePoint(VectorMakeX2(2, 2)))

AS (f))

Or like this in the case of five objects (the same approach works for more than five):

--SQL

FUNCTION MakeSeveralPoints(@p GEOM, @q GEOM, @r GEOM, @s GEOM, @t GEOM) TABLE AS

(VALUES (@p), (@q), (@r), (@s), (@t) AS (f))

END;

SELECT GeomMergePoints(f) FROM CALL MakeSeveralPoints(

GeomMakePoint(VectorMakeX2(0, 0)),

GeomMakePoint(VectorMakeX2(1, 1)),

GeomMakePoint(VectorMakeX2(2, 2)),

GeomMakePoint(VectorMakeX2(3, 3)),

GeomMakePoint(VectorMakeX2(4, 4)))

Finally, we take a look at the difference between GeomMergeAreas and GeomUnionAreas. We run the following one by one:

--SQL

SELECT SPLIT CALL GeomToCoords(GeomMergeAreas(f)) FROM

(VALUES

(GeomMakeRect(VectorMakeX4(0, 0, 7, 7))), -- A

(GeomMakeRect(VectorMakeX4(2, 2, 3, 3))), -- B

(GeomMakeRect(VectorMakeX4(8, 8, 9, 9))) -- C

AS (f))

--SQL

SELECT SPLIT CALL GeomToCoords(GeomUnionAreas(f)) FROM

(VALUES

(GeomMakeRect(VectorMakeX4(0, 0, 7, 7))), -- A

(GeomMakeRect(VectorMakeX4(2, 2, 3, 3))), -- B

(GeomMakeRect(VectorMakeX4(8, 8, 9, 9))) -- C

AS (f))

GeomMergeAreas produces three branches while GeomUnionAreas produces two branches because B is completely covered by A and disappears in the union.

2 billion / 1 billion record limitations - Median, Diversity, DiversityIndex, Major are currently limited to 2 billion values, as a practical matter limiting their use to aggregates of no more than 2 billion records.. Corr, Covar, CovarPop are currently limited to 1 billion pairs of values. Future builds will remove those limits. StDev, StDevPop, Var, VarPop were also limited to 2 billion values in the past, but no longer have that limit.

Median aggregate - The behavior of Median on an even number of values is an implementation choice. Consider taking the median of two integers, 1, and 2. What is the median? 8 selects the lowest value of a pair of central values. 9 selects the highest value, primarily for historical reasons. Manifold code could be adjusted to selecting the lowest value like 8 does, simply to avoid breaking compatibility for little reason. That might happen in future builds. However, it could be that was is really needed here is a variant of Median that takes an average.

Nulls not skipped - First and Last do not skip NULLs because if they did a construction like

SELECT First(a), First(b)

could return values from different records.

Unused arguments - A function that does not use one of its arguments will not fail to compile when passed an aggregate. Instead, the aggregate is optimized away and is not computed. For example:

FUNCTION f(@t NVARCHAR) INT32 AS 5 END;

SELECT name, f(Max(property))

FROM mfd_meta GROUP BY name;

Will compile and run OK.

Streaming mode - Aggregates operate in streaming mode, spilling excess data to disk if required. This both allows handling bigger amounts of data at predictable speed and also protects against running out of memory if the system is low on memory. Protection against running out of memory is not guaranteed by streaming as it is still possible to run out of memory, but doing so is much more difficult.

GROUP BY, COLLECT, and SPLIT Tutorial