Distinct and GroupBy

In the Laravel-Elasticsearch integration, distinct() and groupBy() methods play a pivotal role in data aggregation, especially when retrieving unique field values or aggregating grouped data summaries.

Overview

The distinct() and groupBy() methods are used to retrieve unique values of a given field. But they differ significantly in how they’re implemented under the hood

distinct() uses: Nested Term Aggs
- Full results, cannot paginate
- Potentially more expensive than groupBy()
- Sort on column names and by doc_count (useful for ordering by most aggregated values)
- Ideal For
  - Quick unique field lookups by doc count
groupBy() uses: Composite Aggregation
- Can paginate
- Less expensive than distinct()
- Sort on column names only (no doc_count)
- Ideal For
  - Large datasets
  - Paginated results

Distinct

Basic Usage

->distinct(string|array $field, $includeDocCount = false);

UserLog::where('created_at', '>=', Carbon::now()->subDays(30))->distinct('user_id');

// Alternative syntax, explicitly selecting the field
UserLog::where('created_at', '>=', Carbon::now()->subDays(30))->select('user_id')->distinct();

Retrieves All the unique user_ids of users logged in the last 30 days

{
    "index": "user_logs",
    "body": {
        "query": {
            "range": {
                "created_at": {
                    "gte": "2025-02-21T18:41:18+00:00"
                }
            }
        },
        "size": 0,
        "aggs": {
            "by_user_id": {
                "terms": {
                    "field": "user_id",
                    "size": 1000
                }
            }
        }
    },
    "_source": [
        "user_id"
    ]
}

Note on Limits

Distinct will run a Nested Term Aggs on a given query. The aggregation will be limited to the number of records returned by the query - which will come from the default limit in your connection settings (1000 if unchanged) if you do not set a limit in your query.

ex: UserLog::distinct('user_id') - will run a term aggregation on the first 1000 records returned by the query
ex: UserLog::limit(5)->distinct('user_id') - will run a term aggregation on the first 5 records returned by the query
ex: UserLog::orderByDesc('created_at')->distinct('user_id') - will run a term aggregation on the latest 1000 records returned by the query
ex: UserLog::limit(10000)->distinct('user_id') - will run a term aggregation on the first 10000 records returned by the query

Multiple Fields

UserLog::where('created_at', '>=', Carbon::now()->subDays(30))->distinct(['status', 'log_code']);

Retrieves All the unique combinations of status and log_code of users logged in the last 30 days

{
    "index": "user_logs",
    "body": {
        "query": {
            "range": {
                "created_at": {
                    "gte": "2025-02-21T18:42:19+00:00"
                }
            }
        },
        "size": 0,
        "aggs": {
            "by_status": {
                "terms": {
                    "field": "status",
                    "size": 1000
                },
                "aggs": {
                    "by_log_code": {
                        "terms": {
                            "field": "log_code",
                            "size": 1000
                        }
                    }
                }
            }
        }
    },
    "_source": [
        "status",
        "log_code"
    ]
}

Ordering by Aggregation Count

_count is a special internal alias used by the package to sort by aggregation result count.

_count is not available with groupBy().

UserLog::where('created_at', '>=', Carbon::now()->subDays(30))->orderByDesc('_count')->distinct('user_id');

Retrieves All the unique user_ids of users logged in the last 30 days Ordered by most logs by a user

{
    "index": "user_logs",
    "body": {
        "query": {
            "range": {
                "created_at": {
                    "gte": "2025-02-21T18:47:14+00:00"
                }
            }
        },
        "size": 0,
        "aggs": {
            "by_user_id": {
                "terms": {
                    "field": "user_id",
                    "size": 1000,
                    "order": [
                        {
                            "_count": "desc"
                        }
                    ]
                }
            }
        }
    },
    "_source": [
        "user_id"
    ]
}

Return with doc count

UserLog::where('created_at', '>=', Carbon::now()->subDays(30))
    ->orderByDesc('_count')
    ->distinct('user_id',true);

//OR

UserLog::where('created_at', '>=', Carbon::now()->subDays(30))
    ->orderByDesc('_count')->select('user_id')
    ->distinct(true);

Retrieves All the unique user_ids of users logged in the last 30 days Ordered by most logs by a user

{
    "index": "user_logs",
    "body": {
        "query": {
            "range": {
                "created_at": {
                    "gte": "2025-02-21T18:47:14+00:00"
                }
            }
        },
        "size": 0,
        "aggs": {
            "by_user_id": {
                "terms": {
                    "field": "user_id",
                    "size": 1000,
                    "order": [
                        {
                            "_count": "desc"
                        }
                    ]
                }
            }
        }
    },
    "_source": [
        "user_id"
    ]
}

{
    "user_id": "1",
    "user_id_count": 24
},
{
    "user_id": "2",
    "user_id_count": 18
},
{
    "user_id": "3",
    "user_id_count": 8
},

GroupBy

...->groupBy(string|array $field)->....

Basic Usage

UserLog::where('created_at', '>=', Carbon::now()->subDays(30))->groupBy(['user_id'])->get();

Retrieves All the unique user_ids of users logged in the last 30 days

{
    "index": "user_logs",
    "body": {
        "query": {
            "range": {
                "created_at": {
                    "gte": "2025-02-22T09:20:46+00:00"
                }
            }
        },
        "size": 0,
        "aggs": {
            "group_by": {
                "composite": {
                    "sources": [
                        {
                            "user_id": {
                                "terms": {
                                    "field": "user_id.keyword"
                                }
                            }
                        }
                    ]
                }
            }
        }
    },
    "_source": [
        "*"
    ]
}

GroupBy take and skip

If no limit is provided, then the aggregation will return the first 10 results as per Elasticsearch default.

You can set the limit to whatever you want, but it is not necessary to cover all the hits of your query since the composite is pagable.

UserLog::groupBy(['user_id'])->take(15)->skip(4)->get();

{
    "index": "user_logs",
    "body": {
        "size": 0,
        "aggs": {
            "group_by": {
                "composite": {
                    "sources": [
                        {
                            "user_id": {
                                "terms": {
                                    "field": "user_id.keyword"
                                }
                            }
                        }
                    ],
                    "size": 15,
                    "after": {
                        "user_id": "158e6cc9-fa3e-3258-99c2-571d052e923c"
                    }
                }
            }
        }
    },
    "_source": [
        "*"
    ]
}

Multiple Fields

UserLog::where('created_at', '>=', Carbon::now()->subDays(30))->groupBy(['user_id','status'])->get();

{
    "index": "user_logs",
    "body": {
        "query": {
            "range": {
                "created_at": {
                    "gte": "2025-02-22T11:26:48+00:00"
                }
            }
        },
        "size": 0,
        "aggs": {
            "group_by": {
                "composite": {
                    "sources": [
                        {
                            "user_id": {
                                "terms": {
                                    "field": "user_id.keyword"
                                }
                            }
                        },
                        {
                            "status": {
                                "terms": {
                                    "field": "status"
                                }
                            }
                        }
                    ]
                }
            }
        }
    },
    "_source": [
        "*"
    ]
}

Model Relations

The results from distinct() and groupBy() queries are returned as ElasticCollections, enabling the use of Laravel’s rich collection methods for further manipulation or processing.

Example: User has many user logs:

// Loading related user data from the distinct user_ids
$users = UserLog::where('created_at', '>=', Carbon::now()->subDays(30))->groupBy(['user_id'])->get();
return $users->load('user');

Loads the related user data for the distinct user_ids

GroupBy Pagination with Relations

Example:

UserLog::where('created_at', '>=', Carbon::now()->subDays(30))
    ->groupBy(['user_id'])
    ->paginate(10)->through(function ($userLog) {
        $userLog->load('user');
        return $userLog;
    }
);