Example: Clustering using FoF algorithm

A friend of friends (FoF) algorithm is useful when you want to find groups of objects that are close to each other forming clusters. Here is an example of how to perform clustering using the pycorrelator package.

First, let’s create a mock catalog:

import pandas as pd

# Create a mock catalog as a pandas DataFrame
catalog = pd.DataFrame([[80.894, 41.269, 15.5], [120.689, -41.269, 12.3],
                        [10.689, -41.269, 18.7], [10.688, -41.270, 14.1],
                        [10.689, -41.270, 16.4], [10.690, -41.269, 13.2],
                        [120.690, -41.270, 17.8]], columns=['ra', 'dec', 'mag'])

Note

If you want to use a format other than a pandas DataFrame, see the supported formats for more information.

fof()

Then, we can perform clustering using the FoF algorithm with the tolerance of 0.01 degree using the pycorrelator.fof() function.

from pycorrelator import fof
result_object = fof(catalog, tolerance=0.01)

The result object contains the clustering results. Four methods are available to get the results in different formats:

get_group_dataframe()

To get the clustering results with the appendind data ('mag' in this case), use the pycorrelator.FoFResult.get_group_dataframe() method:

groups_df = result_object.get_group_dataframe()
print(groups_df)

Expected output:

                   Ra     Dec   mag
Group Object
0     0        80.894  41.269  15.5
1     1       120.689 -41.269  12.3
      6       120.690 -41.270  17.8
2     2        10.689 -41.269  18.7
      3        10.688 -41.270  14.1
      4        10.689 -41.270  16.4
      5        10.690 -41.269  13.2

This method returns a pandas DataFrame with two layers of indices: the group index and the object index from the original catalog.

You can iterate through each group by:

for group_index, group in groups_df.groupby('Group'):
    print(f"Print group {group_index}:")
    print(f"The type of group is {type(group)}.")
    print(group, end="\n\n")

Expected output:

Print group 0:
The type of group is <class 'pandas.core.frame.DataFrame'>.
                  Ra     Dec   mag
Group Object
0     0       80.894  41.269  15.5

Print group 1:
The type of group is <class 'pandas.core.frame.DataFrame'>.
                   Ra     Dec   mag
Group Object
1     1       120.689 -41.269  12.3
      6       120.690 -41.270  17.8

Print group 2:
The type of group is <class 'pandas.core.frame.DataFrame'>.
                  Ra     Dec   mag
Group Object
2     2       10.689 -41.269  18.7
      3       10.688 -41.270  14.1
      4       10.689 -41.270  16.4
      5       10.690 -41.269  13.2

Each group is also a pandas DataFrame.

Note

The iterater from groupby() is extremely slow for large datasets. The current solution is to flatten the DataFrame into a single layer of index and manupulate the index directly, or even turn the DataFrame into a numpy array.

If you want DataFrame with a single layer of index and the size of each group as a column, you can use the following code:

groups_df['group_size'] = groups_df.groupby('Group')['Ra'].transform('size')
groups_df.reset_index(level='Group', inplace=True)
print(groups_df)

Expected output:

        Group       Ra     Dec   mag  group_size
Object
0           0   80.894  41.269  15.5           1
1           1  120.689 -41.269  12.3           2
6           1  120.690 -41.270  17.8           2
2           2   10.689 -41.269  18.7           4
3           2   10.688 -41.270  14.1           4
4           2   10.689 -41.270  16.4           4
5           2   10.690 -41.269  13.2           4

get_group_sizes()

To get the size of each group in the order of the group index, use the pycorrelator.FoFResult.get_group_sizes() method:

print(result_object.get_group_sizes())

Expected output:

[1, 2, 4]

get_coordinates()

To get the coordinates of the objects in each group, use the pycorrelator.FoFResult.get_coordinates() method:

print(result_object.get_coordinates())

Expected output:

[[(80.894, 41.269)],
 [(120.689, -41.269), (120.69, -41.27)],
 [(10.689, -41.269), (10.688, -41.27), (10.689, -41.27), (10.69, -41.269)]]

get_group_coordinates()

To get the center coordinates of each group, use the pycorrelator.FoFResult.get_group_coordinates() method:

print(result_object.get_group_coordinates())

Expected output:

[(80.894, 41.269), (120.6895, -41.2695), (10.689 , -41.2695)]