How to keep the only the top N values in a dataframe

You can use np.unique to sort and find the 5th largest value, and use where:

uniques = np.unique(df)

# what happens if len(uniques) < 5?
thresh = uniques[-5]
df.where(df >= thresh)

Output:

        day1  day2  day3  day4  day5
larry    NaN   4.0     4     3     5
gunnar   NaN   NaN     3     4     4
tin      NaN   5.0     5     6     7

Update: On the second look, I think you can do:

df.apply(pd.Series.nlargest, n=3,axis=1).reindex(df.columns, axis=1)

Output:

        day1  day2  day3  day4  day5
larry    NaN   4.0   4.0   NaN   5.0
gunnar   NaN   NaN   3.0   4.0   4.0
tin      NaN   5.0   NaN   6.0   7.0

To keep, for every row, the top 3 values you can use:

df = (df * df.astype(float).apply(lambda x: x.isin(x.nlargest(3)), axis=1)).replace('', np.nan)

You can migrate nlargest's parameter into a variable if needed.

Output:

        day1 day2 day3 day4 day5
larry    NaN    4    4  NaN    5
gunnar   NaN  NaN    3    4    4
tin      NaN    5    5    6    7

stack, sort, keep the top 3 per row with a groupby, and then unstack. We need a mergesort to ensure we always keep the first value per row in the case of ties. Reindex ensures we have all the original columns if there are no largest values in any of the rows for that column.

If you require Int in the end, the best you can do is 'Int64', pandas nullable-integer dtype.

# df = df.apply(pd.to_numeric)  # To make numeric, not object. 

N = 3
(df.stack()
   .sort_values(ascending=False, , kind='mergesort')
   .groupby(level=0).head(N)
   .unstack()
   .reindex(df.columns, axis=1)
   .astype('Int64'))

        day1  day2  day3  day4  day5
larry   <NA>     4     4  <NA>     5
gunnar  <NA>  <NA>     3     4     4
tin     <NA>     5  <NA>     6     7

Here are some timings vs the number of rows, and proof that these three methods are equivalent (before any type-casting). @anky's solution is very fast and simple.

import perfplot
import pandas as pd
import numpy as np


def alollz_stack(df, N=3):
    return (df.stack()
              .sort_values(ascending=False, kind='mergesort')
              .groupby(level=0).head(N)
              .unstack()
              .reindex(df.columns, axis=1))
    
def quang_nlargest(df, N=3):
    return df.apply(pd.Series.nlargest, n=N, axis=1).reindex(df.columns, axis=1)

def anky_rank(df, N=3):
    return df[df.iloc[:,::-1].rank(1,'first').ge(df.shape[1]-N+1)]
    

perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.randint(0, 100, (n, 5)),
                                 columns=[f'col{i}' for i in range(1,6)]), 
    kernels=[
        lambda df: alollz_stack(df),
        lambda df: quang_nlargest(df),
        lambda df: anky_rank(df)],
    labels=['Stack', 'Apply_Nlargest', 'Rank'],
    n_range=[2 ** k for k in range(18)],
    equality_check=lambda x,y: np.allclose(x,y, equal_nan=True),
    xlabel='N_rows'
)

enter image description here

How to keep the only the top N values in a dataframe

Tags:

Python

Pandas

Numpy

Dataframe

Related

Recent Posts