Creating Probabilistic Pandas Datatypes

5 minute read

Imagine you’re doing some kind of tabluar data work in Pandas, but rather than work with the same old concrete types, you’re working with some kind of uncertainty. Perhaps you have a benchmark suite and it does lots of microbenchmarks like this:

1
2
3
4
5
6
    import random
    
    my_list = [random.randint(0,10) for _ in range(1000)]
    
    def sum_list(x: List[str]) -> float:
        return sum(x)

%timeit sum_list(my_list)
6.29 µs ± 125 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

(Putting it in the same terms, 6.29 µs ± 0.125 µs)

If I have a bunch of different files and I benchmark each of them, then I’ll have a group of independent random measurements. If I want to sum them up, I could just sum the means, but what if I want to estimate the uncertainty of the whole sum? Let’s create a new datatype that works in Pandas!

One possible solution is to create a Distribution datatype which can be computed upon in Pandas like a normal Numpy number. Internally this object could exist a number of ways, but let’s discuss an AnalyticNormal type which interacts with other values by dint of some simple rules.

A Normal distribution is parameterized by two values, a mean \(\mu\) and a standard deviation \(\sigma\). We can write that a random variable \(X\) is distributed normally according to the parameters like so:

\[X \sim N(\mu,\sigma)\]

All the operations we will want to define will exist in terms of these two parameters, so let’s create a class that stores them as instance variables:

1
2
3
4
5
    class Normal(Distribution):

      def __init__(self, mu, sigma):
        self.mu = float(mu)
        self.sigma = float(sigma)

Maybe somehow I have a few benchmarks with no uncertainty, that is, they’re just floats, and I want to sum them up with my uncertain measurements. The easiest operation we can define is interaction with floating point numbers, which for some constant \(b\) follows the rule

\[\begin{aligned} X &\sim N(\mu, \sigma)\\ X +b &\sim N(\mu + b, \sigma)\end{aligned}\]

This requires a double underscore or “dunder” method to be added, __add__. Adding other types will require different rules, so let’s make sure we catch just numbers:

1
2
3
4
5
    def __add__(self, other):
      if isinstance(other, numbers.Number):
        return Normal(mu + other, sigma)
      else:
        raise NotImplementedError

We will ask __add__ to do more later so let’s separate concerns by creating a constant_add function:

1
2
3
4
5
6
7
8
    def __add__(self, other):
      if isinstance(other, numbers.Number):
        return self.constant_add(other)
      else:
        raise NotImplementedError

    def constant_add(self, b: number.Number):
      return Normal(self.mu+b, self.sigma)

Let’s try out our new method:

    > Normal(0,1) + 3
    <__main__.Normal at 0x130423d30>

It looks like it worked, but the interpreter doesn’t know how to print a representation of the object. Let’s fix that.

1
2
3
4
    def __repr__(self) -> str:
      descr_str = f"N({np.round(self.mu,2)},{np.round(self.sigma,2)})"

      return descr_str

1
2
    > Normal(0,1) + 3
    N(3.0,1.0)

Nice!

Now, we’ve taught a Normal how to add to itself a float, but what if we have an operation that goes the other way?

1
2
3
4
5
6
7
8
    > 3 + Normal(0,1)
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    Input In [30], in <cell line: 1>()
    ----> 1 3+Normal(0,3)


    TypeError: unsupported operand type(s) for +: 'int' and 'Normal'

Normal has a method to handle an int, but int doesn’t have a method to handle a normal. While we probably could extend int to handle our case, there’s a better way: the reverse add or __radd__ dunder.

Addition is commutative so we can just re-order the operation. We don’t even need to screen for types because our existing __add__ method already handles that.

1
2
    def __radd__(self, other):
      return self.__add__(other)

1
2
    > 3 + Normal(0,1)
    N(3.0,1)

Implementing arithmetic operators opens the door to treating them like garden-variety numbers in Pandas. For instance now we can do something like

1
2
3
4
5
6
7
8
9
    > pd.Series([Normal(1,1),2,3])
    
    0    N(1.0,1.0)
    1             2
    2             3
    dtype: object

    > pd.Series([Normal(1,1),2,3]).sum()
    N(6.0,1.0)

Let’s add a method so we can add normals to each other. Recall for independent normals

\[\begin{align}X &\sim N(\mu_x, \sigma_x)\\ Y &\sim N(\mu_y, \sigma_y)\\ X+Y &\sim N(\mu_x + \mu_y, \sqrt({\sigma_x^2 + \sigma_y^2}))\end{align}\]

so we can write

1
2
3
def normal_add(self, other):
  new_sigma = np.sqrt(self.sigma**2 + other.sigma**2)
  return Normal(mu=(self.mu + other.mu), sigma=new_sigma)

and we can set that up inside our __add__:

1
2
3
4
5
6
7
def __add__(self, other):
  if isinstance(other, numbers.Number):
    return self.constant_add(other)
  elif isinstance(other, Normal):
    return self.normal_add(other)
  else:
    raise NotImplementedError

and now we can add normals:

> pd.Series([Normal(1,1), Normal(1,2), Normal(1,3)]).sum()

N(3.0,3.74)

One last thing we may want is to use our datatype for grouping. Let’s try it:

1
2
3
4
foo = pd.Series([AnalyticNormal(1,2), AnalyticNormal(1,2),AnalyticNormal(2,3)])
bar = pd.Series([1,2,3])
df = pd.DataFrame({'f':foo,'b':bar})
df.groupby('f')['b'].sum()

File pandas/_libs/hashtable_class_helper.pxi:5394, in pandas._libs.hashtable.PyObjectHashTable.factorize()

File pandas/_libs/hashtable_class_helper.pxi:5310, in pandas._libs.hashtable.PyObjectHashTable._unique()

TypeError: unhashable type: 'AnalyticNormal'

Welp.

The problem here is that pandas can’t tell what elements of the groupby key are unique, because they’re not hashable. This essentially means they can’t be reduced to unique integers. We need to implement that, and also some comparison functions like __lt__.

1
2
3
4
5
6
7
8
9
10
def __hash__(self):
    return hash((self.mu, self.sigma))

def __lt__(self, other):
    if isinstance(other, numbers.Number):
        return self.to_mean() < other
    elif isinstance(other, Normal):
        return self.mu < other.mu
    else:
        raise NotImplementedError

See what we get when we use our new function:

1
AnalyticNormal(1,2).__hash__()

-3550055125485641917

Our full class now looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
class AnalyticNormal(Normal):
    def __init__(self, mu, sigma):
        assert sigma >= 0, "cannot be ngative"
        super().__init__(mu,sigma)
        
    def __eq__(self, other):
        if isinstance(other, numbers.Number):
            return self.mu == other
        elif isinstance(other, Normal):
            return (self.mu == other.mu) and (self.sigma == other.sigma)
        else:
            raise NotImplementedError

    def __repr__(self):
        descr_str = f"N({np.round(self.mu,2)},{np.round(self.sigma,2)})"
        return descr_str
    
    def __hash__(self):
        return hash((self.mu, self.sigma))
    
    def __lt__(self, other):
        if isinstance(other, numbers.Number):
            return self.to_mean() < other
        elif isinstance(other, Normal):
            return self.mu < other.mu
        else:
            raise NotImplementedError

and if we try our groupby again:

1
2
3
4
foo = pd.Series([AnalyticNormal(1,2), AnalyticNormal(1,2),AnalyticNormal(2,3)])
bar = pd.Series([1,2,3])
df = pd.DataFrame({'f':foo,'b':bar})
df.groupby('f')['b'].sum()

f
N(1.0,2.0)    3
N(2.0,3.0)    3
Name: b, dtype: int64

Success!

There are a lot of other arithmetic operators needed for general Pandas use, including

__add__
__radd__
__floordiv__
__truediv__
__mul__
__rmul__
__sub__
__rsub__
__repr__
__eq__
__ne__
__ge__
__gt__
__lt__
__le__ the implementation of which will be left as an exercise to the reader. I’ve also implemented a few more for convenience including quantile to get the qth quantile of the distribution, confidence_interval to get x% CI and a few more.

Have fun!

Twitter Facebook LinkedIn

Creating Probabilistic Pandas Datatypes

You May Also Enjoy

Deploying MLFlow

Reading the Source Code of R Functions

Parallel Computing on Windows with R

Tillé’s Method for Unequal Probability Sampling Without Replacement