0

I have a dataset similar to this one:

    Mother ID ChildID    ethnicity
0     1       1          White Other
1     2       2          Indian
2     3       3          Black
3     4       4          Other
4     4       5          Other
5     5       6          Mixed-White and Black

To simplify my dataset and make it more relevant to the classification I am performing, I want to categorise ethnicities into 3 categories as such:

  1. White: within this category I will include 'White British' and 'White Other' values
  2. South Asian: the category will include 'Pakistani', 'Indian', 'Bangladeshi'
  3. Other: 'Other', 'Black', 'Mixed-White and Black', 'Mixed-White and South Asian' values

So I want the above dataset to be transformed to:

    Mother ID ChildID    ethnicity
0     1       1          White
1     2       2          South Asian
2     3       3          Other
3     4       4          Other
4     4       5          Other
5     5       6          Other

To do this I have run the following code, similar to the one provided in this answer:


    col         = 'ethnicity'
    conditions  = [ (df[col] in ('White British', 'White Other')),
                   (df[col] in ('Indian', 'Pakistani', 'Bangladeshi')),
                   (df[col] in ('Other', 'Black', 'Mixed-White and Black', 'Mixed-White and South Asian'))]
    choices     = ['White', 'South Asian', 'Other']
        
    df["ethnicity"] = np.select(conditions, choices, default=np.nan)
    

But when running this, I get the following error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Any idea why I am getting this error? Am I not handling the string comparison correctly? I am using a similar technique to manipulate other features in my dataset and it is working fine there.

0

2 Answers 2

2

I can not find why in is not working, but isin definitely solve the problem, maybe someone else can tell why in has a problem.

conditions  = [ (df[col].isin(('White British', 'White Other'))),
                (df[col].isin(('Indian', 'Pakistani', 'Bangladeshi'))),
                (df[col].isin(('Other', 'Black', 'Mixed-White and Black', 'Mixed-White and South Asian')))]
print(conditions)
choices     = ['White', 'South Asian', 'Other']

df["ethnicity"] = np.select(conditions, choices, default=np.nan)
print(df)

output

   Mother ID  ChildID    ethnicity
0          1        1        White
1          2        2  South Asian
2          3        3        Other
3          4        4        Other
4          4        5        Other
5          5        6          nan
Sign up to request clarification or add additional context in comments.

2 Comments

This does indeed fix the problem, similar to @jezrael answer to this question: stackoverflow.com/questions/56170164/…
Shouldn't your output have 'Other' for the 6th row here?
0

With df[col] in some_tuple you are searching df[col] inside some_tuple, which is obviously not what you want. What you want is df[col].isin(some_tuple), which returns a new series of booleans of the same length of df[col].

So, why you get that error anyway? The function for searching a value in a tuple is more or less like the following:

for v in some_tuple:
    if df[col] == v:
        return True
return False
  • df[col] == v evaluates to a series result; no problem here
  • then Python try to evaluate if result: and you get that error because you have a series in a condition clause, meaning that you are (implicitly) trying to evaluate a series as a boolean; this is not allowed by pandas.

For your problem, anyway, I would use DataFrame.apply. It takes a function that map a value to another; in your case, a function that map each ethnicity to a category. There are many ways to define it (see options below).


import numpy as np
import pandas as pd

d = pd.DataFrame({
    'field': range(6),
    'ethnicity': list('ABCDE') + [np.nan]
})

# Option 1: define a dict {ethnicity: category}
category_of = {
    'A': 'X',
    'B': 'X',
    'C': 'Y',
    'D': 'Y',
    'E': 'Y',
    np.nan: np.nan,
}
result = d.assign(category=d['ethnicity'].apply(category_of.__getitem__))
print(result)

# Option 2: define categories, then "invert" the dict.
categories = {
    'X': ['A', 'B'],
    'Y': ['C', 'D', 'E'],
    np.nan: [np.nan],
}
# If you do this frequently you could define a function invert_mapping(d):
category_of = {eth: cat
               for cat, values in categories.items()
               for eth in values}
result = d.assign(category=d['ethnicity'].apply(category_of.__getitem__))
print(result)

# Option 3: define a function (a little less efficient)
def ethnicity_to_category(ethnicity):
    if ethnicity in {'A', 'B'}:
        return 'X'
    if ethnicity in {'C', 'D', 'E'}:
        return 'Y'
    if pd.isna(ethnicity):
        return np.nan
    raise ValueError('unknown ethnicity: %s' % ethnicity)

result = d.assign(category=d['ethnicity'].apply(ethnicity_to_category))
print(result)

5 Comments

I understand what you are saying about evaluating a series as a boolean. But then why does this evaluation work for other features in my dataset. See this answer here: stackoverflow.com/questions/39109045/…
Also, how are NaNs handled in your code? In my code above, I handled them by passing default=np.nan in np.select.
@sums22 You are not doing the same thing. If you use "in" you are calling a method of tuple that will result in the code I wrote above: somewhere in the code there will be an if that will have a series as a condition (the result of series == value). In that answer, you use operators like ">" that are defined in Series and will return another Series. There's no if series: involved.
@sums22 Handling NaNs is as easy as writing category_of[np.nan] = np.nan. Keep in mind that apply just want a function that maps a value into another. My code is just and example. You can define that function in many ways. I'll expand my answer.
@sums22 So, to recap, series in tuple is not the operation you want: you don't want to know if the series is inside the tuple, you want another series that tells you if the elements of series are in the tuple; that's what series.isin(tuple) does.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.