0

I have a sample data from 2014 through 2018 and need to plot a histogram to find outliers. But first, I need to figure out how many of the unique 162 IDs are in 2014, 2015...2018 and then plot it out. I first store data_2014 = data['DATE'].dt.year == 2014 for year 2014, but how do I find which of the 162 unique IDs occurred in 2014? Thank you so much!

|        ID     |    DATE      | VIOLATIONS |
| 0      CHI065 |  2014-07-08  |        65  |
| 1      CHI010 |  2014-07-16  |        56  |
| 2      CHI069 |  2014-07-08  |        10  |
| 3      CHI010 |  2014-07-26  |       101  |
| 4      CHI010 |  2014-07-27  |        92  |
| 5      CHI068 |  2014-08-03  |        20  |
| 17049  CHI040 |   2018-12-22 |        15  |
| 170496 CHI168 |  2018-12-23  |        16  |
| 170497 CHI103 |  2018-12-23  |         8  |
1
  • Thanks for revising the table for me. I used control k today after I formatted the table via notebook just like I did yesterday, but instead of formatting it on stackoverflow, it kept going to google search bar on chrome. Is there a faster way to format data on stackoverflow? Commented Aug 6, 2019 at 21:15

2 Answers 2

3
import pandas as pd

df = pd.DataFrame({'date': {0: '26-1-2014', 1: '26-1-2014', 2:'26-1-2015', 3:'30-1-2014'}, 
                  'ID': {0:"id12", 1: "id13", 2: "id14", 3: "id12"}, 'violations': {0: 34, 1:3, 2: 45, 3: 15} } )
df['year'] = pd.to_datetime(df.date).dt.strftime('%Y')

Return unique Ids per year as dictionary or dataframe for easy lookup

d = df.groupby('year')['ID'].apply(set).to_dict() # as dictionary
d['2014'] #returns unique ids for 2014

The following line creates a df with unique IDs per year. This is good if you just want to know which ids are part of 2014.

df_ids = df.groupby('year')['ID'].apply(set).to_frame(name="id_per_year") #as dataframe

You can now subset on year for example to get only the rows from 2014

df = df.loc[df['year'] == '2014'] # subset for 2014

If you only want to count the unique IDs for 2014 you can groupby year and use nunique()

df_unique = df.groupby('year')['ID'].nunique().to_frame(name="unique_counts")

The following line creates a frame with counts of IDs per year

df_counts = df.groupby('year')['ID'].count().to_frame(name="count")

hope this helps

EDIT 1: included aggregations to address comments

This will generate a table with the number count for each ID + its total number of violations for this year.

import pandas as pd

df = pd.DataFrame({'date': {0: '26-1-2014', 1: '26-1-2014', 2:'26-1-2015', 3:'30-1-2014'}, 
                  'ID': {0:"id12", 1: "id13", 2: "id14", 3: "id12"}, 'violations': {0: 34, 1:3, 2: 45, 3: 15} } )
df['year'] = pd.to_datetime(df.date).dt.strftime('%Y')

aggregations = {'ID': 'count', 'violations': 'sum'}

df_agg = df.groupby(['year', 'ID']).agg(aggregations)

corr = df_agg.groupby('year')[['ID', 'violations']].corr() #optional

If you like the number of unique IDs per year you can adjust the aggregations and the grouping

aggregations = {'ID': pd.Series.nunique, 'violations': 'sum'}
df_agg = df.groupby('year').agg(aggregations)

You can make a scatter plot like this. Make sure to add a color for each year in palette.

import seaborn as sns
sns.scatterplot(df_agg["ID"], df_agg["violations"],hue=df_agg.index.get_level_values("year"),palette=["r", "b"], legend='full')
Sign up to request clarification or add additional context in comments.

6 Comments

Yes, that helped, but I"m trying to create a final_variable for 2014 that will connect the number of citation and the number of cameras for year 2014 and I would do this for rest of the years. Thanks to you, I have 3 versions of data. first shows all subset of 2014, the second one shows all the active cams each year and the third shows number of citations for year. note: the unique identifier between all three should be the year. I'm not sure if the final variable 2014 should use subset of 2014 and connect to citation variable df_cit_yr or should it connect to the variable df_unique.
first one: data_2014= df1.loc[df1['YEAR'] == '2014'] # all the records in 2014 Second one: df_unique = df1.groupby('YEAR')['ID'].nunique().to_frame(name="Number_of_active_cameras") # of cameras from 2014 - 2018 Third one: df_cit_yr = df1.groupby(['YEAR'])['VIOLATIONS'].sum() # number of citations per year The end result is that I will have a final variable for each year that has citation and violation for that year.
in the end, I need to show the graph via matplotlib that if there's any correlation between the number of cams installed and number of violation for that year.
I'm trying to connect citation and violations via year. fin_2014 = df_unique[df_cit_yr.YEAR == 2014] # this gives an error saying objects has no attribute
make sure year is within quotes "2014". My edit 1, might provide a easier way to address your problem
|
0

You can use the example in this answer to get the year in a new column

df['year'] = pd.DatetimeIndex(df['DATE']).year

or

df['year'] = df['ArrivalDate'].dt.year

then use groupby and agg to get a count of each year:

counts = df.groupby('year').agg('count')

So each year

1 Comment

df_unique = df1.groupby('YEAR')['ID'].nunique().to_frame(name="Number_of_active_cameras") # of cameras from 2014 - 2018 I"m trying to plot the year in the x-axis and ID in Y-axis but plt.plot (df_unique.ID, df_unique.ID) is not working. It doesn't recognize ID and YEAR

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.