Create a new column within dataframe using existing csv data

Question

So I have the following CSV data. If you look at the columns, PPID is the parent process ID and PID is the process ID. I want to update my existing dataframe so that i can add a new column called PPIDName with the corresponding name of the process rather than an ID. How can I go about doing this?

Following is an example:

PID of services.exe is 768. PPID of svchost.exe is PPID as 768 (which is services.exe). I want to make a new column in this so that for every row I print out the actual name of the parent process rather than its PPID

"TreeDepth","PID","PPID","ImageFileName","Offset(V)","Threads","Handles","SessionId","Wow64","CreateTime","ExitTime"
1,768,632,"services.exe","0xac8190e52100",7,,0,False,"2021-04-01 05:05:01.000000 ", 
2,1164,768,"svchost.exe","0xac8191053340",3,,0,False,"2021-04-01 05:05:02.000000 ",

"TreeDepth","PID","PPID","ImageFileName","Offset(V)","Threads","Handles","SessionId","Wow64","CreateTime","ExitTime"
0,4,0,"System","0xac818d45d080",158,,,False,"2021-04-01 05:04:58.000000 ",
1,88,4,"Registry","0xac818d5ab040",4,,,False,"2021-04-01 05:04:54.000000 ",
1,404,4,"smss.exe","0xac818dea7040",2,,,False,"2021-04-01 05:04:58.000000 ",
0,556,548,"csrss.exe","0xac81900e4140",10,,0,False,"2021-04-01 05:05:00.000000 ",
0,632,548,"wininit.exe","0xac81901ee080",1,,0,False,"2021-04-01 05:05:00.000000 ",
1,768,632,"services.exe","0xac8190e52100",7,,0,False,"2021-04-01 05:05:01.000000 ",
2,1152,768,"svchost.exe","0xac8191034300",2,,0,False,"2021-04-01 05:05:02.000000 ",
2,2560,768,"svchost.exe","0xac8191485080",6,,0,False,"2021-04-01 05:05:03.000000 ",
2,1668,768,"svchost.exe","0xac8191238080",6,,0,False,"2021-04-01 05:05:03.000000 ",
2,1924,768,"svchost.exe","0xac819132b340",6,,0,False,"2021-04-01 05:05:03.000000 ",
2,908,768,"svchost.exe","0xac8190076080",1,,0,False,"2021-04-01 05:05:01.000000 ",
2,1164,768,"svchost.exe","0xac8191053340",3,,0,False,"2021-04-01 05:05:02.000000 ",
2,2956,768,"svchost.exe","0xac81915d5080",3,,0,False,"2021-04-01 05:05:04.000000 ",
2,652,768,"svchost.exe","0xac8194af2080",11,,0,False,"2021-04-05 21:59:50.000000 ",
2,1680,768,"svchost.exe","0xac819123a700",9,,0,False,"2021-04-01 05:05:03.000000 ",
2,1172,768,"svchost.exe","0xac8191055380",4,,0,False,"2021-04-01 05:05:02.000000 ",
2,2964,768,"svchost.exe","0xac819163e080",7,,0,False,"2021-04-01 05:05:04.000000 ",
2,4500,768,"svchost.exe","0xac8192760080",4,,0,False,"2021-04-01 05:48:25.000000 ",
2,2196,768,"svchost.exe","0xac8191ff0080",4,,0,False,"2021-04-02 01:20:04.000000 ",
2,2456,768,"svchost.exe","0xac8191333080",6,,0,False,"2021-04-01 05:05:03.000000 ",
2,1688,768,"svchost.exe","0xac819267c2c0",7,,0,False,"2021-04-01 05:48:24.000000 ",
2,1180,768,"svchost.exe","0xac8191058700",4,,0,False,"2021-04-01 05:05:02.000000 ",
2,2588,768,"spoolsv.exe","0xac81914db0c0",15,,0,False,"2021-04-01 05:05:03.000000 ",
2,2716,768,"svchost.exe","0xac8192615340",4,,2,False,"2021-04-01 05:48:24.000000 ",

I didn't do any filtering so I just read the csv into a dataframe so the existing field names within the csv as shown above is what it has dfprocs = pd.read_csv( args.path + '/PsTree.csv') — universepp
– universepp, Commented May 3, 2022 at 12:00
Could you add your expected output column for maybe the first few rows to your question? It's not clear what you're after — Emi OB
– Emi OB, Commented May 3, 2022 at 12:17

Emi OB · Accepted Answer · 2022-05-03 14:28:42Z

I think I understand what you're after.

I've made a smaller df with only the relevant columns for my answer (so you can assume Another Col replaces all the other columns):

     PID  PPID ImageFileName  Another Col
0      4     0        System            1
1     88     4      Registry            2
2    404     4      smss.exe            3
3    556   548     csrss.exe            4
4    632   548   wininit.exe            5
                 ...

Firstly, I got all of the PIDs with their corresponding name, and removed any duplicates (if they exist):

df_PID = df[['PID', 'ImageFileName']].drop_duplicates()

     PID ImageFileName
0      4        System
1     88      Registry
2    404      smss.exe
3    556     csrss.exe
4    632   wininit.exe
5    768  services.exe
6   1152   svchost.exe
        ...

I then renamed these columns to PPID and PPIDName, to make it easier to merge onto the original df to get the desired result. That and the merge are below:

df_PID.columns = ['PPID', 'PPIDName']
df = df.merge(df_PID, on='PPID', how='left')

This gives the below output, which I think is what you want:

     PID  PPID ImageFileName  Another Col      PPIDName
0      4     0        System            1           NaN
1     88     4      Registry            2        System
2    404     4      smss.exe            3        System
3    556   548     csrss.exe            4           NaN
4    632   548   wininit.exe            5           NaN
5    768   632  services.exe            6   wininit.exe
6   1152   768   svchost.exe            7  services.exe
7   2560   768   svchost.exe            8  services.exe
8   1668   768   svchost.exe            9  services.exe
9   1924   768   svchost.exe           10  services.exe
                          ...

Zero · Accepted Answer · 2022-05-03 12:49:18Z

0

This does the job,

ppid_name = df.loc[df["PID"].isin(df["PPID"]), ["PID", "ImageFileName"]].set_index("PID", drop = False)
replace_with = (ppid_name["PID"].astype(str) + "_" + ppid_name["ImageFileName"]).to_dict()
df["PPID"] = df["PPID"].replace(replace_with)

Output -

	TreeDepth	PID	PPID	ImageFileName	Offset(V)	Threads	Handles	SessionId	Wow64	CreateTime	ExitTime
0	0	4	0	System	0xac818d45d080	158	nan	nan	False	2021-04-01 05:04:58.000000	nan
1	1	88	4_System	Registry	0xac818d5ab040	4	nan	nan	False	2021-04-01 05:04:54.000000	nan
2	1	404	4_System	smss.exe	0xac818dea7040	2	nan	nan	False	2021-04-01 05:04:58.000000	nan
3	0	556	548	csrss.exe	0xac81900e4140	10	nan	0.0	False	2021-04-01 05:05:00.000000	nan
4	0	632	548	wininit.exe	0xac81901ee080	1	nan	0.0	False	2021-04-01 05:05:00.000000	nan

edited May 3, 2022 at 12:49

answered May 3, 2022 at 11:59

Zero

1,9091 gold badge10 silver badges24 bronze badges

6 Comments

universepp Over a year ago

That is not the one I'm looking for. Following is an example services.exe has the PID of 768 and svchost.exe has its PPID as 768. I want to make a new column in this so that for every row I print out the actual name of the parent process rather than its PPID 1,768,632,"services.exe","0xac8190e52100",7,,0,False,"2021-04-01 05:05:01.000000 ", 2,1164,768,"svchost.exe","0xac8191053340",3,,0,False,"2021-04-01 05:05:02.000000 ",

Zero Over a year ago

@universepp Update the question itself so everyone can understand it more clearly.

Zero Over a year ago

@universepp From what I understand you want the Parent name with the PPID for each row?

universepp Over a year ago

Yes. that is what i'm trying to do

Zero Over a year ago

@universepp I have updated the answer. It should work now! And thanks for such an awesome question. It helped me in learning a few new things!

|

Collectives™ on Stack Overflow

Create a new column within dataframe using existing csv data

2 Answers 2

1 Comment

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related