2

I'm working with the following DataFrame column containing Date |TimeStamp | Name | Message as a string

59770        [08/10/18, 5:57:43 PM] Luke: Message
59771   [08/10/18, 5:57:48 PM] Luke: Message
59772     [08/10/18, 5:57:50 PM] Luke: Message

I use the following function to capture the Date.

def getdate(x):
    res = re.search("\d\d/\d\d/\d\d",x)

and the following code to capture the rest of the data (TimeStamp | Name | Message) into columns:

df['Data'].str.extract(r'\s*(.{10})](.*):(.*)')

Is there a workaround to capture and extract all 4 entities together?

Please Advise

1
  • If you could change the data in file you could convert file to a csv and then used from pandas Commented Feb 21, 2021 at 12:22

2 Answers 2

2

As an alternative you could use regex named groups together with pandas extractall.

import pandas as pd
import re

df = pd.DataFrame(
    ["        [08/10/18, 5:57:43 PM] Luke: Message",
     "   [08/10/18, 5:57:48 PM] Luke: Message",
     "     [08/10/18, 5:57:50 PM] Luke: Message"])

print(df)

regex = re.compile(\
    r"(?P<date>\d{2}/\d{2}/\d{2}),\s*"
    r"(?P<timestamp>\d+:\d+:\d+\s[AP]M)\]\s+"
    r"(?P<name>.+?):\s*"
    r"(?P<message>.+)$"
    )

df_out = df[0].str.extractall(regex).droplevel(1)
print(df_out)

Output from df_out

       date   timestamp  name  message
0  08/10/18  5:57:43 PM  Luke  Message
1  08/10/18  5:57:48 PM  Luke  Message
2  08/10/18  5:57:50 PM  Luke  Message
Sign up to request clarification or add additional context in comments.

4 Comments

Well this is exactly where I'm stuck at, apparently your method and mine include a part of some messages in the name column as well. Here's an example Luke: Message....
@Luke Try to make the name capture non greedy. I have edit the answer.
Worked like a charm! Would you care to explain what you changed?
Changed the name capture from ?P<name>.+ to ?P<name>.+? making it non greedy. This way the regex engine will matche as few characters as possible until reaches the first :.
0

I change the format of each line as follow in file "file.csv" :

08/10/18, 5:57:43 PM, Luke, Message

And then used from this code to read it as data frame :

 import pandas as pd
 df = pd.read_csv("file.csv")
 print (df)

OutPut:

  Date         time   name       msg

0 08/10/18 5:57:43 PM Luke Message

Suppose your data is in file "file_data.txt" as follow format:

  [08/10/18, 5:57:43 PM] Luke: Message
  [08/10/18, 5:57:48 PM] Luke: Message
  [08/10/18, 5:57:50 PM] Luke: Message

you can use from thease sed commands to convert data to csv :

 sed -i "s/]/,/"  file_data.txt
 sed -i "s/\[//"  file_data.txt
 sed -i "s/:/,/"  file_data.txt  

3 Comments

How do you change the format? It's actually a .txt file not .csv
No matters the format of file you can change it by sed.
Yields an error sed: 1: "Chat.txt": invalid command code C

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.