0

I have a text file with a proprietary programming language and I want to extract the relevant information about various function calls.

The structure of the function is:

function name(input1, input2) returns (output1, output2) function body

I'm using Python and RegEx to capture this information, but I've hit a snag. I can capture the name, the inputs and the outputs, but I am unable to grab all of the function body.

I use the following line to capture this info:

re.findall("(function)(.*?)\((.*?)\) returns \((.*?)\)(.*)", file_contents)

However, after the first instance of the word, 'function', this fails. Due to nested statements in the function body, I am unable to use a particular keyword (I've tried different approaches, and I cannot fully grab the entire body) to grab the last group (this would be the function body).

How can I group everything after a particular point and then repeat the pattern?

What I want: 'function', 'name', 'input1, input2', 'output1, output2', 'function body' to repeat indefinitely. I want the last group to grab everything after the outputs and then the pattern to restart when it gets to the next occurrence of the word 'function'. I've tried different variations of the (.?) and (.) quantifiers, but I can't seem to get it.

I am not a programmer by trade, so I am not that adept with RegEx or Python. I know just enough to do the very basics.

3
  • Can you give an example of the function? It would be useful to know if it is it always written on a single line; whether there may be variations in the whitespace (new lines versus spaces versus indents); and how the beginning and end of the function body is defined (e.g. until the end of the line, by curly brackets { }, by indents, or by a keyword). Commented Dec 11, 2019 at 20:12
  • You can see a sample of the format and the most recent RegEx expression here: regex101.com/r/2zqD02/1 Commented Dec 12, 2019 at 13:47
  • Most recent regex101.com/r/PkfofA/1 Commented Dec 13, 2019 at 16:10

3 Answers 3

1

Based on further information from the comments, I tested the following regex code using the re.findall function in Python3.6, which works with the example:

import re

file_contents = "function func1(in1 : bool; in2 : bool; in3 : bool) returns ( out : bool) var L1 : bool; L2 : bool; L5 : bool; L4 : bool; L3 : bool; begin L1 = L3 and L4; L2 = L1 or L5; out = L2; L5 = in3; L4 = in2; L3 = in1; end \n random code \nfunction func2(in1 : bool; in2 : bool; in3 : bool) returns ( out : bool) var L1 : bool; L2 : bool; L5 : bool; L4 : bool; L3 : bool; begin L1 = L3 and L4; L2 = L1 or L5; out = L2; L5 = in3; L4 = in2; L3 = in1;"

pattern = r"(function) (.*?)\((.*?)\) returns \((.*?)\) (.*)"
regex_results = re.findall( pattern, file_contents )

print( regex_results )

Output:

[('function', 'func1', 'in1 : bool; in2 : bool; in3 : bool', ' out : bool', 'var L1 : bool; L2 : bool; L5 : bool; L4 : bool; L3 : bool; begin L1 = L3 and L4; L2 = L1 or L5; out = L2; L5 = in3; L4 = in2; L3 = in1; end '), ('function', 'func2', 'in1 : bool; in2 : bool; in3 : bool', ' out : bool', 'var L1 : bool; L2 : bool; L5 : bool; L4 : bool; L3 : bool; begin L1 = L3 and L4; L2 = L1 or L5; out = L2; L5 = in3; L4 = in2; L3 = in1;')]

Sign up to request clarification or add additional context in comments.

Comments

1

I figured out a different way to accomplish what I was trying to do.

I used the following line:

re.split('(function )(.*?)\((.*?)\) returns \((.*?)\)', contents)

This will split up what I wanted into a list. I then chunk the list and assign it to the variables I have.

Thanks for everyone who took the time to answer.

Comments

1

This will grab the function up until the next function.
There are 5 capture groups.

If using findall, post-process into a group of 5's to get results.

(?s)(\bfunction\b)(.*?)\((.*?)\)\s+returns\s+\((.*?)\)((?:(?!\bfunction\b).)*)

https://regex101.com/r/PkfofA/1

Expanded

 (?s)
 ( \b function \b )            # (1)
 ( .*? )                       # (2)
 \( 
 ( .*? )                       # (3)
 \) \s+ returns \s+ \( 
 ( .*? )                       # (4)
 \) 
 (                             # (5 start)
      (?:
           (?! \b function \b )
           . 
      )*
 )                             # (5 end)

I guess finditer() is a way to get better control of each set of 5 groups :

iter = re.finditer(r"(?s)(\bfunction\b)(.*?)\((.*?)\)\s+returns\s+\((.*?)\)((?:(?!\bfunction\b).)*)", target)
for result in iter:
    g1 = result.group(1)
    g2 = result.group(2)
    g3 = result.group(3)
    g4 = result.group(4)
    g5 = result.group(5)

2 Comments

Thanks for responding. I tried this bit of code, and unfortunately it still does not want to capture the function body correctly. See here: regex101.com/r/2zqD02/1. Any advice? Again, thanks for taking the time out to respond.
@bitmeddler - My mistake, fixed it. regex101.com/r/PkfofA/1

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.