How to remove JavaScript & other tags using Python... without importing modules

Question

For the first part of school project, I'm trying to figure out how to remove JavaScript <script {...} > and </script {...} > tags as well as anything between < and >.

However we can't import any modules (even built in Python ones) because apparently the marker might not be able to access them etc etc.

I tried this:

text = "<script beep beep> hello </script boop doop woop> hello <hi> hey <bye>"
while text.find("<script") >= 0:
    script_start = text.find("<script")
    script_end = text.find(">", text.find("</script")) + 1
    text = text[:script_start] + text[script_end:]

while text.find("<") >= 0:
    script2_start = text.find("<")
    script2_end = text.find(">") + 1
    text = text[:script2_start] + text[script2_end:]

And that does work for smaller files but the project has to do with big text files (the simplified testing one we were given is 10.4MB) so it doesn't finish & it just gets stuck.

Anyone got any ideas to make it more efficient?

make sure the marker has a decent python installation and use regex. — Patrick Artner
– Patrick Artner, Commented Sep 30, 2020 at 10:16
why do you differentiate between <script + > and all between < ... > - arent those the same? — Patrick Artner
– Patrick Artner, Commented Sep 30, 2020 at 10:17
we can't use regex, apparently it defeats the purpose... also no... because i need to delete whats between the script tags too, not just the tags themselves. — user13292868
– user13292868, Commented Sep 30, 2020 at 10:20

Patrick Artner · Accepted Answer · 2020-09-30 12:28:05Z

You do not need to delete anything. In fact: you never want to modify strings.

Strings are immuteable: every time you "modify" one, you instead create a new one and trash the old one. That is a waste of processor and memory.

You are operating on files - so process it character-wise:

remember if you are inside <...> or not
if so, only important character is > to get outside again
if outside and character is < you get inside and ignore that character
if outside and not < you write the character to the output (-file)

# create file
with open("somefile.txt","w") as f:
    # up the multiplicator to 10000000 to create something in the megabyte range
    f.write("<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata\n"*10)

# open file to read from and file to write to
with open("somefile.txt") as f, open("otherfile.txt","w") as out:
    # starting outside
    inside = False
    # we iterate the file line by line
    for line in f:
        # and each line characterwise
        for c in line:
            if not inside and c == "<":
                inside = True
            elif inside and c != ">":
                continue
            elif inside and c == ">":
                inside = False
            elif not inside:
                # only case to write to out
                out.write(c)

print(open("somefile.txt").read() + "\n")
print(open("otherfile.txt").read())

Output:

<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata
<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata
<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata
<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata
<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata
<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata
<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata
<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata
<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata
<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata


 hello  hello  hey  tata
 hello  hello  hey  tata
 hello  hello  hey  tata
 hello  hello  hey  tata
 hello  hello  hey  tata
 hello  hello  hey  tata
 hello  hello  hey  tata
 hello  hello  hey  tata
 hello  hello  hey  tata
 hello  hello  hey  tata

If you arent allowed to directly operate with the files, read the file into a list that consumes 11+Mbyte of memory:

data = list("<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata\n" * 10)

result = []

inside = False
for c in data:
    if inside:
        if c == ">":
            inside = False
        # else ignore c - because we are inside
    elif c == "<":
        inside = True
    else:
        result.append(c)

print(''.join(result))

this is still better then iteratively searching for the first occurrence of "<" in the list but might need up to twice the memory of your source (if it does not contain any <..> you double the list).

Operating the files is far more memory efficient then doing any inplace list modification (wich would be a third way to do this).

There are some glaring things you would also need to work around, f.e.

<script type="text/javascript">
var i = 10;
if (i < 5) {
  // some code
}
</script>

will leave the "code" inside.

This might do the easier corner cases:

# open file to read from and file to write to
with open("somefile.txt") as f, open("otherfile.txt","w") as out:
    # starting outside
    inside = False
    insideJS = False
    jsStart = 0
    # we iterate the file line by line
    for line in f:

        # string manipulation :/ - will remove <script ...> .. </script ..>
        # even over multiple lines - probably missed some cornercases.
        while True:
          if insideJS and not "</script" in line:
              line = ""
              break

          if "<script" in line:
              insideJS = True
              jsStart = line.index("<script")
              jsEnd = len(line)
          elif insideJS:
              jsStart = 0
          
          if not insideJS:
              break

          if "</script" in line:
              jsEnd = line.index(">", line.index("</script", jsStart))+1
              line = line[:jsStart] + line[jsEnd:]
              insideJS = False
          else:
              line = line[:jsStart]

        # and each line characterwise
        for c in line:
            # ... same as above ...

woblob · Accepted Answer · 2020-09-30 11:18:16Z

1

Even though there are 2 while loops, it is still linear complexity

string = "<script beep beep> hello </script boop doop woop> hello <hi> hey <bye>"
new_string = ''
i = 0
while i < len(string):
    if string[i] == "<":
        while i < len(string):
            i += 1
            if string[i] == '>':
                break
    else:
        new_string += string[i]
    i += 1

print(new_string)

Outputs:

 hello  hello  hey

answered Sep 30, 2020 at 11:18

woblob

1,37710 silver badges13 bronze badges

Comments

Tibebes. M · Accepted Answer · 2020-09-30 11:31:42Z

1

Here is one approach with FSA:

output = ''

NORMAL, INSIDE_TAG = range(2) # availale states

state = NORMAL # start with normal state

s = '<script beep beep> hello </script boop doop woop> hello <hi id="someid" class="some class"><a> hey </a><bye>'

for char in s:
  if char == '<': # if we encounter '<' we enter the INSIDE_TAG state
    state = INSIDE_TAG
    continue
  elif char == '>': # we can safely exit the INSIDE_TAG state
    state = NORMAL
    continue

  if state == NORMAL:
    output += char  # add the char to the output only if we are in normal state

print(output)

If parsing the tags semantic is required, make sure to use stack (can be implemented with list).

It would increase the complexity but you go achieve a robust checking with FSM.

see the following example:

output = ''

(
  NORMAL,
  TAG_ATTRIBUTE,
  INSIDE_JAVASCRIPT,
  EXITING_TAG,
  BEFORE_TAG_OPENING_OR_ENDING,
  TAG_NAME,
  ABOUT_TO_EXIT_JS
) = range(7) # availale states

state = NORMAL # start with normal state

tag_name = ''

s = """
<script type="text/javascript">
  var i = 10;
  if (i < 5) {
    // some code
  }
</script>
<sometag>
  test string
  <a href="http://google.com"> another string</a>
</sometag>
"""

for char in s:
  # print(char, '-', state, ':', tag_name)
  if state == NORMAL:
    if char == '<':
      state = BEFORE_TAG_OPENING_OR_ENDING
    else:
      output += char
  elif state == BEFORE_TAG_OPENING_OR_ENDING:
    if char == '/':
      state = EXITING_TAG
    else:
      tag_name += char
      state = TAG_NAME
  elif state == TAG_ATTRIBUTE:
    if char == '>':
      if tag_name == 'script':
        state = INSIDE_JAVASCRIPT
      else:
        state = NORMAL
  elif state == TAG_NAME:
    if char == ' ':
      state = TAG_ATTRIBUTE
    elif char == '>':
      if tag_name == 'script':
        state = INSIDE_JAVASCRIPT
      else:
        state = NORMAL
    else:
      tag_name += char
  elif state == INSIDE_JAVASCRIPT:
    if char == '<':
      state = ABOUT_TO_EXIT_JS
    else:
      pass
      # output += char
  elif state == ABOUT_TO_EXIT_JS:
    if char == '/':
      state = EXITING_TAG
      tag_name = ''
    else:
      # output += '<'
      state = INSIDE_JAVASCRIPT
  elif state == EXITING_TAG:
    if char == '>':
      state = NORMAL

print(output)

OUTPUT:

  test string
  another string

edited Sep 30, 2020 at 11:31

answered Sep 30, 2020 at 10:49

Tibebes. M

7,6285 gold badges18 silver badges39 bronze badges

2 Comments

Patrick Artner Over a year ago

same problem as my 1st version: '<script bla> here some code </script blubb>' will not remove 'here some code' from the output. He commented something along the lines of i need to delete whats between the script tags too, not just the tags themselves

Tibebes. M Over a year ago

oh I mis-understood the requirement.. I thought these were needed. I've updated the answer

Collectives™ on Stack Overflow

How to remove JavaScript & other tags using Python... without importing modules

3 Answers 3

1 Comment

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related