1

I'm using a python script where I'm using a shell command to copy from local to hdfs.

import os
import logging
import subprocess


filePath = "/tmp"
keyword = "BC10^Dummy-Segment"
for root, dirs, files in os.walk(filePath):
    for file in files:
        if keyword in file:
            subprocess.call(["hadoop fs -copyFromLocal /tmp/BC10%5EDummy-Segment* /user/app"], shell=True)
            subprocess.call(["hadoop fs -rm /tmp/BC10%5EDummy-Segment*"], shell=True)

I'm seeing this error:

copyFromLocal: `/tmp/BC10^Dummy-Segment*': No such file or directory
rm: `/tmp/BC10^Dummy-Segment_2019': No such file or directory

Updated code:

import glob
import subprocess
import os
from urllib import urlencode, quote_plus

filePath = "/tmp"
keyword = "BC10^Dummy-Segment"

wildcard = os.path.join(filePath, '{0}*'.format(keyword))
print(wildcard)
files = [urlencode(x, quote_via=quote_plus) for x in  glob.glob(wildcard)]
subprocess.check_call(["hadoop", "fs", "-copyFromLocal"] + files + ["/user/app"])
#subprocess.check_call(["hadoop", "fs", "-rm"] + files)

Seeing error when I run:

Traceback (most recent call last):
  File "ming.py", line 11, in <module>
    files = [urlencode(x, quote_via=quote_plus) for x in  glob.glob(wildcard)]
TypeError: urlencode() got an unexpected keyword argument 'quote_via'
14
  • The real file will have BC10^Dummy-Segment followed with a timestamp so I wanted to fetch all the files beginning with this keyword. Commented Sep 11, 2019 at 15:58
  • 1
    Are you sure that these files exist? Because this might simply be caused by the * symbol matching no files at all. Commented Sep 11, 2019 at 16:00
  • When you shell=True there is no need to pass an array, passing the command as a string would suffice Commented Sep 11, 2019 at 16:03
  • Could you post a file name (full path) that you know it's there? Commented Sep 11, 2019 at 16:05
  • 1
    You are running rm on a loop, it will remove the files at first iteration and fail at next. For your use case you don't need to loop, in fact for this a simple shell script with that two commands would suffice Commented Sep 11, 2019 at 16:08

1 Answer 1

1

I'm guessing you are URL-encoding the path to pass it properly to Hadoop, but in doing so you basically hide it from the shell. There really are no files matching the wildcard /tmp/BC10%5EDummy-Segment* where % etc are literal characters.

Try handling the glob from Python instead. With that, you can also get rid of that pesky shell=True; and with that change, it is finally actually correct and useful to pass the commands as a list of strings (never a list of a singe space-separated string, and with shell=True, don't pass a list at all). Notice also the switch to check_call so we trap errors and don't delete the source files if copying them failed. (See also https://stackoverflow.com/a/51950538/874188 for additional rationale.)

import glob
import subprocess
import os
from urllib import quote_plus

filePath = "/tmp"
keyword = "BC10^Dummy-Segment"

wildcard = os.path.join(filePath, '{0}*'.format(keyword))
files = [quote_plus(x) for x in  glob.glob(wildcard)]
subprocess.check_call(["hadoop", "fs", "-copyFromLocal"] + files + ["/user/app"])
subprocess.check_call(["hadoop", "fs", "-rm"] + files)

This will not traverse subdirectories; but neither would your attempt with os.walk() do anything actually useful if it found files in subdirectories. If you actually want that to happen, please explain in more detail what the script should do.

Sign up to request clarification or add additional context in comments.

10 Comments

Hi I'm seeing this error a SyntaxError: invalid syntax with an arrow pointing to the "s" in subprocess.check_call(["hadoop", "fs", "-copyFromLocal"] + files + ["/user/app"])
I'm guessing you are URL-encoding the path to pass it properly to Hadoop, but in doing so you basically hide it from the shell. There really are no files matching the wildcard /tmp/BC10%5EDummy-Segment* where % etc are literal characters good catch!
@tripleee Sorry, one more error. from urllib.parse import urlencode, quote_plus ImportError: No module named parse
I'm using Python 2.6.6
In Python 2, use from urllib import urlencode, quote_plus.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.