0

I am new to python and am trying to find the largest word in the alice_in_worderland.txt. I think I have a good system set up ("See Below"), but my output is returning a "word" with dashes connecting multiple words. Is there someway to remove the dashes in the input of the file? For the text file visit here

sample from text file:

That's very important,' the King said, turning to the jury. They were just beginning to write this down on their slates, when the White Rabbit interrupted: UNimportant, your Majesty means, of course,' he said in a very respectful tone, but frowning and making faces at him as he spoke. " UNimportant, of course, I meant,' the King hastily said, and went on to himself in an undertone, important--unimportant-- unimportant--important--' as if he were trying which word sounded best."

code:


    #String input
    with open("alice_in_wonderland.txt", "r") as myfile:
        string=myfile.read().replace('\n','')
    #initialize list
    my_list = []
    #Split words into list
    for word in string.split(' '):
        my_list.append(word)
    #initialize list
    uniqueWords = []
    #Fill in new list with unique words to shorten final printout
    for i in my_list:
        if not i in uniqueWords:
            uniqueWords.append(i)
    #Legnth of longest word
    count = 0
    #Longest word place holder
    longest = []
    for word in uniqueWords:
        if len(word)>count:
            longest = word
            count = len(longest)
        print longest
3
  • Can you provide a sample of input (not everyone wants to follow a link to get your data) and output that's required? Describe how it differs from what you've got now etc... Commented Aug 16, 2014 at 23:02
  • That's very important,' the King said, turning to the jury. They were just beginning to write this down on their slates, when the White Rabbit interrupted: UNimportant, your Majesty means, of course,' he said in a very respectful tone, but frowning and making faces at him as he spoke. " UNimportant, of course, I meant,' the King hastily said, and went on to himself in an undertone, important--unimportant-- unimportant--important--' as if he were trying which word sounded best." Commented Aug 16, 2014 at 23:08
  • This is a small subset with the unusual quote. I don't know if I can count the dashed phase as the longest word though, so any thoughts on how to remove them from the input? Maybe something like .replace('\n','\-','')? Commented Aug 16, 2014 at 23:10

3 Answers 3

3
>>> import nltk # pip install nltk
>>> nltk.download('gutenberg')
>>> words = nltk.corpus.gutenberg.words('carroll-alice.txt')
>>> max(words, key=len) # find the longest word
'disappointment'
Sign up to request clarification or add additional context in comments.

3 Comments

Seems we get the same answer then
+1 from me for serious usage of word usage - NLTK is the way to go
thumbs up! I didn't know about nltk
2

Here's one way using re and mmap:

import re
import mmap

with open('your alice in wonderland file') as fin:
    mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
    words = re.finditer('\w+', mf)
    print max((word.group() for word in words), key=len)

# disappointment

Far more efficient than loading the file to physical memory.

Comments

0

Use str.replace to replace the dashes with spaces (or whatever you want). To do this, simply add another call to replace after the first call on line 3:

string=myfile.read().replace('\n','').replace('-', ' ')

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.