I created a simple command line application to read CSV files in a specific folder and find the same values on the specify column which exists on all files. (Here's the complete source code.)
Here's the core of the logic:
func csvReader(inputColumn, fileLocation string) {
startTime := time.Now()
// get all the files from the specific folder
files, err := ioutil.ReadDir(fileLocation)
if err != nil {
log.Fatal(err)
}
noColumn := make(chan struct{}, len(files))
results := make(chan map[string]bool, len(files))
// reading all the files from dir concurrently
for _, file := range files {
wg.Add(1)
// and read it concurrently to get the data from specific column
go func(file os.FileInfo) {
defer wg.Done()
f, err := os.Open(dir + file.Name())
if err != nil {
log.Fatal(err)
}
datas, ok := readFile(inputColumn, f)
if !ok {
noColumn <- struct{}{}
return
}
results <- datas
}(file)
}
wg.Wait()
// check if we got the column or not
select {
case <-noColumn:
fmt.Printf("The column name = %v doesnt exist\n", inputColumn)
return
default: // do nothing and continue
}
close(results)
close(noColumn)
//receive results and determine which size datas is the smallest
theSameValue := getSameValues(&results)
fmt.Printf("final result = %+v\n", theSameValue)
fmt.Printf("final result size = %+v\n", len(theSameValue))
fmt.Printf("time consume = %+v\n", time.Since(startTime).Seconds())
}
The function is finding the files on specified fileLocation and then concurrently reading the file to get the values from the given column inputColumn, and passing the values to channel to save the results from reading the CSV files.
The channel results := make(chan []map[string]bool) is passed to a function the begin searching the values that exist on all files.
theSameValue := getSameValues(&results)
Notice that the function getSameValues() receives the pointer of results. I did this because I don't' want the results channel to be copied to the functions argument which takes resource of memory.
And here is the getSameValues() function:
// getting the same value from all the datas
func getSameValues(results *chan map[string]bool) []string {
var datas = make([]map[string]bool, len(*results))
minIndex := -1
minSize := int(MaxUint >> 1)
i := 0
for values := range *results {
sizeValues := len(values)
if sizeValues < minSize && sizeValues > 0 {
minSize = sizeValues
minIndex = i
}
datas[i] = values
i++
}
// getting the same value from all the datas
var theSameValue []string
for value, _ := range datas[minIndex] {
isExistAll := false
for _, data := range datas {
if data[value] {
isExistAll = true
} else {
isExistAll = false
}
}
if isExistAll {
theSameValue = append(theSameValue, value)
}
}
return theSameValue
}
Hopefully this would help others and if you find something that can be improved please give some suggestion.