|
|
|||||||
Hi all, I'm hoping I can have a little light shed on this subject, I'm trying to write a script that searches through a massive csv (over 6GB in size!!!! ) and searches for any duplicate lines in that file. The ideal situation would be that the script reads the first line of the csv then searches the entire file for any copies of that line thus indicating duplication, then moves onto the second line etc. I'm hoping this won't add too much of a complication but id like the script to delete any duplications it finds thus leaving a substantially smaller sized csv with a single instance of the string. Any help would be greatly appreciated. Cheers |
||||||||
|
|
|||||||
Welcome to KORG! There are two public functions - QSort that will sort the data, and Uniq() that will remove all dups. Both of these functions work with data in an array. The FileIO() function can load your file into an array. I'm not sure about the file size as that is quite hefty. Here's the basic process: Code: $aFileData = FileIO('mybigfile.csv', 'R') ; read the big CSV file $aSort = QSort($aFileData) ; sort the array $aUniq = Uniq($aSort) ; remove dups $aFileData = FileIO('mynewbigfile.csv', 'W', $aUniq) ; write a new, sorted, deduped file |
||||||||
|
|
|||||||
Actually, after checking, the Uniq() UDF posted on my site no longer needs the array to be sorted. That simplifies the code above and eliminates any potential issue with changing the data sequence. Load the file into an array with FileIO Remove dups with the Uniq() UDF - you can use the same array name for both parameters - $aFileData = Uniq($aFileData) - this gives you the unique result. Save the array to a file (can be the same file) - FileIO('filename', 'W', $aFileData) Glenn |
||||||||
|
|
|||||||
Plus make sure you have enough memory for the action |
||||||||
|
|
|||||||
Hi all, Thank you very much for the prompt response, I might be missing something very simple here but I keep getting [ERROR : expected ')'!] when I try to incorporate these functions into my script. Is there anything I need to change? Cheers, Tom |
||||||||
|
|
|||||||
Hi, you will have to add the udf code at the end of your script .. see the how to use udf article in udf forum .. Still you will probably fail to do this due to the size of the file |
||||||||
|
|
|||||||
AHHH!!! Thank you! Yeah the file size... |
||||||||
|
|
|||||||
I have a server in my environment with Cygwin tools installed, specifically for special situations. It's not KIX, but for this situation it might be helpful. "cat mybigfile | sort | uniq > mysmalluniquefile" Once the file size is reasonable, you can use Kix for ongoing maintenance. Glenn |
||||||||
|
|
|||||||
Works a treat! Thank you all for the help, much appreciated. Cheers |
||||||||
|
|
|||||||
Great! (but which method did you use?) Mind posting your solution (either code or method review) to help others? Thanks! Glenn |