Page 1 of 1 1
Topic Options
#211216 - 2016-03-18 12:12 PM Searching for duplicate strings in CSV file
Sweeny Offline
Fresh Scripter

Registered: 2016-03-18
Posts: 18
Loc: Hampshire
Hi all,

I'm hoping I can have a little light shed on this subject, I'm trying to write a script that searches through a massive csv (over 6GB in size!!!! ) and searches for any duplicate lines in that file. The ideal situation would be that the script reads the first line of the csv then searches the entire file for any copies of that line thus indicating duplication, then moves onto the second line etc. I'm hoping this won't add too much of a complication but id like the script to delete any duplications it finds thus leaving a substantially smaller sized csv with a single instance of the string.

Any help would be greatly appreciated.

Cheers \:\)

Top
#211217 - 2016-03-18 02:45 PM Re: Searching for duplicate strings in CSV file [Re: Sweeny]
Glenn Barnas Administrator Offline
KiX Supporter
*****

Registered: 2003-01-28
Posts: 4396
Loc: New Jersey
Welcome to KORG!

There are two public functions - QSort that will sort the data, and Uniq() that will remove all dups. Both of these functions work with data in an array. The FileIO() function can load your file into an array.

I'm not sure about the file size as that is quite hefty. Here's the basic process:
 Code:
$aFileData = FileIO('mybigfile.csv', 'R')   ; read the big CSV file
$aSort = QSort($aFileData)  ; sort the array
$aUniq = Uniq($aSort) ; remove dups
$aFileData = FileIO('mynewbigfile.csv', 'W', $aUniq) ; write a new, sorted, deduped file
Glenn
_________________________
Actually I am a Rocket Scientist! \:D

Top
#211218 - 2016-03-18 02:51 PM Re: Searching for duplicate strings in CSV file [Re: Glenn Barnas]
Glenn Barnas Administrator Offline
KiX Supporter
*****

Registered: 2003-01-28
Posts: 4396
Loc: New Jersey
Actually, after checking, the Uniq() UDF posted on my site no longer needs the array to be sorted. That simplifies the code above and eliminates any potential issue with changing the data sequence.

Load the file into an array with FileIO

Remove dups with the Uniq() UDF - you can use the same array name for both parameters - $aFileData = Uniq($aFileData) - this gives you the unique result.

Save the array to a file (can be the same file) - FileIO('filename', 'W', $aFileData)

Glenn
_________________________
Actually I am a Rocket Scientist! \:D

Top
#211219 - 2016-03-18 03:14 PM Re: Searching for duplicate strings in CSV file [Re: Glenn Barnas]
Jochen Administrator Offline
KiX Supporter
*****

Registered: 2000-03-17
Posts: 6380
Loc: Stuttgart, Germany
Plus make sure you have enough memory for the action \:D
_________________________



Top
#211228 - 2016-03-21 01:02 PM Re: Searching for duplicate strings in CSV file [Re: Jochen]
Sweeny Offline
Fresh Scripter

Registered: 2016-03-18
Posts: 18
Loc: Hampshire
Hi all,

Thank you very much for the prompt response, I might be missing something very simple here but I keep getting [ERROR : expected ')'!] when I try to incorporate these functions into my script. Is there anything I need to change?

Cheers,
Tom

Top
#211229 - 2016-03-21 01:04 PM Re: Searching for duplicate strings in CSV file [Re: Sweeny]
Jochen Administrator Offline
KiX Supporter
*****

Registered: 2000-03-17
Posts: 6380
Loc: Stuttgart, Germany
Hi,
you will have to add the udf code at the end of your script .. see the how to use udf article in udf forum ..

Still you will probably fail to do this due to the size of the file \:D
_________________________



Top
#211230 - 2016-03-21 02:24 PM Re: Searching for duplicate strings in CSV file [Re: Jochen]
Sweeny Offline
Fresh Scripter

Registered: 2016-03-18
Posts: 18
Loc: Hampshire
AHHH!!! Thank you! Yeah the file size...
Top
#211231 - 2016-03-21 02:38 PM Re: Searching for duplicate strings in CSV file [Re: Sweeny]
Glenn Barnas Administrator Offline
KiX Supporter
*****

Registered: 2003-01-28
Posts: 4396
Loc: New Jersey
I have a server in my environment with Cygwin tools installed, specifically for special situations. It's not KIX, but for this situation it might be helpful. "cat mybigfile | sort | uniq > mysmalluniquefile"

Once the file size is reasonable, you can use Kix for ongoing maintenance.

Glenn
_________________________
Actually I am a Rocket Scientist! \:D

Top
#211232 - 2016-03-21 04:53 PM Re: Searching for duplicate strings in CSV file [Re: Sweeny]
Sweeny Offline
Fresh Scripter

Registered: 2016-03-18
Posts: 18
Loc: Hampshire
Works a treat! Thank you all for the help, much appreciated.

Cheers

Top
#211235 - 2016-03-22 04:15 PM Re: Searching for duplicate strings in CSV file [Re: Sweeny]
Glenn Barnas Administrator Offline
KiX Supporter
*****

Registered: 2003-01-28
Posts: 4396
Loc: New Jersey
Great! (but which method did you use?)
Mind posting your solution (either code or method review) to help others?

Thanks!

Glenn
_________________________
Actually I am a Rocket Scientist! \:D

Top
Page 1 of 1 1


Moderator:  Jochen, Allen, Radimus, Glenn Barnas, ShaneEP, Ruud van Velsen, Arend_, Mart 
Hop to:
Shout Box

Who's Online
2 registered (morganw, mole) and 414 anonymous users online.
Newest Members
gespanntleuchten, DaveatAdvanced, Paulo_Alves, UsTaaa, xxJJxx
17864 Registered Users

Generated in 0.068 seconds in which 0.025 seconds were spent on a total of 13 queries. Zlib compression enabled.

Search the board with:
superb Board Search
or try with google:
Google
Web kixtart.org