How To Sort 40 Lakh Records?

Pensu · 2012-09-30T06:14:51+00:00

So, here is the scenario. I have a csv (comma separated values) file with almost 40 lakh records in it. Each record is separated by a new line i.e. each line contains new record. The records contain a user id and then some related data. Now, I need to sort it user wise in a way that for every user i will create a separate file and the related data will be stored in it. I wrote a C program for that in which i simply took the user id, opened a file with the particular name and stored the data in it. It actually run continuously for 19 hours and then i had to shut it down for some reasons. The size of file is 388 Mb and i was able to sort 247 Mb of data out of it in those 19 hours. Now, instead of using the same approach, I am thinking of trying something else. Any ideas are welcome to make the algo more efficient. Any approach with parallelism will be appreciated, I have core2duo i3 second generation processor.

How To Sort 40 Lakh Records?

Pensu
@pensu-8tNeGU

Updated: Oct 26, 2024

Views: 1.1K

So, here is the scenario. I have a csv (comma separated values) file with almost 40 lakh records in it. Each record is separated by a new line i.e. each line contains new record. The records contain a user id and then some related data. Now, I need to sort it user wise in a way that for every user i will create a separate file and the related data will be stored in it. I wrote a C program for that in which i simply took the user id, opened a file with the particular name and stored the data in it. It actually run continuously for 19 hours and then i had to shut it down for some reasons. The size of file is 388 Mb and i was able to sort 247 Mb of data out of it in those 19 hours.

Now, instead of using the same approach, I am thinking of trying something else. Any ideas are welcome to make the algo more efficient. Any approach with parallelism will be appreciated, I have core2duo i3 second generation processor.

0

Replies

Howdy guest!

Dear guest, you must be logged-in to participate on CrazyEngineers. We would love to have you as a member of our community. Consider creating an account or login.

Replies

grsalvi

Member • Sep 30, 2012

How each user ID is different from other ? I mean how to detect different users ?

Are you sure? This action cannot be undone.
Cancel
Pensu

Member • Sep 30, 2012

Well how do IDs differ? I mean every user is assigned with a separate number like 1,2,3.....and so on. And the twist is I dont even know how many users are there exactly. There are supposed to around 275.

Are you sure? This action cannot be undone.
Cancel
grsalvi

Member • Sep 30, 2012

Okay . I assume before every comma,data of a particular user ends and after every comma 1st data is user id and then his information ...right ?

I actually don't know what logic you used or what codes you wrote.
But if data related to every user is same then,instead of checking id and commas you can use fread to read blocks of data from file and save.

Are you sure? This action cannot be undone.
Cancel
Pensu

Member • Sep 30, 2012

Well, thats a good approach and I actually used that only. The data of a particular user differs by a line meaning in every new line there is a new data. So, a particular line belongs to a particular user. And thats what I was doing, i checked the id, stored the data and moved on to next line. But this is too much time taking. I was hoping for a more efficient approach, or a more faster, to be more precise.

Are you sure? This action cannot be undone.
Cancel
grsalvi

Member • Sep 30, 2012

How about writing code to split the csv file into two parts...making two instances of your sorting program and allow them to work separately on the two different halves.
If enough RAM is there simultaneously two(or more) sorting program might work.What you think?

In fact if you just want it sorted :

Split the main (csv) file into 3 parts...
Use three different PCs(so three different RAMs) to run sorting programme which works on 3 different parts.👍

Are you sure? This action cannot be undone.
Cancel
Pensu

Member • Sep 30, 2012

Yes, it could work. But there are two issues in that. First of all, i have been searching a lot about how to implement parallel programming and truth be hold, I havent found much help. Secondly, to make two instances, I should know the number of records (I guess). I mean, if i am going to split the file, I should know the limits for those instances, which are missing in this case....😛

Are you sure? This action cannot be undone.
Cancel
grsalvi

Member • Sep 30, 2012

Buddy not exactly half.....use fread to read 50(or less) records and store to different files till you reach end.

So leave parallel programming run sorting programme on different machines ... Definately won't take 19 hours😉.

Let me know updates here...Because its quite interesting..Actually what this file is about & where you got it ?

Are you sure? This action cannot be undone.
Cancel
Pensu

Member • Sep 30, 2012

Haha.....Nice idea, i was thinking about that too....😉. Actually its part of my project work. I told one of my faculty about the problem and he said, "so what if it took 19 hours. Big simulations take time. It might take 3 days too"....😛. Let me try something else, will get back to you after that.

Are you sure? This action cannot be undone.
Cancel
Kaustubh Katdare

Administrator • Sep 30, 2012

Okay, I've never done anything like that; but how about migrating your CSV to a MySQL database and then perform the operations? 😉 . Obviously this approach isn't what you're looking at. Are you focused on solving the problem only using efficient programming or ready to alter the approach altogether?

Are you sure? This action cannot be undone.
Cancel
xero

Member • Sep 30, 2012

although m not much of a C programmer, but if i had to solve such a problem, i would rely on threads, wherein each thread will be responsible for a certain section of the file. With apropo file write lock handling mechanisms this operation can be done quickly.

Are you sure? This action cannot be undone.
Cancel
Uday Bidkar

Member • Sep 30, 2012

Are you creating a file per userid everytime a new userid is encountered and keeping all those file pointers in memory to write new data for that user as and when encountered?

Are you sure? This action cannot be undone.
Cancel
Uday Bidkar

Member • Sep 30, 2012

How about this approach,

First, parse the entire file and during parsing, maintain a map of key value pairs with key as userid and value as a linked list of integers giving start of the location of each entry in the file for that user.

Next take each entry from this map, loop through the link list of positions of entry into the file, fseek to that location, read that particular entry and write to file corresponding to that user. At the end of the loop, you have all entries for that user in one file, close the file and open a new one for next user. Repeat the process until all the entries in the map are processed.

Are you sure? This action cannot be undone.
Cancel
Uday Bidkar

Member • Sep 30, 2012

With this approach you dont even have to generate the files specific to every user if that's not your ultimate goal. If your ultimate goal is to just group the data in that file by user id to retrive it later for one particular user, you just save the map you generated in first pass in a file and next time when you need data for a particular user, first read map file, get the list of entry positions for required user and read only those entries

Are you sure? This action cannot be undone.
Cancel
Pensu

Member • Oct 1, 2012

#-Link-Snipped-#: I am actually working on that only. That would make things simpler later too as I also have to do a lot of analysis on that data. I can just use a simple sql query to get the data of a particular user. Lets see how that works out.👍
#-Link-Snipped-#: I dont think threads would be of much help here unless I implement them in parallel fashion. But again parallel implementation is also an issue. But if sql thing fails, I will look into this approach.
#-Link-Snipped-#:The idea is that there is a separate file for each user. So, when I encounter a user id, then the related data is saved in his particular file, if the file for that user already exists or I make a new file for that user and save the data in it. Now, about your approach actually the issue is the size of the data. I mean if I am going to parse it, then it would take a lot of time and then mapping process would take its time too.👎 Anyways I actually liked the idea of mapping.😀

Are you sure? This action cannot be undone.
Cancel
Uday Bidkar

Member • Oct 1, 2012

#-Link-Snipped-#, Can you post your code snippet where you read the main file and write the data to user specific file?

Are you sure? This action cannot be undone.
Cancel

Pensu

Member • Oct 1, 2012

Sure, here it is:

#include <stdio.h>
#include <string.h>
 
int fready = 0;
int fsearch(char name[]);
 
int main()
{
    FILE *fp, *fp1, *fp2;
    char c, user[3], fname[10];
    int i,count = 0, flag = 0, copy, fwrite = 0;
    fp = fopen("wtd.csv","r");
    fp1 = fopen("users.txt","w");
    fclose(fp1);
    c = fgetc(fp);
    i = 0;
    while(c != EOF)
    {
        if(count == 0)
        {
            user[i] = c;
            if(c == ',')
            {
                user[i] = '\0';
                printf("%s\n",user);
                copy = fsearch(user);
                printf("Copy = %d",copy);
                if(copy == 1)
                {
                    fp1 = fopen("users.txt","a");
                    printf("\nData into file %s",user);
                    fputs(user,fp1);
                    fputc('\n',fp1);
                    fclose(fp1);
                    puts(user);
                    fready = 1;
                }
                count = 1;
            }
            i++;
        }
        else if(c == '\n')
        {
            count = 0;
            i = 0;
            fready = 0;
            fwrite =0;
            fputc('\n',fp2);
            fclose(fp2);
        }
    if(fready == 1)
    {
        strcat(user,".txt");
        fp2 = fopen(user,"a");
        fputs(user,fp2);
        fready = 0;
        fwrite = 1;
    }
    if(fwrite == 1)
        fputc(c,fp2);
    c = fgetc(fp);
    }
    return 0;
}
 
// Function to check whether user id is saved or not
int fsearch(char str[])
{
    FILE *fp;
    char c, name[3];
    int i = 0 ,n;
    fp = fopen("users.txt","r");
    if(fp)
        puts("Success");
    else
        puts("Cant open file");
    c = fgetc(fp);
    while(c != EOF)
    {
        puts("Hello");
        if(c != '\n')
        {
            name[i] = c;
            i++;
        }
        else
        {
            name[i] = '\0';
            printf("Data from file");
            puts(name);
            n = strcmp(name,str);
            if(n == 0)
            {
                printf("These ids are same %s %s",name,str);
                fclose(fp);
                fready = 1;
                return 0;
            }
            i = 0;
        }
        c = getc(fp);
    }
    return 1;
}

The code is a little vague as it was prepared in a hurry, but I hope you will get the idea what I am trying to do....😀

Are you sure? This action cannot be undone.

Cancel

Uday Bidkar

Member • Oct 1, 2012

Lots of things to improve performance here.

Don't read/write file character by character, read it in large chunk, 5-10MB should be good I think. Processing the data once it's in memory would be a lot faster. Since you are also writing the data character by character, that's going to slow down the program a lot.

Every time you want to write user data, you are opening a file and closing it right after writing a line. Avoid that.

Remember, file I/O is too much expensive than memory operations so minimize them as much as you can.

You are maintaining another file "users.txt", is that really required? You can have that data in memory as well and if needed, save that to file at the end of entire program.

Are you sure? This action cannot be undone.
Cancel
Pensu

Member • Oct 1, 2012

Well, I have to read it character by character as to extract user id from every line. I can improve writing though. And actually I am closing each user file after writing, coz otherwise there would be a lot of file pointers to maintain and a lot of open files. Handling that could be a problem, I guess. And yeah, I have actually removed "users.txt", as it was slowing down the process.😀

Are you sure? This action cannot be undone.
Cancel
Uday Bidkar

Member • Oct 1, 2012

#-Link-Snipped-#, you don't "have" to read the file character by character just to extract user id. You can read large chunk of file in memory and do the same processing on that data. Believe me, reading characters from file will be a lot slow than reading characters from chunk in the memory.

Are you sure? This action cannot be undone.
Cancel
Pensu

Member • Oct 1, 2012

Hmmm.....okay then. I will try to read it from memory. Thanks a lot for your help....😀

Are you sure? This action cannot be undone.
Cancel
Umesh Kumar Rai

Member • Nov 6, 2012

Use SSIS technique to pull the records from CSV file and store it in database, then in SQL server, use SQL command to sort it out as you are want.
It will not take that long time as well. So here you can save your time 😁

Are you sure? This action cannot be undone.
Cancel
Manish Goyal

Member • Nov 8, 2012

why are you going too much complicated here

I am not sure about c , but in php you can write this program in just 10 lines MAY BE 2 -3 extra 😀

Also i am sure it won't take more than 1 hr to perform this operation,

Are you sure? This action cannot be undone.
Cancel