/* As discussed in class your assignment 9 due 6PM Dec 1 is to redo the word hashing problem using a disk hash file rather than the hash table in memory. However we can use this minor modification of lastweek's solution as the starting template. Notice that the hashtable array is gone so we just have a single hentry struct that will be reused thousands of times. I've also removed the hashwalk() function but have given you a separate small program asn09hashwalk.c that you should use to check your hash_file. My runs are shown below. [awetzel@europa /tmp]$ gcc -O asn09s.c [awetzel@europa /tmp]$ time a.out < vbgle10.txt 24858 total collisions real 0m0.776s user 0m0.170s sys 0m0.610s [awetzel@europa /tmp]$ ls -l hash_file -rw-r--r-- 1 awetzel users 538216 Nov 17 21:28 hash_file [awetzel@europa /tmp]$ gcc -O asn09hashwalk.c [awetzel@europa /tmp]$ a.out 12816 unique words out of 210333 total words most frequent word appears 16930 times Notice that you should get exactly the same number of collisions while building the hash file as we had from the solution handout in class using Duangduen's hash function that is already included in this template. (This presents a challenge ... I'll have to find an even better one now!) The block of code in the #ifdef PREVIOUS_CODE_TO_REPLACE is just for your reference as your job is to replace its functionality with your new disk based code. It is also interesting to note that the maximum possible size of the hash file would be 26*HTSIZE which is 538244 but the actual file is only 538216 since nothing actually fell into the last table position. The file creation is already done and the file will automatically close when your program ends so the only I/O calls you have to worry about are lseek(), read() and write(). There is no new reading for this assignment other than making sure you have read the hash chapter. We will have one more program after this one making it 10 in all. That program be a disk based tree structure. Have a Happy Thanksgiving!!! */ #include #include #include #define HTSIZE 19223 // A prime that is near 1.5 * 12816 #define MAXWLEN 24 struct hentry { int count; char word[MAXWLEN]; } hentry; // A single reusable hentry struct int ncollisions; // record your # of collisions int duangduen_hash(char *); int dumb_hash(char *); int main(int, char **); int get_word(char *); char word_buff[100]; int main(int argc, char *argv[]) { int h, inc, fd, n; unlink("hash_file"); // remove the old file if present // this open is more portable replacement for creat(() fd = open("hash_file", O_CREAT|O_RDWR|O_EXCL, 0666); while(get_word(word_buff) > 0) { inc = dumb_hash(word_buff); // dbl hash for jump size h = duangduen_hash(word_buff); // the hash starting point #ifdef PREVIOUS_CODE_TO_REPLACE while(strcmp(word_buff, hashtable[h].word)) { if(hashtable[h].count == 0) { strcpy(hashtable[h].word, word_buff); break; } ncollisions++; // wrong word is a collision h = (h + inc) % HTSIZE; // beware that inc is NOT 0 } hashtable[h].count++; // counts both old & new words #endif //**** YOUR DISKHASH CODE TO REPLACE THE ABOVE CODE FROM LAST WEEK //**** 15-20 lines using a different loop strategy than last week //**** since you have to do the lseek and read ops before you can do //**** the word comparison check that was in the while condition. } printf("%d total collisions\n", ncollisions); return(0); } int dumb_hash(char *wp) { unsigned int v = 0; while(*wp) v += *wp++; return(v % HTSIZE); } int duangduen_hash(char *wp) { unsigned int v = 0; /* unsigned to avoid shifting problems */ int i = 1; // for Duangduen's hash function while(*wp) v += ((*wp++) * (i++)*(i)) + ((4295)*(v+i)); // 33783 24858 Duangduen return(v % HTSIZE); } #include /* * Leave this routine EXACTLY as it stands */ int get_word(char *s) { int c; do { c = getchar(); if(c == EOF) return(0); } while(!isalpha(c) && !isdigit(c)); do { if(isupper(c)) c = tolower(c); *s++ = c; c = getchar(); } while(isalpha(c) || isdigit(c)); *s = 0; return(1); }