CS594 Information Retrieval Project Report

Yuanlei Zhang, Zhiao Shi

1. IFS structure

    In the parsing phase, if an XML file is parsed successfully without errors, all its information will be temporarily held in a 3-level associative array $resume. The structure of $resume is:

$resume['FIRST']
$resume['LAST']
$resume['OBJECTIVES']
$resume['SKILLS']
$resume['ANS1']
$resume['ANS2']
$resume['ANS3'];

 

(0 <= i <= 4)

$resume['EXPERIENCE'][i]['FROM']
$resume['EXPERIENCE'][i]['TO']
$resume['EXPERIENCE'][i]['EMPLOYER']
$resume['EXPERIENCE'][i]['POSITION']
$resume['EXPERIENCE'][i]['REASON']


(0 <= i <= 4)

$resume['EDUCATION'][i]['DEGREE']
$resume['EDUCATION'][i]['SCHOOL']
$resume['EDUCATION'][i]['MAJOR']
$resume['EDUCATION'][i]['COMPLETED']


 

    Fields of $resume['EXPERIENCE'] and $resume['EDUCATION'] will be used to create corresponding database tables, while terms in $resume['OBJECTIVES'], $resume['SKILLS'], $resume['ANS1'], $resume['ANS2'], $resume['ANS3'] will be used to build the IFS. When building the IFS, terms are extracted ignoring punctuation marks and digits, and all terms are converted to lowercase. Since we choose to use the classic vector model, we create both a IFS and an FS to facilitate the searching. The IFS has the structure:

ifs[term] -> [firstName_lastName.xml] -> [frequency]
                  [firstName_lastName.xml] -> [frequency]
                  ...
ifs[term]
....

 

    The FS has the structure:

fs[firstName_lastName.xml] -> [term] -> [frequency]
                                                [term] -> [frequency]
                                                ...
fs[firstName_lastName.xml]
....

 

2. Database design

     Three tables are created to hold relevant information about applicants in this project. They are:

         This tables contains each applicant's first name and last name. And each applicant is associated with a      

         unique  id. It is created with SQL command:

       

 CREATE TABLE applicants03(

                                                           aid int,

                                                           first varchar(20),

                                                           last varchar(20));

        

         Information about applicants' education is maintained by table education03.

 CREATE TABLE education03(

                                                           aid int,

                                                           iid int,

                                                           degree varchar(100),

                                                           school varchar(100),

                                                           completed date,

                                                           major varchar(100));

       

  

          Table experience03 keeps records of applicants' experience. Its structure is:

   CREATE TABLE experience03(

                                                            aid int,

                                                            jid int,

                                                            from_date date,

                                                            to_date d ate,

                                                            employer varchar(100),

                                                            position varchar(100),

                                                            reason varchar(20));

     

       These tables are dropped and created each time the php script is executed.

3. Routing Query

   Our objective is to find a game programmer for Nintendo. First we match all the applicants against some criteria regarding to their education and experience to select some candidates. We require that applicants must have degrees higher than (including) BS. Successful applicants must have experiences of over 3 years. Besides, no applicants should be fired on any of their previous jobs.  Here is the actual query used:

SELECT   DISTINCT  first , last
FROM     applicants03,education03
WHERE    applicants03.aid = education03.aid    AND
                   (degree = 'BS' OR degree ='MS' OR degree = 'PHD') AND
                   (applicants03.aid NOT IN (
                                                                        SELECT aid
                                                                        FROM experience03
                                                                       WHERE reason='fired')) AND
                    (applicants03.aid IN (

                                                                         SELECT aid FROM experience03
                                                                         GROUP BY aid
                                                                         HAVING SUM
(to_date-   from_date)>3*365 ));

After choosing the candidates, we use IFS to rank them.

4. Ranking Algorithms

    We use the classic vector model to do our ranking. The formal definition of the vector model is as below:

t - total number of terms in the vocabulary
N - number of documents in the collection
ni - number of documents containing term i
freqi,j - frequency of term i appearing in document j
freqi,q - frequency of term i appearing in query q

where        and   

    Our key-word based query is:

$query = array("pc"                        => 4,
                                "game"                  => 1,
                                "design"               => 1,
                                "designer"            => 1,
                                "programmer"     => 4,
                                "programming"  => 4,
                                "software"            => 4,
                                "developer"          => 1,
                                "development"    => 1,
                                "computer"          => 2,
                                "computing"       => 2,
                                "art"                       => 2,
                                "artist"                  => 2,
                                max freq              => 4);

   Important words are given higher frequency to weigh more in the calculation.

    The ranking code can be divided into two sections, the first section goes through the IFS and calculates the dot product part of the vector model.

while ($qEntry = each($query)) {
    $term = $qEntry[0];
    $qFreq = $qEntry[1];

    $logterm = log($N/$IFS[$term]["#"], 2);
    $w_iq = ($qFreq/$max) * $logterm;

    while ($dEntry = each($IFS[$term])) {
        $doc = $dEntry[0];
        $dFreq = $dEntry[1];
        if (!isset($candidates[$doc])) {
            continue
        }

        $w_ij = ($dFreq/$FS[$doc]["*"]) * $logterm;
        $candidates[$doc][0] += $w_iq*$w_ij;
    }
}

    The second section goes through the FS and calculates the magnitude part of the vector model.

while ($dEntry = each($candidates)) {
    $doc = $dEntry[0];
    $dot = $dEntry[1][0];

    $d_mag = 0.0;
    $max = $FS[$doc]["*"]; 
    while ($tEntry = each($FS[$doc])) {
        $term = $tEntry[0];
        $freq = $tEntry[1];
        $logterm = log($N/$IFS[$term]["#"], 2);
        $w_ij = ($freq/$max) * $logterm;
        $d_mag += $w_ij*$w_ij;
    }

    $d_mag = sqrt($d_mag);
    $rank[$doc] = $dot/($q_mag*$d_mag);
}

    The sim(d, q) is used as the final rank, which is a value between 0 and 1.


5. Credits

Yuanlei Zhang implemented all the tasks as described in the above sections 1 and 4 (searching the IFS)

Zhiao Shi implemented all the tasks as described in the above sections 2 and 3 (searching the database)

Yuanlei and Zhiao together designed the web interface.