php - Generate popular subjects from collection of post titles -


i have content aggregator website. i'd process post titles generate list of popular post subjects. subject "software development" important point 2 words "software" , "development" don't have directly next each other in post title.

the idea generate list of possible collections (a group of posts based on same subject)

i have started writing code , have far managed generate list of used words. need generate new list ordered number of occurrences of multiple words in post titles i'm hoping me final list of popular subjects.

looking finish off. added comments code describe i've done far.

also if know better way please let me know!

/**  * execute console command.  *  * @return mixed  */ public function fire() {     $this->info('starting...');      // last 1000 posts     $posts = post::orderby('created_at', 'desc')->take(1000)->get();      // create array of post titles     $titles_arr = array_map(function($n){         return $n['title'];     }, $posts->toarray());      // create big string of post titles     $titles_str = implode(' ', $titles_arr);      // create array of above string     $words = explode(' ', strtolower($titles_str));      // words ignore     $stopwords = array('a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also','although','always','am','among', 'amongst', 'amoungst', 'amount',  'an', 'and', 'another', 'any','anyhow','anyone','anything','anyway', 'anywhere', 'are', 'around', 'as',  'at', 'back','be','became', 'because','become','becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom','but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven','else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fify', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own','part', 'per', 'perhaps', 'please', 'put', 'rather', 're', 'same', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'thickv', 'thin', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves', 'the');      // $words after stop words filtered out     $final_words = array_filter($words, function($n) use($stopwords){         return (!in_array(strtolower($n), $stopwords));     });      // count occurence of array values     $reduce = array_count_values($final_words);     arsort($reduce); // sort      // take first 1000 popular words     $top_1000 = array_slice($reduce, 0, 1000);      // have top 1000 used words. need find ones present (ordered)      $matched_total = array();      // each post title     foreach ($titles_arr $title) {          // split array         $words = explode(' ', strtolower($title));          $matched = array();          echo $title . "\n";          foreach ($words $word) {             // if in $top_1000 print , add $matched array             if(array_key_exists($word, $top_1000)) {                 echo $word . "\n";                 $matched[] = $word;             }         }          // if contains popular words @ matched_total         if(count($matched) > 0)             $matched_total[] = $matched;              }      // have words match each post, need ones appear      $this->info('finished.'); } 

i have feeling shouldn't done in php here goes anyway.

  1. remove non-stop words

  2. for each word keep track of titles appears in see title_keys in debug below

  3. for each word, count common titles between , other words , store count in associative array (see count_other_word_in_same_title in debug output) , keep store counts in separate array know @ end of loop how many times top n phrase appears

  4. grab phrases appear >= number of times top n phrase appears , stop when you've grabbed n phrases

$titles = array(     'where in world waldo',     'software development beginners',     'waldo travelling world',     'beginners learn develop software',     'the big brown dog jumped on waldo', );  $stopwords = array('a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also','although','always','am','among', 'amongst', 'amoungst', 'amount',  'an', 'and', 'another', 'any','anyhow','anyone','anything','anyway', 'anywhere', 'are', 'around', 'as',  'at', 'back','be','became', 'because','become','becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom','but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven','else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fify', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own','part', 'per', 'perhaps', 'please', 'put', 'rather', 're', 'same', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'thickv', 'thin', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves', 'the'); $stopwords = array_flip($stopwords); $allwords = array();  foreach($titles $key => $title) {      $words = explode(' ', strtolower($title));     $words = array_unique($words); //optional      foreach($words $word) {          if(isset($stopwords[$word]))             continue;          $allwords[$word]['title_keys'][$key] = $key;                     } }  print_r($allwords);  $copy_allwords = $allwords; $counts = array();  //if want more 2 word combos //make recursive foreach($copy_allwords $word => $stats) {             unset($copy_allwords[$word]);             foreach($copy_allwords $word2 => $stats2) {         $intersect = array_intersect_key($stats['title_keys'],$stats2['title_keys']);         if(!empty($intersect)) {             $allwords[$word]['count_other_word_in_same_title'][$word2] = count($intersect);             $counts[] = count($intersect);         }     } }  rsort($counts);     $count_phrases = 2; // setting retrieve top n phrases     $phrases = array();  foreach($allwords $word => $stats) {      if(!isset($stats['count_other_word_in_same_title']))         continue;      foreach($stats['count_other_word_in_same_title'] $other_word => $count) {         if($count >= $counts[($count_phrases-1)]) {             $phrases[$word . ' ' . $other_word] = $count;             if(count($phrases) >= $count_phrases)                 break 2;         }     } }  arsort($phrases); print_r($phrases); 

top 2 phrases output

array (     [software beginners] => 2 //2 = number of titles phrase appears in     [world waldo] => 2 ) 

contents of $allwords @ end

array (     [world] => array         (             [title_keys] => array                 (                     [0] => 0                     [2] => 2                 )              [count_other_word_in_same_title] => array                 (                     [waldo] => 2                     [travelling] => 1                 )          )      [waldo] => array         (             [title_keys] => array                 (                     [0] => 0                     [2] => 2                     [4] => 4                 )              [count_other_word_in_same_title] => array                 (                     [travelling] => 1                     [big] => 1                     [brown] => 1                     [dog] => 1                     [jumped] => 1                 )          )      [software] => array         (             [title_keys] => array                 (                     [1] => 1                     [3] => 3                 )              [count_other_word_in_same_title] => array                 (                     [development] => 1                     [beginners] => 2                     [learn] => 1                     [develop] => 1                 )          )      [development] => array         (             [title_keys] => array                 (                     [1] => 1                 )              [count_other_word_in_same_title] => array                 (                     [beginners] => 1                 )          )      [beginners] => array         (             [title_keys] => array                 (                     [1] => 1                     [3] => 3                 )              [count_other_word_in_same_title] => array                 (                     [learn] => 1                     [develop] => 1                 )          )      [travelling] => array         (             [title_keys] => array                 (                     [2] => 2                 )          )      [learn] => array         (             [title_keys] => array                 (                     [3] => 3                 )              [count_other_word_in_same_title] => array                 (                     [develop] => 1                 )          )      [develop] => array         (             [title_keys] => array                 (                     [3] => 3                 )          )      [big] => array         (             [title_keys] => array                 (                     [4] => 4                 )              [count_other_word_in_same_title] => array                 (                     [brown] => 1                     [dog] => 1                     [jumped] => 1                 )          )      [brown] => array         (             [title_keys] => array                 (                     [4] => 4                 )              [count_other_word_in_same_title] => array                 (                     [dog] => 1                     [jumped] => 1                 )          )      [dog] => array         (             [title_keys] => array                 (                     [4] => 4                 )              [count_other_word_in_same_title] => array                 (                     [jumped] => 1                 )          )      [jumped] => array         (             [title_keys] => array                 (                     [4] => 4                 )          )  ) 

Comments

Popular posts from this blog

c++ - OpenCV Error: Assertion failed <scn == 3 ::scn == 4> in unknown function, -

php - render data via PDO::FETCH_FUNC vs loop -

The canvas has been tainted by cross-origin data in chrome only -