为了账号安全,请及时绑定邮箱和手机立即绑定

计算帖子的相似度

计算帖子的相似度

PHP
慕少森 2022-06-17 14:27:57
我正在使用php 7.3并且正在计算帖子的相似性。<?php$posts = [    'post_count' => 3,    'posts' => [        [            'ID' => 1,            'post_content' => "Wrong do point avoid by fruit learn or in death. So passage however besides invited comfort elderly be me. Walls began of child civil am heard hoped my. Satisfied pretended mr on do determine by.",        ],        [            'ID' => 2,            'post_content' => "Lorem ipsum dolor sit"        ],        [            'ID' => 3,            'post_content' => "Months on ye at by esteem desire warmth former. Sure that that way gave any fond now. His boy middleton sir nor engrossed affection excellent."        ],        [            'ID' => 4,            'post_content' => "Lorem ipsum dolor sit"        ],    ]];print_r($posts);function getNonSimilarTexts($posts){    $similarityPercentageArr = array();    for ($i = 0; $i <= $posts['post_count']; $i++) {        // $posts->the_post();        $currentPost = $posts['posts'][$i];        if (!is_null($currentPost['ID'])) {            for ($y = 0; $y <= $posts['post_count']; $y++) {                $comparePost = $posts['posts'][$y];                if (!is_null($comparePost['ID'])) {                    similar_text(strip_tags($currentPost['post_content']), strip_tags($comparePost['post_content']), $perc);                    // similarity is 100 if self compare                    if ($perc != 100) {                        array_push($similarityPercentageArr, [$currentPost['ID'], $comparePost['ID'], $perc]);                    }                }            }        }    }    return $similarityPercentageArr;}$p = getNonSimilarTexts($posts);print_r($p);如您所见,我得到一个数组作为输出[[ID, ID, similarity_percentage],...]我想过滤这个数组并去掉所有相似之处,>20%此外,我想只保留 1 个相似的帖子并删除 ohters。我想要的结果是帖子 ID:1,2,3有什么建议如何过滤这样的数组吗?
查看完整描述

2 回答

?
慕森卡

TA贡献1806条经验 获得超8个赞

您可以立即进行过滤,将条件更改if ($perc != 100)为if ($perc > 20),以便只保留您想要删除的类似帖子。然后,您甚至可以完全跳过存储相似性,因为您已经有了要删除的帖子 ID 数组列表。


所以,当你有这样的代码时:


if ($perc > 20) {

    $similarityPercentageArr[$currentPost['ID']][] = $comparePost['ID'];

}

然后,您可以像这样删除所有不需要的帖子:


$postsToRemove = [];

$postsToKeep = [];


foreach ($similarityPercentageArr as $postId => $similarPostIds) {

    // this post has already appeared as similar somewhere, so its similar posts have already been added 

    if (in_array($postId, $postsToRemove)) {

        continue;

    }


    $postsToKeep[] = $postId;

    $postsToRemove = array_merge($postsToRemove, $similarPostIds);

}

现在您在 中拥有原始帖子 ID $postsToKeep,以及在 中的相似之处的 ID $postsToRemove。


我还会稍微优化一下代码,这样similar_text当您知道您正在将帖子与其自身进行比较时,您根本不会调用。因此,if (!is_null($comparePost['ID']))您将拥有if (!is_null($comparePost['ID']) && $comparePost['ID'] !== $currentPost['ID']).


查看完整回答
反对 回复 2022-06-17
?
大话西游666

TA贡献1817条经验 获得超14个赞

similar_text — Calculate the similarity between two strings

莱文斯坦

levenshtein — Calculate Levenshtein distance between two strings

声音

soundex — Calculate the soundex key of a string

关于您的问题,在阅读后,似乎标题与您的查询不太匹配!

仅仅通过另一个条件还不够吗?

<?php


$posts = [

    'post_count' => 3,

    'posts' => [

        [

            'ID' => 1,

            'post_content' => "Wrong do point avoid by fruit learn or in death. So passage however besides invited comfort elderly be me. Walls began of child civil am heard hoped my. Satisfied pretended mr on do determine by.",

        ],

        [

            'ID' => 2,

            'post_content' => "Lorem ipsum dolor sit"

        ],

        [

            'ID' => 3,

            'post_content' => "Months on ye at by esteem desire warmth former. Sure that that way gave any fond now. His boy middleton sir nor engrossed affection excellent."

        ],

        [

            'ID' => 4,

            'post_content' => "Lorem ipsum dolor sit"

        ],

    ]

];


print_r($posts);


function getNonSimilarTexts($posts)

{

    $similarityPercentageArr = array();


    for ($i = 0; $i <= $posts['post_count']; $i++) {

        // $posts->the_post();

        $currentPost = $posts['posts'][$i];

        if (!is_null($currentPost['ID'])) {

            for ($y = 0; $y <= $posts['post_count']; $y++) {

                $comparePost = $posts['posts'][$y];

                if (!is_null($comparePost['ID'])) {

                    similar_text(strip_tags($currentPost['post_content']), strip_tags($comparePost['post_content']), $perc);

                    // similarity is 100 if self compare and more than 20 

                    if ($perc != 100 && $perc > 20) {

                        array_push($similarityPercentageArr, [$currentPost['ID'], $comparePost['ID'], $perc]);

                    }

                }

            }

        }

    }

    return $similarityPercentageArr;

}


$p = getNonSimilarTexts($posts);

print_r($p);

输出:


Array

(

    [0] => Array

        (

            [0] => 1

            [1] => 3

            [2] => 23.145400593472

        )


)


查看完整回答
反对 回复 2022-06-17
  • 2 回答
  • 0 关注
  • 157 浏览

添加回答

举报

0/150
提交
取消
微信客服

购课补贴
联系客服咨询优惠详情

帮助反馈 APP下载

慕课网APP
您的移动学习伙伴

公众号

扫描二维码
关注慕课网微信公众号