如何使用 simplehtmldom 从此页面提取数据

我正在尝试使用 simplehtmldom 从https://benthamopen.com/browse-by-title/B/1/中提取信息。具体来说，我想访问页面的以下部分：<div style="padding:10px;"><strong>ISSN: </strong>1874-1207<br><div class="sharethis-inline-share-buttons" style="padding-top:10px;" data-url="https://benthamopen.com/TOBEJ/home/" data-title="The Open Biomedical Engineering Journal"></div></div>我有这个代码：$html = file_get_html('https://benthamopen.com/browse-by-title/B/1/');foreach($html->find('div[style=padding:10px;]') as $ele) { echo("<pre>".print_r($ele,true)."</pre>");}...返回（我只显示页面中的一项）simplehtmldom\HtmlNode Object( [nodetype] => HDOM_TYPE_ELEMENT (1) [tag] => div [attributes] => Array ( [style] => padding:10px; ) [nodes] => Array ( [0] => simplehtmldom\HtmlNode Object ( [nodetype] => HDOM_TYPE_ELEMENT (1) [tag] => strong [attributes] => none [nodes] => none ) [1] => simplehtmldom\HtmlNode Object ( [nodetype] => HDOM_TYPE_TEXT (3) [tag] => text [attributes] => none [nodes] => none ) [2] => simplehtmldom\HtmlNode Object ( [nodetype] => HDOM_TYPE_ELEMENT (1) [tag] => br [attributes] => none [nodes] => none )我不确定如何从这里继续。我想提取：ISSN 文本（在 echo 语句中没有显示 - 不确定为什么）[上例中的 1874-1207]。它是 [nodes] 的元素零'data-url' [https://benthamopen.com/TOBEJ/home/，在上面的示例中]“数据标题”[开放生物医学工程杂志，在上面的例子中]也许我对PHP对象和数组的理解还不够好，我不知道为什么echo语句中没有显示ISSN。我尝试了各种（很多）方法，但只是努力从元素中提取数据。

查看完整描述

1 回答

森林海

TA贡献2011条经验获得超2个赞

我对 simplehtmldom 不熟悉，除了知道避免它之外。因此，我将提出一个使用 PHP 内置 DOM 类的解决方案：

<?php

libxml_use_internal_errors(true);

// get the HTML

$html = file_get_contents("https://benthamopen.com/browse-by-title/B/1/");

// create a DOM object and load it up

$dom = new DomDocument();

$dom->loadHtml($html);

// create an XPath object and query it

$xpath = new DomXPath($dom);

$elements = $xpath->query("//div[@style='padding:10px;']");

// loop through the matches

foreach ($elements as $el) {

// skip elements without ISSN

$text = trim($el->textContent);

if (strpos($text, "ISSN") !== 0) {

continue;

}

// get the first div inside this thing

$div = $el->getElementsByTagName("div")[0];

// dump it out

printf("%s %s %s<br/>\n", str_replace("ISSN: ", "", $text), $div->getAttribute("data-title"), $div->getAttribute("data-url"));

}

XPath 的内容可能有点让人不知所措，但对于像这样的简单搜索，它与 CSS 选择器没有太大区别。希望评论能解释一切，如果没有，请告诉我！

输出：

1874-1207 The Open Biomedical Engineering Journal https://benthamopen.com/TOBEJ/home/<br/>

1874-1967 The Open Biology Journal https://benthamopen.com/TOBIOJ/home/<br/>

1874-091X The Open Biochemistry Journal https://benthamopen.com/TOBIOCJ/home/<br/>

1875-0362 The Open Bioinformatics Journal https://benthamopen.com/TOBIOIJ/home/<br/>

1875-3183 The Open Biomarkers Journal https://benthamopen.com/TOBIOMJ/home/<br/>

2665-9956 The Open Biomaterials Science Journal https://benthamopen.com/TOBMSJ/home/<br/>

1874-0707 The Open Biotechnology Journal https://benthamopen.com/TOBIOTJ/home/<br/>

反对回复 2023-09-08

热搜

最近搜索清空

如何使用 simplehtmldom 从此页面提取数据

如何使用 simplehtmldom 从此页面提取数据

1 回答

添加回答