博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
QT学习:c++解析html相关
阅读量:4562 次
发布时间:2019-06-08

本文共 7311 字,大约阅读时间需要 24 分钟。

原来我做爬虫的时候,对页面进行解析的时候总是用很简单粗暴的方法,直接找规律。后来在网上看到了gumbo,尝试了一下,发现确实很好用,所以向大家推荐一下。

以下转自:

1.c++好像没有太多的html解析库可以用,最后试着在qt里面集成了htmlcxx,一开始在pro里面写了 includepath += 路径,发现仍然没有用后来发现只要在 HEADERS 和 SOURCES 里面 把htmlcxx的c文件和.h文件 +=进去就行了,像这样:SOURCES += main.cpp\        html/utils.cc \        html/Uri.cc \        html/ParserSax.cc \        html/ParserDom.cc \        html/Node.cc \        html/Extensions.ccHEADERS  += mainwindow.h \        html/utils.h \        html/Uri.h \        html/tree.h \        html/ParserSax.h \        html/ParserDom.h \        html/Node.h \        html/Extensions.h \        html/debug.h \        html/ci_string.h \        html/wincstring.h \        html/tld.h参考了:   htmlcxx for qt(mingw)      http://blog.chinaunix.net/uid-21525518-id-1824657.html2.使用gumbo解析导入c和h文件方法同上,记一下gumbo常用类型GumboOutput   用GumboOutput来解析html源码,然后output->root即为根节点。GumboOutput* output = gumbo_parse(htmlString.c_str());GumboNode* node = output->rootGumboNode    节点                      GumboNode node;      获得节点里面的东西    node->v->text                           //  节点的文本node->v.element.children    // 获得节点的子节点列表node->type     //节点的类型 GumboVector    节点容器  比如可以   GumboVector  * children  =    node->v.element.children;   来获得节点的子节点列表(GumboNode*) ( children->data[i] )     //获得这个节点列表的第i个节点   GumboAttribute  节点属性GumboAttribute* href;  if (node->v.element.tag == GUMBO_TAG_A &&   (href = gumbo_get_attribute(&node->v.element.attributes, "href"))) {    std::cout << href->value << std::endl;  }节点的类型    ELEMENT_NODE,普通元素节点,如,

,

,
, ATTRIBUTE_NODE,元素属性 TEXT_NODE,文本节点 CDATA_SECTION_NODE,即 ENTITY_REFERENCE_NODE,实体引用,如& ENTITY_NODE,实体,如
PROCESSING_INSTRUCTION_NODE,PI,处理指令,如
COMMENT_NODE,注释
DOCUMENT_NODE,根节点,即document.nodeType DOCUMENT_TYPE_NODE,DTD,文档类型type == GUMBO_NODE_ELEMENT )typedef enum { /** Document node. v will be a GumboDocument. */ GUMBO_NODE_DOCUMENT, /** Element node. v will be a GumboElement. */ GUMBO_NODE_ELEMENT, /** Text node. v will be a GumboText. */ GUMBO_NODE_TEXT, /** CDATA node. v will be a GumboText. */ GUMBO_NODE_CDATA, /** Comment node. v. will be a GumboText, excluding comment delimiters. */ GUMBO_NODE_COMMENT, /** Text node, where all contents is whitespace. v will be a GumboText. */ GUMBO_NODE_WHITESPACE} GumboNodeType;标签类型: (使用方法 node->v.element.tag != GUMBO_TAG_SCRIPT )typedef enum { // http://www.whatwg.org/specs/web-apps/current-work/multipage/semantics.html#the-root-element GUMBO_TAG_HTML, // http://www.whatwg.org/specs/web-apps/current-work/multipage/semantics.html#document-metadata GUMBO_TAG_HEAD, GUMBO_TAG_TITLE, GUMBO_TAG_BASE, GUMBO_TAG_LINK, GUMBO_TAG_META, GUMBO_TAG_STYLE, // http://www.whatwg.org/specs/web-apps/current-work/multipage/scripting-1.html#scripting-1 GUMBO_TAG_SCRIPT, GUMBO_TAG_NOSCRIPT, GUMBO_TAG_TEMPLATE, // http://www.whatwg.org/specs/web-apps/current-work/multipage/sections.html#sections GUMBO_TAG_BODY, GUMBO_TAG_ARTICLE, GUMBO_TAG_SECTION, GUMBO_TAG_NAV, GUMBO_TAG_ASIDE, GUMBO_TAG_H1, GUMBO_TAG_H2, GUMBO_TAG_H3, GUMBO_TAG_H4, GUMBO_TAG_H5, GUMBO_TAG_H6, GUMBO_TAG_HGROUP, GUMBO_TAG_HEADER, GUMBO_TAG_FOOTER, GUMBO_TAG_ADDRESS, // http://www.whatwg.org/specs/web-apps/current-work/multipage/grouping-content.html#grouping-content GUMBO_TAG_P, GUMBO_TAG_HR, GUMBO_TAG_PRE, GUMBO_TAG_BLOCKQUOTE, GUMBO_TAG_OL, GUMBO_TAG_UL, GUMBO_TAG_LI, GUMBO_TAG_DL, GUMBO_TAG_DT, GUMBO_TAG_DD, GUMBO_TAG_FIGURE, GUMBO_TAG_FIGCAPTION, GUMBO_TAG_MAIN, GUMBO_TAG_DIV, // http://www.whatwg.org/specs/web-apps/current-work/multipage/text-level-semantics.html#text-level-semantics GUMBO_TAG_A, GUMBO_TAG_EM, GUMBO_TAG_STRONG, GUMBO_TAG_SMALL, GUMBO_TAG_S, GUMBO_TAG_CITE, GUMBO_TAG_Q, GUMBO_TAG_DFN, GUMBO_TAG_ABBR, GUMBO_TAG_DATA, GUMBO_TAG_TIME, GUMBO_TAG_CODE, GUMBO_TAG_VAR, GUMBO_TAG_SAMP, GUMBO_TAG_KBD, GUMBO_TAG_SUB, GUMBO_TAG_SUP, GUMBO_TAG_I, GUMBO_TAG_B, GUMBO_TAG_U, GUMBO_TAG_MARK, GUMBO_TAG_RUBY, GUMBO_TAG_RT, GUMBO_TAG_RP, GUMBO_TAG_BDI, GUMBO_TAG_BDO, GUMBO_TAG_SPAN, GUMBO_TAG_BR, GUMBO_TAG_WBR, // http://www.whatwg.org/specs/web-apps/current-work/multipage/edits.html#edits GUMBO_TAG_INS, GUMBO_TAG_DEL, // http://www.whatwg.org/specs/web-apps/current-work/multipage/embedded-content-1.html#embedded-content-1 GUMBO_TAG_IMAGE, GUMBO_TAG_IMG, GUMBO_TAG_IFRAME, GUMBO_TAG_EMBED, GUMBO_TAG_OBJECT, GUMBO_TAG_PARAM, GUMBO_TAG_VIDEO, GUMBO_TAG_AUDIO, GUMBO_TAG_SOURCE, GUMBO_TAG_TRACK, GUMBO_TAG_CANVAS, GUMBO_TAG_MAP, GUMBO_TAG_AREA, // http://www.whatwg.org/specs/web-apps/current-work/multipage/the-map-element.html#mathml GUMBO_TAG_MATH, GUMBO_TAG_MI, GUMBO_TAG_MO, GUMBO_TAG_MN, GUMBO_TAG_MS, GUMBO_TAG_MTEXT, GUMBO_TAG_MGLYPH, GUMBO_TAG_MALIGNMARK, GUMBO_TAG_ANNOTATION_XML, // http://www.whatwg.org/specs/web-apps/current-work/multipage/the-map-element.html#svg-0 GUMBO_TAG_SVG, GUMBO_TAG_FOREIGNOBJECT, GUMBO_TAG_DESC, // SVG title tags will have GUMBO_TAG_TITLE as with HTML. // http://www.whatwg.org/specs/web-apps/current-work/multipage/tabular-data.html#tabular-data GUMBO_TAG_TABLE, GUMBO_TAG_CAPTION, GUMBO_TAG_COLGROUP, GUMBO_TAG_COL, GUMBO_TAG_TBODY, GUMBO_TAG_THEAD, GUMBO_TAG_TFOOT, GUMBO_TAG_TR, GUMBO_TAG_TD, GUMBO_TAG_TH, // http://www.whatwg.org/specs/web-apps/current-work/multipage/forms.html#forms GUMBO_TAG_FORM, GUMBO_TAG_FIELDSET, GUMBO_TAG_LEGEND, GUMBO_TAG_LABEL, GUMBO_TAG_INPUT, GUMBO_TAG_BUTTON, GUMBO_TAG_SELECT, GUMBO_TAG_DATALIST, GUMBO_TAG_OPTGROUP, GUMBO_TAG_OPTION, GUMBO_TAG_TEXTAREA, GUMBO_TAG_KEYGEN, GUMBO_TAG_OUTPUT, GUMBO_TAG_PROGRESS, GUMBO_TAG_METER, // http://www.whatwg.org/specs/web-apps/current-work/multipage/interactive-elements.html#interactive-elements GUMBO_TAG_DETAILS, GUMBO_TAG_SUMMARY, GUMBO_TAG_MENU, GUMBO_TAG_MENUITEM, // Non-conforming elements that nonetheless appear in the HTML5 spec. // http://www.whatwg.org/specs/web-apps/current-work/multipage/obsolete.html#non-conforming-features GUMBO_TAG_APPLET, GUMBO_TAG_ACRONYM, GUMBO_TAG_BGSOUND, GUMBO_TAG_DIR, GUMBO_TAG_FRAME, GUMBO_TAG_FRAMESET, GUMBO_TAG_NOFRAMES, GUMBO_TAG_ISINDEX, GUMBO_TAG_LISTING, GUMBO_TAG_XMP, GUMBO_TAG_NEXTID, GUMBO_TAG_NOEMBED, GUMBO_TAG_PLAINTEXT, GUMBO_TAG_RB, GUMBO_TAG_STRIKE, GUMBO_TAG_BASEFONT, GUMBO_TAG_BIG, GUMBO_TAG_BLINK, GUMBO_TAG_CENTER, GUMBO_TAG_FONT, GUMBO_TAG_MARQUEE, GUMBO_TAG_MULTICOL, GUMBO_TAG_NOBR, GUMBO_TAG_SPACER, GUMBO_TAG_TT, // Used for all tags that don't have special handling in HTML. GUMBO_TAG_UNKNOWN, // A marker value to indicate the end of the enum, for iterating over it. // Also used as the terminator for varargs functions that take tags. GUMBO_TAG_LAST,} GumboTag;3.使用gumbo的时候,报了一个RtlWerpReportException failed with status code :-1073741823 错,一开始以为是堆栈溢出的问题,后来发现是自己代码逻辑没写对,最好对照着官方demo的用法去写if (node->v.element.tag == GUMBO_TAG_A && (href = gumbo_get_attribute(&node->v.element.attributes, "href"))) { std::cout << href->value << std::endl; }4.编译gumbo的时候报了一个错 错误:'for' loop initial declarations are only allowed in C99 mode所以在项目pro配置里要加上这两句QMAKE_CFLAGS_DEBUG += --std=c99QMAKE_CFLAGS_RELEASE += --std=c99

 

转载请注明:

转载于:https://www.cnblogs.com/fnlingnzb-learner/p/5835428.html

你可能感兴趣的文章
RMI、RPC、SOAP通信技术介绍及比对
查看>>
Struts2学习笔记——Struts2与Spring整合
查看>>
结对编程
查看>>
python数据类型及基本运算符
查看>>
HLPP算法 一种高效的网络最大流算法
查看>>
Could not get a resource from the pool 错误解决
查看>>
聊聊Docker
查看>>
pycharm远程服务器进行调试
查看>>
linux下 如何切换到root用户
查看>>
Python中的json操作
查看>>
数据结构之排序算法二
查看>>
Mysql数据库索引的使用
查看>>
【推荐系统篇】--推荐系统之训练模型
查看>>
Mysql篇--Linux中安装Mysql
查看>>
CSS3实现图片木桶布局
查看>>
Flask入门之Virtualvenv的安装及使用(windows)
查看>>
Coder-Strike 2014 - Finals (online edition, Div. 2) B. Start Up
查看>>
(转载)软件开发模式对比(瀑布、迭代、螺旋、敏捷)
查看>>
eclipse中AndroidA工程依赖B工程设置
查看>>
Oracle - 查询
查看>>