|
Post by ezmoney on Feb 23, 2013 0:03:41 GMT -5
I was trying out the xmlparser the input code was... url$="http://url.com/" <--- not the real one but representative. xml$ = httpget$(url$) xmlparser #rxx, xml$ The error print out was... Runtime Error in program 'untitled': xmlparser #rxx, xml$HttpClientError What am I doing wrong?
|
|
neal
Full Member
Posts: 104
|
Post by neal on Feb 25, 2013 0:13:54 GMT -5
Try this: url$="http://url.com/" <--- not the real one but representative. xml$ = httpget$(url$) xml$ = mid$(xml$, instr(xml$, "<html")) xmlparser #rxx, xml$ Th extra line removes the <!DOCTYPE ...> line - this seems to overcome the HttpClientError, but I then get other errors - HTML is not XML I guess
|
|
|
Post by ezmoney on Feb 25, 2013 12:23:40 GMT -5
Thanks for the help.
it came back as..
Runtime Error in program 'untitled': xmlparser #rxx, xml$ The close tag for link was not found
|
|
|
Post by StefanPendl on Feb 25, 2013 15:53:24 GMT -5
The problem is that HTML doesn't always follow the XML standard, so it is allowed to omit the closing tags in HTML, where this is not allowed for XML.
It is best to use the XML parser only for real XML files.
|
|
|
Post by ezmoney on Feb 26, 2013 4:31:29 GMT -5
Thanks for the info... Maybe some of those that have worked thru the parsing can help out.
Anything would be appreciated.
All I want to do is identify the web page link and the discription behind it..
Thanks...
|
|
|
Post by StefanPendl on Feb 26, 2013 15:59:52 GMT -5
|
|
|
Post by ezmoney on Mar 1, 2013 7:42:20 GMT -5
That works good.. I better understand... Now I have a funny thing happen... I put in the url string for page 1 and retrived the data.. Then I put in the url string for page 2 but got back page 1 links.. The whole 6 pages came back page 1.... only the page number was different in the get string,,, It has got me stumped why it is doing this ... Just so I would not make a typo error I used the copy to replace the URL links ... I don't understand this one. Also I thought maybe the previous return was staying in core.. So just before I did the httpget I put in ret$="" ret$=httpget(url$) Thus the return had to come from the URL$ that the httpget found.. Anybody got any ideas? Why this does this and how to fix it? Things to try or ideas that will show why it is not picking the correct url$ page? There is a bug in the works someplace...It has me stumped. Thanks..
|
|
|
Post by StefanPendl on Mar 1, 2013 16:05:53 GMT -5
If you show your code and the URLs, then we will be able to help you.
|
|
|
Post by meerkat on Mar 1, 2013 16:05:54 GMT -5
Not sure exactly what you are trying to do, but maybe this will help.
Extracting stuff from html is a common practice with a language like RB that deals with the web. Remember that web pages are free format. So don't expect to see perfect tags. If you look for "<a href" you may not find it. It could be written with extra spaces or tabs as "<a href" or "<a(tab)href". Or on separate lines such as "<a href" Or caps like "<A href" or "<a Href" or '<A HrEf".
You need to get as decent web page as possible. First changes <CR>and <LF> to a single space. Reduce multiple spaces to a single space. Since letter case may be a problem, you should either make it all upper or all lower case. If you want to maintain the original case, move the changed case to a new destination. That way you can search the case changed document but use the offsets to get data from the unchanged web page.
when finding all "<a href" you need to offset it by the last instr() for example: i = instr(webPage$,"<a href") while i > 0 j = instr(webPage$,">") hrefData$ = mid$(webPage$,i,j-i-1) ....
i = instr(webPage$,"<a href",i+1) wend
HTH
|
|
|
Post by ezmoney on Mar 2, 2013 6:34:27 GMT -5
I convert everything to lower case at the
ret$=lower$(httpget(git$))
the problem is when I change the git$ from 1 page to the page 2 I get the same return... Thus for some reason url$ page 1 is returned regardless of the page number.. that is the issue..
Any operations below the Fetch httpget(git$) would not effect what was returned.
I search for the url link which is rather lenthy...as it contains the url address.. Generally the web site address is in the link. And if there are non found.. I'm done... I'm out of that search.
Each record has a begning the "http://www.abcdefghijklmnopqrst....xyz" and an ends generally "</a>"
somewhere in that record between "http" and "</a>" is a quote mark thus the link is
search$="http://www.thesearchsiteaddress"
v=0 [again] ifr=instr(url$,search$,v) if ifr<1 then [nxturl] ' if not found then i'm done search is over... v=ifr+len(search$)+1 ' this moves the search to the next search point ito=instr(url$,"</a>",v) ' find the end string from last point v if ito<1 then [nxturl] 'there is no end to the record for some reason malformed or.. v=ito 'move the v pointer up to the last search found. print " found end of record at=";ito
the return of the link and what comes after it is..
ret$= mid$(url$,ifr,ito-ifr) 'assign to ret$
'find the quote mark.. iqt=instr(ret$,chr$(34),0) ' quoto mark position if iqt<1 then [skipprt] ' no qoute market skip it must be a malformed link. lnk$=left$(ret$,(iqt-1)) ' the web link is everything left$ of the quote mark dis$=right$(ret$,(len(ad$)-iqt-1)) ' the text or discription is everything to the right. dis$=ascii$(dis$) ' this function cleans up all the unprintables print st$;" ";lnk$;space$(5);dis$ ' this forms the new record to save. rc=rc+1 ' this counts the records for the input link page trc=trc+1 ' this counts the total records found print #4,st$;" ";lnk$;space$(5);dis$;" ";date$("yyyy/mm/dd") 'this saves output in file 4 [skipprt] after development and all is working i make the extra prints comments and if anything is funny looking later i can quickly bring them back into the printing and see more of what is happening.
|
|