Xmlparser error...

ezmoney
Junior Member

Posts: 69

Xmlparser error... Feb 23, 2013 0:03:41 GMT -5

Quote

Post by ezmoney on Feb 23, 2013 0:03:41 GMT -5

I was trying out the xmlparser

the input code was...
url$="http://url.com/" <--- not the real one but representative.

xml$ = httpget$(url$)

xmlparser #rxx, xml$

The error print out was...

Runtime Error in program 'untitled': xmlparser #rxx, xml$HttpClientError

What am I doing wrong?

neal
Full Member

Posts: 104

Xmlparser error... Feb 25, 2013 0:13:54 GMT -5

Quote

Post by neal on Feb 25, 2013 0:13:54 GMT -5

Try this:

url$="http://url.com/" <--- not the real one but representative.

xml$ = httpget$(url$)
xml$ = mid$(xml$, instr(xml$, "<html"))

xmlparser #rxx, xml$

Th extra line removes the <!DOCTYPE ...> line - this seems to overcome the HttpClientError, but I then get other errors - HTML is not XML I guess

Last Edit: Feb 25, 2013 0:14:18 GMT -5 by neal

ezmoney Junior Member Posts: 69	Xmlparser error... Feb 25, 2013 12:23:40 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by ezmoney on Feb 25, 2013 12:23:40 GMT -5 Thanks for the help. it came back as.. Runtime Error in program 'untitled': xmlparser #rxx, xml$ The close tag for link was not found

StefanPendl Global Moderator Run for BASIC ... Posts: 943	Xmlparser error... Feb 25, 2013 15:53:24 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by StefanPendl on Feb 25, 2013 15:53:24 GMT -5 The problem is that HTML doesn't always follow the XML standard, so it is allowed to omit the closing tags in HTML, where this is not allowed for XML. It is best to use the XML parser only for real XML files.
	[b]Stefan[/b] - [a href=http://stefanpendl.runbasichosting.com/]Homepage[/a][br][br][b]Please give credit if you use code I post, no need to ask for permission.[/b][br][br]Run BASIC 1.01, Fire-/Waterfox (IE11, Edge), Windows 10 Professional x64, Intel Core i7-4710MQ 2.5GHz, 16GB RAM

ezmoney Junior Member Posts: 69	Xmlparser error... Feb 26, 2013 4:31:29 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by ezmoney on Feb 26, 2013 4:31:29 GMT -5 Thanks for the info... Maybe some of those that have worked thru the parsing can help out. Anything would be appreciated. All I want to do is identify the web page link and the discription behind it.. Thanks...

StefanPendl Global Moderator Run for BASIC ... Posts: 943	Xmlparser error... Feb 26, 2013 15:59:52 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by StefanPendl on Feb 26, 2013 15:59:52 GMT -5 Take a look at lbpe.wikispaces.com/ParsingText which is parsing text with regular string commands.
	[b]Stefan[/b] - [a href=http://stefanpendl.runbasichosting.com/]Homepage[/a][br][br][b]Please give credit if you use code I post, no need to ask for permission.[/b][br][br]Run BASIC 1.01, Fire-/Waterfox (IE11, Edge), Windows 10 Professional x64, Intel Core i7-4710MQ 2.5GHz, 16GB RAM

ezmoney
Junior Member

Posts: 69

Xmlparser error... Mar 1, 2013 7:42:20 GMT -5

Quote

Post by ezmoney on Mar 1, 2013 7:42:20 GMT -5

That works good.. I better understand...

Now I have a funny thing happen...

I put in the url string for page 1 and retrived the data..

Then I put in the url string for page 2 but got back page 1 links..

The whole 6 pages came back page 1.... only the page number
was different in the get string,,,

It has got me stumped why it is doing this ...

Just so I would not make a typo error I used the copy to
replace the URL links ... I don't understand this one.

Also I thought maybe the previous return was staying in core..
So just before I did the httpget

I put in
ret$=""
ret$=httpget(url$)

Thus the return had to come from the URL$ that the httpget found..

Anybody got any ideas? Why this does this and how to fix it?

Things to try or ideas that will show why it is not picking the
correct url$ page?

There is a bug in the works someplace...It has me stumped.

Thanks..

StefanPendl Global Moderator Run for BASIC ... Posts: 943	Xmlparser error... Mar 1, 2013 16:05:53 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by StefanPendl on Mar 1, 2013 16:05:53 GMT -5 If you show your code and the URLs, then we will be able to help you.
	[b]Stefan[/b] - [a href=http://stefanpendl.runbasichosting.com/]Homepage[/a][br][br][b]Please give credit if you use code I post, no need to ask for permission.[/b][br][br]Run BASIC 1.01, Fire-/Waterfox (IE11, Edge), Windows 10 Professional x64, Intel Core i7-4710MQ 2.5GHz, 16GB RAM

meerkat
Full Member

Posts: 220

Xmlparser error... Mar 1, 2013 16:05:54 GMT -5

Quote

Post by meerkat on Mar 1, 2013 16:05:54 GMT -5

Not sure exactly what you are trying to do, but maybe this will help.

Extracting stuff from html is a common practice with a language like RB that deals with the web.
Remember that web pages are free format.
So don't expect to see perfect tags. If you look for "<a href" you may not find it. It could be written with extra spaces or tabs as "<a href" or "<a(tab)href". Or on separate lines such as
"<a
href" Or caps like "<A href" or "<a Href" or '<A HrEf".

You need to get as decent web page as possible.
First changes <CR>and <LF> to a single space.
Reduce multiple spaces to a single space. Since letter case may be a problem, you should either make it all upper or all lower case.
If you want to maintain the original case, move the changed case to a new destination. That way you can search the case changed document but use the offsets to get data from the unchanged web page.

when finding all "<a href" you need to offset it by the last instr()
for example:
i = instr(webPage$,"<a href")
while i > 0
j = instr(webPage$,">")
hrefData$ = mid$(webPage$,i,j-i-1)
....

i = instr(webPage$,"<a href",i+1)
wend

HTH

ezmoney
Junior Member

Posts: 69

Xmlparser error... Mar 2, 2013 6:34:27 GMT -5

Quote

Post by ezmoney on Mar 2, 2013 6:34:27 GMT -5

I convert everything to lower case at the

ret$=lower$(httpget(git$))

the problem is when I change the git$ from 1 page to the page 2 I get the same return...
Thus for some reason url$ page 1 is returned regardless of the page number.. that is the issue..

Any operations below the Fetch httpget(git$) would not effect what was returned.

I search for the url link which is rather lenthy...as it contains the url address..
Generally the web site address is in the link.
And if there are non found.. I'm done... I'm out of that search.

Each record has a begning the "http://www.abcdefghijklmnopqrst....xyz"
and an ends generally "</a>"

somewhere in that record between "http" and "</a>" is a quote mark thus the link is

search$="http://www.thesearchsiteaddress"

v=0
[again]
ifr=instr(url$,search$,v)
if ifr<1 then [nxturl] ' if not found then i'm done search is over...
v=ifr+len(search$)+1 ' this moves the search to the next search point
ito=instr(url$,"</a>",v) ' find the end string from last point v
if ito<1 then [nxturl] 'there is no end to the record for some reason malformed or..
v=ito 'move the v pointer up to the last search found.
print " found end of record at=";ito

the return of the link and what comes after it is..

ret$= mid$(url$,ifr,ito-ifr) 'assign to ret$

'find the quote mark..
iqt=instr(ret$,chr$(34),0) ' quoto mark position
if iqt<1 then [skipprt] ' no qoute market skip it must be a malformed link.
lnk$=left$(ret$,(iqt-1)) ' the web link is everything left$ of the quote mark
dis$=right$(ret$,(len(ad$)-iqt-1)) ' the text or discription is everything to the right.
dis$=ascii$(dis$) ' this function cleans up all the unprintables
print st$;" ";lnk$;space$(5);dis$ ' this forms the new record to save.
rc=rc+1 ' this counts the records for the input link page
trc=trc+1 ' this counts the total records found
print #4,st$;" ";lnk$;space$(5);dis$;" ";date$("yyyy/mm/dd") 'this saves output in file 4
[skipprt]
after development and all is working i make the extra prints comments and if anything
is funny looking later i can quickly bring them back into the printing and see more of what is happening.

Xmlparser error...

Post by ezmoney on Feb 23, 2013 0:03:41 GMT -5

Post by neal on Feb 25, 2013 0:13:54 GMT -5

Post by ezmoney on Feb 25, 2013 12:23:40 GMT -5

Post by StefanPendl on Feb 25, 2013 15:53:24 GMT -5

Post by ezmoney on Feb 26, 2013 4:31:29 GMT -5

Post by StefanPendl on Feb 26, 2013 15:59:52 GMT -5

Post by ezmoney on Mar 1, 2013 7:42:20 GMT -5

Post by StefanPendl on Mar 1, 2013 16:05:53 GMT -5

Post by meerkat on Mar 1, 2013 16:05:54 GMT -5

Post by ezmoney on Mar 2, 2013 6:34:27 GMT -5