.NET Framework - Multi-Threaded App

Asked By Robert Sheppard
14-Feb-08 03:42 PM
I am new to C# and am trying to build a multi-threaded web crawler. I want
to crawl many sites all at once. I know how to use IHTMLDocument2 to parse
the document object but I want to launch multiple threads to parse each
induvidual web page.

With the WebBrowser control I can start parsing when I get the
Documet_Complete event but how can I do this with each web site on a
different thread? How are the Document_Complete events
handled in a multi-threaded environment?

This is an Asycronous operation and so I cannot see how it can be
done.
IHTMLDocument2
(1)
HttpWebResponse
(1)
HttpWebRequest
(1)
ApartmentState
(1)
WebBrowser
(1)
Asycronous
(1)
Documet
(1)
Induvidual
(1)
  Nicholas Paldino [.NET/C# MVP] replied...
14-Feb-08 04:18 PM
Robert,

This would be difficult in this situation.  You couldn't use the
WebBrowser control, because it needs to be tied to a UI thread.

You could use MSHTML through COM interop.  However, you would have to
make sure that every thread that you use MSHTML on is set up so that the
ApartmentState for that thread is STA.  I am not sure about this, but I also
believe you would have to pump messages in order for the events to work
correctly.

Needless to say, it's a better idea in this case to use
HttpWebRequest/HttpWebResponse and then take the content from those and set
the content of a new MSHTML instance in your thread to the content
downloaded.  This way, you don't have to wait for MSHTML to download the
document, and you can work with it right away.


--
- Nicholas Paldino [.NET/C# MVP]
- mvp@spam.guard.caspershouse.com
  Robert Sheppard replied...
14-Feb-08 05:54 PM
Thanks... I will look at HttpWebRequest/HttpWebResponse. The old VB6 crawler
that I am porting from was using the WebBrowser control, which works fine
but very slow. Let me stress SLOW.
Thanks again for the help.

also
set
want
parse
  Nicholas Paldino [.NET/C# MVP] replied...
14-Feb-08 10:19 PM
Robert,

Do you have a specific need to parse the entire document, or are you
looking for specific parts?  If you don't need to parse the entire document,
and what you are looking to scrape from the HTML is specific, then using
HttpWebRequest and HttpWebResponse will probably simplify things
considerably.


--
- Nicholas Paldino [.NET/C# MVP]
- mvp@spam.guard.caspershouse.com
Create New Account
help
and then click on the button? thanks for any tips and hints MR C# Discussions IHTMLDocument2 (1) HttpWebResponse (1) HttpWebRequest (1) WebBrowser (1) OtxtUserNameBox (1) OtxtPassWordBox (1) BtnLogin (1) Simulates (1) Try something like this: / *Assuming the following HTML in document * * <form> * * * * < / form> * / mshtml.IHTMLDocument2 doc = axWebBrowser1.Document as mshtml IHTMLDocument2; mshtml.HTMLInputElement otxtUserNameBox = (mshtml.HTMLInputElement) doc.all.item("username", 0); otxtUserNameBox.value = "test_uname"; mshtml.HTMLInputElement HTML DOM (this was pointed out already). However, I would ask why not use the HttpWebRequest / HttpWebResponse classes? A visual component isn't necessary for automating the process, as you seem to
Get method and the ResponseText wiuld have the return value. Thnaks. Nadee VB.NET Discussions IHTMLDocument2 (1) HttpWebResponse (1) WebHeaderCollection (1) HttpWebRequest (1) GetResponseStream (1) StreamReader (1) WebRequest (1) EventArgs (1) Nadee, Can you describe a little Me.TextBox1.ScrollBars = ScrollBars.Both 'above only for showing the sample Dim Doc As mshtml.IHTMLDocument2 Doc = New mshtml.HTMLDocumentClass Dim wbReq As Net.HttpWebRequest = _ DirectCast(Net.WebRequest.Create("http: / / msdn.microsoft.com / "), _ Net.HttpWebRequest) Dim wbResp As Net.HttpWebResponse = _ DirectCast(wbReq.GetResponse(), Net HttpWebResponse) Dim wbHCol As Net.WebHeaderCollection = wbResp.Headers Dim myStream As IO.Stream = wbResp.GetResponseStream() Dim
Me.TextBox1.ScrollBars = ScrollBars.Both 'above only for showing the sample Dim Doc As mshtml.IHTMLDocument2 Doc = New mshtml.HTMLDocumentClass Dim wbReq As Net.HttpWebRequest = _ DirectCast(Net.WebRequest.Create("http: / / start.csail.mit.edu / startfarm.cgi?query = USA"), _ Net.HttpWebRequest) Dim wbResp As Net HttpWebResponse = _ DirectCast(wbReq.GetResponse(), Net.HttpWebResponse) Dim wbHCol As Net.WebHeaderCollection = wbResp.Headers Dim myStream As IO.Stream = wbResp.GetResponseStream() Dim etc. Any suggestions as to why it duplicates would be great. K. VB.NET Discussions IHTMLDocument2 (1) IHTMLElement (1) System.Text.StringBuilder (1) Mshtml.HTMLDocumentClass (1) Mshtml.IHTMLDocument2 (1) Mshtml.IHTMLElement (1) WebHeaderCollection (1) HttpWebResponse (1) Kronecker, The HttpRequest gives you only back the HTML content of the document that
Me.TextBox1.ScrollBars = ScrollBars.Both 'above only for showing the sample Dim Doc As mshtml.IHTMLDocument2 Doc = New mshtml.HTMLDocumentClass Dim wbReq As Net.HttpWebRequest = _ DirectCast(Net.WebRequest.Create("http: / / start.csail.mit.edu / "), _ Net.HttpWebRequest) Dim wbResp As Net.HttpWebResponse = _ DirectCast(wbReq.GetResponse(), Net HttpWebResponse) Dim wbHCol As Net.WebHeaderCollection = wbResp.Headers Dim myStream As IO.Stream = wbResp.GetResponseStream() Dim If End Select Next TextBox1.Text = sb.ToString End Sub End Class VB.NET Discussions IHTMLDocument2 (1) IHTMLElement (1) System.Text.StringBuilder (1) Mshtml.HTMLDocumentClass (1) Mshtml.IHTMLDocument2 (1) Mshtml.IHTMLElement (1) WebHeaderCollection (1) HttpWebResponse (1) I forgot to add that when I do enter the debugger it points to